Rossman Store Sales Prediction from kaggle

問題：

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

今回のケースではXgBoostを使用する

XgBoost XGboostは「eXtreme Gradient Boosting」の略で2014年に発表された手法です。

勾配ブースティングと呼ばれるアンサンブル学習と決定木を組み合わせた手法で非常に高い汎化能力を誇ります。

アンサンブル学習とは、弱学習器（それほど性能の高くない手法）を複数用いて総合的に結果を出力する方法で、バギングとブースティングというタイプがあります。

バギングは弱学習器を並列に使うイメージ。決定木とバギングを組み合わせたのがランダムフォレストです。会社の売り上げを予測する機械学習プロセス。ロジスティック回帰などの回帰直線を計算して、回帰直線の示す値から未来の月の売り上げなどを予測する。

必要なライブラリをインポート

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sb

データの読み込み

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
store = pd.read_csv('store.csv')

Data fields Most of the fields are self-explanatory. The following are descriptions for those that aren't.

Id - an Id that represents a (Store, Date) duple within the test set Store - a unique Id for each store Sales - the turnover for any given day (this is what you are predicting) Customers - the number of customers on a given day Open - an indicator for whether the store was open: 0 = closed, 1 = open StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools StoreType - differentiates between 4 different store models: a, b, c, d Assortment - describes an assortment level: a = basic, b = extra, c = extended CompetitionDistance - distance in meters to the nearest competitor store CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened Promo - indicates whether a store is running a promo on that day Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2 PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

データの確認

print('Training Data Shape : ',train.shape)
print('Test Data Shape : ',test.shape)
print('Store Data Shape : ',store.shape)

Training Data Shape :  (1017209, 9)
Test Data Shape :  (41088, 8)
Store Data Shape :  (1115, 10)

Pandas head() データフレームの最初のn行を返す

・ここでOpen == 1の箇所は店舗が閉鎖していることを意味するので、ここではオープンされている店舗のみを採用する。そのため下記の処理で、すでに営業していいない店舗を省く作業を行なっている。

train.head()

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	1
1	2	5	2015-07-31	6064	625	1	1	1
2	3	5	2015-07-31	8314	821	1	1	1
3	4	5	2015-07-31	13995	1498	1	1	1
4	5	5	2015-07-31	4822	559	1	1	1

test.head()

	Id	Store	DayOfWeek	Date	Open	Promo
0	1	1	4	2015-09-17	1.0	1
1	2	3	4	2015-09-17	1.0	1
2	3	7	4	2015-09-17	1.0	1
3	4	8	4	2015-09-17	1.0	1
4	5	9	4	2015-09-17	1.0	1

store.head()

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN

not_openにtrainの中のデータのOpen == "0"というデータを取得してきている。店舗が営業していないということは、売り上げも0になるはずなので、(train['Sales'] != 0)]で店舗の売り上げがない部分を抽出。

ここで確認していることは、営業していないのに売り上げがある店舗と、営業しているのに売り上げがない店舗は不自然なのでそのようなデータが混在していないかを確認している。

not_open = train[(train['Open'] == 0) & (train['Sales'] != 0)]
print("No closed store with sales: " + str(not_open.size == 0))
no_sales = train[(train['Open'] == 1) & (train['Sales'] <= 0)]
print("No open store with no sales: " + str(no_sales.size == 0))

No closed store with sales: True
No open store with no sales: False

trainデータセットの中のSalesデータが0のものは省くため、train.loc[train['Sales'] > 0]として、Salesが0より上のものをtrain変数に再代入している。

train = train.loc[train['Sales'] > 0]

Salesデータが0のデータを省いたデータセットを再代入したtrainデータセットを再度shape表示してどのような配列になっているかを確認する。

行数・列数を取得: df.shape pandas.DataFrameのshape属性で行数と列数をタプル(行数, 列数)で取得できる。

以上より、trainデータの中身は列数：9列行数：844338行のデータセットとなっている。

print('New Training Data Shape : ',train.shape)

New Training Data Shape :  (844338, 9)

データセットにあるDateカラムをソートして、最初と最後のデータを取得してくることでこのデータの取得されている期間を調べることができる。

dates = pd.to_datetime(train['Date']).sort_values()
dates = dates.unique()
start_date = dates[0]
end_date = dates[-1]
print("Start date: ", start_date)
print("End Date: ", end_date)
date_range = pd.date_range(start_date, end_date).values

Start date:  2013-01-01T00:00:00.000000000
End Date:  2015-07-31T00:00:00.000000000

Visualization

データの可視化

matplotlib.pyplot.subplots

Matplotlibの使い方④（plt.subplots、plt.title、plt.legend）｜Pythonによる可視化入門 #4 このページの説明を見た方がわかりやすいかもしれないが、グラフを同時に複数枚同列に表示する場合に使用する。今回の場合では8つ同時にプロットしているので、その時に使用している。

plt.rcParams['figure.figsize'] = (15.0, 12.0)

f, ax = plt.subplots(7, sharex=True, sharey=True)
for i in range(1, 8):
    data = train[train['DayOfWeek'] == i]
    ax[i - 1].set_title("Day {0}".format(i))
    ax[i - 1].scatter(data['Customers'], data['Sales'], label=i)

plt.legend()
plt.xlabel('Customers')
plt.ylabel('Sales')
plt.tight_layout()
plt.show()

output_11_0.png (59.6 kB)

General Corelation between customer and sales Observed in the above plot

下記の散布図では、 train['Customers']をx軸にとり、 train['Sales']をy軸に取ったグラフを表示している。ここでは、週の曜日と売上に相関関係があるかどうかを散布図として示している。色が同じものは同じ週の曜日なので、それぞれどこの曜日の売り上げが高くなっているかが見て取れる。

#ploting customer vs sales for each day of week
plt.scatter(train['Customers'], train['Sales'], c=train['DayOfWeek'], alpha=0.6, cmap=plt.cm.get_cmap('YlGn'))

plt.xlabel('Customers')
plt.ylabel('Sales')
plt.show()

output_13_0.png (315.6 kB)

ここでは、School Holiday（学校の休日）が売上に影響を与えているかどうかを調べている。

for i in [0, 1]:
    data = train[train['SchoolHoliday'] == i]
    if (len(data) == 0):
        continue
    plt.scatter(data['Customers'], data['Sales'], label=i)

plt.legend()
plt.xlabel('Customers')
plt.ylabel('Sales')
plt.show()

output_14_0.png (83.8 kB)

データを見るとわかるとおり、SchoolHolidayは[0, 1]のboolean型で、0だと休みじゃなく、1だと休みである。このグラフも休みの日とそうでない日を色分けしてどちらに売上の相関があるのかを確認している。（グラフで見ても重なっている要素が大きいからよくわからないけど。。）これを見ると、休みかどうかではあまり相関関係がないように見れる。休みの日の学生はショッピングに行く傾向にあるかどうかを見ようとしているグラフの可視化だが、そこまで関連はないようだ。スクリーンショット 2021-01-02 0.45.34.png (127.6 kB)

次に販促イベントがあるかないかで、売上げの関連性を見ている。グラフを見てもわかる通り、販促イベント（割引セールなど）があるときはオレンジ色のポイントがより上の方に集まっている傾向が見て取れる。

for i in [0, 1]:
    data = train[train['Promo'] == i]
    if (len(data) == 0):
        continue
    plt.scatter(data['Customers'], data['Sales'], label=i)

plt.legend()
plt.xlabel('Customers')
plt.ylabel('Sales')
plt.show()

output_15_0.png (72.0 kB)

次にデータを少し変形してSalesPerCustomer（売上/人）などの要素を足して、よりデータアナライズをしやすくする。

train.head()

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	1
1	2	5	2015-07-31	6064	625	1	1	1
2	3	5	2015-07-31	8314	821	1	1	1
3	4	5	2015-07-31	13995	1498	1	1	1
4	5	5	2015-07-31	4822	559	1	1	1

train['Sales'] / train['Customers']と計算することで、顧客一人当たりの売上額がいくらくらいかのカラムをtrainデータに含むことができる。その他にもそれぞれの店舗に平均どれくらいの顧客が来ていて、どれくらいの売上/顧客があるかを計算している。

train['SalesPerCustomer'] = train['Sales'] / train['Customers']

avg_store = train.groupby('Store')[['Sales', 'Customers', 'SalesPerCustomer']].mean()
avg_store.rename(columns=lambda x: 'Avg' + x, inplace=True)
store = pd.merge(avg_store.reset_index(), store, on='Store')
store.head()

	Store	AvgSales	AvgCustomers	AvgSalesPerCustomer	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	4759.096031	564.049936	8.393038	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	4953.900510	583.998724	8.408443	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	6942.568678	750.077022	9.117599	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	9638.401786	1321.752551	7.249827	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	4676.274711	537.340180	8.611229	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN

avg_store.head()

	AvgSales	AvgCustomers	AvgSalesPerCustomer
Store
1	4759.096031	564.049936	8.393038
2	4953.900510	583.998724	8.408443
3	6942.568678	750.077022	9.117599
4	9638.401786	1321.752551	7.249827
5	4676.274711	537.340180	8.611229

array(['c', 'a', 'd', 'b'], dtype=object)

それぞれの店舗タイプに分けて、その店舗タイプと売上にどのくらい相関関係があるかを調べるためのプロット。このデータから見て取れるように、Store Bはあまり街中に個数が存在しないということなので、Bの店舗数はそこまで多くなく、街中ではあまり見かけないタイプの店舗であるということを予想できる。Type Aは、かなり多くの顧客を獲得できているので、街中にたくさんの店舗があるということを予測できる。

for i in store.StoreType.unique():
    data = store[store['StoreType'] == i]
    if (len(data) == 0):
        continue
    plt.scatter(data['AvgCustomers'], data['AvgSales'], label=i)

plt.legend()
plt.xlabel('Average Customers')
plt.ylabel('Average Sales')
plt.show()

output_21_0.png (48.5 kB)

store.Assortment.unique()

array(['a', 'c', 'b'], dtype=object)

for i in store.Assortment.unique():
    data = store[store['Assortment'] == i]
    if (len(data) == 0):
        continue
    plt.scatter(data['AvgCustomers'], data['AvgSales'], label=i)

plt.legend()
plt.xlabel('Average Customers')
plt.ylabel('Average Sales')
plt.show()

output_23_0.png (47.4 kB)

store.Promo2.unique()

array([0, 1], dtype=int64)

販促イベントについても同様に関連性を見る

for i in store.Promo2.unique():
    data = store[store['Promo2'] == i]
    if (len(data) == 0):
        continue
    plt.scatter(data['AvgCustomers'], data['AvgSales'], label=i)

plt.legend()
plt.xlabel('Average Customers')
plt.ylabel('Average Sales')
plt.show()

output_25_0.png (45.4 kB)

Feature Engineering

pandasで欠損値NaNが含まれているか判定、個数をカウント nullとなっているデータが各カラムにどれくらいの数あるかを出力してくれる。

store.isnull().sum()

Store                          0
AvgSales                       0
AvgCustomers                   0
AvgSalesPerCustomer            0
StoreType                      0
Assortment                     0
CompetitionDistance            3
CompetitionOpenSinceMonth    354
CompetitionOpenSinceYear     354
Promo2                         0
Promo2SinceWeek              544
Promo2SinceYear              544
PromoInterval                544
dtype: int64

CompetitionDistanceは-1で置き換えることでnull値を根絶することができる。

CompetitionDistance - distance in meters to the nearest competitor store

上記の説明通り、CompetitionDistanceは近所の競合他社との距離を表している。データを見ると競合が近くに存在する店舗がかなり多くを占めており、近くに競合がある方がより平均売り上げが高くなっている。

# fill NaN values
store["CompetitionDistance"].fillna(-1)


plt.scatter(store['CompetitionDistance'], store['AvgSales'])

plt.xlabel('CompetitionDistance')
plt.ylabel('Average Sales')
plt.show()

output_28_0.png (34.6 kB)

store.head()

	Store	AvgSales	AvgCustomers	AvgSalesPerCustomer	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	4759.096031	564.049936	8.393038	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	4953.900510	583.998724	8.408443	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	6942.568678	750.077022	9.117599	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	9638.401786	1321.752551	7.249827	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	4676.274711	537.340180	8.611229	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN

store['StoreType'] = store['StoreType'].astype('category').cat.codes
store['Assortment'] = store['Assortment'].astype('category').cat.codes
train["StateHoliday"] = train["StateHoliday"].astype('category').cat.codes
store.head()

	Store	AvgSales	AvgCustomers	AvgSalesPerCustomer	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	4759.096031	564.049936	8.393038	2	0	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	4953.900510	583.998724	8.408443	0	0	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	6942.568678	750.077022	9.117599	0	0	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	9638.401786	1321.752551	7.249827	2	2	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	4676.274711	537.340180	8.611229	0	0	29910.0	4.0	2015.0	0	NaN	NaN	NaN

これらの表を見比べるとわかるが、それぞれStoreType, Assortment, StateHolidayがa, b, c, dのように文字列でカテゴライズされていたものを数値にして計算に使いやすくしている。スクリーンショット 2021-01-02 1.36.02.png (194.0 kB)

train.head()

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday	SalesPerCustomer
0	1	5	2015-07-31	5263	555	1	1	1	1	9.482883
1	2	5	2015-07-31	6064	625	1	1	1	1	9.702400
2	3	5	2015-07-31	8314	821	1	1	1	1	10.126675
3	4	5	2015-07-31	13995	1498	1	1	1	1	9.342457
4	5	5	2015-07-31	4822	559	1	1	1	1	8.626118

LEFT JOINしている。

select * from train left join store on train.store_id = store.store_id

みたいなことをしていて、Storeが同じものを一位のデータとしてジョインしている。

merged = pd.merge(train, store, on='Store', how='left')
merged.head()

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday	SalesPerCustomer	...	AvgSalesPerCustomer	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	5	2015-07-31	5263	555	1	1	1	1	9.482883	...	8.393038	2	0	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	5	2015-07-31	6064	625	1	1	1	1	9.702400	...	8.408443	0	0	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	5	2015-07-31	8314	821	1	1	1	1	10.126675	...	9.117599	0	0	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	5	2015-07-31	13995	1498	1	1	1	1	9.342457	...	7.249827	2	2	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	5	2015-07-31	4822	559	1	1	1	1	8.626118	...	8.611229	0	0	29910.0	4.0	2015.0	0	NaN	NaN	NaN

5 rows × 22 columns

merged.shape

(844338, 22)

merged.isnull().sum()

Store                             0
DayOfWeek                         0
Date                              0
Sales                             0
Customers                         0
Open                              0
Promo                             0
StateHoliday                      0
SchoolHoliday                     0
SalesPerCustomer                  0
AvgSales                          0
AvgCustomers                      0
AvgSalesPerCustomer               0
StoreType                         0
Assortment                        0
CompetitionDistance            2186
CompetitionOpenSinceMonth    268600
CompetitionOpenSinceYear     268600
Promo2                            0
Promo2SinceWeek              423292
Promo2SinceYear              423292
PromoInterval                423292
dtype: int64

null値を0で置き換えている処理

# remove NaNs
merged.fillna(0, inplace=True)

Dateというカラムをdatetimeに変換して再代入している。文字列になっているとデータとしての処理を加えることができないため、Datetimeとして変換している。

merged['Date'] = pd.to_datetime(merged['Date'])
merged.dtypes

Store                                 int64
DayOfWeek                             int64
Date                         datetime64[ns]
Sales                                 int64
Customers                             int64
Open                                  int64
Promo                                 int64
StateHoliday                           int8
SchoolHoliday                         int64
SalesPerCustomer                    float64
AvgSales                            float64
AvgCustomers                        float64
AvgSalesPerCustomer                 float64
StoreType                              int8
Assortment                             int8
CompetitionDistance                 float64
CompetitionOpenSinceMonth           float64
CompetitionOpenSinceYear            float64
Promo2                                int64
Promo2SinceWeek                     float64
Promo2SinceYear                     float64
PromoInterval                        object
dtype: object

先ほどDatetimeに変換したことを利用して、それぞれ、year, month, day, weekに分けて各カラムを新しく作成したのち、それぞれに入れている。

merged['Year'] = merged.Date.dt.year
merged['Month'] = merged.Date.dt.month
merged['Day'] = merged.Date.dt.day
merged['Week'] = merged.Date.dt.week
merged.head()

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday	SalesPerCustomer	...	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval	Year	Month	Day	Week
0	1	5	2015-07-31	5263	555	1	1	1	1	9.482883	...	9.0	2008.0	0	0.0	0.0	0	2015	7	31	31
1	2	5	2015-07-31	6064	625	1	1	1	1	9.702400	...	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct	2015	7	31	31
2	3	5	2015-07-31	8314	821	1	1	1	1	10.126675	...	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct	2015	7	31	31
3	4	5	2015-07-31	13995	1498	1	1	1	1	9.342457	...	9.0	2009.0	0	0.0	0.0	0	2015	7	31	31
4	5	5	2015-07-31	4822	559	1	1	1	1	8.626118	...	4.0	2015.0	0	0.0	0.0	0	2015	7	31	31

5 rows × 26 columns

ここでは、近所の競合他社がどれくらいの期間営業しているかを取得して来ている。12ヶ月中どれくらいの期間（年と月）営業していて、それらが0のもの、イコール、営業していなかったものをmergedのデータセットから取得してきている。

locの中では、それぞれの競合他社が営業している時間が0の場合はMonthsCompetitionOpenの値が0になるはずなので（営業していないという認識になるから）ここで、0を入れてデータの整合性を保っている。

# Number of months that competition has existed for
merged['MonthsCompetitionOpen'] = 12 * (merged['Year'] - merged['CompetitionOpenSinceYear']) + (merged['Month'] - merged['CompetitionOpenSinceMonth'])
merged.loc[merged['CompetitionOpenSinceYear'] == 0, 'MonthsCompetitionOpen'] = 0

WeeksPromoOpenにも上記と同様の処理を行う

# Number of weeks that promotion has existed for
merged['WeeksPromoOpen'] = 12 * (merged['Year'] - merged['Promo2SinceYear']) + (merged['Date'].dt.weekofyear - merged['Promo2SinceWeek'])
merged.loc[merged['Promo2SinceYear'] == 0, 'WeeksPromoOpen'] = 0

merged.dtypes

Store                                 int64
DayOfWeek                             int64
Date                         datetime64[ns]
Sales                                 int64
Customers                             int64
Open                                  int64
Promo                                 int64
StateHoliday                           int8
SchoolHoliday                         int64
SalesPerCustomer                    float64
AvgSales                            float64
AvgCustomers                        float64
AvgSalesPerCustomer                 float64
StoreType                              int8
Assortment                             int8
CompetitionDistance                 float64
CompetitionOpenSinceMonth           float64
CompetitionOpenSinceYear            float64
Promo2                                int64
Promo2SinceWeek                     float64
Promo2SinceYear                     float64
PromoInterval                        object
Year                                  int64
Month                                 int64
Day                                   int64
Week                                  int64
MonthsCompetitionOpen               float64
WeeksPromoOpen                      float64
dtype: object

これらのカラムをInt型に変換している。

toInt = [
        'CompetitionOpenSinceMonth',
        'CompetitionOpenSinceYear',
        'Promo2SinceWeek', 
        'Promo2SinceYear', 
        'MonthsCompetitionOpen', 
        'WeeksPromoOpen']

merged[toInt] = merged[toInt].astype(int)

ここでは単純に'Sales', 'Customers', 'SalesPerCustomer'のそれぞれのメディアンの値を計算・取得してmergedデータ配列にマージしている。

med_store = train.groupby('Store')[['Sales', 'Customers', 'SalesPerCustomer']].median()
med_store.rename(columns=lambda x: 'Med' + x, inplace=True)

store = pd.merge(med_store.reset_index(), store, on='Store')

store.head()

	Store	MedSales	MedCustomers	MedSalesPerCustomer	AvgSales	AvgCustomers	AvgSalesPerCustomer	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	4647.0	550.0	8.362376	4759.096031	564.049936	8.393038	2	0	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	4783.0	575.5	8.313092	4953.900510	583.998724	8.408443	0	0	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	6619.0	744.0	9.123440	6942.568678	750.077022	9.117599	0	0	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	9430.5	1301.5	7.215175	9638.401786	1321.752551	7.249827	2	2	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	4616.0	564.0	8.584677	4676.274711	537.340180	8.611229	0	0	29910.0	4.0	2015.0	0	NaN	NaN	NaN

merged = pd.merge(med_store.reset_index(), merged, on='Store')
merged.head()

	Store	MedSales	MedCustomers	MedSalesPerCustomer	DayOfWeek	Date	Sales	Customers	Open	Promo	...	Year	Month	Day	Week	MonthsCompetitionOpen
0	1	4647.0	550.0	8.362376	5	2015-07-31	5263	555	1	1	...	2015	7	31	31	82.0
1	1	4647.0	550.0	8.362376	4	2015-07-30	5020	546	1	1	...	2015	7	30	31	82.0
2	1	4647.0	550.0	8.362376	3	2015-07-29	4782	523	1	1	...	2015	7	29	31	82.0
3	1	4647.0	550.0	8.362376	2	2015-07-28	5011	560	1	1	...	2015	7	28	31	82.0
4	1	4647.0	550.0	8.362376	1	2015-07-27	6102	612	1	1	...	2015	7	27	31	82.0

5 rows × 31 columns

merged.columns

Index(['Store', 'MedSales', 'MedCustomers', 'MedSalesPerCustomer', 'DayOfWeek',
       'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday',
       'SchoolHoliday', 'SalesPerCustomer', 'AvgSales', 'AvgCustomers',
       'AvgSalesPerCustomer', 'StoreType', 'Assortment', 'CompetitionDistance',
       'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
       'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'Year', 'Month',
       'Day', 'Week', 'MonthsCompetitionOpen', 'WeeksPromoOpen'],
      dtype='object')

matplotlib でヒストグラムを描く

ヒストグラムを確認すると、 AvgCustomers,はノーマルディストリビューションに沿っていない。

AvgSalesCustomer、ガウス分布（正規分布）に近い。

MedSales, MedCustomersもノーマルディストリビューションに沿っていない。

MedSalesPerCustomerも正規分布に近い

今回Salesという値を予測するわけだが、これを見るとSalesデータもガウス分布に沿っていない。そのため、この Salesの値をガウス分布になるように計算し直す必要がある。ガウス分布に従っている値の方が機械学習で使用しやすいためである。・一般的にデータを予測する線形回帰モデルはガウス分布に従っている。

なので、Modelingの箇所で正規分布に従うように正規化処理を行う

merged.hist(figsize=(20,20))
plt.show()

output_47_0.png (97.9 kB)

merged[X].head()

	Store	Customers	CompetitionDistance	Promo	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	StateHoliday	...	AvgCustomers	AvgSalesPerCustomer	MedSales	MedCustomers	MedSalesPerCustomer	DayOfWeek	Week	Day	Month	Year
0	1	555	1270.0	1	9.0	2008.0	1	...	564.049936	8.393038	4647.0	550.0	8.362376	5	31	31	7	2015
1	1	546	1270.0	1	9.0	2008.0	1	...	564.049936	8.393038	4647.0	550.0	8.362376	4	31	30	7	2015
2	1	523	1270.0	1	9.0	2008.0	1	...	564.049936	8.393038	4647.0	550.0	8.362376	3	31	29	7	2015
3	1	560	1270.0	1	9.0	2008.0	1	...	564.049936	8.393038	4647.0	550.0	8.362376	2	31	28	7	2015
4	1	612	1270.0	1	9.0	2008.0	1	...	564.049936	8.393038	4647.0	550.0	8.362376	1	31	27	7	2015

5 rows × 23 columns

Model Building and Evaluation

メインパート

# 'Store', 'MedSales', 'MedCustomers', 'MedSalesPerCustomer', 'DayOfWeek',
#        'Date', 'Sales', 'Customers', 'Open', 'Promo', 'StateHoliday',
#        'SchoolHoliday', 'SalesPerCustomer', 'AvgSales', 'AvgCustomers',
#        'AvgSalesPerCustomer', 'StoreType', 'Assortment', 'CompetitionDistance',
#        'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
#        'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'Year', 'Month',
#        'Day', 'Week', 'MonthsCompetitionOpen', 'WeeksPromoOpen'],

データ予測を行うときは、予測に使用するデータと関連性の高いデータがあればあるほど予測の精度が上がるのでデータ前処理してなるべく関連するデータの種類を増やした方が精度を高めるのに役立つ。

from sklearn.model_selection import train_test_split
X = [
    'Store', 
    'Customers',
    'CompetitionDistance', 

    'Promo', 
    'Promo2', 

    'CompetitionOpenSinceMonth',
    'CompetitionOpenSinceYear',
    'Promo2SinceWeek',
    'Promo2SinceYear',

    
    'StateHoliday',
    'StoreType',
    'Assortment',

    'AvgSales',
    'AvgCustomers',
    'AvgSalesPerCustomer',
    
    'MedSales',
    'MedCustomers',
    'MedSalesPerCustomer',

    'DayOfWeek',
    'Week',
    'Day',
    'Month',
    'Year',

]
X_data = merged[X]
Y_data = np.log(merged['Sales'])
X_train, X_test, y_train, y_test = train_test_split(X_data, Y_data, test_size=0.20, random_state=10)

from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer,mean_squared_error



def plot_importance(model):
    k = list(zip(X, model.feature_importances_))
    k.sort(key=lambda tup: tup[1])

    labels, vals = zip(*k)
    
    plt.barh(np.arange(len(X)), vals, align='center')
    plt.yticks(np.arange(len(X)), labels)

XgBoostを使用する。

neg_mean_squared_errorを予測の判定に使用する。平均二乗後さが低ければ低いほど予測の回帰直線が正しくひかれていることを意味する。

import xgboost as xgb
from sklearn.model_selection import GridSearchCV

param ={
            'n_estimators': [100,500, 1000,1500],
            'max_depth':[2,4,6,8]
        }

xgboost_tree = xgb.XGBRegressor(
    eta = 0.1,
    min_child_weight = 2,
    subsample = 0.8,
    colsample_bytree = 0.8,
    tree_method = 'exact',
    reg_alpha = 0.05,
    silent = 0,
    random_state = 1023
)

grid = GridSearchCV(estimator=xgboost_tree,param_grid=param,cv=5,  verbose=1, n_jobs=-1,scoring='neg_mean_squared_error')
   
    

    
grid_result = grid.fit(X_train, y_train)
best_params = grid_result.best_params_

print('Best Params :',best_params)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed: 26.8min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed: 186.8min finished
C:\Users\user\Anaconda3\lib\site-packages\xgboost\core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
  if getattr(data, 'base', None) is not None and \


[06:06:54] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Best Params : {'max_depth': 8, 'n_estimators': 1500}

from math import sqrt

pred = grid_result.predict(X_test)
print('Root Mean squared error {}'.format(sqrt(mean_squared_error(np.exp(y_test), np.exp(pred)))))

Root Mean squared error 351.13062643133986

import sklearn
sorted(sklearn.metrics.SCORERS.keys())

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'brier_score_loss',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'max_error',
 'mutual_info_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'v_measure_score']

sample_submission.csv (317.6 kB)

Conclusion

We were able to reduce the error and get quite good results
Whats next ?
- try other algorithms
  - Cat boost, GBM, nueral network
- try finer feature engineering

Pythonでcatboostを使ってみる

zatsu na benkyou matome saito

勉強したことをまとめるだけのサイト、数学、プログラミング、機械学習とか

"Rossman Store Sales Prediction" : XgBoostで予測モデルを作成