-
Notifications
You must be signed in to change notification settings - Fork 0
/
files_cheatsheet.txt
459 lines (365 loc) · 15.7 KB
/
files_cheatsheet.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
FILE INPUTS AND OUTPUTS CHEATSHEET
--------------------------------------------------------------
SCRIPT: EDA.py
Exploratory Data Analysis of Kaggle's Predict Future Sales Dataset.
This script produces a PDF file with different types of plots of data.
FILES IN:
"sales_train.csv"
"items.csv"
"item_categories.csv"
"shops.csv"
FILES OUT:
'EDA.pdf'
-----------------------------------------------------------------------------------------------------------------------------------------------
SCRIPT: preprocessing.py
Preprocesses data for the Future Sales Predictions
FILES IN:
"sales_train.csv",
"items.csv",
"item_categories.csv",
"test.csv",
"shops.csv"
FILES OUT:
'items_preprocessed.pickle',
'shops_preprocessed.pickle',
'categories_preprocessed.pickle',
'sales_train_preprocessed.pickle',
'test_preprocessed.pickle'
#############################
# DETAILS ABOUT OUTPUTS #
#############################
SALES TRAINSET 'sales_train_preprocessed.pickle',
Int64Index: 2916896 entries, 0 to 2935844
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 date object
1 date_block_num int64
2 shop_id int64
3 item_id int64
4 item_price float64
5 item_cnt_day float64
ITEMS 'items_preprocessed.pickle',
Int64Index: 22166 entries, 0 to 22169
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 item_name 22166 non-null object
1 item_id 22166 non-null int64
2 item_category_id 22166 non-null int64
SHOPS 'shops_preprocessed.pickle',
Int64Index: 53 entries, 2 to 59
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 shop_id 53 non-null int64
1 shop_name 53 non-null object
CATEGORIES 'categories_preprocessed.pickle',
Int64Index: 80 entries, 0 to 83
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 item_category_id 80 non-null int64
1 item_category_name 80 non-null object
TEST 'test_preprocessed.pickle'
RangeIndex: 214200 entries, 0 to 214199
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 214200 non-null int64
1 shop_id 214200 non-null int64
2 item_id 214200 non-null int64
-----------------------------------------------------------------------------------------------------------------------------------------------
SCRIPT: nlp.py
NLP data for the Future Sales Predictions
FILES IN:
"items_translated.csv",
"item_categories.csv",
"shops.csv"
FILES OUT:
'tenNN_items.pickle'
'items_nlp.pickle'
'items_clusters.pickle'
'shops_nlp.pickle'
'threeNN_shops.pickle'
-----------------------------------------------------------------------------------------------------------------------------------------------
SCRIPT: feature_eng_0.py
Feature Engineering Part 0 Basics - very basic adds additional categoriacal features
Examples: "shop_category_id", "shop_location_id","item_broader_category_name_id",
"item_name_p2", item_name_p3"
FILES IN:
'items_preprocessed.pickle'
'shops_preprocessed.pickle',
'categories_preprocessed.pickle'
FILES OUT:
'shops_feature_eng_0.pickle',
'categories_feature_eng_0.pickle',
'items_feature_eng_0.pickle'
#############################
# DETAILS ABOUT OUTPUTS #
#############################
CATEGORIES - 'categories_feature_eng_0.pickle',
Int64Index: 80 entries, 0 to 83
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 item_category_id 80 non-null int64
1 item_broader_category_id 80 non-null int64 (formarly subtype_code)
2 item_general_category_id 80 non-null int64 (formarly type_code)
ITEMS 'items_feature_eng_0.pickle'
Int64Index: 22166 entries, 0 to 22169
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 item_id 22166 non-null int64
1 item_category_id 22166 non-null int64
2 item_name_p2_id 22166 non-null int64
3 item_name_p3_id 22166 non-null int64
SHOPS shops_feature_eng_0.pickle',
Int64Index: 53 entries, 2 to 59
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 shop_id 53 non-null int64
1 shop_category_id 53 non-null int64
2 shop_location_id 53 non-null int64
-----------------------------------------------------------------------------------------------------------------------------------------------
SCRIPT: feature_eng_1.py
!! IMPORTANT STEP!! Puts the training data in expanded dataframe
For each time block we identify all items that were sold in that month,
every shop that register a sale in that time block,
and create a row for every combination of item and shop.
This script adds 34 time_block to test and train data as well as
calculates monthly sales for each item, store, timeblock. The feature is used later
to create many of hte lag features.
Example:
If in Jan2013 item1 was sold in shopA,
item2 was sold in shopB,
item1 was sold in shopC
Then we get:
jan2013 item1 shopA
jan2013 item1 shopB
jan2013 item1 shopC
jan2013 item2 shopA
jan2013 item2 shopB
jan2013 item2 shopC
Cartesian product of the exisitng options in a given month.
FILES IN:
'items_preprocessed.pickle'
'categories_preprocessed.pickle'
'shops_preprocessed.pickle'
'sales_train_preprocessed.pickle'
'test_preprocessed.pickle'
FILES OUT:
'main_data_feature_eng_1.pickle'
#############################
# DETAILS ABOUT OUTPUTS #
#############################
MAIN TRAINING DATA 'main_data_feature_eng_1.pickle'
Int64Index: 10925650 entries, 0 to 10925649
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 date_block_num int16
1 shop_id int64
2 item_id int64
3 item_cnt_month float16
4 item_category_id float64
dtypes: float16(1), float64(1), int16(1), int64(2)
memory usage: 375.1 MB
-----------------------------------------------------------------------------------------------------------------------------------------------
SCRIPT: feature_eng_2.py
Creates 5 basic lag features for the data:
LAG ON ITEM COUNT "item_cnt_month"
LAG ON ITEMS AVERAGE COUNT "date_avg_item_cnt"
LAG ON Specific ITEM AVERAGE COUNT "date_item_avg_item_cnt"
LAG on sales of any items per SHOP "date_shop_avg_item_cnt"
LAG on sales of SPEC items per SHOP "date_shop_item_avg_item_cnt"
FILES IN:
'main_data_feature_eng_1.pickle'
FILES OUT:
'main_data_feature_eng_2.pickle'
#############################
# DETAILS ABOUT OUTPUTS #
#############################
Int64Index: 10925650 entries, 0 to 10925649
Data columns (total 45 columns):
# Column Dtype
--- ------ -----
0 date_block_num int16
1 shop_id int64
2 item_id int64
3 item_cnt_month float16
4 item_cnt_month_lag_1 float16
5 item_cnt_month_lag_2 float16
6 item_cnt_month_lag_3 float16
7 item_cnt_month_lag_4 float16
8 item_cnt_month_lag_5 float16
9 item_cnt_month_lag_6 float16
10 item_cnt_month_lag_7 float16
11 item_cnt_month_lag_8 float16
12 item_cnt_month_lag_9 float16
13 item_cnt_month_lag_10 float16
14 date_avg_item_cnt_lag_1 float16
15 date_item_avg_item_cnt_lag_1 float16
16 date_item_avg_item_cnt_lag_2 float16
17 date_item_avg_item_cnt_lag_3 float16
18 date_item_avg_item_cnt_lag_4 float16
19 date_item_avg_item_cnt_lag_5 float16
20 date_item_avg_item_cnt_lag_6 float16
21 date_item_avg_item_cnt_lag_7 float16
22 date_item_avg_item_cnt_lag_8 float16
23 date_item_avg_item_cnt_lag_9 float16
24 date_item_avg_item_cnt_lag_10 float16
25 date_shop_avg_item_cnt_lag_1 float16
26 date_shop_avg_item_cnt_lag_2 float16
27 date_shop_avg_item_cnt_lag_3 float16
28 date_shop_avg_item_cnt_lag_4 float16
29 date_shop_avg_item_cnt_lag_5 float16
30 date_shop_avg_item_cnt_lag_6 float16
31 date_shop_avg_item_cnt_lag_7 float16
32 date_shop_avg_item_cnt_lag_8 float16
33 date_shop_avg_item_cnt_lag_9 float16
34 date_shop_avg_item_cnt_lag_10 float16
35 date_shop_item_avg_item_cnt_lag_1 float16
36 date_shop_item_avg_item_cnt_lag_2 float16
37 date_shop_item_avg_item_cnt_lag_3 float16
38 date_shop_item_avg_item_cnt_lag_4 float16
39 date_shop_item_avg_item_cnt_lag_5 float16
40 date_shop_item_avg_item_cnt_lag_6 float16
41 date_shop_item_avg_item_cnt_lag_7 float16
42 date_shop_item_avg_item_cnt_lag_8 float16
43 date_shop_item_avg_item_cnt_lag_9 float16
44 date_shop_item_avg_item_cnt_lag_10 float16
-----------------------------------------------------------------------------------------------------------------------------------------------
SCRIPT: feature_eng_3.py
Creates 3 broader category and shop location focused lag features for the data:
LAG on sales of broader category sales per SHOP 'date_item_city_avg_item_cnt'
LAG on sales of by month per SHOP location (city) 'date_city_avg_item_cnt'
LAG on sales of each item by month per SHOP location (city) 'date_shop_subtype_avg_item_cnt'
Adds 2 first sale focused features
"item_shop_first_sale"
"item_first_sale"
FILES IN:
'main_data_feature_eng_1.pickle'
'categories_preprocessed_0.pickle'
'shops_preprocessed_0.pickle'
'items_preprocessed_0.pickle'
FILES OUT:
'main_data_feature_eng_3.pickle'
#############################
# DETAILS ABOUT OUTPUTS #
#############################
MAIN TRAIN DATA 'main_data_feature_eng_3.pickle'
Int64Index: 10925650 entries, 0 to 10925649
Data columns (total 22 columns):
# Column Dtype
--- ------ -----
0 date_block_num int16
1 shop_id int64
2 item_id int64
3 item_cnt_month float16
4 item_category_id float64
5 item_broader_category_id float64
6 item_general_category_id float64
7 shop_category_id int64
8 shop_location_id int64
9 date_shop_broad_avg_item_cnt_lag_1 float16
10 date_city_avg_item_cnt_lag_1 float16
11 date_item_city_avg_item_cnt_lag_1 float16
12 month_shopcat_item_cnt_mean_lag_1 float16
13 month_shopcat_cnt_mean_lag_1 float16
14 month_shop_cat_cnt_mean_lag_1 float16
15 month_shop_shopcat_cnt_mean_lag_1 float16
16 month_cat_cnt_mean_lag_1 float16
17 month_broad_itemcat_cnt_mean_lag_1 float16
18 month_type_cnt_mean_lag_1 float16
19 month int16
20 item_shop_first_sale int16
21 item_first_sale int16
dtypes: float16(11), float64(3), int16(4), int64(4)
memory usage: 979.4 MB
-----------------------------------------------------------------------------------------------------------------------------------------------
SCRIPT: feature_eng_4.py
Creates 3 broader category and shop location focused lag features for the data:
LAG on sales of broader category sales per SHOP 'date_item_city_avg_item_cnt'
LAG on sales of by month per SHOP location (city) 'date_city_avg_item_cnt'
LAG on sales of each item by month per SHOP location (city) 'date_shop_subtype_avg_item_cnt'
Adds 2 first sale focused features
"item_shop_first_sale"
"item_first_sale"
FILES IN:
'main_data_feature_eng_1.pickle'
'holidays.csv'
FILES OUT:
'main_data_feature_eng_4.pickle'
#############################
# DETAILS ABOUT OUTPUTS #
#############################
Int64Index: 10925650 entries, 0 to 10925649
Data columns (total 27 columns):
# Column Dtype
--- ------ -----
0 date_block_num int16
1 shop_id int64
2 item_id int64
3 item_cnt_month float16
4 item_category_id float64
5 holidays_cnt_month float64
6 holidays_cnt_month_fut_1 float64
7 holidays_cnt_month_fut_2 float64
8 holidays_cnt_month_fut_3 float64
9 month int16
10 item_shop_first_sale int16
11 item_first_sale int16
12 shop_months_since_opening int16
13 shop_opening bool
14 item_month_id_of_release int16
15 item_month_of_release int16
16 item_months_since_release int16
17 item_new bool
18 item_month_id_of_first_sale_in_shop int16
19 item_month_of_first_sale_in_shop int16
20 item_months_since_first_sale_in_shop int16
21 item_never_sold_in_shop_before bool
22 item_never_sold_in_shop_before_but_not_new bool
23 item_seniority int8
24 item_sold_in_shop bool
25 item_month_id_of_last_sale_in_shop float64
26 item_months_since_last_sale_in_shop float64
dtypes: bool(5), float16(1), float64(7), int16(11), int64(2), int8(1)
memory usage: 1.1 GB
-----------------------------------------------------------------------------------------------------------------------------------------------
SCRIPT: feature_eng_5.py
Creates 3 broader category and shop location focused lag features for the data:
LAG on sales of broader category sales per SHOP 'date_item_city_avg_item_cnt'
LAG on sales of by month per SHOP location (city) 'date_city_avg_item_cnt'
LAG on sales of each item by month per SHOP location (city) 'date_shop_subtype_avg_item_cnt'
Adds 2 first sale focused features
"item_shop_first_sale"
"item_first_sale"
FILES IN:
'main_data_feature_eng_1.pickle'
'categories_preprocessed_0.pickle'
'shops_preprocessed_0.pickle'
'items_preprocessed_0.pickle'
FILES OUT:
'main_data_feature_eng_5.pickle'
#############################
# DETAILS ABOUT OUTPUTS #
#############################
Int64Index: 10925650 entries, 0 to 10925649
Data columns (total 8 columns):
# Column Dtype
--- ------ -----
0 date_block_num int16
1 shop_id int64
2 item_id int64
3 item_cnt_month float16
4 item_category_id float64
5 delta_price_lag float64
6 delta_shop_price_lag float64
7 delta_revenue_lag_1 float64
dtypes: float16(1), float64(4), int16(1), int64(2)
memory usage: 625.2 MB
-----------------------------------------------------------------------------------------------------------------------------------------------