The folder contains all the classification methods used. However, the steps below are to be followed to achieve the best results achieved.
Moreover, the result on running the model may not be exactly the same. This is because of the differences in the system and randomness in the models.
- Anaconda installation with multithreading xgb (local set up). (We found the single-threading version to give lower scores on the same model.)
- xgboost installation is not supported by anaconda or pip - so needs to be installed manually on your local drive with instructions from their docs (However, the documentation is slightly tricky)
https://github.com/dmlc/xgboost/blob/master/doc/build.md
- Create a 'Data' folder. Store the unzipped data files, also create a 'pickle' folder inside the data folder
- Run code for label encoding:
python preprocessing/preprocessing_label_encoding.py --data_directory <file-path-to-Data-directory>
Note: Before running the above code, download the data from Kaggle and store it in a directory called 'Data'. Extract the files and do not change the names. - Run DimensionalityReduction_with17304_removal.ipynb
- Run PreprocessingInterpolation.ipynb
- Run Preprocessing_dateFeatures.ipynb
- Run group_outcome_change.ipynb
- Run Preprocessing_merging.ipynb. ** Note: _ Preprocessing_merging.ipynb required you to check if all categorical variable are one-hot-encodable i.e. there are no inconsistency in the total unique value in a OHE column in test and train. An additional row maybe added to make it consistent. See comment block in the file **
- Run xgboost.ipynb
Additional Note: Depending on the path to your data folder, you may need to change file paths in the ipynb files. These are always present at the beginning of each notebook.