This is a data science project that performs binary classification on all US stocks. The goal is to compare traditional machine learning approach and deep learning approach for time series classification.
A stock ticket is labeled as "1" (positive) if its price have "fallen to the ground", whatever that means, and labeled as "0" (negative) otherwise. Such low price stocks are very risky. But price patterns can be very diverse, so it is difficult to filter such stocks by any fixed rule. We manually labeled 414 positive tickets and 616 negative tickets that are randomly chosen from a pool of 6000+ tickets, for a total of 1030 labeled tickets. Labels are stored in the label.json
file.
download the repository, create a new python envrionment, and install dependencies.
pip install -r requirements.txt
① download historical day-level price data for all US stocks. The list of tickets (data/tickets.txt
) comes from nasdaq, and may not be 100% complete or up-to-date.
cd data
python download.py
Each ticket history will be downloaded as a separate csv file in data/csv
folder. You can use --num=1000
flag to download data for only certain number of tickets at a time.
② plot all the data.
python plot.py
The plots will be saved in the data/plots
folder.
① extract features from data
cd ml
python extract_features.py
the output will be two csv tables in the ml
folder, one for labeled tickets and one for unlabeled tickets.
② fit model
python fit.py --model=lr
Available models:
--model=lr
for logistic regression (default)--model=tree
for decision trees--model=boost
for gradient boosting
You should be able to get around 95% test accuracy. Model file will be saved as model.pkl
in the same folder.
③ classify 5000+ unlabeled tickets
python pred.py
Prediction will be saved to a prediction.json
file.
① prepare data for training with
cd cnn
python prepare.py
The output is two csv files "training_data.csv" and "unlabeled_data.csv" in the cnn
folder.
② train the model with
python train.py
wandb is used for logging with project name "stock-cls", so before training please create an account and login. Otherwise, you can comment out the logger variable in train.py
. After training, model weights will be saved to cnn/stock-cls/[some-name]/checkpoints/
folder.
③ classify unlabeled tickets with
python pred.py
The output will be a prediction.json
file in the cnn
folder.
point 0. defining the problem is often hard.
If you have a precise definition of your problem, then you already solved it. There are so many known and unknown variations that we deem should be of the same category, that's why we label data. If you are a client-facing consultant, you'll find that clients often don't know what they want, until you show them your work. In such situation, it is important to encourage clients to clarify their needs early on.
point 1. no feature can perfectly distinguish classes.
Otherwise, this single feature can be used as a classifier. For consistency, it is recommended for all features to have the same scaling, for example in range
point 2. machine learning approach is very explainable.
During feature selection, it is even possible to discover labeling errors in data by examining individual features, which is something that is not very possible with deep learning approach. When the ML model works well, you know exactly why it works. You solved the problem with human intelligence. The whole process is transparent.
point 3. deep learning is powerful but hard to control.
Deep learning models have powerful representations, you can instantly achieve high accuracy without going through feature engineering. However, setting up and training neural networks is a heavy process, which means it is less flexible when you want to update or change something later. Training can be very unstable and volatile, and model performance is sensitive to hyperparameters. Yet hyperparameter tuning wouldn't give you much insights to your original problem.