Too many Github notifications, too little time. The obvious answer there is to spend lots of time creating a hyper personalised prediction engine that can tell me what I'm interested in. And learn a whole bunch of stuff on the way. This is a tongue-in-cheek experiment, which resulted in a realisation that I'm pretty unpredictable.
See it in action (predictions for the chillu
github user):
http://github-issue-ml-relevancy.herokuapp.com
- Collect Github events from each repo the viewer has previously interacted with
- Score each issue and pull request based on the amount of interactions (if any)
- Train a neural network with both categorical and continuous data, with a regression learner
- Provide a prediction service for this user
The approach was also presented in Sept 2020 at the virtual StripeConEU conference - see talk recording.
The input parameters are sourced from https://githubarchive.org, a ~6TB data set of every Github event ever created. The data is accessible via Google BigQuery. We're only interested in events related to repositories that the user has previously interacted with. In my case, this got the training data set to about 20k rows.
See notebook/learn.ipynb for the BigQuery queries run to retrieve the parameters.
Training happens via Python3 on the Fast.AI framework, which builds on awesome libraries like Pytorch, Scikit Learn and Pandas. We're training both a Neural Network and a Random Forest.
See notebook/learn.ipynb for a (non-interactive) snapshot of the training process.
The frontend is a flask web application served by gunicorn, powered by Python3. It's hosted on Heroku.
Installation
pip3 install -r requirements.txt
Configuration (populate vars in new file)
cp .env.default .env
Start local server
FLASK_ENV=development FLASK_APP=app.py flask run
Deploy to Heroku
# Initialise new heroku project
heroku init
# Set up config vars
heroku config:set GITHUB_API_TOKEN="..."
heroku config:set VIEWER_LOGIN="..."
# Deploy
git push heroku master
The CLI tool is an alternative to the web frontend.
Installation
pip3 install -r requirements.txt
Configuration
cp .env.default .env
Run
python3 cli.py https://github.com/myorg/myrepo/issue/999
- Error when using repo that data set hasn't been trained on
- Lock python deps via
pipenv