To run the application, create a file called backend/backend/.env
with the following values (without quotation marks):
NTA_DEVELOPER_KEY=YOUR_KEY
AWS_MYSQL_PASSWORD=AWS_MYSQL_PASSWORD
LOCAL_MYSQL_PASSWORD=LOCAL_MYSQL_PASSWORD
SOCIAL_SECRET=SOCIAL_SECRET
GOOGLE_CLIENT_ID=GOOGLE_CLIENT_ID
By default the Django backend will use the MySQL instance hosted by AWS.
To switch to a local MySQL instance instead you can set the following environment variable:
export DB='local_mysql'
This requires that the MySQL database is setup correctly on the local host.
To switch to a local SQLite instance you can set the following environment variable:
export DB='local_sqlite3'
Before running the setup_db.sh script, the GTFS_STATIC_DIR environmental variable must be set, which specifies the path to where the GTFS static files are located e.g.
export GTFS_STATIC_DIR='/home/team13/UCD-DublinBus/backend/dublinbus/scripts/gtfs_static'
One way to frame the prediction problem is to match all pairs of adjacent bus stops where it is possible to get a bus between them. A separate model is then trained for each of these pairs. The training data can come from any bus route that travels between the two stops at any time. There are two stages to this adjacent stops approach:
- Matching Adjacent Pairs of Stops: This step matches up pairs of adjacent stops
for each day in the 2018 data and saves them as separate parquet files. Run it from the
data-analytics
folder as follows:
nohup python -u -m preprocessing.run_preprocessing create_adjacent_stop_pairs &
It saves the output to /home/team13/data/adjacent_stop_pairs/
. Timestampted logs are available in
/home/team13/logs/preprocessing/
.
- Feature Engineering: This stage takes the input of the previous stage and combines all the
parquet files for a particular stop pair together and adds features. Run it from the
data-analytics
folder as follows:
nohup python -u -m preprocessing.run_preprocessing features &
It saves the output to /home/team13/data/adjacent_stop_pairs_with_features/
. Timestampted logs are available in
/home/team13/logs/preprocessing/
. The features are:
bank_holiday
: A binary variable that is 1 if the date was a bank holiday or 0 otherwisecos_time
: The cosine of the number of seconds since midnightsin_time
: The sine of the number of seconds since midnightcos_day
: The cosine of the numeric day of the week (0 for Monday, 1 for Tuesday, etc.)sin_day
: The sine of the numeric day of the week (0 for Monday, 1 for Tuesday, etc.)is_weekend
: A binary variable equal to 1 if it's a weekend or 0 otherwiseday_1-day_6
: One-hot encoding of the days of the week.rain
: Amount of precipitation in mm in the current hourlagged_rain
: Amount of precipitation in mm in the previous hourtemp
: Air temperature in degrees celsius
Only one of the cos_day
and sin_day
, the is_weekend
, or the day_1-day_6
features should be used at a time.
To run a model run a command such as:
nohup python -u -m models.run_model linear-regression train &
nohup python -u -m models.run_model random-forest train &
from the data-analytics
folder.
The saved models and the associated metrics will be saved to /home/team13/model_output/
. It is also possible to create learning curves for a particular stoppair by using the command:
python -u -m models.linear_regression learning-curve 618_to_619
The learning curves will be saved to /home/team13/model_output/
.
An important consideration in predicting the time taken for a bus journey is whether our predictions will include the time that the bus may spend waiting at a particular bus stop which is known as "dwell time". One approach could involve simply ignoring this and assuming that it will be an insignificant factor in the overall prediction. This would involve using the time that a bus leaves a stop and arrives at the next stop as the target value. Alternatively, we could try to account for dwell time by using the time that a bus leaes a stop and departs from the next stop as the target value. This target should then include any time spent by the bus at the stop. One disadvantage of this method is that the prediction might slightly overestimate the time for journeys between adjacent stops. However, for longer journeys involving many stops it seems better to try to include this dwell time because omitting it could result in a significant underestimation of true journey times.
An alternative way to frame the problem is to make a time prediction for the full route in a particular direction and then split this prediction up based on the historical averages as requested by the user.