-
Ferry uses VSCode's Dev Container for the dev environment. All environment setup is handled through the standard Dev Container pipeline. For a quick start, use the
Dev Containers: Reopen in Container
command in VSCode. -
MacOS: If you want to develop outside of the Dev Container, please install
pygraphviz
first before runningpip install -e .
pip install -U --no-cache-dir \ --config-settings="--global-option=build_ext" \ --config-settings="--global-option=-I$(brew --prefix graphviz)/include/" \ --config-settings="--global-option=-L$(brew --prefix graphviz)/lib/" \ pygraphviz
Ferry uses a submodule for the data repository. To clone the submodule, run:
git submodule update --init --depth 1
Any time in the future, you can update the submodule with:
cd data
git fetch --depth 1 origin master
git reset --hard origin/master
Then, you can commit the new submodule link to the main repo. It's not necessary to sync the latest link all the time, but feel free to whenever you are committing other changes.
Make sure you have all dependencies installed or are using the Dev Container.
python main.py -f config/dev_fetch.yml
python main.py -f config/dev_sync_db_courses.yml
Options can be passed either through the command line, through a YAML config file, or through environment variables, in decreasing order of precedence.
Note: because we use argparse
, you can provide just the prefix of each argument, such as --config
instead of --config-file
.
CLI flag | Config option | Env key | Default | Description |
---|---|---|---|---|
--cas-cookie |
cas_cookie |
CAS_COOKIE |
None ; prompt if crawl_evals |
Only used for fetching evals; see below |
-f , --config-file |
N/A | N/A | None |
Path to YAML config file, relative to PWD; if unspecified, all options are read from command |
--data-dir |
data_dir |
N/A | data |
Directory to load/store parsed data. This is usually where the ferry-data is cloned. |
--database-connect-string |
database_connect_string |
POSTGRES_URI |
None ; prompt if sync_db_courses or sync_db_evals |
Postgres connection string; for dev, see dev_sync_db_courses.yml |
-d , --debug |
debug |
N/A | False |
Enable debug logging |
-r , --release |
release |
N/A | False |
Run in release mode; see below |
-s , --seasons |
seasons |
N/A | None |
A list of seasons to fetch; see below |
--sentry-url |
sentry_url |
SENTRY_URL |
None ; prompt if release |
Sentry URL for error reporting; required in release mode, ignored in dev mode |
--use-cache |
use_cache |
N/A | False ; always False if release |
Use cached data instead of fetching fresh data. Even if not using cache, cache will still be updated. |
You can selectively run certain stages of Ferry. Options below are in the order of the stages. All stages are False
by default, in which case Ferry basically does nothing.
CLI flag | Config option | Description |
---|---|---|
--save-config |
save_config |
Save the parsed config options to config_file ; does nothing if config_file is unspecified. |
--crawl-seasons |
crawl_seasons |
Run the seasons crawler. If False and no cached data is available, then --seasons must be provided and all be valid seasons. |
--crawl-classes |
crawl_classes |
Run the class crawler. |
--crawl-evals |
crawl_evals |
Run the eval crawler. |
--transform |
transform |
Run the transformer. Always True if sync_db_courses , sync_db_evals , or snapshot_tables . |
--snapshot-tables |
snapshot_tables |
Generate CSV files capturing data that would be written to DB. |
--sync-db-courses |
sync_db_courses |
Sync the transformed data to the database. |
--sync-db-evals |
sync_db_evals |
Sync the transformed data to the database. |
--rewrite |
rewrite |
Use old sync db function to write to database instead of incremental write. Use this during the first run to set up the database. Only has an effect if sync_db_courses . |
--generate-diagram |
generate_diagram |
Generate a DB visualization diagram to docs/db_diagram.pdf |
For example, to test the transformer in isolation, you may find this handy:
python main.py --transform --snapshot-tables
In release mode:
use_cache
is alwaysFalse
; we always fetch fresh data.sentry_url
is required; we will send error reports to Sentry.
In non-release mode:
- We never initialize Sentry and
sentry_url
is ignored.
To get a valid CAS cookie to connect to OCE, first log into https://oce.app.yale.edu/ocedashboard/studentViewer. Refresh with the network inspector tab open, and find the first HTTP request. Copy the Cookie
header from the request and paste into the prompt or the cas_cookie
option. It should look like:
JSESSIONID=...; ...
- If the
seasons
option is unspecified, we default to fetching ALL viable seasons. - If the
seasons
option contains a single value calledLATEST_n
, we fetch the latestn
seasons. - Otherwise, the
seasons
option should be a list of seasons to fetch, using the standard season code format: e.g.--seasons 202301 202303
.
In almost all cases, it is sufficient to only fetch the last 3 seasons in dev. In fact, ferry
will not work when fetching and syncing to DB all seasons in dev due to professor_id
mapping requiring legacy seasons. Please clone ferry-data
if working with all seasons is necessary.
Eval data is fetched and committed using an entirely separate pipeline, because it's only run once a semester.
python main.py -f config/dev_evals.yml
python main.py --sync-db-evals
# Enter the prod database URL when prompted
black .
pyright
- Ensure that parallel evals is accurate
- Rearchitect GitHub action to create Postgres dump, upload with VC (enable lifecycle management) to Azure Blob, and restart prod
postgres
container to trigger new pull. - Modernize dependencies
- SIGINT handler
- Lint and format CI