Orchestrate data pipelines for ML dataset creation
In order to train and evaluate an ML model, datasets must be created consistently and reproducibly.
Forecasting renewable energy generation depends on large-timescale weather data: Numerical Weather Prediction (NWP) data; satellite imagery; atmospheric quality data. Dagster helps to these datasets organised and up to date.
This repository contains the Dagster definitions that orchestrate the creation of these datasets.
The repository is packaged as a Docker image that can be used as a Dagster code server
$ docker pull ghcr.io/openclimatefix/dagster-dags
To add as a code location in an existing Dagster setup:
$ docker run -d \
-p 4266:4266 \
-e DAGSTER_CURRENT_IMAGE=ghcr.io/openclimatefix/dagster-dags \
ghcr.io/openclimatefix/dagster-dags
# $DAGSTER_HOME/workspace.yaml
load_from:
- grpc_server:
host: localhost
port: 4266
location_name: "dagster-dags" # Name of the module
Note
Setting DAGSTER_CURRENT_IMAGE
environment variable is necessary to tell Dagster
to spawn jobs using the set container image. Since the Containerfile has all the
required dependencies for the user code, it makes sense to set it to itself.
To deploy the entire Dagster multi-container stack:
$ docker compose up -f infrastructure/docker-compose.yml
Note
This will start a full Dagster setup with a web UI, a Postgres database, and a QueuedRunCoordinator. This might be overkill for some setups.
The repository is split into folders covering the basic concepts of Dagster:
- Top-level Definitions defining the code location are defined in
src/dagster_dags/definitions.py
- Assets are in
src/dagster_dags/assets
- Resources are in
src/dagster_dags/resources
They are then subdivided by module into data-type-specific folders.
To run a development Dagster server, install the required dependencies in a virtual environment, activate it, and run the server:
$ cd scr && dagster dev --module-name=dagster_dags
This should spawn a UI at localhost:3000
where you can interact with the Dagster webserver.
This project uses MyPy for static type checking and Ruff for linting. Installing the development dependencies makes them available in your virtual environment.
Use them via:
$ python -m mypy .
$ python -m ruff check .
Be sure to do this periodically while developing to catch any errors early and prevent headaches with the CI pipeline. It may seem like a hassle at first, but it prevents accidental creation of a whole suite of bugs.
Run the unittests with:
$ python -m unittest discover -s tests
On running your own GRPC code server as a code location in Dagster:
- Dagster guide on running a GRPC server.
- Creating a GRPC code server container as part of a multi-container Dagster stack.
- PR's are welcome! See the Organisation Profile for details on contributing
- Find out about our other projects in the OCF Meta Repo
- Check out the OCF blog for updates
- Follow OCF on LinkedIn
Part of the Open Climate Fix community.