Skip to content

openclimatefix/dagster-dags

Repository files navigation

Dagster Dags

Orchestrate data pipelines for ML dataset creation

tags badge contributors badge workflows badge ease of contribution: easy

In order to train and evaluate an ML model, datasets must be created consistently and reproducibly.

Forecasting renewable energy generation depends on large-timescale weather data: Numerical Weather Prediction (NWP) data; satellite imagery; atmospheric quality data. Dagster helps to these datasets organised and up to date.

This repository contains the Dagster definitions that orchestrate the creation of these datasets.

Installation

The repository is packaged as a Docker image that can be used as a Dagster code server

$ docker pull ghcr.io/openclimatefix/dagster-dags

Example Usage

To add as a code location in an existing Dagster setup:

$ docker run -d \
    -p 4266:4266 \
    -e DAGSTER_CURRENT_IMAGE=ghcr.io/openclimatefix/dagster-dags \
    ghcr.io/openclimatefix/dagster-dags
# $DAGSTER_HOME/workspace.yaml

load_from:
  - grpc_server:
      host: localhost
      port: 4266
      location_name: "dagster-dags" # Name of the module

Note

Setting DAGSTER_CURRENT_IMAGE environment variable is necessary to tell Dagster to spawn jobs using the set container image. Since the Containerfile has all the required dependencies for the user code, it makes sense to set it to itself.

To deploy the entire Dagster multi-container stack:

$ docker compose up -f infrastructure/docker-compose.yml

Note

This will start a full Dagster setup with a web UI, a Postgres database, and a QueuedRunCoordinator. This might be overkill for some setups.

Documentation

The repository is split into folders covering the basic concepts of Dagster:

  • Top-level Definitions defining the code location are defined in src/dagster_dags/definitions.py
  • Assets are in src/dagster_dags/assets
  • Resources are in src/dagster_dags/resources

They are then subdivided by module into data-type-specific folders.

Development

To run a development Dagster server, install the required dependencies in a virtual environment, activate it, and run the server:

$ cd scr && dagster dev --module-name=dagster_dags

This should spawn a UI at localhost:3000 where you can interact with the Dagster webserver.

Linting and static type checking

This project uses MyPy for static type checking and Ruff for linting. Installing the development dependencies makes them available in your virtual environment.

Use them via:

$ python -m mypy .
$ python -m ruff check .

Be sure to do this periodically while developing to catch any errors early and prevent headaches with the CI pipeline. It may seem like a hassle at first, but it prevents accidental creation of a whole suite of bugs.

Running the test suite

Run the unittests with:

$ python -m unittest discover -s tests

Further Reading

On running your own GRPC code server as a code location in Dagster:


Contributing and community

issues badge


Part of the Open Climate Fix community.

OCF Logo