This developer guide is for data engineers, data scientists and developers of the OS-Climate community who are looking at leveraging the OS-Climate Data Commons to build data ingestion and processing pipelines, as well as AI / ML pipelines. It shows step-by-step how to configure your development environment, structure projects, and manage data and code in a way that complies with our Architecture Blueprint.
Need Help?
- Outage / System failure: File an Linux Foundation (LF) outage ticket (note: select OS-Climate from project list)
- New infrastructure request (e.g. software upgrade): File an LF ticket (note: select OS-Climate from project list)
- General infrastructure support: Get help on OS-Climate Slack Data Commons channel
- Data Commons developer support: Get help on OS-Climate Slack Developers channel
Pipeline development leverages a number of tools provided by Data Commons. The list below provides an overview of key technologies involved as well as links to development instances:
Technology | Description | Link |
---|---|---|
GitHub | Version control tool used to maintain the pipelines as code | OS-Climate GitHub |
GitHub Projects | Project tracking tool that integrates issues and pull requests | Data Commons Project Board |
JupyterHub | Self-service environment for Jupyter notebooks used to develop pipelines | JupyterHub Development Instance |
Kubeflow Pipelines | MLOps tool to support model development, training, serving and automated machine learning | |
Trino | Distributed SQL Query Engine for big data, used for data ingestion and distributed queries | Trino Console |
CloudBeaver | Web-based database GUI tool which provides rich web interface to Trino | CloudBeaver Development Instance |
Pachyderm | Data-driven pipeline management tool for machine learning, providing version control for data | |
dbt | SQL-based data transformation tool providing git-enabled version control of data transformation pipelines | |
Great Expectations | Data quality tool providing git-enabled data quality pipelines management | |
OpenMetadata | Centralized metadata store providing data discovery, data collaboration, metadata versioning and data lineage | OpenMetadata Development Instance |
Airflow | Workflow management platform for data engineering pipelines | Airflow Development Instance |
Apache Superset | Data exploration and visualization platform | Superset Development Instance |
Grafana | Analytics and interactive visualization platform | Grafana Development Instance |
INCEpTION | Text-annotation environment primarily used by OS-C for machine learning-based data extraction | INCEpTION Development Instance |
Nowadays, developers (including data scientists) use Git and GitOps practices to store and share code on development platforms such as GitHub. GitOps best practices allow for reproducibility and traceability in projects. For this reason, we have decided to adopt a GitOps approach toward managing the platform, data pipeline code as well as data and related artifacts.
One of the most important requirements to ensure data quality through reproducibility is dependency management. Having dependencies clearly managed in audited configuration artifacts allows portability of notebooks, so they can be shared safely with others and reused in other projects.
We use two project templates as starting point for new repositories:
- A project template for data pipelines, specific to OS-Climate Data Commons, can be found here: Data Pipelines Template
- A project tempalte specifically for AI/ML pipelines can be found here: Data Science Template.
Together the use of these templates ties data scientist needs (e.g. notebooks, models) and data engineers needs (e.g. data and metadata pipelines). Having structure in a project ensures all the pieces required for the Data and MLOps lifecycles are present and easily discoverable.