Skip to content

Latest commit

 

History

History
223 lines (154 loc) · 10.8 KB

12_airflow_dependencies_and_extras.rst

File metadata and controls

223 lines (154 loc) · 10.8 KB

Airflow dependencies

Airflow is not a standard python project. Most of the python projects fall into one of two types - application or library. As described in this StackOverflow question, the decision whether to pin (freeze) dependency versions for a python project depends on the type. For applications, dependencies should be pinned, but for libraries, they should be open.

For applications, pinning the dependencies makes it more stable to install in the future - because new (even transitive) dependencies might cause installation to fail. For libraries - the dependencies should be open to allow several different libraries with the same requirements to be installed at the same time.

The problem is that Apache Airflow is a bit of both - application to install and library to be used when you are developing your own operators and DAGs.

This - seemingly unsolvable - puzzle is solved by having pinned constraints files.

Note

Only pip installation is officially supported.

While it is possible to install Airflow with tools like poetry or pip-tools, they do not share the same workflow as pip - especially when it comes to constraint vs. requirements management. Installing via Poetry or pip-tools is not currently supported.

There are known issues with bazel that might lead to circular dependencies when using it to install Airflow. Please switch to pip if you encounter such problems. The Bazel community added support for cycles in this PR so it might be that newer versions of bazel will handle it.

If you wish to install airflow using these tools you should use the constraint files and convert them to appropriate format and workflow that your tool requires.

By default when you install apache-airflow package - the dependencies are as open as possible while still allowing the apache-airflow package to install. This means that the apache-airflow package might fail to install when a direct or transitive dependency is released that breaks the installation. In that case, when installing apache-airflow, you might need to provide additional constraints (for example pip install apache-airflow==1.10.2 Werkzeug<1.0.0)

There are several sets of constraints we keep:

  • 'constraints' - these are constraints generated by matching the current airflow version from sources
    and providers that are installed from PyPI. Those are constraints used by the users who want to install airflow with pip, they are named constraints-<PYTHON_MAJOR_MINOR_VERSION>.txt.
  • "constraints-source-providers" - these are constraints generated by using providers installed from current sources. While adding new providers their dependencies might change, so this set of providers is the current set of the constraints for airflow and providers from the current main sources. Those providers are used by CI system to keep "stable" set of constraints. They are named constraints-source-providers-<PYTHON_MAJOR_MINOR_VERSION>.txt
  • "constraints-no-providers" - these are constraints generated from only Apache Airflow, without any providers. If you want to manage airflow separately and then add providers individually, you can use them. Those constraints are named constraints-no-providers-<PYTHON_MAJOR_MINOR_VERSION>.txt.

The first two can be used as constraints file when installing Apache Airflow in a repeatable way. It can be done from the sources:

from the PyPI package:

pip install "apache-airflow[google,amazon,async]==2.2.5" \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.5/constraints-3.8.txt"

The last one can be used to install Airflow in "minimal" mode - i.e when bare Airflow is installed without extras.

When you install airflow from sources (in editable mode) you should use "constraints-source-providers" instead (this accounts for the case when some providers have not yet been released and have conflicting requirements).

pip install -e ".[devel]" \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-main/constraints-source-providers-3.8.txt"

This also works with extras - for example:

pip install ".[ssh]" \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-main/constraints-source-providers-3.8.txt"

There are different set of fixed constraint files for different python major/minor versions and you should use the right file for the right python version.

If you want to update just the Airflow dependencies, without paying attention to providers, you can do it using constraints-no-providers constraint files as well.

pip install . --upgrade \
  --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-main/constraints-no-providers-3.8.txt"

The constraints-<PYTHON_MAJOR_MINOR_VERSION>.txt and constraints-no-providers-<PYTHON_MAJOR_MINOR_VERSION>.txt will be automatically regenerated by CI job every time after the pyproject.toml is updated and pushed if the tests are successful.

Note

Only pip installation is currently officially supported.

While there are some successes with using other tools like poetry or pip-tools, they do not share the same workflow as pip - especially when it comes to constraint vs. requirements management. Installing via Poetry or pip-tools is not currently supported.

There are known issues with bazel that might lead to circular dependencies when using it to install Airflow. Please switch to pip if you encounter such problems. Bazel community works on fixing the problem in this PR so it might be that newer versions of bazel will handle it.

If you wish to install airflow using these tools you should use the constraint files and convert them to appropriate format and workflow that your tool requires.

There are a number of extras that can be specified when installing Airflow. Those extras can be specified after the usual pip install - for example pip install -e.[ssh] for editable installation. Note that there are two kinds of extras - regular extras (used when you install airflow as a user, but in editable mode you can also install devel extras that are necessary if you want to run airflow locally for testing and doc extras that install tools needed to build the documentation.

This is the full list of these extras:

The devel extras are not available in the released packages. They are only available when you install Airflow from sources in editable installation - i.e. one that you are usually using to contribute to Airflow. They provide tools such as pytest and mypy for general purpose development and testing, also some providers have their own development-related extras tbat allow to install tools necessary to run tests, where the tools are specific for the provider.

devel, devel-all, devel-all-dbs, devel-ci, devel-debuggers, devel-devscripts, devel-duckdb, devel- hadoop, devel-mypy, devel-sentry, devel-static-checks, devel-tests

The doc extras are not available in the released packages. They are only available when you install Airflow from sources in editable installation - i.e. one that you are usually using to contribute to Airflow. They provide tools needed when you want to build Airflow documentation (note that you also need devel extras installed for airflow and providers in order to build documentation for airflow and provider packages respectively). The doc package is enough to build regular documentation, where doc_gen is needed to generate ER diagram we have describing our database.

doc, doc-gen

Those extras are available as regular Airflow extras and are targeted to be used by Airflow users and contributors to select features of Airflow they want to use They might install additional providers or just install dependencies that are necessary to enable the feature.

aiobotocore, airbyte, alibaba, all, all-core, all-dbs, amazon, apache-atlas, apache-beam, apache- cassandra, apache-drill, apache-druid, apache-flink, apache-hdfs, apache-hive, apache-impala, apache-kafka, apache-kylin, apache-livy, apache-pig, apache-pinot, apache-spark, apache-webhdfs, apprise, arangodb, asana, async, atlas, atlassian-jira, aws, azure, cassandra, celery, cgroups, cloudant, cncf-kubernetes, cohere, common-io, common-sql, crypto, databricks, datadog, dbt-cloud, deprecated-api, dingding, discord, docker, druid, elasticsearch, exasol, fab, facebook, ftp, gcp, gcp_api, github, github-enterprise, google, google-auth, graphviz, grpc, hashicorp, hdfs, hive, http, imap, influxdb, jdbc, jenkins, kerberos, kubernetes, ldap, leveldb, microsoft-azure, microsoft-mssql, microsoft-psrp, microsoft-winrm, mongo, mssql, mysql, neo4j, odbc, openai, openfaas, openlineage, opensearch, opsgenie, oracle, otel, pagerduty, pandas, papermill, password, pgvector, pinecone, pinot, postgres, presto, pydantic, qdrant, rabbitmq, redis, s3, s3fs, salesforce, samba, saml, segment, sendgrid, sentry, sftp, singularity, slack, smtp, snowflake, spark, sqlite, ssh, statsd, tableau, tabular, telegram, teradata, trino, vertica, virtualenv, weaviate, webhdfs, winrm, yandex, zendesk


You can now check how to update Airflow's metadata database if you need to update structure of the DB.