Skip to content

An Ubuntu Vagrant Virtual Machine (VM) with Airflow, a data workflow management system from Airbnb

Notifications You must be signed in to change notification settings

GovardhanKanala/airflow-local

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

airflow-local

About

This project provides a Ubuntu (16.04) Vagrant Virtual Machine (VM) with Airflow, a data workflow management system from Airbnb.

There are Ansible scripts that automatically install the software when the VM is started.

1. Run Airflow

Connect to the VM

  1. To start the virtual machine(VM) type

    vagrant up
    
  2. Setup private keys

    copy-private-keys.bat
    
  3. Connect to the VM

    vagrant ssh airflow-master
    
  4. Setup private keys

    /vagrant/scripts/setup-private-keys.sh
    

Initialize Airflow

  1. Initialize Airflow home

    export AIRFLOW_HOME=~/airflow
    
  2. Setup the sqlite database

    airflow initdb
    
  3. Change to the Airflow directory

    cd $AIRFLOW_HOME
    
  4. Reduce the load on the Airflow system by setting values in airflow.cfg

    parallelism = 4
    dag_concurrency = 2
    celeryd_concurrency = 4
    
  5. Start the web server

    airflow webserver -p 8080
    
  6. Open a web browser to the UI at http://192.168.33.10:8080

Run a task

  1. List DAGS

    airflow list_dags
    
  2. List tasks for example_bash_operator DAG

    airflow list_tasks example_bash_operator
    
  3. List tasks for example_bash_operator in a tree view

    airflow list_tasks example_bash_operator -t
    
  4. Run the runme_0 task on the example_bash_operator DAG today

    airflow run example_bash_operator runme_0 `date +%Y-%m-%d`
    
  5. Backfill a DAG

    export START_DATE=$(date -d "-2 days" "+%Y-%m-%d")
    airflow backfill -s $START_DATE example_bash_operator
    
  6. Clear the history of DAG runs

    airflow clear example_bash_operator
    

Add a new task

  1. Go to the Airflow config directory

    cd ~/airflow
    
  2. Set the airflow dags directory in airflow.cfg by change the line:

    dags_folder = /vagrant/airflow/dags
    
  3. Remove example dags

    load_examples = False
    
  4. Restart the web server

    airflow webserver -p 8080
    

Run a task

  1. Run the dynamic_dags task

    airflow list_dags
    
  2. Run the dag

    airflow trigger_dag dynamic_dags
    
  3. Run the scheduler to actually run the dag

    airflow scheduler
    

Disable logging

  1. Change to the airflow directory

    cd /vagrant/airflow
    
  2. Set airflow environment

    source set_airflow_env.sh
    
  3. Run airflow without any logging messages

Setup airflow dags directory

  1. Edit file ~/airflow/airflow.cfg

  2. Set the following:

    dags_folder = /vagrant/airflow/dags
    load_examples = False
    
  3. Start the webserver & scheduler by running the following

    tmuxp load /vagrant/scripts/tmux-webserver-scheduler.yaml
    

Setup Airflow in Pseudo-distributed mode using Local Executor

Follow the instructions here

Setup Airflow in distributed mode using Celery Executor

Follow the instructions here

2. Run RabbitMQ

Start RabbitMQ

  1. Start the RabbitMQ in a Docker container

    export RMQ_IMG=rabbitmq:3.6.10-management
    docker run -d --rm --hostname airflow-rmq \
        --name airflow-rmq -p 192.168.33.10:15672:15672 -p 5672:5672 $RMQ_IMG
    
  2. Display the list of running Docker instances

    docker ps
    
  3. Go to the RabbitMQ dashboard at http://192.168.33.10:15672/

  4. Login using guest/guest

Stop RabbitMQ

  1. Connect to the RabbitMQ Docker container

    export RMQ=$(docker ps -aq --filter name=airflow-rmq)
    
  2. List queues

    docker exec -ti $RMQ rabbitmqctl list_queues
    
  3. Stop RabbitMQ

    docker stop $RMQ
    

3. Connect to RabbitMQ using Python

Connect to RabbitMQ only using Python (no Celery)

The RabbitMQ web site demonstrates how to connect using Python and the Pika library.

  1. List queues

    docker exec -ti $RMQ rabbitmqctl list_queues
    
  2. Send a message to a RabbitMQ queue called hello

    python rmq-send.py
    
  3. Receive a message from RabbitMQ queue called hello

    python rmq-receive.py
    
  4. List queues displaying the hello queue

    docker exec -ti $RMQ rabbitmqctl list_queues
    
  5. Stop the app

    docker exec -ti $RMQ rabbitmqctl stop_app
    
  6. Start the app

    docker exec -ti $RMQ rabbitmqctl start_app
    
  7. List queues and the hello queue is not displayed

    docker exec -ti $RMQ rabbitmqctl list_queues
    

Connect to RabbitMQ using Celery

  1. Start the Celery worker

    export PYTHONPATH=/vagrant/scripts
    celery -A tasks worker --loglevel=info
    
  2. Call the task

    export PYTHONPATH=/vagrant/scripts
    python -c "from tasks import add; add.delay(2, 3)"
    

4. Run Postgres

Start Postgres

  1. Start the Postgres in the Docker container with the name

    export PG_IMG=postgres:9.6.3
    export PGPASSWORD=airflow_pg_pass
    docker run -d --rm --name airflow-pg -p 0.0.0.0:5432:5432 \
        -e POSTGRES_PASSWORD=$PGPASSWORD $PG_IMG
    export PG=$(docker ps -aq --filter name=airflow-pg)
    

Stop Postgres

  1. List the Docker container

    docker ps --filter id=$PG
    
  2. Stop Postgres

    docker stop $PG
    

5. Connect to Postgres using Psycopg2 and SQLAlchemy

Start the Postres database before running these steps

  1. Connect to the database using psql and create the database test

    docker exec -ti $PG psql -U postgres -c "create database test"
    
  2. Create table test in database test only using Psycopg2

    export PGHOST=localhost
    python /vagrant/scripts/pg-psycopg2.py
    
  3. Connect to database test using Psycopg2 and SQLAlchemy

    python pg-sqlalchemy-read.py
    
  4. Connect to the postgres database again

    docker exec -ti $PG psql -U postgres
    
  5. List the databases

    \l
    
  6. Connect to the test database

    \c test
    
  7. List objects in the test database

    \d
    
  8. Select all rows from the test database

    select id, num, data from test;
    
  9. Quit the psql utility

    \q
    
  10. Drop database test

    docker exec -ti $PG psql -U postgres -c "drop database test"
    

6. Setup Airflow with Postgres

1. Using the LocalExecutor

  1. Change the executor in ~/airflow.cfg file to the following values

    executor = LocalExecutor
    
  2. Change the sql_alchemy_conn in ~/airflow.cfg file to the following values

    # Change the meta db configuration
    sql_alchemy_conn = postgresql+psycopg2://postgres:airflow_pg_pass@localhost/test
    

2. Using the CeleryExecutor with Redis

  1. Change the executor in ~/airflow.cfg file to the following values

    executor = CeleryExecutor
    
  2. Set the following two values in ~/airflow.cfg file

    broker_url = redis://localhost:6379/0
    celery_result_backend = redis://localhost:6379/0
    

3. Initialize the database with Airflow data

  1. Initialze the test database as the Airflow database

    airflow initdb
    

4. Start airflow daemons with Tmux

```
tmuxp load /vagrant/scripts/tmux-airflow-daemons.yaml
```

5. View the Airflow web server at http://192.168.33.10:8080

6. View the Airflow flower server at http://192.168.33.10:5555

7. Setup Airflow worker machine

  1. Copy Airflow configuration

    rsync -zvh airflow/airflow*.cfg worker:~/airflow/
    
  2. Initialize Airflow home export AIRFLOW_HOME=~/airflow

  3. Run airflow worker airflow worker

  4. Stop redis sudo systemctl stop redis

  5. Modify Redis configuration sudo vim /etc/redis/redis.conf

  6. Change bind line to the following bind 0.0.0.0

  7. Start Redis sudo systemctl start redis

8. Start airflow daemons with supervisord

  1. Create postgres Docker container
  2. Create test database
  3. Run airflow init
  4. Run sudo supervisord
  5. Need to setup logs in airflow-scheduler.conf, airflow-webserver.conf, airflow-worker.conf

9. Setup netdata for monitoring

  1. Edit netdata configuration

    vim /opt/netdata/etc/netdata/netdata.conf
    
  2. Show status of netdata

    sudo systemctl status netdata
    
  3. Start netdata

    sudo systemctl start netdata
    
  4. View the netdata at http://192.168.33.10:19999/

  5. Stop netdata

    sudo systemctl stop netdata
    
  6. Netdata stores data in memory and updates every second. To store hours of data without using up memory add the following

    [global]
    update every = 10
    

Documentation

  1. Main documentation

  2. Videos on Airflow

  3. Slides

  4. Airflow reviews

  5. Airflow tips and tricks

Requirements

The following software is needed to get the software from github and run Vagrant to set up the Python development environment. The Git environment also provides an SSH client for Windows.

About

An Ubuntu Vagrant Virtual Machine (VM) with Airflow, a data workflow management system from Airbnb

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 78.6%
  • Shell 19.1%
  • Batchfile 2.3%