Skip to content

Latest commit

 

History

History
556 lines (391 loc) · 23.5 KB

README.md

File metadata and controls

556 lines (391 loc) · 23.5 KB

Superside challenge resolution

Kubernetes Terraform

Airbyte Apache Airflow Python 3.10.12 dbt-version

Ruff Code style: black Imports: isort

Conventional Commits pre-commit

Resolution of the Superside challenge for the Lead Data Engineer role.

Directories structure

This is the structure of the project.

.
├── .dockerignore
├── .env.dbt.local.example
├── .env.services.local.example
├── .github
│   └── workflows
│       └── pull-request.workflow.yaml
├── .gitignore
├── .markdownlint.json
├── .pre-commit-config.yaml
├── .python-version
├── .sqlfluffignore
├── .vscode
│   ├── extensions.json
│   └── settings.json
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── dags
│   ├── .airflowignore
│   ├── settings.py
│   └── transformations.py
├── dbt
│   └── superside
│       ├── .sqlfluff
│       ├── README.md
│       ├── analyses
│       │   └── .gitkeep
│       ├── dbt_project.yml
│       ├── logs
│       ├── macros
│       │   ├── .gitkeep
│       │   ├── generate_schema_name.sql
│       │   ├── intermediate
│       │   │   ├── extract_number.sql
│       │   │   └── map_categories.sql
│       │   └── macros.yml
│       ├── models
│       │   ├── intermediate
│       │   │   ├── _intermediate__models.yml
│       │   │   └── int_engagement_metrics.sql
│       │   ├── marts
│       │   │   ├── _marts__models.yml
│       │   │   ├── fct_engagement_metrics.sql
│       │   │   └── project_engagement.sql
│       │   └── staging
│       │       ├── _clients__models.yml
│       │       ├── _clients__sources.yml
│       │       └── stg_clients__engagement_metrics.sql
│       ├── packages.yml
│       ├── profiles.yml
│       ├── seeds
│       │   ├── .gitkeep
│       │   ├── marts
│       │   │   └── dim_project.csv
│       │   └── seeds.yml
│       ├── snapshots
│       │   └── .gitkeep
│       └── tests
│           └── .gitkeep
├── dbt_packages
├── diagrams
│   ├── airbyte.png
│   ├── diagram.py
│   └── kubernetes.png
├── docker-compose.yml
├── images
│   ├── airbyte_ui.png
│   ├── cluster.png
│   ├── make.png
│   ├── transformations.png
│   ├── uk.png
│   └── workspace.png
├── infra
│   ├── .terraform.lock.hcl
│   ├── airbyte-values.yml
│   ├── airbyte.tf
│   ├── airflow-values.yml
│   ├── airflow.tf
│   ├── providers.tf
│   └── variables.tf
├── mypy.ini
├── noxfile.py
├── poetry.lock
├── pyproject.toml
├── registries.yaml
├── scripts
│   ├── clients_postgres_init.sh
│   └── warehouse_postgres_init.sh
└── source_data
    └── engagement_metrics_raw.csv

26 directories, 68 files

What you'll need

📙 It's recommended to take your time and read this whole guide before running any command

This solution runs in a local kubernetes cluster, so is containerized. You'll need the following mandatory tools in your local machine:

  • k3d for the local k8s cluster
  • kubectl to manage the k8s cluster through the CLI
  • Docker and docker-compose
    • Beware that you'll need around 10 GB of RAM available to allocate (check this link to see how in Docker Desktop)
  • Terraform
  • GNU Make
  • poetry to handle python dependencies
    • There's an useful make rule for this one, so you can skip its installation

Depending on your OS, the installation process will be different. If you're in macOS then run:

brew install k3d kubectl docker docker-compose tfenv
tfenv install 1.5.6
tfenv use 1.5.6

There are other optional dependencies:

  • Lens to easily manage the k8s cluster
  • DBeaver as a desktop SQL client
  • The recommended VS Code extensions

Architecture overview

The selected data stack is as follows:

  • Airbyte for data movement
    • The Airbyte metadata DB is external to the cluster
  • Airflow for workflow orchestration
    • The Airflow cluster is deployed with the CeleryExecutor and a Redis database working as a message broker between Celery and the worker
    • The Airflow metadata DB is external to the cluster
  • dbt for data modeling
    • The Airflow and dbt integration was made through cosmos
  • Postgres for data storage
    • This DB was selected just for simplicity
  • A local registry for storing and distributing container images

Airbyte and Airflow are installed in the kubernetes cluster via Helm through its Terraform provider. Also, the Airbyte source, destination, and connection are handled by Terraform.

This is a simplified diagram of the architecture:

cluster

Container orchestration

Each platform runs in its own node and namespace. These nodes are labeled with the component: [platform] label, where platform can be either airbyte or airflow. Then, the nodeSelector property is set to component: [platform] in each platform's values files.

Both platforms will run its jobs in ephemeral pods, which will be scheduled in a third node with label component: jobs. This is convenient for these reasons:

  • If using a node provisioner like karpenter, this architectue allows to provide ephemeral nodes just to run this workloads an then remove them, saving costs.
  • As the pods runs in isolated environments, any kind of disruption won't affect the other platform's components.
  • The nodes and its pods' resources, requests, and limits can be managed separately.
  • The ephemeral pods' resources can be modified through Airflow variables, as I've used the kubernetesPodOperator in the transformations DAG, making it easier to manage them.

Moreover, the databases were deployed as external services (outside the cluster) in order to ensure the statelessness of the cluster and seamlessly persist the platforms' states.

Data flow

The data flow is as follows (the provided raw data is in the source_data directory):

  1. The raw engagement_metrics_raw.csv data is loaded into the clients DB through the scripts/clients_postgres_init.sh script. This DB is considered as a source.
    • This was done to better emulate a production environment, and to allow me to use Airbyte, because otherwise, I would need these credentials which I don't have, in order to sync the data directly from the Google Sheets.
    • Another option could have been to create a Python script that runs on a DAG in Airflow and loads the data into the DB. I considered Airbyte a much more robust, resilient, and practical option.
  2. Once Airbyte runs its sync, the raw data is moved to the warehouse DB, which is the destination. You'll find the data in the clients.engagement_metrics landing table.
  3. Then, Airflow triggers the dbt transformations and tests, and the models are materialized in the warehouse DB, in separate schemas:
    • staging: materialized as a view, where simple casting and renaming is done, and has a 1-1 relation with the landing table.
    • intermediate: materialized as a view, where more complex transformations are done to normalize and prepare data for downstream consumption.
    • marts: materialized as a table, where the dim_project.csv data is loaded as a seed, and then joined with the fct_engagement_metrics table in a model named project_engagement.

Please go ahead and check the red arrows in the architecture diagram.

Setup

Before you begin, please disconnect any VPN you might be connected to, if any.

Please, take into account that the whole process will take at least 30 minutes, depending if you have previously pulled the Docker images, and your internet connectivity.

1. Generate the environment variables

Open a shell in your machine, and navigate to this directory. Then run:

make generate-dotenv

This will generate two .env files with predefined values. Please, go ahead and open it! If you want to modify some values, just take into account that this may break some things.

If you want to run any dbt command without the --target prod option, please fill the DBT_SCHEMA variable in the .env.dbt.local file.

2. Install the project dependencies

Run these commands in this sequence (beware if you've poetry already installed in your machine, as version 1.6.1 will be installed by default):

make install-poetry
make install-project
make dbt-install-pkgs

Optionally, if you've cloned the repo, you can run:

make install-pre-commit

To install the pre-commit hooks and play around with them.

3. Create an empty k8s cluster in your machine

⚠️ Remember to assign the 10 GB of RAM in Docker Desktop.

Run this command and wait a while for the cluster to be ready:

make cluster-create

You can monitor its state with Lens. Anyway, you can check the pods status from the terminal:

watch -d kubectl get pods -A

Wait until they're in the Running state.

Also, this command creates some useful services which you can check that are running both with Docker Desktop or by running:

docker ps

4. Deploy the platforms

Each time you run any command related to the cluster through a make rule, the current context is switched to the local cluster, to avoid any conflicts with other ones you may have in your ~/.kube/config file.

To deploy Airbyte and Airflow in the cluster, run:

make cluster-install-apps

This will take a while (10-15 minutes, depending on your machine), but you can monitor the state the same way you did before. Please wait until this process finishes before continuing.

5. Setup Airbyte

Go ahead and port-forward the following services to these local ports (verify that you aren't using them already):

  • Airbyte web server: 8085
  • Airbyte API server: 8001
  • Airflow web server: 8090

You can do this manually with Lens, or by running:

make cluster-port-forward

Verify that you can access the web servers by going to http://localhost:8085 and http://localhost:8090.

Then, please complete the Airbyte's initial setup.

airbyte_ui

Once done, please copy the workspace id from the Airbyte UI.

workspace

Or by running:

curl -u airbyte:airbyte http://localhost:8001/api/public/v1/workspaces

Also, just if you want, go to settings and upgrade all the sources and destinations connectors' versions.

Then, please fill the workspace_id in the infra/variables.tf file and run:

make cluster-setup-airbyte

This will trigger some Airbyte jobs that will run in some pods, so it will take a while to complete (around 5 minutes).

Once finished, go to the Airbyte's connections, and you'll see a new one named Clients. Please trigger a sync manually and wait until it finishes (around 5-10 minutes). You can trigger the sync as many times as you want, since the sync mode is Full Refresh | Overwrite.

6. Run the dbt models with Airflow

As the dbt models will run in ephemeral pods via the kubernetesPodOperator, you'll need to provide an image to the containers. To create it, please run:

make cluster-local-image

💡 Any time you change something in a dbt model, you need to re-run this command to update the image.

Go to http://localhost:8090, and login with the default credentials airflow:airflow.

Then, unpause the transformations DAG. You should see how the dbt models and tests are running in the ephemeral pods (scheduled in the node with label component=jobs). Please check this with Lens, or by running:

watch -d kubectl get pods \
  -n airflow \
  --field-selector spec.nodeName=$(
      kubectl get nodes \
        -l component=jobs \
        -o jsonpath='{.items[0].metadata.name}' \
    )

Then, wait around 3 minutes until the models and tests run. The DAG looks as follows:

transformations

7. Check the results in the warehouse

Open your SQL client and connect to the warehouse. These are the credentials:

  • User: warehouse
  • Password: warehouse
  • Database: warehouse
  • Host: localhost
  • Port: 5470

Then run the following query to fetch the results of the joined data:

SELECT * FROM marts.project_engagement;

Please go ahead and check the tables and views in the others schemas.

Other credentials

There are 3 more DDBB:

  • Database: Clients

    • User: clients
    • Password: clients
    • Host: localhost
    • Port: 5460
  • Database: Airflow Metadata

    • User: airflow
    • Password: airflow
    • Host: localhost
    • Port: 5450
  • Database: Airbyte Metadata

    • User: airbyte
    • Password: airbyte
    • Host: localhost
    • Port: 5440

8. Handle the platforms and external services

Please run make help for more information.

The following command will stop both the cluster and the external services:

make cluster-stop

If you want to destroy these resources, you can run directly:

make cluster-delete

If you want to uninstall the deployed platforms, please run:

make cluster-uninstall-apps

Data exploration

Before starting to build the models, I explored the raw data to understand its nature and relation with the dim_project (which I verified is in a clean format, ready to be consumed as is). To do this, I loaded the raw data in a source table in the Clients DB, engagement_metrics table. Some things that came up from the analysis are in the following section.

If you want to check the data and you have few resources in your machine, you don't need to keep the k8s cluster up and running for this. You can stop it with:

make cluster-stop

And then just start the external services (DDBB and registry) with:

make up

This will set up a much lighter environment that you can use for data exploration.

Presence of duplicates

I realized that the table's unique key is (project_id, engagement_id):

uk

I check these 2 cases with this query:

SELECT *
FROM engagement_metrics
WHERE project_id IN ('0bf50700-a93a-4e7d-8a04-30a6116acbeb', 'e34525ca-b73d-41b0-8e83-4ba9e983555d')
    AND engagement_id IN ('035ae529-0ace-4a6b-b0bf-95c85ff5af03', '11089782-22e7-43fd-8ace-221538ea812a')

And noticed that:

  • The row with customer_name = 'Customer_305' is duplicated, since all its columns has the same values.
  • The row with customer_name = 'Customer_561' has all its columns with the same values, except for the service, which differs between Consulting and Design, and Strateby and Frontend in the sub_service column.

Because of this, I've decided to deduplicate with the ROW_NUMBER() window function, because both cases seems duplicates to me.

Data normalization

All these transformations happen in the intermediate layer.

After further exploring the data, I noticed that the dates are in various formats, which is another issue that needs to be fixed.

Something similar happened with the names of the clients. They had typos and inconsistencies which were fixed.

On the other hand, the employee_count column contained numbers but in 2 cases it contained the words fifty and hundred, so they were replaced by their associated numbers because I thought that these values would be more useful as integers rather than varchar.

All the columns associated with monetary values ​​had the same characteristics. In addition to the number, they contained the currency symbol, and in some cases the suffix k. I assumed that:

  • All currencies were the same
  • The suffix k meant 103
  • These values would need to be converted to integers

So I adjusted this data accordingly.

Levenshtein distance

The most interesting part was that almost all columns seem to contain some categories, but these contain spelling errors in different positions of the sequences. To fix this issue more cleanly, I've used the Levenshtein distance.

As dbt doesn't provide an out-of-the-box package or method for this, I've installed the fuzzystrmatch extension in the warehouse.

After some tests, I found that all the misspelled categories were within a Levenshtein distance of less than 2 from their correct categories. You can find all these categories in the dbt project-level variables declared in the dbt_project.yml file.

You can check the results of this process in the warehouse.marts.project_engament table, for example, by running (analogous for the other columns):

categories

dbt macros

All this logic related to data normalization was the same for almost all the columns, so I've encapsulated it in some macros. Please go ahead and check the dbt/superside/macros directory.

Relationships

I checked that all the values in the warehouse.clients.engagement_metrics.project_id column corresponds to a value in the warehouse.marts.dim_project.project_id column, and vice versa. On the other hand, if I use this criterion to join them, some values look odd. For example, in some cases, the engagement_date was greater than the date_project_ended column, which is confusing.

Another thing that I found is that, in the joined table, the customer_id differs from the one in the dim_project column for the same project_id. If I add customer_id as a join condition, it happens that only ~90 rows match it. This also seemed weird to me, so I left it without this condition.

After further analysis, and as there was no source documentation provided, I wasn't able to determine if there was a problem with the data or if I was missing something, so I decided to perform the join just for project_id.

Final comments and issues found

I tried to enforce contracts both in the staging and marts layer, but it turned out that dbt 1.5 has a bug with this feature when the data types are UUID.

I have implemented tests to ensure unique keys, non-nullity, and categories in the intermediate and final models. I can implement more, but I consider that my understanding of the subject is demonstrated.

dbt documentation

Do you want to know more about the dbt project? Please go ahead and serve its documentation in you machine:

make dbt-docs-generate
make dbt-docs-serve port="9000"

This will open the documentation in your browser. Adjust the port number if needed.

Please take a look at the scripts/hide_dbt_resources_from_docs.py file. It contains an useful CLI to hide dbt resources from the docs that can't be hidden because of some bugs with dbt.

cli

CI Pipeline

I've deployed a simple CI pipeline to run the pre-commit hooks in a Github runner. Pleas go ahead and check the workflow file, and the previous runs in the repository.

More commands and help

If you're struggling with some commands, please run make help to get all the available commands and its usage examples.

make

About the development tools

I've used poetry to manage the project's dependencies. If you want to install it in your local machine, please run:

make install-poetry

And then run:

make install-project

Then you'll have all the dependencies installed, and a virtual environment created in this very directory. This is useful, for example, if you're using VS Code and want to explore the code. Also, you might want to use pyenv to install Python 3.10.12.

All the code in this project has been linted and formatted with these tools:

Just cloned the repo and want to play around with the pre-commit framework? Then run:

make nox-hooks

Curious about nox? Check out the nox docs.