Skip to content

trannhatnguyen2/NYC_Taxi_Data_Pipeline

Repository files navigation

In today's data-driven landscape, analyzing extensive datasets is essential for deriving business insights. Our Taxi Data Analytics application leverages Airflow, Spark, Delta Lake, Debezium, Kafka, DBT, and Great Expectations to convert raw taxi trip data into actionable intelligence.

πŸ’₯ Challenge: Data Integration Issues

At the beginning, our operations encountered considerable difficulties in integrating various data sources. The diversity in systems and formats posed challenges in consolidating and thoroughly analyzing trip data.

This fragmentation led to incomplete insights and impeded our capability to make informed decisions based on data effectively. Hence, we required a robust solution to consolidate our data sources or streamline the analysis process.

πŸ“• Table Of Contents

🌟 System Architecture

System Architecture

πŸ“ Repository Structure

.
    β”œβ”€β”€ airflow/                                    /* airflow folder including dags,.. /*
    β”œβ”€β”€ batch_processing/
    β”‚   └── datalake_to_dw.py                           /* ETL data from datalake to staging area /*
    β”œβ”€β”€ configs/                                    /* contain config files /*
    β”‚   β”œβ”€β”€ spark.yaml
    β”‚   └── datalake.yaml
    β”œβ”€β”€ data/                                       /* contain dataset /*
    β”‚   β”œβ”€β”€ 2020/
    β”‚   β”œβ”€β”€ 2021/
    β”‚   β”œβ”€β”€ 2022/
    β”‚       β”œβ”€β”€ green_tripdata_2022-01.parquet
    β”‚       β”œβ”€β”€ green_tripdata_2022-02.parquet
    β”‚       β”œβ”€β”€ green_tripdata_2022-03.parquet
    β”‚       β”œβ”€β”€ ...
    β”‚       β”œβ”€β”€ yellow_tripdata_2022-01.parquet
    β”‚       β”œβ”€β”€ yellow_tripdata_2022-02.parquet
    β”‚       β”œβ”€β”€ yellow_tripdata_2022-03.parquet
    β”‚       └── ...
    β”‚   β”œβ”€β”€ 2023/
    β”‚   └── 2024/
    β”œβ”€β”€ data_validation/                            /* validate data before loading data warehouse /*
    β”‚   β”œβ”€β”€ gx/
    β”‚       β”œβ”€β”€ checkpoints/
    β”‚       β”œβ”€β”€ expectations/
    β”‚       β”œβ”€β”€ ...
    β”‚       └── great_expections.yml
    β”‚   β”œβ”€β”€ full_flow.ipynb
    β”‚   └── reload_and_validate.ipynb
    β”œβ”€β”€ dbt_nyc/                                    /* data transformation folder /*
    β”œβ”€β”€ debezium/                                   /* CDC folder /*
    β”‚    β”œβ”€β”€ configs/
    β”‚       └──  taxi-nyc-cdc-json                           /* file config to connect between database and kafka through debezium  /*
    β”‚    └── run.sh                                     /* run create connector */
    β”œβ”€β”€ imgs/
    β”œβ”€β”€ jars/                                       /* JAR files for Spark version 3.5.1 */
    β”œβ”€β”€ scripts/
    β”‚   β”œβ”€β”€ data/
    β”‚       └── taxi_lookup.csv                             /* CSV file to look up latitude and longitude */
    β”‚   β”œβ”€β”€ extract_load.py                             /* upload data from local to 'raw' bucket (MinIO) */
    β”‚   β”œβ”€β”€ transform_data.py                           /* transform data to 'processed' bucket (MinIO) */
    β”‚   └── convert_to_delta.py                         /* convert data parquet file from 'processed' to 'delta' bucket (MinIO) */
    β”œβ”€β”€ streaming_processing/
    β”‚    β”œβ”€β”€ read_parquet_streaming.py
    β”‚    β”œβ”€β”€ schema_config.json
    β”‚    └── streaming_to_datalake.py               /* read data stream in kafka topic and write to 'raw' bucket (Minio) */
    β”œβ”€β”€ trino/
    β”‚    β”œβ”€β”€ catalog/
    β”‚       └──  datalake.properties
    β”‚    β”œβ”€β”€ etc/
    β”‚       β”œβ”€β”€ config.properties
    β”‚       β”œβ”€β”€ jvm.config
    β”‚       └── node.properties
    β”œβ”€β”€ utils/                                     /* functions /*
    β”‚    β”œβ”€β”€ create_schema.py
    β”‚    β”œβ”€β”€ create_table.py
    β”‚    β”œβ”€β”€ postgresql_client.py                       /* PostgreSQL Client: create connect, execute query, get columns in bucket /*
    β”‚    β”œβ”€β”€ helper.py
    β”‚    β”œβ”€β”€ minio_utils.py                             /* Minio Client: create connect, create bucket, list parquet files in bucket /*
    β”‚    β”œβ”€β”€ streaming_data_json.py                     /* stream data json format into kafka */
    β”‚    β”œβ”€β”€ streaming_data_db.py                       /* stream data into database */
    β”‚    └── trino_db_scripts_generate.py
    β”œβ”€β”€ .env
    β”œβ”€β”€ .gitignore
    β”œβ”€β”€ airflow-docker-compose.yaml
    β”œβ”€β”€ docker-compose.yaml
    β”œβ”€β”€ Makefile
    β”œβ”€β”€ README.md
    β”œβ”€β”€ requirements.txt
    └── stream-docker-compose.yaml

πŸš€ Getting Started

  1. Clone the repository:

    git clone https://github.com/trannhatnguyen2/NYC_Taxi_Data_Pipeline
  2. Start all infrastructures:

    make run_all

    This command will download the necessary Docker images, create containers, and start the services in detached mode.

  3. Setup environment:

    conda create -n bigdata python==3.9
    y
    conda activate bigdata
    pip install -r requirements.txt

    Activate your conda environment and install required packages

  4. Access the Services:

    • Postgres is accessible on the default port 5432.
    • Kafka Control Center is accessible at http://localhost:9021.
    • Debezium is accessible at http://localhost:8085.
    • MinIO is accessible at http://localhost:9001.
    • Airflow is accessible at http://localhost:8080.
  5. Download Dataset: You can download and use this dataset in here: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

  6. Download JAR files for Spark:

    mkdir jars
    cd jars
    curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
    curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
    curl -O https://repo1.maven.org/maven2/org/postgresql/postgresql/42.4.3/postgresql-42.4.3.jar
    curl -O https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.12/3.2.1/spark-sql-kafka-0-10_2.12-3.2.1.jar

πŸ” How to Guide

I. Batch Processing

  1. Push the data (parquet format) from local to raw bucket - Datalake (MinIO):
    python src/local_to_raw.py

Pushed the data to MinIO successfully

  1. Process the data from raw to processed bucket (MinIO):
    python src/raw_to_processed.py

Processed the data successfully

  1. Convert the data into Delta Lake format:
    python src/processed_to_delta.py

Converted the data successfully

  1. Create schema staging, production and table staging.nyc_taxi in PostgreSQL
   python utils/create_schema.py
   python utils/create_table.py
  1. Execute Spark to read, process the data from Datalake (MinIO) and write to Staging Area
   python batch_processing/datalake_to_dw.py

This command may take a little time to process.

Queried the data after executing Spark

  1. Validate data in Staging Area
   cd data_validation
   great_expectations init
   Y

Then, run the file full_flow.ipynb

Validated the data using Great Expectations

  1. Use DBT to transform the data and create a star schema in the data warehouse
   cd dbt_nyc

Read README.md for the next steps

Data Warehouse - Star Schema

  1. (Optional) Check the data in the Data Warehouse

II. Stream Processing

  1. Create Connector Postgres to Debezium:
   cd debezium/
   bash run.sh register_connector configs/taxi-nyc-cdc.json

Created Debezium Connector successfully

  1. Create an empty table in PostgreSQL and insert new record to the table:
   cd ..
   python utils/create_schema.py
   python utils/create_table.py
   python utils/streaming_data_db.py

Access localhost:9021 to check the data stream in the device.iot.taxi_nyc_time_series Topic.

Data stream in `device.iot.taxi_nyc_time_series` Kafka Topic

  1. Read and write data stream to 'raw' bucket in MinIO
   python stream_processing/streaming_to_datalake.py

Data Stream stored into 'raw' bucket in MinIO

  1. (Optional) Read data streaming in MinIO

After putting your files to MinIO, please execute trino container by the following command:

docker exec -ti datalake-trino bash
trino

After that, run the following command to register a new schema for our data:

    CREATE SCHEMA IF NOT EXISTS datalake.stream
    WITH (location = 's3://raw/');

    CREATE TABLE IF NOT EXISTS datalake.stream.nyc_taxi(
        VendorID                INT,
        tpep_pickup_datetime    TIMESTAMP,
        tpep_dropoff_datetime   TIMESTAMP,
        passenger_count         DOUBLE,
        trip_distance           DOUBLE,
        RatecodeID              DOUBLE,
        store_and_fwd_flag      VARCHAR,
        PULocationID            INT,
        DOLocationID            INT,
        payment_type            INT,
        fare_amount             DOUBLE,
        extra                   DOUBLE,
        mta_tax                 DOUBLE,
        tip_amount              DOUBLE,
        tolls_amount            DOUBLE,
        improvement_surcharge   DOUBLE,
        total_amount            DOUBLE,
        congestion_surcharge    DOUBLE,
        airport_fee             DOUBLE
    ) WITH (
        external_location = 's3://raw/stream',
        format = 'PARQUET'
    );

III. Airflow - Data Orchestration

   cd airflow/

Read README.md for the next steps

Airflow Result


πŸ“Œ References

[1] NYC Taxi Trip Dataset


Β© 2024 NhatNguyen