Skip to content

Latest commit

 

History

History
152 lines (96 loc) · 5.17 KB

README.md

File metadata and controls

152 lines (96 loc) · 5.17 KB

Fake Streaming Pipeline

This project is designed to simulate a real-time data pipeline.

It generates real time fake data like user and song in Spotify, sends this data as a stream through a Kafka server, processes the data in real-time using PySpark(Structured Streaming), writes the processed data to a Cassandra database and visualize the results in Grafana.

The simulation includes data for the user (such as name, address, age, and nationality), the song being played (including artist, album name, track name, track duration, and genre), and the time of play.

The data processing includes selecting certain fields, transforming some of the data, and computing the average age of listeners for each artist.

Table of Contents

Structure

diagram

Why Fake Data?

In a real-world scenario, you would likely have a consistent stream of data coming from a source such as a web app, IoT devices, or logs. However, setting up such a source for the purpose of demonstrating or testing a data pipeline can be challenging, especially when you want a large volume of data. You might not have access to enough real data, or there could be privacy issues with using real user data.

This is why Faker comes in. Faker is a Python library that generates fake data. You can use it to generate data that mimics a variety of real-world data types. By using Faker, you can easily generate a large volume of realistic-looking data for your data pipeline.

Benefits

  • Control over the volume of data

    You can generate as much or as little data as you need, simply by running the Faker function more or fewer times.

  • Privacy

    Since the data is all fake, there are no privacy concerns.

  • Variety

    Faker can generate a wide range of data types, allowing you to simulate a wide range of real-world scenarios.

  • Consistency

    The data generated by Faker is consistent in format, making it easier to process.

Getting Started

This project is set up to run in a dockerized environment, making it easy to get up and running.

Prerequisites

Before starting, ensure you have the following installed:

  • Python 3.10
  • Pipenv
  • Docker and Docker Compose

Installation

  1. Clone the repository and run the following command

    git clone https://github.com/OZOOOOOH/Fake-Streaming-Pipeline.git
    
    cd Fake-Streaming-Pipeline
  2. Set up Python virtual environment

    2-1. Install all necessary dependencies

    pipenv install

    2-2. Activate the new virtual environment

    pipenv shell
  3. Start Docker containers

    docker-compose up -d
  4. Run the application

    python src/main.py

Usage

If you follow above descriptions, You can monitor the Kafka server and the Cassandra database to see the data in real-time. You can also query the Cassandra database to analyze the processed data.

How to

To monitor the Kafka server

diagram

To watch the grafana dashboard diagram

Data Processing Details

Class SpotifyStreamingProcessor in process.py reads the data from the Kafka stream, processes it, and writes it to Cassandra.

The processing involves selecting certain fields, transforming some of the data, and computing the average listener age for each artist. The processed data and the average age of listeners by artist are then written to Cassandra.

Roadmap

  • Add Batch Processing Pipelines
  • Enhance the structure of the application to handle larger volumes of data.
  • Adapt the application to run on cloud services such as AWS.
  • Implement a batch processing pipeline.
  • Extend the application to handle data from multiple sources.
  • Monitor cassandra and kafka containers using grafana

Acknowledgments

Projects

Libraries

Dataset