HFT Real-Time Data Pipeline

This project implements an MLOps framework utilizing ZenML for automating the real-time processing of stock data. The pipeline ingests, processes, and stores financial data, applying various data transformations and handling missing values, followed by feature engineering.

Overview

The pipeline is composed of several steps:

Fetch Data: Fetch stock data for a given ticker symbol.
Ingest Data: Process the raw data and prepare it for further processing.
Handle Missing Values: Identify and handle any missing values in the stock data.
Feature Engineering: Generate new features based on the stock data.
Store Processed Data: Store the processed data in a PostgreSQL database.

Setup

Prerequisites:

Python 3.x
ZenML
PostgreSQL (for storing processed data)
A virtual environment (for managing dependencies)

Installation:

Clone the repository and install the necessary dependencies:

git clone https://github.com/AyanArshad02/HFT_RealTime_DataPipeline.git
cd HFT_RealTime_DataPipeline
pip install -r requirements.txt

Setting up the Virtual Environment:

If you're using Conda, create a virtual environment:

conda create --name myenv python=3.8
conda activate myenv

Database Setup:

Configure PostgreSQL with the necessary parameters to store processed stock data. You can modify the database connection settings in utils/config.py.

Running the Pipeline

To run the pipeline manually:

python pipelines/run_pipeline.py

This will execute the entire data pipeline, from data ingestion to storing the processed data.

Automating the Pipeline

To automate the pipeline to run daily at 10 PM, use the provided setup_daily_pipeline.sh script. It will set up a cron job for you:

bash setup_daily_pipeline.sh

The cron job will run the pipeline every day at 10 PM, using the specified virtual environment. Logs will be stored in pipeline_cronjob.log.

Folder Structure

src: Contains the core functionality, including data ingestion, feature engineering, and data storage.
steps: Defines individual steps in the ZenML pipeline.
pipelines: Contains the pipeline definition and logic for the data pipeline.
setup.py: For packaging the project.
setup_daily_pipeline.sh: A script to schedule the pipeline to run daily.
tests : Unittest for testing each code in src folder.

ZenML Integration

ZenML is used to manage and orchestrate the steps of the pipeline. Each step is defined using the @step decorator, and the entire pipeline is orchestrated using the @pipeline decorator.

Conclusion

This pipeline allows for automated, real-time processing of stock data, leveraging ZenML for MLOps workflow automation. You can easily extend the pipeline by adding more steps or integrating additional data sources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HFT Real-Time Data Pipeline

Overview

Setup

Prerequisites:

Installation:

Setting up the Virtual Environment:

Database Setup:

Running the Pipeline

Automating the Pipeline

Folder Structure

ZenML Integration

Conclusion

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
pipelines		pipelines
src		src
steps		steps
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
setup_daily_pipeline.sh		setup_daily_pipeline.sh

AyanArshad02/HFT_RealTime_DataPipeline

Folders and files

Latest commit

History

Repository files navigation

HFT Real-Time Data Pipeline

Overview

Setup

Prerequisites:

Installation:

Setting up the Virtual Environment:

Database Setup:

Running the Pipeline

Automating the Pipeline

Folder Structure

ZenML Integration

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages