This is a basic ETL pipeline for populating transactions data into a data warehouse, implemented in Jupyter Notebook.
- transactions-etl.ipynb: Using this simple pipeline, the data warehouse is populated from the transaction file. Contains stages for extraction, transfer, and loading
- transactions-task-tables: simple queries to view the tables
- transactions-task-reports: a data warehouse usage example
In the model folder, you will find a Star schema design with two different fact tables and dimensions for the data warehouse
- Python3 (pandas, Jupyter notebook)
- Postgres
- Docker / Postgres installation
- Python3
- pip package manager
-
Python dependencies All python dependencies can be install from requirenments.txt
pip install -r requirements.txt
-
Postgres install. Postgres can be used from a container. for example -
docker pull postgres docker run --name postgresql -e POSTGRES_USER=myusername -e POSTGRES_PASSWORD=mypassword -p 5432:5432 -v /data:/var/lib/postgresql/data -d postgres
-
From the project root, open jupyter notebook -
jupyter notebook
-
Update connection string url from -
connection = sqlalchemy.create_engine(f'postgresql+psycopg2://[email protected]:5432/postgres')
to your connection string
-
Start with transactions-etl.ipynb file that create the data
I think that's it!