GitHub - 24jmwangi/mentalhealth_analysis-data-pipeline: An end to end data pipeline for for mental health analysis

MENTAL HEALTH ANALYSIS DATA PIPELINE

An end to end data pipeline built using python, prefect, gcp, dbt cloud

PIPELINE ARCHITECTURE

PREREQUISITES

Python
prefect
GCP
dbt cloud
terraform

Setup a python venv and install the required packages

Python packages: install the required packages using the requirements file

CREATING INFRA ON GCP USING TERRAFORM

Configure gcloud sdk on your machine and setup terrraform
create a terraform directory to initialize terrafrom files

# download and setup terraform
wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform

# set credentials for gcp
export GOOGLE_APPLICATION_CREDENTIALS="yourkeys.json"

# Refresh token/session, and verify authentication
gcloud auth application-default login

# Initialize state file (.tfstate)
terraform init

# Check changes to new infra plan
terraform plan -var="project=<your-gcp-project-id>"

# Create new infra
terraform apply -var="project=<your-gcp-project-id>"

configure variables.tf and main.tf before running plan and apply commands

RUNNING PREFECT FLOWS

Create a prefect directory and inside add flows and blocks folders
Start prefect server
Setup kaggle access

# setup kaggle api access
mv kaggle.json /home/james23/.kaggle/kaggle.json

run the flows to ingest data and load into bigquery

# start prefect orion server
prefect server start

# register your custom block
prefect block register --file my_block.py

# run your flows 
python3 ./prefect/flows/data_to_gcs.py

python3 ./prefect/flows/gcs_to_bq.py

RUNNING TRANSFORMATIONS IN DBT CLOUD

setup a dbtcloud account
clone your github repo and initialize dbt
setup database credentials in dbt cloud for bigquery (bigquery)
create and configure dbt_project.yml, macros and models accordingly
run the builds in developer mode

Example of macros get_gender_properties.sql

Example of staging-models(for development) stag_mentalhealth_data.sql

Example of core models(for production) dim_employee.sql

IMPORTANT: configure the development env with the correct target database/dataset for bigquery

ADDING A DEPLOY ENVIRONMENT ON DBT CLOUD

This evironment runs jobs for loading data into production tables
set up the deployment enviroment
add job runs and schedule them

VISUALIZING THE DATA

looker studio is used to build the dashboard for analysis

link looker-dashboard of mental health analysis

IMPORTANT: configure the deployment env with the correct target database/dataset for bigquery

NEXT STEPS

You can customize this project in the following ways.

Run the flows in prefect cloud
Enhance deployment by adding triggers (eg. on pull request )

CONTRIBUTING

Contributions are welcome! If you have any ideas, improvements, or bug fixes, please open an issue or submit a pull request.

LICENSE

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Images		Images
analyses		analyses
data		data
macros		macros
models		models
prefect		prefect
seeds		seeds
snapshots		snapshots
terraform		terraform
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
commands		commands
dbt_project.yml		dbt_project.yml
mental_heath_data.csv		mental_heath_data.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MENTAL HEALTH ANALYSIS DATA PIPELINE

PIPELINE ARCHITECTURE

PREREQUISITES

CREATING INFRA ON GCP USING TERRAFORM

RUNNING PREFECT FLOWS

RUNNING TRANSFORMATIONS IN DBT CLOUD

ADDING A DEPLOY ENVIRONMENT ON DBT CLOUD

VISUALIZING THE DATA

NEXT STEPS

CONTRIBUTING

LICENSE

About

Releases

Packages

Languages

License

24jmwangi/mentalhealth_analysis-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

MENTAL HEALTH ANALYSIS DATA PIPELINE

PIPELINE ARCHITECTURE

PREREQUISITES

CREATING INFRA ON GCP USING TERRAFORM

RUNNING PREFECT FLOWS

RUNNING TRANSFORMATIONS IN DBT CLOUD

ADDING A DEPLOY ENVIRONMENT ON DBT CLOUD

VISUALIZING THE DATA

NEXT STEPS

CONTRIBUTING

LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages