Dear User
Imagine you have a toy box where you keep all your favorite toys. A local data platform is like that toy box, but for storing and organizing important information instead of toys. Just like how your toy box, a local data platform keeps all your data (like pictures, documents, and other info) in one place so you can easily find, use and manage it.
It's really handy for keeping everything organized and in one spot! 🌟📦
Got it? What else are you curious about?
Vision: Local Data Platform is used as a python library to learn and operate data lake house locally.
Mission: Develop a python package which provides solutions for all stages of data organisation, ranging from ingestion to reporting. The goal is that one can build data pipeline locally, test and easily scale up to cloud.
By 2025, local-data-platform is a python package that uses open source tools to orchestrate a data platform operation, locally, for development and testing.
Question | Answer |
---|---|
What? | a local data platform that can scale up to cloud |
Why? | save costs on cloud infra and development time |
When? | start of product development life cycle |
Where? | local first |
Who? | Business who wants a product data platform that will run locally and scale up when the time comes. |
This will help you understand how to read the repository.
Note: The users install the package and the developer import the library.
-
.github/
hidden folder
- ISSUE-TEMPLATE/
samples
- bug_report.md
Report bugs here
- custom.md
Report ad hoc issues here
- feature_request.md
Request a new feature here
- bug_report.md
- pull_request_template.md
Raise a pull request on the repo
- ISSUE-TEMPLATE/
-
docs/
Documentation for Read the Docs
-
local-data-platform
package
- local_data_platform
library
- hello_world.py
module
- hello_world
function
- prints 'Hello, world!'
output
- prints 'Hello, world!'
- hello_world
- hello_world.py
- local_data_platform
-
samples/
tutorials
- bigQueryTutorial.py
Demo bigQuery compatibility here
- bigQueryTutorial.py
-
.gitignore
Mention files to ignore in your PR
-
.readthedocs.yaml
Configuration for Read the Docs
-
LICENSE
for legal purposes
-
lumache.py
Template used in Sphinx projects for Read the Docs
-
pyproject.toml
template configuration
-
README.md
How to understand the repo
-
README.rst
Configuration for Read the Docs
- Check the directory structure
ls
- Change directory to local-data-platform
cd local-data-platform
- Install the dependencies listed in your pyproject.toml file
$poetry install
- Execute your test suite to ensure everything is working as expected
poetry run pytest
- Run hello world command
poetry run python hello_world.py
- local-data-platform
package
- dist
Package distribution files
- docs
Documentation
- local_data_platform
library
- catalog
Catalog your data
- local
Catalog your data locally
- iceberg
Catalog your data in iceberg SQL lite db
- export.py
Export your catalog data to csv
- export.py
- iceberg
- local
- cloud
Interact with cloud service providers
- gcp
Interact with Google Cloud Platform
- login
Login to GCP to get API credentials
- login
- gcp
- engine
Underlying processing Tech
- format
Supported formats for storage
- csv
Supports Google sheets and Excel sheets
- iceberg
Supports Apache Iceberg
- parquet
Supports Apache Parquet
- csv
- issue
Github Issues
- pipeline
Data Pipeline
- egression
Downstream pipelines
- csv_to_iceberg
Raw to Silver Layer
- iceberg_to_csv
Silver Layer to Gold Layer
- csv_to_iceberg
- ingestion
Upstream pipelines
- bigquery_to_csv
Source to Raw
- csv_to_iceberg
Raw to Silver Layer
- paraquet_to_iceberg
Raw to Silver Layer
- bigquery_to_csv
- scraper
HTML to CSV
- egression
- store
Data store
- source
Source data class
- gcp
GCP Storage
- bigquery
GCP service
- bigquery
- json
Local JSON file
- near
NEAR Data Lake
- parquet
Local Parquet file
- target
Target data class
- iceberg
Local Data Lake house
- iceberg
- gcp
- source
- etl.py
Sample pipeline
- exceptions.py
Known limitations
- hello_world.py
Test Feature
- is_function.py
Query Library Functions
- logger.py
Library logger
- catalog
- real_world_use_cases
User Test Cases
- near_data_lake
NEAR Coin Transactions
- config
Pipeline configurations
- sample_queries
NEAR Data Lake Transaction Table
- near_transaction.json
Query List
- near_transaction.json
- egression.json
Loading data in local data lake house
- ingestion.json
Extracting data from NEAR data lake house
- sample_queries
- data
target path
- near_transactions.db
Local data lake house
- transactions
iceberg table
- data
table records
- metadata
iceberg table metadata
- data
- transactions
- near_transactions_catalog.db
iceberg local data catalog
- near_transactions.db
- reports
Production analysis
- get_data.py
Get insights
- put_data.py
Refresh Gold Layer
- get_data.py
- near_transactions.csv
Output
- config
- nyc_yello_taxi_dataset
NYC Yello Taxis Rides
- config
Pipeline configurations
- egression.json
Loading data in local data lake house
- egression_payments.json
Loading payments report in Gold Layer
- ingestion.json
Extracting data from local parquet file
- egression.json
- data
target path
- nyc_yello_taxi_dataset.db
Local data lake house
- rides
iceberg table
- data
table records
- metadata
iceberg table metadata
- data
- rides
- nyc_yellow_taxi_dataset_catalog.db
iceberg local data catalog
- nyc_yellow_taxi_rides.csv
Ouput
- nyc_yello_taxi_dataset.db
- reports
Production analysis
- export_catalog.py
Saves local iceberg catalog in CSV
- get_data.py
Create Gold Layer
- get_report.py
Updates Gold Layer
- put_data.py
Refreshes Gold Layer
- export_catalog.py
- monthly_reporting.md
Report in MD
- config
- near_data_lake
- tests
PyTest Unit testing
- test_gcp_connection.py
Testing GCP Login
- test_gcp_connection.py
- dist
Milestone | Epic | Target Date | Delivery Date | Comment |
---|---|---|---|---|
0.1.0 | HelloWorld | 1st Oct 24 | 1st Oct 24 | Good Start |
0.1.1 | Ingestion | 31st Oct 24 | 5th Nov 24 | First Release: Completed in 2 Sprints |
0.1.2 | Warehousing | 15th Nov 24 | TBD | Coming Soon |
0.1.3 | Orchestration | 29th Nov 24 | TBD | Coming Soon |
0.1.4 | Self Serving Gold Layer | 29th Nov 24 | TBD | Coming Soon |
0.1.5 | Monitoring | 29th Nov 24 | TBD | Coming Soon |
0.1.6 | BI Reporting Dashboard | 31st Dec 24 | TBD | Coming Soon |
0.1.7 | Data Science Insights | 31st Dec 24 | TBD | Coming Soon |
0.1.8 | LLM | 31st Dec 24 | TBD | Coming Soon |
0.1.9 | Launch Documentation | 30th Nov 24 | TBD | Coming Soon |
1.0.0 | Ready for Production | 1st Nov 24 | TBD | End Game |
-
0.1.0 : Done- Published Library on PyPI
-
0.1.1 : In Progress- Demo BigQuery compatibility
-
0.1.1 : Done- Documentation: Updated README to explain clearly problem and plan of excecution
-
0.1.2 : To-do- Warehousing: DuckDB, Iceberg, DBT
-
0.1.3 : To-do- Orchestration
-
0.1.4 : To-do- Self Serving Gold Layer
-
0.1.5 : To-do- Monitoring
-
0.1.6 : To-do- Business Intelligence Reporting Dashboard
-
0.1.7 : To-do- Data Science Insights
-
0.1.8 : To-do- LLM
-
0.1.9 : To-do- Launch Documentation
-
0.2.0 : To-do- Cloud Integration
-
1.0.0 : To-do- Product