Skip to content

Commit

Permalink
fix: paths
Browse files Browse the repository at this point in the history
  • Loading branch information
jaanphare committed Apr 10, 2024
1 parent 11afd12 commit 2d6ed41
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 18 deletions.
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,14 +93,14 @@ Example plot of this data: https://s13.gifyu.com/images/SCGH2.gif (code here: ht

Example visualization: live demo here - https://jaanli.github.io/american-community-survey/ (visualization code [here](https://github.com/jaanli/american-community-survey/))

![image](https://github.com/jaanli/exploring_american_community_survey_data/assets/5317244/0428e121-c4ec-4a97-826f-d3f944bc7bf2)
![image](https://github.com/jaanli/exploring_data_processing_data/assets/5317244/0428e121-c4ec-4a97-826f-d3f944bc7bf2)

## Requirements

Clone the repo; create and activate a virtual environment:
```
git clone https://github.com/jaanli/exploring_american_community_survey_data.git
cd exploring_american_community_survey_data
git clone https://github.com/jaanli/american-community-survey.git
cd american-community-survey
python3 -m venv .venv
source activate
```
Expand All @@ -124,7 +124,7 @@ brew install duckdb

To retrieve the list of URLs from the Census Bureau's server and download and extract the archives for all of the 50 states' PUMS files, run the following:
```
cd american_community_survey
cd data_processing
dbt run --exclude "public_use_microdata_sample.generated+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}'
```

Expand All @@ -144,7 +144,7 @@ dbt run --select "public_use_microdata_sample.generated+" --vars '{"public_use_m
Inspect the output folder to see what has been created in the `output_path` specified in the previous command:
```
❯ tree -hF -I '*.pdf' ~/data/american_community_survey
[ 224] /Users/me/data/american_community_survey/
[ 224] /Users/me/data/data_processing/
├── [ 128] 2022/
│ └── [3.4K] 1-Year/
│ ├── [ 128] csv_hak/
Expand All @@ -169,7 +169,7 @@ To see the size of the csv output:

```
❯ du -sh ~/data/american_community_survey/2022
6.4G /Users/me/data/american_community_survey/2022
6.4G /Users/me/data/data_processing/2022
```

And the compressed representation size:
Expand Down Expand Up @@ -278,12 +278,12 @@ Check that you can execute a SQL query against these files:
```
duckdb -c "SELECT COUNT(*) FROM '~/data/american_community_survey/*individual_people_united_states*2021.parquet'"
```
6. Create a data visualization using the compressed parquet files by adding to the `american_community_survey/models/public_use_microdata_sample/figures` directory, and using examples from here https://github.com/jaanli/american-community-survey/ or here https://github.com/jaanli/lonboard/blob/example-american-community-survey/examples/american-community-survey.ipynb
6. Create a data visualization using the compressed parquet files by adding to the `data_processing/models/public_use_microdata_sample/figures` directory, and using examples from here https://github.com/jaanli/american-community-survey/ or here https://github.com/jaanli/lonboard/blob/example-american-community-survey/examples/american-community-survey.ipynb

To save time, there is a bash script with these steps in `scripts/process_one_year_of_american_community_survey_data.sh` that can be used as follows:
To save time, there is a bash script with these steps in `scripts/process_one_year_of_data_processing_data.sh` that can be used as follows:
```
chmod a+x scripts/process_one_year_of_american_community_survey_data.sh
./scripts/process_one_year_of_american_community_survey_data.sh 2021
chmod a+x scripts/process_one_year_of_data_processing_data.sh
./scripts/process_one_year_of_data_processing_data.sh 2021
```

The argument specifies the year to be downloaded, transformed, compressed, and saved. It takes about 5 minutes per year of data.
Expand Down Expand Up @@ -564,7 +564,7 @@ dbt run --select "public_use_microdata_sample.microdata_area_shapefile_paths"
```
5. Check that the paths are correct:
```
❯ duckdb -c "SELECT * FROM '/Users/me/data/american_community_survey/microdata_area_shapefile_paths.parquet';"
❯ duckdb -c "SELECT * FROM '/Users/me/data/data_processing/microdata_area_shapefile_paths.parquet';"
```
Displays:

Expand All @@ -573,11 +573,11 @@ Displays:
│ shp_path │
│ varchar │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│ /Users/me/data/american_community_survey/PUMA5/2010/tl_2010_02_puma10/tl_2010_02_puma10.shp │
│ /Users/me/data/data_processing/PUMA5/2010/tl_2010_02_puma10/tl_2010_02_puma10.shp │
│ · │
│ · │
│ · │
│ /Users/me/data/american_community_survey/PUMA5/2010/tl_2010_48_puma10/tl_2010_48_puma10.shp │
│ /Users/me/data/data_processing/PUMA5/2010/tl_2010_48_puma10/tl_2010_48_puma10.shp │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│ 54 rows (40 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
Expand Down
10 changes: 5 additions & 5 deletions data_processing/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "american_community_survey"
name: "data_processing"
version: "1.0.0"
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: "american_community_survey"
profile: "data_processing"

# Variables that can be changed from the command line using the `--vars` flag:
# example: dbt run --vars 'my_variable: my_value'
Expand All @@ -28,8 +28,8 @@ macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
- "target"
- "dbt_packages"

# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
Expand All @@ -38,7 +38,7 @@ clean-targets: # directories to be removed by `dbt clean`
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
american_community_survey:
data_processing:
# Config indicated by + and applies to all files under models/example/
# example:
+materialized: view
Expand Down

0 comments on commit 2d6ed41

Please sign in to comment.