Skip to content

Commit

Permalink
update for 2023 data
Browse files Browse the repository at this point in the history
  • Loading branch information
jaanli committed Oct 17, 2024
1 parent c5aef2f commit ee3de8e
Show file tree
Hide file tree
Showing 112 changed files with 271,428 additions and 22 deletions.
24 changes: 11 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,25 +121,25 @@ For debugging, the `duckdb` command line tool is available on homebrew:
brew install duckdb
```

## Usage for 2022 ACS Public Use Microdata Sample (PUMS) Data
## Usage for 2023 ACS Public Use Microdata Sample (PUMS) Data

To retrieve the list of URLs from the Census Bureau's server and download and extract the archives for all of the 50 states' PUMS files, run the following:

```
cd data_processing
dbt run --select "public_use_microdata_sample.list_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}'
dbt run --select "public_use_microdata_sample.list_urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}'
```

Then save the URLs:

```
dbt run --select "public_use_microdata_sample.urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
dbt run --select "public_use_microdata_sample.urls" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
```

Then execute the dbt model for downloading and extract the archives of the microdata (takes ~2min on a Macbook):

```
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' --threads 8
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
```

Then generate the CSV paths:
Expand Down Expand Up @@ -259,26 +259,26 @@ duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_
```
2. Download and extract the archives for all of the 50 states' PUMS files (takes about 30 seconds on a gigabit connection):
```
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
dbt run --select "public_use_microdata_sample.download_and_extract_archives" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
```
Save the paths to the CSV files:
```
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_csv_paths" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
dbt run --select "public_use_microdata_sample.csv_paths" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
```
Check that the CSV files are present:
```
duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_sample_csv_paths.parquet'"
duckdb -c "SELECT * FROM '~/data/american_community_survey/csv_paths.parquet'"
```

2. Parse the data dictionary:

```bash
python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv
python scripts/parse_data_dictionary.py https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv
```

Then:
```
dbt run --select "public_use_microdata_sample.public_use_microdata_sample_data_dictionary_path" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
dbt run --select "public_use_microdata_sample.data_dictionary_path" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
```
Check that the data dictionary path is displayed correctly:
```
Expand All @@ -287,13 +287,11 @@ duckdb -c "SELECT * FROM '~/data/american_community_survey/public_use_microdata_

1. Generate the SQL commands needed to map every state's individual people or housing unit variables to the easier to use (and read) names:
```
python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py \
~/data/american_community_survey/public_use_microdata_sample_csv_paths.parquet \
~/data/american_community_survey/PUMS_Data_Dictionary_2021.json
python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py ~/data/american_community_survey/csv_paths.parquet PUMS_Data_Dictionary_2023.json
```
1. Execute these generated SQL queries using 8 threads (you can adjust this number to be higher depending on the available processor cores on your system):
```
dbt run --select "public_use_microdata_sample.generated.2021+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' --threads 8
dbt run --select "public_use_microdata_sample.generated.2023+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2023/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2023.csv", "output_path": "~/data/american_community_survey"}' --threads 8
```
1. **Test** that the compressed parquet files are present and have the expected size:
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,12 @@ def model(dbt, session):
base_url = dbt.config.get('public_use_microdata_sample_url') # Assuming this is correctly set

# Fetch URLs from your table or view
query = "SELECT * FROM list_urls "
result = session.execute(query).fetchall()
columns = [desc[0] for desc in session.description]
url_df = pd.DataFrame(result, columns=columns)

# query = "SELECT * FROM ref('list_urls')"
# result = session.execute(query).fetchall()
# columns = [desc[0] for desc in session.description]
# url_df = pd.DataFrame(result, columns=columns)
# load from parquet file in ~/data/american_community_survey/urls.parquet
url_df = pd.read_parquet('~/data/american_community_survey/urls.parquet')
# Determine the base directory for data storage
base_path = os.path.expanduser(dbt.config.get('output_path'))
base_dir = os.path.join(base_path, f'{base_url.rstrip("/").split("/")[-2]}/{base_url.rstrip("/").split("/")[-1]}')
Expand All @@ -50,4 +51,6 @@ def model(dbt, session):
paths_df = pd.DataFrame(extracted_files, columns=['csv_path'])

# Return the DataFrame with paths to the extracted CSV files
#save the paths to parquet file
paths_df.to_parquet('~/data/american_community_survey/csv_paths.parquet', index=False)
return paths_df
Loading

0 comments on commit ee3de8e

Please sign in to comment.