This is an implementation of the SynPUF Extract-Transform-Load (ETL) specification designed to generate a set of CDMv5-compatible CSV files that can then be bulk-loaded into your RDBMS of choice.
- Clone the repo and build your docker container from the provided Dockerfile
- cd into the etl directory
- Type 'python pipeline.py' to run the ETL - pipeline.py is a script that runs the 3 etl steps, extract.py, transform.py, and load.py
- Enter answers to the prompts
The Entire pipeline takes about 2 hours to run. If testing, please only select one sample to run.
The ETL process comes in a Docker container with all necessary files and packages. To run the ETL, you need to have Docker installed on your machine. You can download Docker from here.
The original ETL required python 2.7 and the dotenv library, but some code has been updated to be compatible with python 3 and has been tested with python 3.11
The SynPUF data is divided into 20 parts (8 files per part), and the files for each part should be saved in respective directories DE_1 through DE_20. They can either be downloaded with a python utility script (extract.py
) or manually, described in the next two subsections.
In the etl folder, there is a python program 'extract.py', which can be run to fetch one or more of the 20 SynPUF data sets. Run as follows:
python extract.py path/to/output/directory <SAMPLE> ... [SAMPLE]
Where each SAMPLE is a number from 1 to 20, representing the 20 parts of the CMS data. If you only wanted to obtain samples 4 and 15, you would run:
python extract.py path/to/output/directory 4 15
To obtain all of the data, run:
python extract.py path/to/output/directory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
OR
python extract.py path/to/output/directory all
Hyperlinks to the 20 parts can be found here: https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.html
For example for DE_1, create a directory called DE_1 and download the following files:
DE1_0_2008_Beneficiary_Summary_File_Sample_1.zip
DE1_0_2008_to_2010_Carrier_Claims_Sample_1A.zip
DE1_0_2008_to_2010_Carrier_Claims_Sample_1B.zip
DE1_0_2008_to_2010_Inpatient_Claims_Sample_1.zip
DE1_0_2008_to_2010_Outpatient_Claims_Sample_1.zip
DE1_0_2008_to_2010_Prescription_Drug_Events_Sample_1.zip
DE1_0_2009_Beneficiary_Summary_File_Sample_1.zip
DE1_0_2010_Beneficiary_Summary_File_Sample_1.zip
Note: - If you are downloading the files manually from CMS website, you need to rename 'DE1_0_2008_to_2010_Carrier_Claims_Sample_11A.csv.zip' to 'DE1_0_2008_to_2010_Carrier_Claims_Sample_11A.zip'. Also, some zipped files have '.Copy.csv' file inside them. Rename those files from 'Copy.csv' to '.csv' after unzipping the zipped files. If you use the download script, you don't have to do all of these manual steps. The script will take care of all these.
Download vocabulary files from http://www.ohdsi.org/web/athena/, ensuring that you select at minimum, the following vocabularies: SNOMED, ICD9CM, ICD9Proc, CPT4, HCPCS, LOINC, RxNorm, and NDC.
-
Unzip the resulting .zip file to a directory.
-
Because CPT4 vocabulary is not part of CONCEPT.csv file, one must download it with the provided cpt4.jar program via:
java -Dumls-apikey=<xxx> -jar cpt4.jar 5 <output-file-name>
, which will append the CPT4 concepts to the CONCEPT.csv file. You will need to pass in your UMLS credentials in order for this command to work. Onlyapikey
and5
are required. -
Note: This command works with Java version 10 or below.
-
Note: These files as of 2023-02-07 are available in the /Users/me/Documents/GitHub/dec-etl-project/data/BASE_OMOP_INPUT_DIRECTORY directory.
Edit the variables in the .env file which specify various directories used during the ETL process. Environment files have been edited to include appropriate directories in the data folder of the repo.
- BASE_SYNPUF_INPUT_DIRECTORY = where the downloaded CMS directories are contained, that is, the DE_1 through DE_20 directories.
- BASE_OMOP_INPUT_DIRECTORY = the CDM v5 vocabulary directory, for example: /workspaces/practice/cms-synpuf/Data/BASE_OMOP_INPUT_DIRECTORY.
- BASE_OUTPUT_DIRECTORY = contains all of the output files after running the ETL.
- BASE_ETL_CONTROL_DIRECTORY = contains files used for auto-incrementing record numbers and keeping track of physicians and physician institutions over the 20 parts so that the seperate DE_1 through DE_20 directories can be processed sequentially. These files need to be deleted if you want to restart numbering.
To process any of the DE_1 to DE_20 folders, run:
python transform.py <sample number>
- Where
<sample number>
is the number of one of the samples you downloaded from CMS - e.g.
python transform.py 4
will run the ETL on the SynPUF data in the DE_4 directory - The resulting output files should be suitable for bulk loading into a CDM v5 database.
- Where
The runs cannot be done in parallel because counters and unique physician and physician institution providers are detected and carried over multiple runs (saved in BASE_ETL_CONTROL_DIRECTORY). I recommend running them sequentially from 1 through 20 to produce a complete ETL of the approximately 2.33M patients. If you wanted only 1/20th of the data, you could run only sample number 1 and load the resulting .csv files into your database.
Note: - On average, the transform.py program takes approximately 45-60 minutes to process one input file (e.g. DE_1). We executed the program on an Intel Xeon CPU E3-1271 v3, with 16GB of memory and it took approximately 14 hours to process all 20 DE files.
All the paths are taking from the .env
file that was set up previously.
To load the concatenated csv files folders, run:
python load.py
This will load the data into a DuckDB database specified in the Data folder.
a) As per OHDSI documentation for the observation and measurement tables, the fields 'value_as_string', 'value_as_number', and 'value_as_concept_id' in both tables are not mandatory, but Achilles Heels gives an error when all of these 3 fields are NULL. Achilles Heels requires one of these fields should have non-NULL value. So, to fix this error, field 'value_as_concept_id' has been populated with '0' in both the measurement and observation output .csv files.
b) The concepts for Unknown Race, Non-white, and Other Race (8552, 9178, and 8522) have been deprecated, so race_concept_id in Person file has been populated with '0' for these deprecated concepts.
c) Only two ethnicity concepts (38003563, 38003564) are available. 38003563: Hispanic and 38003564: Non-Hispanic.
d) When a concept id has no mapping in the CONCEPT_RELATIONSHIP table:
- If there is no mapping from OMOP (ICD9) to OMOP (SNOMED) for an ICD9 concept id, target_concept_id for such ICD9 concept id is populated with '0' .
- If there is no self-mapping from OMOP (HCPCS/CPT4) to OMOP (HCPCS/CPT4) for an HCPCS/CPT4 concept id, target_concept_id for such HCPCS/CPT4 concept id is populated with '0' .
- If there is no mapping from OMOP (NDC) to OMOP (RxNorm) for an NDC concept id, target_concept_id for such NDC concept id is populated with '0'.
e) The source data contains concepts that appear in the CONCEPT.csv file but do not have relationship mappings to target vocabularies. For these, we create records with concept_id 0 and include the source_concept_id in the record. Achilles Heel will give warnings about these concepts for the Condition, Observation, Procedure, and Drug tables as follows. If condition_concept_id or observation_concept_id or procedure_concept_id or drug_concept_id is '0' respectively:
- WARNING: 400-Number of persons with at least one condition occurrence, by condition_concept_id; data with unmapped concepts
- WARNING: 800-Number of persons with at least one observation occurrence, by observation_concept_id; data with unmapped concepts
- WARNING: 600-Number of persons with at least one procedure occurrence, by procedure_concept_id; data with unmapped concepts
- WARNING: 700-Number of persons with at least one drug exposure, by drug_concept_id; data with unmapped concepts
f) About 6% of the records in the drug_exposure file have either days_supply or quantity or both set to '0' (e.g. days_supply = 10 & quantity=0 OR quantity=120 & days_supply=0). Though such values are present in the input file, they don't seem to be correct. Because of this, dosage calculations would result in division by zero, hence effective_drug_dose has not been calculated. For that reason we have also left the dose_era table empty. The CMS documentation says the following about both quantity and days_supply, "This variable was imputed/suppressed/coarsened as part of disclosure treatment. Analyses using this variable should be interpreted with caution. Analytic inferences to the Medicare population should not be made when using this variable.""
g) The locations provided in the DE_SynPUF data use SSA codes, and we mapped them to 2-letter state codes. However SSA codes for Puerto Rico('40') and Virgin Islands ('48') as well other extra-USA locations have been coded in source and target data as '54' representing Others, where Others= PUERTO RICO, VIRGIN ISLANDS, AFRICA, ASIA OR CALIFORNIA; INSTITUTIONAL PROVIDER OF SERVICES (IPS) ONLY, CANADA & ISLANDS, CENTRAL AMERICA AND WEST INDIES, EUROPE, MEXICO, OCEANIA, PHILIPPINES, SOUTH AMERICA, U.S. POSSESSIONS, AMERICAN SAMOA, GUAM, SAIPAN OR NORTHERN MARIANAS, TEXAS; INSTITUTIONAL PROVIDER OF SERVICES (IPS) ONLY, NORTHERN MARIANAS, GUAM, UNKNOWN.
h) As per OMOP CDMv5 visit_cost documentation, the cost of the visit may contain just board and food, but could also include the entire cost of everything that was happening to the patient during the visit. As the input data doesn't have any specific data for visit cost, we are not writing any information to visit_cost file.
i) There is no specific data in the input DE_SynPUF files for device cost, so only a header line is written to the device_cost file.
j) Because each person can be covered by up to four payer_plans, we cannot uniquely assign a payer_plan_period_id to drugs or procedures within the drug_cost and procedure_cost files. We leave payer_plan_period_id as blank in those two files.The calculation for the fields payer_plan_period_start_date and payer_plan_period_end_date is based on the values of the following 4 fields of the input beneficiary files: BENE_HI_CVRAGE_TOT_MONS, BENE_SMI_CVRAGE_TOT_MONS, BENE_HMO_CVRAGE_TOT_MONS, and PLAN_CVRG_MOS_NUM, corresponding to the number of months a beneficiary was covered under each of up to 4 plans (Medicare Part A, Medicare Part B, HMO, and Medicare Part D). Every beneficiary can thus be covered by up to 4 plans, over the three years of data (2008-2010). The CDM requires the specification of start and end date of coverage, which we do not have. Thus we will make some (questionable) assumptions and create payer_plan_period_start_date and payer_plan_period_end_date records for each of the 4 plans using the information about the number of months a beneficiary was covered in a given year as follows:
- if the value of the fields is 12 in 2008, 2009, and 2010, payer_plan_period_start_date is set to '1/1/2008' and payer_plan_period_end_date is set to '12/31/2010'.
- if the value of the fields is 12 in 2008 and 2009 and is less than 12 in 2010, payer_plan_period_start_date is set to '1/1/2008' and payer_plan_period_end_date is set to '12/31/2009 + #months in 2010' (12/31/2009 if #months=0).
- if the value of the fields is 12 in 2008 and 2010 and is less than 12 in 2009, different payer_plan_period_start_date and payer_plan_period_end_date are set for these 3 years. Three payer_plan_period records are created with payer_plan_period_start_date and payer_plan_period_end_date set to: [('1/1/2008', '12/31/2008'), ('1/1/2009', '1/1/2009 + #months' no record is written if #months=0), ('1/1/2010', '12/31/2010')]
- if the value of the fields is 12 in 2008 and is less than 12 in 2009 and 2010, three payer_plan_period records are created with payer_plan_period_start_date and payer_plan_period_end_date set to: [('1/1/2008', '12/31/2008'), ('1/1/2009', '1/1/2009+ #months' no record is written if #months=0), ('1/1/2010', '1/1/2010+ #months' no record is written if #months=0)]
- if the value of the fields is 12 in 2009 and 2010 and is less than 12 in 2008, payer_plan_period_start_date is calculated by subtracting #months from '12/31/2008' and payer_plan_period_end_date is set to '12/31/2010'.
- if the value of the fields is 12 in 2009 and is less than 12 in 2008 and 2010, payer_plan_period_start_date is calculated by subtracting #months from 12/31/2008 and payer_plan_period_end_date is calculated by adding #months to '12/31/2009'
- if the value of the fields is 12 is 2010 and is less than 12 in 2008 and 2009, three payer_plan_period records are created with payer_plan_period_start_date and payer_plan_period_end_date set to: [('1/1/2008', '1/1/2008+ #months' no record is written if #months=0), ('1/1/2009', '1/1/2009+ #months' no record is written if #months=0), ('1/1/2010', '12/31/2010')]
- if the value of the fields is less than 12 in 2008, 2009, and 2010, three payer_plan_period records are created with payer_plan_period_start_date and payer_plan_period_end_date set to - [('1/1/2008', '1/1/2008+ #months' - no record is written if #months=0), ('1/1/2009', '1/1/2009+ #months' - no record is written if #months=0), ('1/1/2010', '1/1/2010+ #months' - no record is written if #months=0)]
k) Input files (DE_1-DE_20) have some ICD9/HCPCS/NDC codes that are not defined in the concept file and therefore such records are not processed by the program and are written to the 'unmapped_code_log.txt' file. This file is opened in append mode so that if more than one input file is processed together, the program should append unmapped codes from all input files instead of overwriting. So the file 'unmapped_code_log.txt' must be deleted if you want to rerun the program with the same input file. We list the unmapped codes that occurred 50 or more times along with their putative source vocabulary (ICD9/HCPCS/CPT4/NDC) below. Some appear to be typos or invalid entries. Others may represent ICD9 codes that are not part of ICD9CM. For instance the 04.xx codes are listed on some non-US lists of ICD9 codes (see for example 04.22).
Count Vocabulary Code
----- ----------- ----
54271 ICD9 XX000
11697 HCPCS/CPT4 201E5
5293 ICD9 0422
5266 ICD9 0432
5249 ICD9 0440
5240 ICD9 0430
5220 ICD9 0421
5208 ICD9 0429
5206 ICD9 0431
5204 ICD9 0433
5157 ICD9 0439
5119 ICD9 0420
2985 ICD9 OTHER
1773 HCPCS/CPT4 0851
1153 HCPCS/CPT4 99910
1038 ICD9 30513
993 ICD9 30512
925 ICD9 30510
897 ICD9 30511
655 HCPCS/CPT4 90699
406 HCPCS/CPT4 01
327 HCPCS/CPT4 0841
313 NDC OTHER
286 HCPCS/CPT4 0521
270 HCPCS/CPT4 520
234 HCPCS/CPT4 99998
211 ICD9 9
180 HCPCS/CPT4 00000
125 ICD9 30040
119 HCPCS/CPT4 XXXXX
101 HCPCS/CPT4 J2000
99 HCPCS/CPT4 A9170
93 HCPCS/CPT4 X9999
92 HCPCS/CPT4 A9160
80 ICD9 72330
71 HCPCS/CPT4 GOOO8
71 HCPCS/CPT4 GO283
68 ICD9 73930
59 ICD9 4900
54 HCPCS/CPT4 521
50 ICD9 VO48