EHLC_Data_Harmonization

Repository for holding content from the data harmonization use case. This repository holds:

Crosstab_X_Y.xlsx: manually generated mappings of variables from studies X and Y. Overview tab in each spreadsheet describes the tabs and details.

MappingResults.xlsx: holds the counts of manual mappings between pairs of studies.

data_prep.py: code to prep for analysis, including converting manually mapped data into analysis format and to generate embeddings for each variable to support analysis. A OpenAI key is needed to run this code.

run_analysis.py: code to generate a computer based mapping of variables between studies (using embeddings for similiarity between pairs of variables) and compare the computer and manual mappings.

gen_heatmap.py: throw away code used to create a heatmap of cross study mappings based on embeddings as a sanity check on using the embeddings.

overall_best_results.xlsx: holds results of comparing manual and computer based mappings using the OpenAI text-embedding-3-large model.

2016-34-DD-DemoHealth.xlsx: data dictionary for study 2016-34 2016-1407-DD-DemoHealth.xlsx: data dictionary for study 2016-1407 2016-1450-DD-DemoHealth.xlsx: data dictionary for study 2016-1450 2016-1740-DD-DemoHealth.xlsx: data dictionary for study 2016-1740 2017-1945-DD-DemoHealth.xlsx: data dictionary for study 2016-1945

Note: this repository uses a different naming convention for the studies than that publication. The mapping between names in this repository and the paper are: 2016-1450 = Study A 2016-1407 = Study B 2016-1945 = Study C 2016-34 = Study D 2016-1740 = Study E

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EHLC_Data_Harmonization

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
2016-1407-DD-DemoHealth.xlsx		2016-1407-DD-DemoHealth.xlsx
2016-1450-DD-DemoHealth.xlsx		2016-1450-DD-DemoHealth.xlsx
2016-1740-DD-DemoHealth.xlsx		2016-1740-DD-DemoHealth.xlsx
2016-34-DD-DemoHealth.xlsx		2016-34-DD-DemoHealth.xlsx
2017-1945-DD-DemoHealth.xlsx		2017-1945-DD-DemoHealth.xlsx
Crosstab_1450_1407.xlsx		Crosstab_1450_1407.xlsx
Crosstab_1740-34.xlsx		Crosstab_1740-34.xlsx
Crosstab_1945-34.xlsx		Crosstab_1945-34.xlsx
Crosstab_1945_1740.xlsx		Crosstab_1945_1740.xlsx
MappingResults.xlsx		MappingResults.xlsx
README.md		README.md
data_prep.py		data_prep.py
gen_heatmap.py		gen_heatmap.py
run_analysis.py		run_analysis.py
run_analysis_tfidf.py		run_analysis_tfidf.py

NIEHS/EHLC_Data_Harmonization

Folders and files

Latest commit

History

Repository files navigation

EHLC_Data_Harmonization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages