Skip to content

Latest commit

 

History

History

software-mentions-linker-disambiguator

Software Mentions Linking + Disambiguation

The goal of this project is to produce a high quality dataset of software used in the biomedical literature to facilitate analysis of adoption and impact of open-source scientific software. Our overall methodology is the following:

  1. Extract plain-text software mentions from the PMC-OA access using an NER Machine Learning Algorithm (developed by Ivana Williams) G
  2. Link the software mentions to repositories and generate metadata by querying a number of databases. We link mentions to: PyPI, Bioconductor, CRAN, SciCrunch and GitHub
  3. Disambiguate the software mentions

More detailed descriptions of the linking and disambiguation steps can be found below, together with instructions on how to run the code.

Linking

Linking Task description

  1. We query the following databases, searching for exact matches for plain text sofware mentions in our dataset:
  1. We normalize the metadata files to a common schema.

Linking Schema

Metadata files are normalized to the following fields:

Field Description
ID unique ID of software mention (generated by us)
software_mention plain-text software mention
mapped_to value the software_mention is being mapped to
source source of the mapping - eg Bioconductor Index, GitHub API
platform platform of software_mention - eg PyPI, CRAN
package_url URL linking software_mention to source
description description of software_mention
homepage_url homepage_url of software_mention
other_urls other related URLs
license software license
github_repo GitHub repository
github_repo_license GitHub repository license
exact_match whether or not this mapping was an exact match
RRID RRID for software_mention
reference journal articles linked to software_mention (identified either through DOI, pmid or RRID)
scicrunch_synonyms ynonyms for software_mention, retrieved from Scicrunch

How to run the Linking code

All the scripts for linking are under the linker folder. Here are the instructions for running the code from scratch.

Step 1: Setup

1. Add the necessary folders and data

  • This step will add a data folder structure.
python initialize.py
  • Download the input data from the Dryad Link here. Add the input software_mentions file (e.g. comm_IDs.tsv) into the data/input_files folder. Do not unzip the file. The scripts assume a .gz extension.

2. Assign IDs for software mentions

This step will assign IDs to software mentions in the input file. It will also generate a mention2ID.pkl file which contains mappings from mention to an ID. It can generate this file from scratch or update an already existing mention2ID file.

Generate mention2ID from scratch:

python assign_IDs.py --input-file (your_input_file) --mention2ID-file (your_output_file_for_mention2ID)

Example: python assign_IDs.py --input-file comm.tsv.gz --output-file comm_IDs.tsv.gz
Note: the script assumes that input_file is under data/input_files.

Update an already existing mention2ID file:

python assign_IDs.py --input-file (your_input_file) --mention2ID-file (your_existing_file_for_mention2ID) --mention2ID-updated_file (your_updated_file_for_mention2ID) 

At the end of this step, you should have:

  • mention2ID.pkl file under the data/intermediate_files
  • comm_IDs.tsv file under data/input_files

3. Filter comm_IDs.tsv.gz to exclude non-software mentions

This step will filter the comm_IDs.tsv.gz to exclude mentions that are marked as not-software by our expert bio-curation team.
The curated list of terms to be excluded is under data/curation/curation_top10k_mentions_binary_labels.csv.

python filter_curated_terms.py --input-file (your_input_file) --output-curated-dataset (your_output_file)

Example: python filter_curated_terms.py --input-file comm_IDs.tsv.gz --output-curated-dataset comm_curated.tsv.gz

At the end of this step, you should have:

  • comm_curated.tsv.gz file under the data/input_files - comm_IDs with mentions marked as non-software filtered out
  • comm_with_labels.tsv.gz file under data/input_files - comm_IDs with an additional field corresponding to the software mention label (eg 'software', 'not-software', 'unclear', or 'not curated') to mark that the mention has not been curated

Step 2: Query databases

This step will link software mentions to the PyPI, CRAN, Bioconductor, Scicrunch and Github repositories.
Here are the instructions if you would like to compute the metadata files from scratch:

1. Generate keys for Accessing the APIs

You will need to have a number of access keys. You can get them for free from https://libraries.io, https://github.com, https://sparc.science. Then create a file keys with the following content:

export GITHUB_USER = ...
export GITHUB_TOKEN = ...
export SCICRUNCH_TOKEN = ...

Source this file as . keys or source keys

2. Generate Metadata files
Generate metadata files from scratch.
Each of the commands below queries the specific database for linking and generating metadata for the software mentions.
There are a number of command-line parameters that can be tuned, more info in the scripts themselves.
Metadata files are normalized to a common schema and saved under data/metadata_files/normalized. Raw versions are also saved under data/metadata_files/raw.

Note that these scripts can take a long time to run, especially given the large number of mentions in the dataset. In particular, the Github API requests are subjected to a limit/per minute. We recommend parallelizing or using distributed computing. We used a Spark environment to speed up the process.

python bioconductor_linker.py --input-file comm_IDs.tsv.gz --generate-new
python cran_linker.py --input-file comm_IDs.tsv.gz --generate-new
python pypi_linker.py --input-file comm_IDs.tsv.gz --generate-new
python github_linker.py --input-file comm_IDs.tsv.gz --generate-new
python scicrunch_linker.py --input-file comm_IDs.tsv.gz --generate-new

Sanity-checking To make sure that everything works the way it should, you could sanity check the code by running something like:
python bioconductor_linker.py --input-file comm_IDs.tsv --top-k 40
This will only try to link the first 40 mentions, for instance, and should take a fairly short time (minutes). Of course, you can do this for any of the metadata linking scripts.

At the end of this step, you should have:

  • raw metadata files saved under the data/metadata_files/raw directory.
    • pypi_raw_df.csv
    • cran_raw_df.csv
    • bioconductor_raw_df.csv
    • scicrunch_raw_df.csv
    • github_raw_df.csv
  • normalized files (to a common schema) saved under the saved under data/metadata_files/normalized.
    • pypi_df.csv
    • cran_df.csv
    • bioconductor_df.csv
    • scicrunch_df.csv
    • github_df.csv

Step 3. Create master metadata_file

Once the individual metadata files are computed, aggregate them together by running:

python generate_metadata_file.py

This step also does some post-processing of the individual metadata files.

At the end of this step, you should have:

  • metadata.csv saved under the data/output_files/ directory

Linking evaluation

We evaluate the linking algorithm using an expert team of biomedical curators. We ask them to evaluate 50 generated software-generated link pairs as one of: correct, incorrect or unclear.
The evaluation file is available as evaluation_linking.csv and the script to compute the metrics is evaluation_linking.py. To get the evaluation metrics, run the evaluation script inside the evaluation folder:

python evaluation_linking.py --linking-evaluation-file `../data/curation/evaluation_linking.csv`

Disambiguation

Disambiguation Task description

For the disambiguation task, we use the following methodology:

1. Synonym Generation: we generate synonyms for mentions in our corpus through:

  1. Keywords-based synonym generation
  2. Scicrunch synonyms retrieval
  3. String similarity (Jaro Winkler) algorithms

2. Similarity matrix generation: based on synonyms generated in the previous step, we build a similarity matrix

3. Cluster Generation

  1. We retrieve the connected components through the similarity matrix
  2. For each connected component, we compute its distance matrix based on its similarity matrix
  3. We cluster a number of connected components by feeding the corresponding distance matrices into the DBSCAN algorithm
  4. We assign each cluster's name to the mention with the highest frequency in our corpus.

How to run the disambiguation code

All the scripts for linking are under the linker folder. Here are the instructions for running the code from scratch.

Step 1: Setup

  • Follow the steps under Linking Setup if you haven't already. In particular, steps in this section require that you generate or retrieve mention2ID.pkl if you haven't already in a previous step.
  • Generate a frequency dictionary freq_dict.pkl containing mappings from {synonym : frequency} by running:
python generate_freq_dict.py --input-file ../data/input_files/comm_IDs.tsv --output-file ../data/intermediate_files/freq_dict.pkl

This file will be later used in clustering.

At the end of this step, you should have:

  • mention2ID.pkl
  • freq_dict.pkl

Step 2: Synonym Generation

Generate synonym files by:

1. Generate keywords-based synonyms

python generate_synonyms_keywords.py

This step assumes that cran_df.csv, pypi_df.csv, bioconductor_df.csv files exist under data/metadata_files/normalized and the mention2ID.pkl file exists under `data/intermediate_files

At the end of this step, you should have:

  • pypi_synonyms.pkl
  • cran_synonyms.pkl
  • bioconductor_synoynms.pkl

2. Generate Scicrunch-based synonyms:

python generate_synonyms_scicrunch.py

This step assumes that scicrunch_df.csv file exists under data/metadata_files/normalized

At the end of this step, you should have:

  • scicrunch_synoynms.pkl
  • extra_scicrunch_synonyms.pkl

3. Generate string similarity synonyms:

python generate_synonyms_string_similarity.py

This step assumes that mention2ID.pkl file exists under data/intermediate_files. This step could be time consuming, so we recommend running in batches. You have the option of choosing an ID_start as well as an ID_end, and a Spark implementation is also available.

python generate_synonyms_string_similarity.py --ID_start 0 --ID_start 100

The start/end IDs refer to the software mention IDs in mention2ID.pkl

After all the batched files are generated, combine all of them in one master file by running:

python generate_string_sim_dict.py

This will generate a string_similarity_dict.pkl

At the end of this step, you should have:

  • string_similarity_dict.pkl

Step 3: Combine synonyms from all sources

This step combines all synonyms files into a master file and does some post-processing.
Assumes that the following files have already been generated:

  • pypi_synonyms.pkl
  • cran_synonyms.pkl
  • bioconductor_synoynms.pkl
  • scicrunch_synoynms.pkl
  • extra_scicrunch_synonyms.pkl
  • string_similarity_dict.pkl
  • mention2ID.pkl
python combine_all_synonyms.py

At the end of this step, you should have:

  • synonyms.csv file under data/disambiguation

Step 4: Cluster generation

This step involves computing the similarity matrix and clustering the mentions.
Note that this step assumes that the synoynms.csv file is already computed and contains similarity scores between pairs of strings.
A frequency dictionary freq_dict.pkl is also required to be able to run the clustering algorithm and assign the cluster name to the mention with highest frequency in the corpus. If you don't have this generated, create it using the steps under [Setup](### Step 1: Setup)

python clustering.py --synonyms-file <synonyms_file>
python clustering.py --synonyms-file ../data/disambiguation_files/synonyms.csv

Disambiguation Evaluation

We evaluate the disambiguation algorithm using an expert team of biomedical curators. We ask them to evaluate 5885 generated software-synonym pairs as one of: Exact, Narrow, Unclear, Not Synonym.
The evaluation file is available as evaluation_disambiguation.csv and the script to compute the metrics is evaluation_disambiguation.py. To get the evaluation metrics, run the evaluation script inside the evaluation folder:

python evaluation_disambiguation.py --linking-evaluation-file `../data/curation/evaluation_disambiguation.csv`

Notes

  • most scripts assume software mentions files (e.g. comm.tsv, comm_IDs.tsv) are in the format comm.tsv.gz or comm_IDs.tsv.gz
  • each script has a number of command line arguments parameters that can be passed on; more info is available inside each file
  • You might get errors when trying to read the publishers_collection files with pandas. If using pandas, we recommend adding the error_bad_lines flag. This might incorrectly disregard a small number of lines.
publishers_collection_df = pd.read_csv('publishers_collections.tsv.gz', sep = '\\t', compression = 'gzip', engine = 'python', error_bad_lines = False)