Skip to content

Phenotype comparison tools using semantic similarity.

License

Notifications You must be signed in to change notification settings

CarlosBorroto/phenopy

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

phenopy

phenopy is a Python package to perform phenotype similarity scoring by semantic similarity. phenopy is a lightweight but highly optimized command line tool and library to efficiently perform semantic similarity scoring on generic entities with phenotype annotations from the Human Phenotype Ontology (HPO).

Phenotype Similarity Clustering

Installation

GitHub

Install from GitHub:

git clone https://github.com/GeneDx/phenopy.git
cd phenopy
python setup.py install

Command Line Usage

Initial setup

phenopy is designed to run with minimal setup from the user, to run phenopy with default parameters (recommended), skip ahead to the Commands overview.

This section provides details about where phenopy stores data resources and config files. The following occurs when you run phenopy for the first time.

  1. phenopy creates a .phenopy/ directory in your home folder and downloads external resources from HPO into the $HOME/.phenopy/data/ directory.
  2. phenopy stores a binary version of the HPO as a networkx graph object here: $HOME/.phenopy/data/hpo_network.pickle.
  3. phenopy creates a $HOME/.phenopy/phenopy.ini config file where users can set variables for phenopy to use at runtime.

Commands overview

phenopy is primarily used as a command line tool. An entity, as described here, is presented as a sample, gene, or disease, but could be any concept that warrants annotation of phenotype terms.

  1. Score similarity of an entity defined by the HPO terms from an input file against all the genes in .phenopy/data/phenotype_to_genes.txt. We provide a test input file in the repo.

    phenopy score tests/data/test.score.txt

    Output:

    #query	gene	score
    SAMPLE	NCBI:10000[AKT3]	0.0252
    SAMPLE	NCBI:10002[NR2E3]	0.0148
    SAMPLE	NCBI:100033413[SNORD116-1]	0.0283
    ...
    
  2. Score similarity of an entity defined by the HPO terms from an input file against a custom list of entities with HPO annotations, referred to as the --records-file.

    phenopy score tests/data/test.score.txt --records-file tests/data/test.score-product.txt

    Output:

    #query	entity_id	score
    SAMPLE	118200	0.0584
    SAMPLE	118210	0.057
    SAMPLE	118220	0.0563
    ...
    
  3. Score pairwise similarity of entities defined in the --records-file.

    phenopy score-product tests/data/test.score-product.txt --threads 4

    Output:

    118200	118200	0.7692
    118200	118300	0.5345
    118200	300905	0.2647
    ...
    

Parameters

For a full list of command arguments use phenopy [subcommand] --help:

phenopy score --help

Output:

    --records_file=RECORDS_FILE
        One record per line, tab delimited. First column record unique identifier, second column pipe separated list of HPO identifier (HP:0000001).
    --query_name=QUERY_NAME
        Unique identifier for the query file.
    --obo_file=OBO_FILE
        OBO file from https://hpo.jax.org/app/download/ontology.
    --pheno2genes_file=PHENO2GENES_FILE
        Phenotypes to genes from https://hpo.jax.org/app/download/annotation.
    --threads=THREADS
        Number of parallel process to use.
    --agg_score=AGG_SCORE
        The aggregation method to use for summarizing the similarity matrix between two term sets Must be one of {'BMA', 'maximum'}
    --no_parents=NO_PARENTS
        If provided, scoring is done by only using the most informative nodes. All parent nodes are removed.
    --hpo_network_file=HPO_NETWORK_FILE
        If provided, phenopy will try to load a cached hpo_network obejct from file.
    --custom_annotations_file=CUSTOM_ANNOTATIONS_FILE
        A comma-separated list of custom annotation files in the same format as tests/data/test.score-product.txt
    --output_file=OUTPUT_FILE
        filepath where to store the results.  

Library Usage

The phenopy library can be used as a Python module, allowing more control for advanced users.

import os
from phenopy import config
from phenopy.obo import restore
from phenopy.score import Scorer

network_file = os.path.join(config.data_directory, 'hpo_network.pickle')

hpo = restore(network_file)
scorer = Scorer(hpo)

terms_a = ['HP:0001882', 'HP:0011839']
terms_b = ['HP:0001263', 'HP:0000252']

print(scorer.score(terms_a, terms_b))

Output:

0.0005

Another example is to use the library to prune parent phenotypes from the phenotype_to_genes.txt

import os
from phenopy import config
from phenopy.obo import restore
from phenopy.util import export_pheno2genes_with_no_parents


network_file = os.path.join(config.data_directory, 'hpo_network.pickle')
phenotype_to_genes_file = os.path.join(config.data_directory, 'phenotype_to_genes.txt')
phenotype_to_genes_no_parents_file = os.path.join(config.data_directory, 'phenotype_to_genes_no_parents.txt')

hpo = restore(network_file)
export_pheno2genes_with_no_parents(phenotype_to_genes_file, phenotype_to_genes_no_parents_file, hpo)

Config

While we recommend using the default settings for most users, the config file can be modified: $HOME/.phenopy/phenopy.ini.

IMPORTANT NOTE:
If the config variable hpo_network_file is defined, phenopy will try to load this stored version of the HPO and ignore the following command-line arguments: obo_file and custom_annotations_file.

To run phenopy with different obo_file or custom_annotations_file: Rename or move the HPO network file: mv $HOME/.phenopy/data/hpo_network.pickle $HOME/.phenopy/data/hpo_network.old.pickle

To run phenopy with a previously stored version of the HPO network, simply set hpo_network_file = /path/to/hpo_network.pickle.

Contributing

We welcome contributions from the community. Please follow these steps to setup a local development environment.

pipenv install --dev

To run tests locally:

pipenv shell
coverage run --source=. -m unittest discover --start-directory tests/
coverage report -m

References

The underlying algorithm which determines the semantic similarity for any two HPO terms is based on an implementation of HRSS, published here.

About

Phenotype comparison tools using semantic similarity.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.6%
  • Dockerfile 1.4%