PaCPaC - Paratope and Clonotype Probing and Clustering
You must have Docker & Docker Compose installed.
git clone https://github.com/aretasg/pacpac.git
cd pacpac
Move csv_dataset
to the /data
folder
docker-compose run pacpac cluster <csv_dataset> <vh_amino_acid_sequence_column_name>
docker-compose run pacpac probe <probe_vh_amino_acid_sequence> <csv_dataset> <vh_amino_acid_sequence_column_name>
Check /data
folder for output
Install conda first
git clone https://github.com/aretasg/pacpac.git
cd pacpac
conda env create -f environment.yml
conda activate pacpac
pip install .
import pandas as pd
from pacpac import pacpac
df = pd.read_csv(<my_data_set.csv>)
df = pacpac.cluster(df, <vh_amino_acid_sequence_column_name>)
df = pacpac.probe(<probe_vh_amino_acid_sequence>, df, <vh_amino_acid_sequence_column_name>)
or alternatively cluster and/or probe using both, VH and VL, sequences
df = pacpac.cluster(df, <vh_amino_acid_sequence_column_name>, <vl_amino_acid_sequence_column_name>)
df = pacpac.probe(
<probe_vh_amino_acid_sequence>,
df,
<vh_amino_acid_sequence_column_name>,
<vl_amino_acid_sequence_column_name>,
<probe_vl_amino_acid_sequence>
)
pacpac cluster <path_to_csv_dataset> <vh_amino_acid_sequence_column_name>
pacpac probe <probe_vh_amino_acid_sequence> <path_to_csv_dataset> <vh_amino_acid_sequence_column_name>
help(pacpac.cluster)
help(pacpac.probe)
pacpac cluster --help
pacpac probe --help
- Sequence annotations operations by ANARCI (Dunbar and Deane, 2015).
- Deep learning model Parapred for paratope predictions (Liberis et al., 2018).
- Clusters using greedy clustering approach.
- Determinism is achieved by sorting the input data set by CDR lengths and paratope length for clonotype and paratope clustering, respectively, and amino acid sequence in a descending order.
- Each cluster has a representitive sequence as indicated by a keyword
seed
. - Clonotyping is done on the amino acid sequence level. Any silent mutations on nucleotide sequence level due to SHM are not taken into an account.
- Paratope probing and clustering provides several clustering options.
- If
structural_equivalence
is set toFalse
matches paratopes of equal CDR lengths only and assumes that CDRs of the same length always have deletions at the same position (Richardson et al., 2021). Useful in fast detection of similar paratopes. - When set to
True
(default) structurally equivalence as assigned by the numbering scheme is used (i.e. numbering residue positions are used for residue matching to allow for a comparison at structuraly equivalent positions) and assumes that CDRs of different lengths can have similar paratopes. Useful in detection of similar binding modes. - Sequence residues can be tokenized (
tokenize=True
) based on residue type groupings as described by Wong et al., 2021.
Task | Time (s) | Notes |
---|---|---|
Annotations using ANARCI | 378 | parallel execution |
Paratope prediction using Parapred | 207 | batch execution without CPU/GPU speed up for TensorFlow |
Clonotype clustering | 13 | on amino acid level |
Paratope clustering | 13 | structural_equivalence=False |
Paratope clustering | 130 | structural_equivalence=True |
Probing | <0.1 | clonotype & paratope |
ANARCI and Parapred can be speed up with more cores and/or CPU/GPU speed up instructions for Tensorflow.
Written by Aretas Gaspariunas. Have a question? You can always ask and I can always ignore.
- Dunbar and Deane, 2015
- Liberis et al., 2018
- Richardson et al., 2021
- Wong et al., 2021
If you found PaCPaC useful for your work please acknowledge it by citing this repository.
@software{aretas_gaspariunas_2021_4470165,
author = {Aretas Gaspariunas},
title = {{aretasg/pacpac: PaCPaC - Python package to probe and cluster antibody VH sequence paratopes and clonotypes}},
month = jan,
year = 2021,
publisher = {Zenodo},
version = {v0.1},
doi = {10.5281/zenodo.4470165},
url = {https://doi.org/10.5281/zenodo.4470165}
}
BSD license.