sparrow: a tool for integrative analysis and prediction from protein sequence data

Major version 0.2.3

Overview

SPARROW (Sequence PARameters for RegiOns in Windows) is our next-generation package for calculating and predicting amino acid sequence features. It is meant as a lightweight object-oriented framework for working with protein sequences that integrates both direct sequence calculations and per-residue predictions from deep learning models trained using PARROT.

Our goal is for SPARROW to be easy for anyone to add new sequence analysis plugins into, creating a simple, extendable framework for de novo sequence feature characterization.

SPARROW is in active development. As of version 0.2 a few things are considered mature - notably the ALBATROSS implementation in SPARROW is robust and is good to be used, as are many of the bulk sequence property calculators.

In general, we recommend using localCIDER for general sequence analysis, but the ALBATROSS predictors in SPARROW are ready for public use.

Use and distribution

SPARROW is still under active development until the first major release; however, the core API has mostly been decided upon.

Installation

Installation can be done via pip directly from GitHub!!!

pip install git+https://[email protected]/idptools/sparrow.git

Usage

General

sparrow gives you a window into protein sequence properties in a lightweight, just-in-time way. Standard pattern for analysis is:

from sparrow import Protein

my_cool_protein = Protein('THISISAMINAACIDSEQWENCE')

The object my_cool_protein is now available and has a large collection of attributes and functions that can be accessed via standard dot notation, e.g.

print(my_cool_protein.FCR)
print(my_cool_protein.NCPR)
print(my_cool_protein.hydrophobicity)

Much more extensive documentation is coming, and for now, to see the functions take a look at the code directly here

In general, SPARROW is written in a Protein-centric way - i.e., all functions emerge from the Protein object.

Reading in FASTA files

One non-obvious thing is if you have a FASTA file you can read it into a dictionary of sparrow.Protein objects using:

from sparrow import read_fasta
protein_dictionary = read_fasta('my_fasta_file.fasta')

protein_dictionary is then a dictionary with key/value pairs for the Protein objects. read_fasta() accepts the same keyword arguments as protfasta.read_fasta so check that out.

ALBATROSS in SPARROW

To use ALBATROSS functionality in SPARROW there are two routes one can take - single protein prediction or batch prediction.

Single protein predictions

For single protein prediction, one can predict specific features using the following syntax:

from sparrow import Protein

P = Protein('MKYLAAYLLLNAAGNTPDATKIKAILESVGIEIEDEKVSSVLSALEGKSVDELITEGNEKLAAVPAAGPASAGGAAAASGDAAAEEEKEEEAAEESDDDMGFGLFD')

print(P.predictor.asphericity())
	
print(P.predictor.radius_of_gyration())
print(P.predictor.radius_of_gyration(use_scaled=True))

print(P.predictor.end_to_end_distance(use_scaled=True))
print(P.predictor.end_to_end_distance(use_scaled=False))

print(P.predictor.scaling_exponent())
print(P.predictor.prefactor())

Note that Rg and Re can be calculated using the use_scaled flag, which if used means we calculate on a network trained on Rg/Sqrt(N) and Re/Sqrt(N) data. We found this works better for shorter sequences.

Batch predictions

ALBATROSS also affords a batch mode which on GPUs enables 1000s of seconds in a few seconds.

Batch prediction can be obtained via

from sparrow.predictors import batch_predict

P = Protein('MKYLAAYLLLNAAGNTPDATKIKAILESVGIEIEDEKVSSVLSALEGKSVDELITEGNEKLAAVPAAGPASAGGAAAASGDAAAEEEKEEEAAEESDDDMGFGLFD')

# dictionary with one sequence, but in general, you'd probably
# want to pass in many...xs
input_seqs = {1:P}

# run batch prediction
return_dict = batch_predict.batch_predict({1:P}, network='re')

The return dict is a dictionary of sequence-to-value mapping, and you can select one of the standard networks for doing batch prediction:

rg
scaled_rg
re
scaled_re
prefactor
scaling_exponent
asphericity

The benefits from parallelization on both GPUs but also on CPUs, i.e. proteome-scale analysis is highly accessible.

NOTE - we are still actively working on the batch predictor so various things, including the format of the return data and/or input data, may well change. As such you may prefer for now to use the single-sequence prediction mode and loop in a for-loop.

If you use ALBATROSS, please cite:

Lotthammer, J. M.; Ginell, G. M.; Griffith, D.; Emenecker, R. J.; Holehouse, A. S. Direct Prediction of Intrinsically Disordered Protein Conformational Properties from Sequence. Nat. Methods 2024, 1–12.

Roadmap

An initial public version of SPARROW was released in June 2022 to enable existing tools developed by the Holehouse lab to use this code. This version is not meant for those outside the Holehouse lab to take and use (see Use and distribution for their own safety!).

A full public release is planned for spring of 2024.

Changelog

June 2024 (version 0.2.3 release)

Cythonized SHD/SCD and IWD clustering sequence parameters

Sept 2023 (version 0.2.2 release)

Updated low complexity domain identification code.
Added ability to grey out sequences in show_sequence() [h/t Garrett!].

May 2023 (version 0.2.1 release)

Update to ALBATROSS v2 networks (all networks use the v2 by default both in individual predictors and batch predictions
Re-wrote much of batch_predict() code. Changes here include
- Implementation of the size-collect algorithm to ensure the passed batchsize does not impact the accuracy of predictions. Batch prediction can now use larger batch sizes, providing better performance on both GPUs and CPUs
- Set default batch size to 32
- Improved robustness of input types batch_predict() can accept. Can now take dictionaries and lists of sparrow.protein.Protein objects or dictionaries/lists of sequences.
- Changed order of input parameters for batch_predict(), such that now the only required options are [0] Input list/dictionary and [2] name of the network to be used.
- Updated return type for batch_predict() such that now the return type by default is a dictionary that maps input IDs (or list positions) to sequence and prediction. The original return behavior (a dictionary that maps sequence to prediction) can be obtained if the return_seq2prediction flag is set to True.
- Wrote much more extensive tests for all batch_predict() code
- Ensure batch_predict() guarantees the same return order as the input order if possible. The only exception is if return_seq2prediction=True and duplicate sequences are found in the input data, in which case only the first occurrence of a given sequence is included. Also order here refers to the order if the return dictionary had dict.values() called.
Added scaled-network for small sequences: In the course of testing the networks we noticed that in both V1 and V2, when sequences are short (<30-40 amino acids) the non-scaled Re and Rg predictors can return non-sensical results. In contrast, the scaled_rg and scaled_re networks show reasonable and reproducible polymeric behavior for these smaller sequences. To address this, in both single sequence predictions and batch predict, by default, even if an rg or re network is requested, if the sequence is less than 35 residues long, we force the scaled_rg or scaled_re networks. This can be over-ridden by setting the 'safe' keyword in either batch_predict() or the single sequence radius_of_gyration() or end_to_end_distance()
Technical change: the end-to-end distance predictor module found under sparrow/predictors was renamed from re to e2e to prevent clashing with Python's regular expression (re) package. This does not introduce any errors, but makes debugging predictors challenging. The actual network name is retained as re.
Default predictor for Protein.predictor.radius_of_gyration() and Protein.predictor.end_to_end_distance() use use_scaled=True as a default, based on accuracy of the V2 networks.

May 2023 (version 0.2 release)

First major 'alpha' release to coincide with ALBATROSS preprint

Feb 2023

Building early version of ALBATROSS

Nov 2022 (0.

Major set of updates to fix errors in how kappa is calculated
Added SCD and SHD functions
Added some tests
Moved scd.py into sparrow.patterning
Updated IWD for bivariate charge clustering (h/t Garrett Ginell)
Fixed some tests(h/t Jeff Lotthammer)
Improved docstrings

July 2022

Moved to idptools! If you had previously cloned sparrow, you can update your git remote location using:
```
  git remote set-url origin [email protected]:idptools/sparrow.git
```
Updated requirement for Python 3.7 or 3.8

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
.github		.github
devtools		devtools
docs		docs
examples		examples
sparrow		sparrow
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.lgtm.yml		.lgtm.yml
.readthedocs.yml		.readthedocs.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
readthedocs.yml		readthedocs.yml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sparrow: a tool for integrative analysis and prediction from protein sequence data

Major version 0.2.3

Overview

Use and distribution

Installation

Usage

General

Reading in FASTA files

ALBATROSS in SPARROW

Single protein predictions

Batch predictions

Roadmap

Changelog

June 2024 (version 0.2.3 release)

Sept 2023 (version 0.2.2 release)

May 2023 (version 0.2.1 release)

May 2023 (version 0.2 release)

Feb 2023

Nov 2022 (0.

July 2022

Copyright

Acknowledgements

About

Releases

Packages

Contributors 3

Languages

License

idptools/sparrow

Folders and files

Latest commit

History

Repository files navigation

sparrow: a tool for integrative analysis and prediction from protein sequence data

Major version 0.2.3

Overview

Use and distribution

Installation

Usage

General

Reading in FASTA files

ALBATROSS in SPARROW

Single protein predictions

Batch predictions

Roadmap

Changelog

June 2024 (version 0.2.3 release)

Sept 2023 (version 0.2.2 release)

May 2023 (version 0.2.1 release)

May 2023 (version 0.2 release)

Feb 2023

Nov 2022 (0.

July 2022

Copyright

Acknowledgements

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages