Skip to content
/ SWAT Public

Size-Wise Analysis of Transfer Learning using ESM2 Embeddings

Notifications You must be signed in to change notification settings

ziul-bio/SWAT

Repository files navigation

author author The Wilke Lab GPLv3 license

Scaling Down for Efficiency: Medium-Sized Transformer Models for Protein Sequence Transfer Learning

plot

About the project:

Abstract

Protein language models such as the transformer-based Evolutionary Scale Modeling 2 (ESM2) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as ESM2 15B, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of all ESM2 models across many biological datasets to determine the impact of model size on transfer learning. Surprisingly, larger models do not always outperform smaller ones, especially when data is limited. Medium sized models, such as ESM2 650M, exhibited consistent performance, falling only slightly behind the 15B parameter model despite being over 20 times smaller. Additionally, we compared various methods of embedding compression to identify the most effective approach, and we found that mean embeddings consistently outperformed other compression methods. Our results show that ESM2 650M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in a variety of biological applications.

Significance

This work challenges the common belief that larger language models always yield better results, here in the context of protein biochemistry. By systematically comparing transformer models of different sizes in transfer learning tasks, we demonstrate that medium size models, such as ESM2 650M, frequently perform as well as larger variants, specially when data is limited. These findings provide a more efficient strategy for machine learning-based protein analysis and promote the broader accessibility of AI in biology. Smaller, more efficient models can help democratize advanced machine-learning tools, making them more accessible to researchers with limited computational resources

Keywords: ESM2 | pLM Embeddings | Feature compression | Transfer Learning

Reproducibility

Example of how to reproduce our results

  1. Setting up the enviroment:
# clone this repository
git clone [email protected]:ziul-bio/SWAT.git

# move in
cd SWAT

# create a python3.10 or higher virtual environment
python3.10 -m venv venv

# install our version of the ESMC, modified to garantee reproducibility. See methods.
pip install esm/

# install remaning dependencies
pip install fair-esm
  1. Extract embeddings:
# Once we have all the fasta files and metadata we can extract the embeddings for each fasta.
python scripts/extract.py esm2_t30_150M_UR50D data/DMS_mut_sequences/BLAT_ECOLX_Ostermeier2014_muts.fasta embeddings/DMS/BLAT_ECOLX_Ostermeier2014_esm2_150M --repr_layers 30 --include bos mean per_tok
  1. Compress embeddings:
# Then we can compress the embeddings with the following command
python scripts/compressing_embeddings.py -e "embeddings/DMS/esm2_150M/BLAT_ECOLX_Ostermeier2014/" -o "embeddings/DMS_compressed/esm2_150M/BLAT_ECOLX_Ostermeier2014/" -c mean -l 30
  1. Regression Model:
# with the compressed embedding we can run the regression model, see script for more details
python scripts/run_reg_LassoCV.py -i embeddings/DMS_compressed/esm2_150M/BLAT_ECOLX_Ostermeier2014/embed_layer_30_mean.pkl -m data/DMS_metadata/BLAT_ECOLX_Ostermeier2014_metadata.csv -o results/lassoCV/DMS/esm2_150M/BLAT_ECOLX_Ostermeier2014_esm2_150M_mean.csv

About

Size-Wise Analysis of Transfer Learning using ESM2 Embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published