Protein language models such as the transformer-based Evolutionary Scale Modeling 2 (ESM2) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as ESM2 15B, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of all ESM2 models across many biological datasets to determine the impact of model size on transfer learning. Surprisingly, larger models do not always outperform smaller ones, especially when data is limited. Medium sized models, such as ESM2 650M, exhibited consistent performance, falling only slightly behind the 15B parameter model despite being over 20 times smaller. Additionally, we compared various methods of embedding compression to identify the most effective approach, and we found that mean embeddings consistently outperformed other compression methods. Our results show that ESM2 650M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in a variety of biological applications.
This work challenges the common belief that larger language models always yield better results, here in the context of protein biochemistry. By systematically comparing transformer models of different sizes in transfer learning tasks, we demonstrate that medium size models, such as ESM2 650M, frequently perform as well as larger variants, specially when data is limited. These findings provide a more efficient strategy for machine learning-based protein analysis and promote the broader accessibility of AI in biology. Smaller, more efficient models can help democratize advanced machine-learning tools, making them more accessible to researchers with limited computational resources
Keywords: ESM2 | pLM Embeddings | Feature compression | Transfer Learning
- Setting up the enviroment:
# clone this repository
git clone [email protected]:ziul-bio/SWAT.git
# move in
cd SWAT
# create a python3.10 or higher virtual environment
python3.10 -m venv venv
# install our version of the ESMC, modified to garantee reproducibility. See methods.
pip install esm/
# install remaning dependencies
pip install fair-esm
- Extract embeddings:
# Once we have all the fasta files and metadata we can extract the embeddings for each fasta.
python scripts/extract.py esm2_t30_150M_UR50D data/DMS_mut_sequences/BLAT_ECOLX_Ostermeier2014_muts.fasta embeddings/DMS/BLAT_ECOLX_Ostermeier2014_esm2_150M --repr_layers 30 --include bos mean per_tok
- Compress embeddings:
# Then we can compress the embeddings with the following command
python scripts/compressing_embeddings.py -e "embeddings/DMS/esm2_150M/BLAT_ECOLX_Ostermeier2014/" -o "embeddings/DMS_compressed/esm2_150M/BLAT_ECOLX_Ostermeier2014/" -c mean -l 30
- Regression Model:
# with the compressed embedding we can run the regression model, see script for more details
python scripts/run_reg_LassoCV.py -i embeddings/DMS_compressed/esm2_150M/BLAT_ECOLX_Ostermeier2014/embed_layer_30_mean.pkl -m data/DMS_metadata/BLAT_ECOLX_Ostermeier2014_metadata.csv -o results/lassoCV/DMS/esm2_150M/BLAT_ECOLX_Ostermeier2014_esm2_150M_mean.csv