SPACE: STRING proteins as complementary embeddings

Code for the manuscript: "SPACE: STRING proteins as complementary embeddings", in which we precalculated:

cross-species network embeddings
ProtT5 sequence embeddings

for all eukaryotic proteins in STRING v12.0.

You can download all the embeddings from the STRING website:

protein.network.embeddings.v12.0.h5
protein.sequence.embeddings.v12.0.h5

Reproduce the results in the manuscript

Please follow this document.

How to Cite

If you use this work in your research, please cite our manuscript

@article {Hu2024.11.25.625140,
	author = {Hu, Dewei and Szklarczyk, Damian and Mering, Christian von and Jensen, Lars Juhl},
	title = {SPACE: STRING proteins as complementary embeddings},
	elocation-id = {2024.11.25.625140},
	year = {2024},
	doi = {10.1101/2024.11.25.625140},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/11/26/2024.11.25.625140},
	eprint = {https://www.biorxiv.org/content/early/2024/11/26/2024.11.25.625140.full.pdf},
	journal = {bioRxiv}
}

and the STRING database.

@article{szklarczyk2023string,
  title={The STRING database in 2023: protein--protein association networks and functional enrichment analyses for any sequenced genome of interest},
  author={Szklarczyk, Damian and Kirsch, Rebecca and Koutrouli, Mikaela and Nastou, Katerina and Mehryary, Farrokh and Hachilif, Radja and Gable, Annika L and Fang, Tao and Doncheva, Nadezhda T and Pyysalo, Sampo and others},
  journal={Nucleic acids research},
  volume={51},
  number={D1},
  pages={D638--D646},
  year={2023},
  publisher={Oxford University Press}
}

Usage of SPACE embeddings

To have the best prediction results, based on our test, it's better to concatenate the cross-species network embeddings and the ProtT5 sequence embeddings. (That is our SPACE embeddings mentioned in the manuscript.)

How to read the embedding files

Install the h5py package to read the single species embedding files. We provide examples of reading the cross-species (aligned) network embeddings. The ProtT5 sequence embedding files have the same format.

The following code reads the cross-species network embedding file 9606.protein.network.embeddings.v12.0.h5.

In Python:

pip install h5py

import h5py

filename = '9606.protein.network.embeddings.v12.0.h5'

with h5py.File(filename, 'r') as f:
    meta_keys = f['metadata'].attrs.keys()
    for key in meta_keys:
        print(key, f['metadata'].attrs[key])

  embedding = f['embeddings'][:]
  proteins = f['proteins'][:]

  # protein names are stored as bytes, convert them to strings
  proteins = [p.decode('utf-8') for p in proteins]

In R:
Install the rhdf5 package to read the embedding files. The following code reads the embedding file 9606.protein.network.embeddings.v12.0.h5.

# Install required packages if not already installed
# install.packages("rhdf5")

# Load the library
library(rhdf5)

filename <- '9606.protein.network.embeddings.v12.0.h5'

metadata <- h5readAttributes(filename, "metadata")
for (key in names(meta_keys)) {
    print(paste(key, meta_keys[[key]]))
}

embeddings <- h5read(filename, "embeddings")
proteins <- h5read(filename, "proteins")

Read the combined network embedding file of all eukaryotes with Python

import h5py

filename = 'protein.network.embeddings.v12.0.h5'

with h5py.File(filename, 'r') as f:
    meta_keys = f['metadata'].attrs.keys()
    for key in meta_keys:
        print(key, f['metadata'].attrs[key])
  
  species = '4932'  # if we check the brewer's yeast
  embeddings = f['species'][species]['embeddings'][:]
  proteins = f['proteins'][species]['embeddings'][:]

  # protein names are stored as bytes, convert them to strings
  proteins = [p.decode('utf-8') for p in proteins]

Read the combined file with R

library(rhdf5)

filename <- 'protein.network.embeddings.v12.0.h5'

meta_keys <- h5attributes(h5file$metadata)
for (key in names(meta_keys)) {
    print(paste(key, meta_keys[[key]]))
}

species <- '4932'  # for brewer's yeast
embeddings <- h5read(filename, paste0('species/', species, '/embeddings'))
proteins <- h5read(filename, paste0('species/', species, '/proteins'))

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
figures		figures
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
reproduce.md		reproduce.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPACE: STRING proteins as complementary embeddings

Reproduce the results in the manuscript

How to Cite

Usage of SPACE embeddings

How to read the embedding files

License

About

Releases

Packages

Languages

License

deweihu96/SPACE

Folders and files

Latest commit

History

Repository files navigation

SPACE: STRING proteins as complementary embeddings

Reproduce the results in the manuscript

How to Cite

Usage of SPACE embeddings

How to read the embedding files

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages