Skip to content

qinchunyuan/clustering-file-reader

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

clustering-file-reader

Build Status

Introduction

The clustering-file-reader is a Java API to process .clustering files - a result file format for MS/MS based spectrum clustering. .clustering files are currently used by the spectra-cluster API, the spectra-cluster-hadoop application and the spectra-cluster-cli application.

The spectra-cluster-hadoop application is currently used to create the PRIDE Cluster resource. The complete clustering results created as a basis of PRIDE Cluster are available for download in the .clustering format (ftp location).

The .clustering result file format is a compact text file format which contains all the information related to clusters, these can include the consensus spectrum, precursor details, and spectrum related details. It is also possible to even store the spectra's original peaklists within the .clustering file.

Getting started

Installation

You will need to have Maven installed in order to build and use the spectra-cluster library.

Add the following snippets in your Maven pom file:

<!-- spectra-cluster dependency -->
<dependency>
    <groupId>uk.ac.ebi.pride.spectracluster</groupId>
    <artifactId>clustering-file-reader</artifactId>
    <version>${current.version}</version>
</dependency>
 <!-- EBI repo -->
 <repository>
     <id>pst-release</id>
     <url>http://www.ebi.ac.uk/Tools/maven/repos/content/repositories/pst-release</url>
 </repository>

 <!-- EBI SNAPSHOT repo -->
 <snapshotRepository>
    <id>pst-snapshots</id>
    <url>http://www.ebi.ac.uk/Tools/maven/repos/content/repositories/pst-snapshots</url>
 </snapshotRepository>

Running the library

The library supports two methods of reading a .clustering file:

  1. Reading all clusters in at once (only advisable for smaller files)
  2. Reading a .clustering file incrementally (optimised for very large result files)
  3. Random access to indexed .clustering files (also works with very large result files)
/**
 * Example reading a file in at once
 */
File myClusteringFile = new File("/tmp/test.clustering");

// create an instance of a ClusteringFileReader
IClusterSourceReader reader = new ClusteringFileReader(myClusteringFile);

// read all clusters
List<ICluster> clusters = reader.readAllClusters();
/**
 * Example processing a file incrementally.
 */
File myLargeClusteringFile = new File("/tmp/large_clustering_file.clustering");

// To process clusters incrementally, the respective classes must implement
// the IClusterSourceListener interface.
IClusterSourceListener myListener = new MyListener();

// multiple readers can be added at once (f.e. one
// calculating the average cluster size, another
// writing out all consensus spectra)
List<IClusterSourceListener> listeners = new ArrayList<IClusterSourceListener>(1);
listeners.add(myListener);

// create the ClusteringFileReader
IClusterSourceReader reader = new ClusteringFileReader(myLargeClusteringFile);

// process the clusters incrementally
reader.readClustersIteratively(listeners);
/**
 * Example randomly accessing a file
 **/

File myLargeClusteringFile = new File("/tmp/large_clustering_file.clustering");

// First, the file must be indexed. The index can also be saved to a file for faster
// re-use.
IIndexer indexer = new ClusteringFileIndexer();
ClusteringFileIndex index = indexer.indexFile(myLargeClusteringFile);

// create the ClusteringFileReader with the index
ClusteringFileReader reader = new ClusteringFileReader(myLargeClusteringFile, index);
ICluster cluster = reader.readCluster("c8ada97f-094d-409b-8651-3d2efc77dbea");

// save the index to a file
File indexFile = new File("/tmp/large_clustering_file.clustering.index");
index.saveToFile(indexFile);

// the index can now be loaded directly from this file
ClusteringFileIndex loadedIndex = ClusteringFileIndex.loadFromFile(indexFile);

Note on consensus spectra

The spectra-cluster API also includes how often a consensus peak was observed in the .clustering files since version 1.0.11. If .clustering files that were created prior to this version, the function getConsensusCountValues() only returns an empty list.

File format specification

The ".clustering" file format is text based.

The first lines contain an optional header specifying properties of the algorithm and the sample set. Each line contains one property where the property's name is separated by an "=" from the value.

Defining clusters

Clusters start with the line "=Cluster=".

The next lines contain the cluster's properties, one property per line where the property's name is separated by an "=" from the value. Cluster properties are:

  1. id: the cluster's id
  2. av_precursor_mz: the average precursor m/z
  3. av_precursor_intensity: the average precursor intensity
  4. consensus_peak_counts: the number of times a peak was observed (only since spectra-cluster version 1.0.11)
  5. sequence: List of sequences of the peptides identified in the cluster in the format "[{sequence}:{count}]"
  6. consensus_mz: ',' delimited m/z values of the consensus spectrum
  7. consensus_intens: ',' delimited intensity values of the consensus spectrum

Defining spectra in clusters

Spectra are defined one line per spectrum containing 'tab' delimited fields. A spectrum line must start with the term "SPEC". The following fields are:

  1. spectrum's id
  • The spectrum id supports a special format to encode more detailed information about the spectrum's origin: #file=test.mgf#id=index=120#title=The original title. The id= field should contain the spectrum's id according to the PSI convention for formatting ids in peak list files (see mzTab specification as an example).
  1. whether this spectrum was identified as the most common peptide in the cluster ("true" / "false")
  2. The identified sequence. If multiple ranks are reported, sequences must be sorted by rank and delimited by an ","
  3. Spectrum's precursor's m/z
  4. Spectrum's charge
  5. Species (taxid), ',' delimited
  6. Modifications in the format "[position]-[accession]". Multiple modifications must be separated by an ",". If multiple PSMs are reported these modification groups must be separated by an ";".
  7. The similarity of the spectrum (based on the used similarity metric) to the cluster's consensus spectrum

It is possible to add a spectrum's peak list to the clustering file. To do this, the spectrum definition line ("SPEC..." line) is followed by a "SPEC_MZ" and "SPEC_INTENS" line. These lines contain the spectrum's m/z and intensity values respectively as ',' separated lists.

Example

=Cluster=
id=197b4666-4e7e-4d61-b1a1-e032b1e15aa7
av_precursor_mz=357.221
av_precursor_intens=1.0
sequence=[GIFAFVK,GIFAFVK:3]
consensus_mz=114.109,115.106,120.076,...
consensus_intens=292.16,272.41,2241.61,...
consensus_peak_counts=1,4,2,4,...
SPEC	PXD000732;MFerrer_PAO1_2013.xml;spectrum=3121	true	GIFAFVK,GIFAFVK	357.22116	3	287	7-MOD:01499,0-MOD:01499;7-MOD:01499,0-MOD:01499	0.9987784157651239

Getting help

If you have questions or need additional help, please contact the PRIDE Help desk at the EBI.

email: pride-support at ebi.ac.uk (replace at with @).

Giving your feedback

Please give us your feedback, including error reports, suggestions on improvements, new feature requests. You can do so by opening a new issue at our issues section

How to cite

Please cite this library using one of the following publications:

  • Griss J, Foster JM, Hermjakob H, Vizcaíno JA. PRIDE Cluster: building the consensus of proteomics data. Nature methods. 2013;10(2):95-96. doi:10.1038/nmeth.2343. PDF, HTML, PubMed

Contribute

We welcome all contributions submitted as pull request.

License

This project is available under the Apache 2 open source software (OSS) license.

About

File reader for clustering result file format

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%