Skip to content

Commit

Permalink
Replace data backend and increase flexibility. (#39)
Browse files Browse the repository at this point in the history
  • Loading branch information
wfondrie authored Oct 10, 2023
1 parent 54f36bf commit 2079903
Show file tree
Hide file tree
Showing 25 changed files with 1,836 additions and 1,114 deletions.
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

We have completely reworked of the data module.
Depthcharge now uses Apache Arrow-based formats instead of HDF5; spectra are converted either Parquet or streamed with PyArrow, optionally into Lance datasets.

### Breaking Changes
- Mass spectrometry data parsers now function as iterators, yielding batches of spectra as `pyarrow.RecordBatch` objects.
- Parsers can now be told to read arbitrary fields from their respective file formats with the `custom_fields` parameter.
- The parsing functionality of `SpctrumDataset` and its subclasses have been moved to the `spectra_to_*` functions in the data module.
- `SpectrumDataset` and its subclasses now return dictionaries of data rather than a tuple of data. This allows us to incorporate arbitrary additional data

### Added
- Added the `StreamingSpectrumDataset` for fast inference.
- Added `spectra_to_df`, `spectra_to_df`, `spectra_to_stream` to the `depthcharge.data` module.

### Changed
- Determining the mass spectrometry data file format is now less fragile.
It now looks for known line contents, rather than relying on the extension.

## [v0.3.1] - 2023-08-18
### Added
- Support for fine-tuning the wavelengths used for encoding floating point numbers like m/z and intensity to the `FloatEncoder` and `PeakEncoder`.
Expand Down
1 change: 0 additions & 1 deletion CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,4 +120,3 @@ version 2.0, available at

[homepage]: https://www.contributor-covenant.org
[v2.0]: https://www.contributor-covenant.org/version/2/0/code_of_conduct.html

8 changes: 8 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Data for testing and examples

This directory contains real data to be used in tests and examples within the Depthcharge documentation.
Currently, they all originate from [PXD000001](http://central.proteomexchange.org/cgi/GetDataset?ID=PXD000001)

## Notes

- [TMT10-Trial-8.mzML]( TMT10-Trial-8.mzML) was modified manually such that one "charge state" CV accession was changed to an "assumed charge state" CV accession.
4 changes: 2 additions & 2 deletions data/TMT10-Trial-8.mzML
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@
</dataProcessingList>
<run id="TMT10-Trial-8" defaultInstrumentConfigurationRef="IC1" startTimeStamp="2018-01-26T22:53:26.1262695Z" defaultSourceFileRef="RAW1">
<spectrumList count="11" defaultDataProcessingRef="pwiz_Reader_conversion">
<spectrum index="0" id="controllerType=0 controllerNumber=1 scan=500" defaultArrayLength="483">
<spectrum index="0" id="index=500" defaultArrayLength="483">
<cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1"/>
<cvParam cvRef="MS" accession="MS:1000579" name="MS1 spectrum" value=""/>
<cvParam cvRef="MS" accession="MS:1000130" name="positive scan" value=""/>
Expand Down Expand Up @@ -157,7 +157,7 @@
<selectedIonList count="1">
<selectedIon>
<cvParam cvRef="MS" accession="MS:1000744" name="selected ion m/z" value="804.774963378906" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
<cvParam cvRef="MS" accession="MS:1000041" name="charge state" value="3"/>
<cvParam cvRef="MS" accession="MS:1000633" name="possible charge state" value="3"/>
<cvParam cvRef="MS" accession="MS:1000042" name="peak intensity" value="11531.6333007813" unitCvRef="MS" unitAccession="MS:1000131" unitName="number of detector counts"/>
</selectedIon>
</selectedIonList>
Expand Down
39 changes: 25 additions & 14 deletions depthcharge/__init__.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,28 @@
"""Initialize the depthcharge package."""
from . import (
data,
encoders,
feedforward,
tokenizers,
transformers,
)
from .primitives import (
MassSpectrum,
Molecule,
Peptide,
PeptideIons,
)
from .version import _get_version
# Ignore a bunch of pkg_resources warnings from dependencies:
import warnings

with warnings.catch_warnings():
for module in ["psims", "pkg_resources"]:
warnings.filterwarnings(
"ignore",
category=DeprecationWarning,
module=module,
)

from . import (
data,
encoders,
feedforward,
tokenizers,
transformers,
)
from .primitives import (
MassSpectrum,
Molecule,
Peptide,
PeptideIons,
)
from .version import _get_version

__version__ = _get_version()
12 changes: 11 additions & 1 deletion depthcharge/data/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,14 @@
"""The Pytorch Datasets."""
from . import preprocessing
from .arrow import (
spectra_to_df,
spectra_to_parquet,
spectra_to_stream,
)
from .fields import CustomField
from .peptide_datasets import PeptideDataset
from .spectrum_datasets import AnnotatedSpectrumDataset, SpectrumDataset
from .spectrum_datasets import (
AnnotatedSpectrumDataset,
SpectrumDataset,
StreamingSpectrumDataset,
)
Loading

0 comments on commit 2079903

Please sign in to comment.