Skip to content

Commit

Permalink
Migrated documentation, scripts, and building of preclinical datasour…
Browse files Browse the repository at this point in the history
…ces from the paper repository. Reduced annotated cell line data to include only those that are clinically or biologically relevant, thus reducing file sizes. Expanded upon documentation for building ExAC and COSMIC.
  • Loading branch information
brendanreardon committed Feb 10, 2022
1 parent 214cb73 commit ccb3c15
Show file tree
Hide file tree
Showing 33 changed files with 2,591,245 additions and 37 deletions.
16 changes: 8 additions & 8 deletions moalmanac/config.ini
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,12 @@ lawrence_handle = datasources/lawrence/lawrence_mapped_ontology.txt
additional_matches = almanac.additional.matches.json

[preclinical]
almanac_gdsc_mappings = datasources/preclinical/almanac-gdsc-mappings.json
summary = datasources/preclinical/cell-lines.summary.txt
variants = datasources/preclinical/ccle.variants.evaluated.txt
copynumbers = datasources/preclinical/ccle.copy-numbers.evaluated.txt
fusions = datasources/preclinical/sanger.fusions.evaluated.txt
fusions1 = datasources/preclinical/sanger.fusions.gene1.evaluated.txt
fusions2 = datasources/preclinical/sanger.fusions.gene2.evaluated.txt
gdsc = datasources/preclinical/sanger.gdsc.txt
almanac_gdsc_mappings = datasources/preclinical/formatted/almanac-gdsc-mappings.json
summary = datasources/preclinical/formatted/cell-lines.summary.txt
variants = datasources/preclinical/annotated/cell-lines.somatic-variants.annotated.txt
copynumbers = datasources/preclinical/annotated/cell-lines.copy-numbers.annotated.txt
fusions = datasources/preclinical/annotated/cell-lines.fusions.annotated.txt
fusions1 = datasources/preclinical/annotated/cell-lines.fusions.annotated.gene1.txt
fusions2 = datasources/preclinical/annotated/cell-lines.fusions.annotated.gene2.txt
gdsc = datasources/preclinical/formatted/sanger.gdsc.txt
dictionary = datasources/preclinical/cell-lines.pkl
6 changes: 4 additions & 2 deletions moalmanac/datasources/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# Datasources used by MOAlmanac
MOAlmanac leverages several datasources to annotate and evaluate molecular features for clinical and biological relevance. **If you are running this software after cloning it from Github, you will need to preprocess some before running this tool** because some files exceed the Github storage limit of 100 Mb. For instructions on how to build the relevant datasources, please refer to their respective folders in this directory.

Some files require download over Google. If you are unable to download through Google or use Docker to access the processed datasources, please contact us and we will work with you to figure something out.

| Name | Immediately ready for use from Github | Included in Docker |
|---------------------------------------------------------|---------------------------------------|--------------------|
| [Cancer cell line data](preclinical/) | :x: | :white_check_mark: |
| [COSMIC](cosmic/) | :x: | :white_check_mark: |
| [ExAC](exac/) | :x: | :white_check_mark: |
| [American College of Medical Genetics v2 (ACMG)](acmg/) | :white_check_mark: | :white_check_mark: |
| [Cancer Gene Census (CGC)](cancergenecensus/) | :white_check_mark: | :white_check_mark: |
| [Cancer cell line data](preclinical/) | :white_check_mark: | :white_check_mark: |
| [Cancer Gene Census (CGC)](cancergenecensus/) | :white_check_mark: | :white_check_mark: |
| [Cancer Hotspots](cancerhotspots/) | :white_check_mark: | :white_check_mark: |
| [ClinVar](clinvar/) | :white_check_mark: | :white_check_mark: |
| [Genes associated with hereditary cancers](hereditary/) | :white_check_mark: | :white_check_mark: |
Expand Down
6 changes: 4 additions & 2 deletions moalmanac/datasources/cosmic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,15 @@ The Molecular Oncology Almanac's heuristic utilizes the [Catalogue of Somatic Mu
COSMIC is developed and maintained by the [Wellcome Trust Sanger Institute](http://www.sanger.ac.uk/) for exploring the impact of somatic mutations in human cancer. At the time of this writing, COSMIC v85 contains nearly 30,000 genes and 6 million protein changes.

## Usage: Downloading and formatting COSMIC
Data can be downloaded from [COSMIC's one click data downloads portal](http://cancer.sanger.ac.uk/cosmic/download). The Molecular Oncology Almanac leverages the COSMIC Mutation Data file.
Data can be downloaded from [COSMIC's one click data downloads portal](http://cancer.sanger.ac.uk/cosmic/download). The Molecular Oncology Almanac leverages the `COSMIC Mutation Data` file, labeled as `CosmicMutantExport.tsv.gz`. Please download the whole file (4 GB) and uncompress the file.

The script `prepare_cosmic.py` is used to extract Genes and Protein Changes for use.
The script `prepare_cosmic.py` is used to extract Genes and Protein Changes for use. Here, we have renamed `CosmicMutantExport.tsv` to be `CosmicMutantExport_v85.tsv` in order to designate the datasource version used in the present study.

```
python prepare_cosmic.py --cosmicMutantExport CosmicMutantExport_v85.tsv
```

This script will produce an output with the suffix `.lite.txt`, which will be used by MOAlmanac.

## References
1. [Forbes SA, Beare D, Boutselakis H, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 2017;45(D1):D777-D783.](https://academic.oup.com/nar/article/45/D1/D777/2605743)
4 changes: 4 additions & 0 deletions moalmanac/datasources/exac/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ All releases of ExAC [are available for download on their webpage](ftp://ftp.bro
The following steps should be performed to prepare ExAC for use with MOAlmanac,
1. Download [ExAC](https://gnomad.broadinstitute.org/downloads#exac-variants) from gnomAD's webpage, titled "All chromosomes VCF" under the "Exomes" section. Download this and the TBI (the VCF index)
2. Download [GATK](https://gatk.broadinstitute.org/hc/en-us)
- Their website only has downloads for GATK4 currently. GATK3.8 can be downloaded [from their archives](https://console.cloud.google.com/storage/browser/gatk-software/package-archive/gatk?pli=1)
- `gs://gatk-software/package-archive/gatk/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2`
3. Download [hg19 reference genome files](https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg19/v0;tab=objects?prefix=&forceOnObjectsSortingFiltering=false) from [gcp-public-data--broad-references](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811). In particular,
- `gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta`
- `gs://gcp-public-data--broad-references/hg19/v0/Homo_sapiens_assembly19.fasta.fai`
Expand All @@ -26,6 +28,8 @@ After converting ExAC to a tab delimited file, sites with multiple alternate all
python expand_exac.py --exac exac.lite-pass.1.4-r1.txt
```

This script will produce an output named `exac.expanded.r1.txt` that is about 923 MB in size.

If you do not have access to Google Cloud or Docker and are having trouble building this datasource, please reach out. We are happy to try our best to help figure something out.

## References
Expand Down
2 changes: 1 addition & 1 deletion moalmanac/datasources/moalmanac/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ The Molecular Oncology Almanac attempts to capture the current body of knowledge
Several other services exist within the Molecular Oncology Almanac ecosystem. See [this repository's docs folder](/docs/) for more information.

## Usage: Formatting the database for use
This method uses a document-based format of the database, which is built using the [database repository](https://github.com/vanallenlab/moalmanac-db) and `create_almanac_db.py`.
This method uses a document-based format of the database, which is built using the [database repository](https://github.com/vanallenlab/moalmanac-db) and `create_almanac_db.py`. If MOAlmanac is updated, **please also regenerate [preclinical datasources](../preclinical/)**.

Arguments:
```
Expand Down
48 changes: 36 additions & 12 deletions moalmanac/datasources/preclinical/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,41 @@
# Directly leveraging cancer cell lines for clinical interpretation
Data processing of cancer cell lines for usage by Molecular Oncology Almanac is discussed in the [MOAlmanac paper Github repository](https://github.com/vanallenlab/moalmanac-paper). This repository contains the files directly utilizes by MOAlmanac to test for preclinical efficacy of relationships and to perform patient model matchmaking.
Molecular Oncology Almanac leverages cancer cell lines for clinical interpretation in two ways,
- [To test if relationships between molecular features and therapies show efficacy in cancer cell lines](../../../docs/description-of-outputs.md#preclinical-efficacy)
- [To perform profile-to-cell line matchmaking](../../../docs/description-of-outputs.md#profile-to-cell-line-matchmaking)

So that users do not have to follow the procedure of downloading, annotating, and evaluating raw data, you can download the utilized cell line somatic variants, copy number alterations, and fusions to this directory with `download-files.sh`. This should be run with the repository's virtual environment active as it utilizes the package [gdown](https://github.com/wkentaro/gdown). The files are hosted on a public Google Drive folder through the Broad Institute and are,
- `ccle.copy-numbers.evaluated.txt` (100.5 MB, MD5: a57eee56310a7c96b4a2cfa53aa6a2de), copy number alterations annotated and evaluated with MOAlmanac
- `ccle.variants.evaluated.txt` (53.8 MB, MD5: 970123c95fdbcd3138eba8b6b71047ce), somatic variants annotated and evaluated with MOAlmanac
- `sanger.fusions.evaluated.txt` (527 KB, MD5: df830391473ea1e73fddd7e449216434), fusions with the strongest match per feature from the `gene1` and `gene2` files
- `sanger.fusions.gene1.evaluated.txt` (483 KB, MD5: 59398e768f48a369724ba619ee6dc177), fusions annotated and evaluated with MOAlmanac relative to gene 1
- `sanger.fusions.gene2.evaluated.txt` (477 KB, MD5: 357e374a78c4200fa8fcca66979a3ebb), fusions annotated and evaluated with MOAlmanac relative to gene 2
The following files must be configured for use with MOAlmanac,
- `formatted/almanac.gdsc.mappings.json` (14 KB)
- Mappings between therapies cataloged in MOAlmanac and therapies utilized in GDSC
- `formatted/cell-lines.summary.txt` (260 KB)
- A table of cancer cell line names across data sets, which data type they have available, and if they are used in any analyses
- `cell-lines.copy-numbers.annotated.txt` (3.3 MB)
- A table of copy number alterations observed in cancer cell lines, annotated with MOAlmanac
- `cell-lines.somatic-variants.annotated.txt` (3.8 MB)
- A table of somatic variants observed in cancer cell lines, annotated with MOAlmanac
- `cell-lines.fusions.gene1.annotated.txt` (314 KB)
- A table of fusions observed in cancer cell lines, annotated with MOAlmanac relative to gene 1
- `cell-lines.fusions.gene2.annotated.txt` (311 KB)
- A table of fusions observed in cancer cell lines, annotated with MOAlmanac relative to gene 2
- `cell-lines.fusions.annotated.txt` (363 KB)
- A table of fusions observed in cancer cell lines, using annotations of the most clinically and biologically relevant gene per fusion
- `cell-lines.pkl` (7.8 MB)
- A file containing summary information of the above, used when creating reports

**These files should be reproduced if the underlying [moalmanac/](../moalmanac/) is updated**.

For more details, please refer to the [paper repository](https://github.com/vanallenlab/moalmanac-paper) and/or [the present study](https://www.nature.com/articles/s43018-021-00243-3).
## Usage: downloading and formatting preclinical data
Please follow the following steps to configure raw data for use with MOAlmanac,
1. Follow instructions under `source/` to download data used in the present study
2. Follow instructions under `formatted/` to format samples and molecular features
3. Follow instructions under `annotated/` to annotate molecular features after formatting
4. Open and execute the notebook `generate-dictionary.ipynb` to produce `cell-lines.pkl`, for easy look up by MOAlmanac

## Usage
```bash
bash download-files.sh
```
Much of these files and scripts have been repurposed and modified from the [MOAlmanac paper Github repository](https://github.com/vanallenlab/moalmanac-paper).

## References
1. [Ghandi, M. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019).](https://www.nature.com/articles/s41586-019-1186-3)
2. [Yang, W. et al. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res. 41, D955–61 (2013).](https://academic.oup.com/nar/article/41/D1/D955/1059448)
3. [Sondka, Z. et al. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat. Rev. Cancer 18, 696–705 (2018).](https://www.nature.com/articles/s41568-018-0060-1)
4. [Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337 (2014).](https://www.nature.com/articles/nmeth.2810)
5. [Reardon, B., Moore, N.D., Moore, N.S., *et al*. Integrating molecular profiles into clinical frameworks through the Molecular Oncology Almanac to prospectively guide precision oncology. *Nat Cancer* (2021).](https://www.nature.com/articles/s43018-021-00243-3)
6. [Reardon, B. & Van Allen, E. M. Molecular profile to cancer cell line matchmaking. Protocol Exchange.](https://protocolexchange.researchsquare.com/article/pex-1539/v1)
Loading

0 comments on commit ccb3c15

Please sign in to comment.