Skip to content

Latest commit

 

History

History
111 lines (80 loc) · 2.92 KB

README.md

File metadata and controls

111 lines (80 loc) · 2.92 KB

Metadata workflow

Get metadata from ThermoFischer proteomics raw files using ThermoRawFileParser

Warning

On unix systems, make sure that mono is available

Note

In case single raw file fail, delete these and try to download them again

Output

  • both json and txt data format into jsons and txts folder
  • create combined rawfile_metadata.json (needs to be deleted if files are added)

Configfile

add a config/files.yaml in config:

out_folder: metadata
out_csv: metadata_rawfiles.csv
thermo_raw_file_parser_exe: mono /projects/rasmussen/people/kzl465/hela_qc_mnt_data/ThermoRawFileParser1.4.4/ThermoRawFileParser.exe
files:
  - 2013_04_03_16_54_Q-Exactive-Orbitrap_1
  - 2013_04_03_17_47_Q-Exactive-Orbitrap_1
ftp_folder: pride/data/archive/2023/12/PXD042233
ftp_prefix: ftp://
ftp_server: ftp.pride.ebi.ac.uk
folder_raw: tmp_rawfiles
excluded:
- 

Add files on PRIDE to config file

Already done for PRIDE, but could be used to select a subset of the files.

The list of files is extracted from pride_metadata.csv.

from pathlib import PurePosixPath as Path
import yaml
import pandas as pd
pd.options.display.max_columns = 80

ftp_folder = 'https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233'
file = 'pride_metadata.csv'

config_file = 'config/files.yaml'

df = pd.read_csv(f'{ftp_folder}/{file}', index_col=0)

with open(config_file) as f:
    config = yaml.safe_load(f)

config['files'] = df.index.to_list()

with open(config_file, 'w') as f:
    yaml.dump(config, f)

Then invoke the workflow with the list of config files

# dry-run
snakemake --configfiles config/files.yaml config/excluded.yaml -p -n

Excluded files

Some files might be corrupted and not be processed by ThermoRawFileParser. These can be excluded based on the tmp folder

# check files
echo 'excluded:' > config/excluded_$(date +"%Y%m%d").yaml
find  tmp -name '*.raw*' | awk 'sub(/^.{4}/," ? ")' >> config/excluded_$(date +"%Y%m%d").yaml

# potentially add these to the workflow exclusion files:
find  tmp -name '*.raw*' | awk 'sub(/^.{4}/," ? ")' >> config/excluded.yaml
# rm -r tmp/* # remove excluded files
# add to files.yaml

these files are ignored in the workflow (configured as a python set) after adding these to the config/files.yaml.

Setup

# sudo apt-get update
sudo apt install mono-complete
conda create -n snakemake snakemake
conda activate snakemake
pip install git+https://github.com/RasmussenLab/hela_qc_mnt_data.git # or locally cloned
pip install papermill
snakemake -n  # see job listing

zip outputs

# could be part of snakemake process
zip -r metadata.zip txt jsons