Metadata workflow

Get metadata from ThermoFischer proteomics raw files using ThermoRawFileParser

Warning

On unix systems, make sure that mono is available

Note

In case single raw file fail, delete these and try to download them again

Output

both json and txt data format into jsons and txts folder
create combined rawfile_metadata.json (needs to be deleted if files are added)

Configfile

add a config/files.yaml in config:

out_folder: metadata
out_csv: metadata_rawfiles.csv
thermo_raw_file_parser_exe: mono /projects/rasmussen/people/kzl465/hela_qc_mnt_data/ThermoRawFileParser1.4.4/ThermoRawFileParser.exe
files:
  - 2013_04_03_16_54_Q-Exactive-Orbitrap_1
  - 2013_04_03_17_47_Q-Exactive-Orbitrap_1
ftp_folder: pride/data/archive/2023/12/PXD042233
ftp_prefix: ftp://
ftp_server: ftp.pride.ebi.ac.uk
folder_raw: tmp_rawfiles
excluded:
-

Add files on PRIDE to config file

Already done for PRIDE, but could be used to select a subset of the files.

The list of files is extracted from pride_metadata.csv.

from pathlib import PurePosixPath as Path
import yaml
import pandas as pd
pd.options.display.max_columns = 80

ftp_folder = 'https://ftp.pride.ebi.ac.uk/pride/data/archive/2023/12/PXD042233'
file = 'pride_metadata.csv'

config_file = 'config/files.yaml'

df = pd.read_csv(f'{ftp_folder}/{file}', index_col=0)

with open(config_file) as f:
    config = yaml.safe_load(f)

config['files'] = df.index.to_list()

with open(config_file, 'w') as f:
    yaml.dump(config, f)

Then invoke the workflow with the list of config files

# dry-run
snakemake --configfiles config/files.yaml config/excluded.yaml -p -n

Excluded files

Some files might be corrupted and not be processed by ThermoRawFileParser. These can be excluded based on the tmp folder

# check files
echo 'excluded:' > config/excluded_$(date +"%Y%m%d").yaml
find  tmp -name '*.raw*' | awk 'sub(/^.{4}/," ? ")' >> config/excluded_$(date +"%Y%m%d").yaml

# potentially add these to the workflow exclusion files:
find  tmp -name '*.raw*' | awk 'sub(/^.{4}/," ? ")' >> config/excluded.yaml
# rm -r tmp/* # remove excluded files
# add to files.yaml

these files are ignored in the workflow (configured as a python set) after adding these to the config/files.yaml.

Setup

download and unzip ThermoRawFileParser
add path to exe to config

# sudo apt-get update
sudo apt install mono-complete
conda create -n snakemake snakemake
conda activate snakemake
pip install git+https://github.com/RasmussenLab/hela_qc_mnt_data.git # or locally cloned
pip install papermill
snakemake -n  # see job listing

zip outputs

# could be part of snakemake process
zip -r metadata.zip txt jsons

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Metadata workflow

Output

Configfile

Add files on PRIDE to config file

Excluded files

Setup

zip outputs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Metadata workflow

Output

Configfile

Add files on PRIDE to config file

Excluded files

Setup

zip outputs