Do bam files sorted by coordinate take shorter time for rmats? #449

cjkenny25 · 2024-11-07T17:32:08Z

I'm trying to use rmats to get quick PSI estimates and read counts for exons using a few hundred single-replicate RNA-seq samples. I am grouping 8 control samples in b2.txt and ~300 single replicate experimental samples in b1.txt and running the following script:

gtf_path=$(grep -- '--sjdbGTFfile' Log.out | awk -F'--sjdbGTFfile ' '{print $2}' | awk '{print $1}'|uniq)
read_length=$(zcat data/*.gz | head -n 2| awk ' NR==2 ' | sed 's/..$//' | wc -c)

for i in $(ls rmats/); do
rmats.py --b1 "$WDR/rmats/$i/b1.txt" \
--b2 "$WDR/rmats/$i/b2.txt" \
--gtf "$gtf_path" \
-t paired \
--readLength "$read_length" \
--nthread 24 \
--allow-clipping \
--od "$WDR/rmats/$i" \
--tmp "$WDR/rmats/$i/tmp" \
--variable-read-length \
--cstat 0.0001 \
--libType fr-unstranded \
--tstat 8

done

I usually align with STAR ahead of time and then use the unsorted BAM files for rmats, which usually works great. However, after 24 hours the job fails to complete and times out. I'm wondering if rmats is faster if the reads are sorted by coordinate? I've done an analysis on a similar scale with clinical samples in the past and it took only ~2 hours with sorted BAM files.

I tried to monitor the progress of the job via files being populated in the tmp folder, but the only file that appears is 2024-11-06-19:43:26_535182_0.rmats and it is empty. Any thoughts?

The text was updated successfully, but these errors were encountered:

EricKutschera · 2024-11-07T18:33:43Z

Using bam files that are sorted by coordinate might be a little faster (maybe due to cache performance), but I wouldn't expect a big difference

rMATS doesn't output much progress information. I don't think there will be any output until it has finished reading through all of the bam files. This post estimates 1 hour per 200 million alignments for the initial step: #323 (comment)

Since you have around 300 samples you could try running using many machines: https://github.com/Xinglab/rmats-turbo/tree/v4.3.0?tab=readme-ov-file#running-prep-and-post-separately

Also, that .rmats file has : in the name which I think means you are using rMATS older than v4.1.2. If that's the case you may run into a performance issue that was fixed in v4.1.2:
https://github.com/Xinglab/rmats-turbo/releases
#104

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do bam files sorted by coordinate take shorter time for rmats? #449

Do bam files sorted by coordinate take shorter time for rmats? #449

cjkenny25 commented Nov 7, 2024 •

edited

Loading

EricKutschera commented Nov 7, 2024

Do bam files sorted by coordinate take shorter time for rmats? #449

Do bam files sorted by coordinate take shorter time for rmats? #449

Comments

cjkenny25 commented Nov 7, 2024 • edited Loading

EricKutschera commented Nov 7, 2024

cjkenny25 commented Nov 7, 2024 •

edited

Loading