Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do bam files sorted by coordinate take shorter time for rmats? #449

Open
cjkenny25 opened this issue Nov 7, 2024 · 1 comment
Open

Do bam files sorted by coordinate take shorter time for rmats? #449

cjkenny25 opened this issue Nov 7, 2024 · 1 comment

Comments

@cjkenny25
Copy link

cjkenny25 commented Nov 7, 2024

I'm trying to use rmats to get quick PSI estimates and read counts for exons using a few hundred single-replicate RNA-seq samples. I am grouping 8 control samples in b2.txt and ~300 single replicate experimental samples in b1.txt and running the following script:

gtf_path=$(grep -- '--sjdbGTFfile' Log.out | awk -F'--sjdbGTFfile ' '{print $2}' | awk '{print $1}'|uniq)
read_length=$(zcat data/*.gz | head -n 2| awk ' NR==2 ' | sed 's/..$//' | wc -c)

for i in $(ls rmats/); do
rmats.py --b1 "$WDR/rmats/$i/b1.txt" \
--b2 "$WDR/rmats/$i/b2.txt" \
--gtf "$gtf_path" \
-t paired \
--readLength "$read_length" \
--nthread 24 \
--allow-clipping \
--od "$WDR/rmats/$i" \
--tmp "$WDR/rmats/$i/tmp" \
--variable-read-length \
--cstat 0.0001 \
--libType fr-unstranded \
--tstat 8

done

I usually align with STAR ahead of time and then use the unsorted BAM files for rmats, which usually works great. However, after 24 hours the job fails to complete and times out. I'm wondering if rmats is faster if the reads are sorted by coordinate? I've done an analysis on a similar scale with clinical samples in the past and it took only ~2 hours with sorted BAM files.

I tried to monitor the progress of the job via files being populated in the tmp folder, but the only file that appears is 2024-11-06-19:43:26_535182_0.rmats and it is empty. Any thoughts?

@EricKutschera
Copy link
Contributor

Using bam files that are sorted by coordinate might be a little faster (maybe due to cache performance), but I wouldn't expect a big difference

rMATS doesn't output much progress information. I don't think there will be any output until it has finished reading through all of the bam files. This post estimates 1 hour per 200 million alignments for the initial step: #323 (comment)

Since you have around 300 samples you could try running using many machines: https://github.com/Xinglab/rmats-turbo/tree/v4.3.0?tab=readme-ov-file#running-prep-and-post-separately

Also, that .rmats file has : in the name which I think means you are using rMATS older than v4.1.2. If that's the case you may run into a performance issue that was fixed in v4.1.2:
https://github.com/Xinglab/rmats-turbo/releases
#104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants