Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Novel isoform not detected #62

Open
stfacc opened this issue Jul 3, 2024 · 3 comments
Open

Novel isoform not detected #62

stfacc opened this issue Jul 3, 2024 · 3 comments

Comments

@stfacc
Copy link

stfacc commented Jul 3, 2024

In my RNAseq there's a novel isoform (extended exon) which is not detected by ESPRESSO. (See BAM file with ESPRESSO gtf below). Only two reads are visible supporting the isoform, but there are actually many more.

Is there any option I can try to identify it?

Many thanks

image

@EricKutschera
Copy link
Contributor

You could add a transcript with that novel splice junction or the full transcript that you see in IGV to your .gtf file. Then ESPRESSO would treat the isoform as annotated and report its abundance

You could try reducing --read_num_cutoff to 1 or 0 (default is 2). That reduces how many "perfect" reads ESPRESSO needs to see for a junction before using it in a novel isoform

If you know the read IDs that you think are for that novel isoform then you could check in the output files to see what ESPRESSO did with those reads. If you ran the Q step with -V compatible_isoform.tsv then you could search for the read IDs in that file to see if ESPRESSO assigned them to another isoform. You could also search for the read IDs in the files like {chr}_read_final.txt. It might show that ESPRESSO marked certain junctions in those reads as "fail" or realigned some of the junctions. If the read IDs aren't in any read_final file then they were filtered out. You could try reducing --mapq_cutoff (default 1) or increasing --inserted_cont_cutoff (default 20) to avoid filtering out those reads

@stfacc
Copy link
Author

stfacc commented Jul 4, 2024

Thank you Eric for your suggestions. I repeated the process with more samples, and including the standard gencode v45 annotation file. No custom SJ annotations.

This time the isoform in my first comment was detected.

There are however two other problems:

  1. An isoform with a 4-bases extended exon is not identified (see figure below)

I added the option --internal_boundary_limit 3, but didn't help.

In file chr12_read_final.txt I see this line for one of the extended reads:

b4a89131-ea7a-4490-b368-18535897f2a4 SJ 1 chr12:21437814:21440363:x 2892 3009 corrected chr12:21437814:21440367:0 2894.5 3006.5 0.000651527191140263 no yes

So the reads with the new SJ are not filtered out but corrected (even with --internal_boundary_limit 3).

image

  1. An intron retention is not detected.

It's weird that ESPRESSO could identify the same intron retention when processing only a subset of the samples.
Not sure how to troubleshoot this issue.

The corresponding reads are marked "ISM" in compatible_isoform.tsv

image

Thanks

@EricKutschera
Copy link
Contributor

For the 4 base different splice junction it looks like the issue is the sequence at the junction. The original alignment is chr12:21437814:21440363:x the :x means that ESPRESSO didn't have that junction annotated and it couldn't match the two bases on the two ends of the junction to what it expects for either the + or - strand. These are the bases ESPRESSO associates with each strand: https://github.com/Xinglab/espresso/blob/v1.4.0/src/ESPRESSO_S.pl#L208

The annotated junction looks like its GT AG. The other junction would be GT AA. ESPRESSO won't create a novel isoform unless all the junctions are either annotated or have the expected junction sequence

For the intron retention, it looks like the intron is between the last 2 exons of the transcript. In that case ESPRESSO would treat the read as ending at the 2nd to last exon but with a different transcript endpoint. When ESPRESSO classifies a read as ISM it's only looking at the splice junctions in the read. Since the intron is more than 1000 bases, even if those reads are ISM they should fail the endpoint check to the annotated isoform they share the junctions with. ESPRESSO could report it as a novel isoform, but only if there are no reads assigned to the annotated isoform that it is ISM in relation to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants