compare the structures of isoforms within a gene #20

Oliverfeudj · 2024-06-14T08:12:37Z

is there a way to do this classify_isoform_differences.py for all isoforms? not doing individually for each isoform so that I have at the end all isoform in a single tsv file, not individual tsv for each isoform

Thank you in advance for your reply

The text was updated successfully, but these errors were encountered:

EricKutschera · 2024-06-14T14:49:08Z

By default rmats_long.py outputs a separate .tsv for each of the most significant isoforms showing the differences of that isoform to the most significant isoform in that gene with a delta proportion in the opposite direction

If you run rmats_long.py with --compare-all-within-gene then when it calls classify_isoform_differences.py it won't use --second-transcript-id and each .tsv will have the differences of the most significant isoform to all other isoforms in the gene

I added a branch which changes classify_isoform_differences.py to do a pairwise comparison of isoforms in a gene when run without --second-transcript-id: bc2bdfb

Doing a pairwise comparison can take a long time for genes with many isoforms

kwonej0617 · 2024-07-08T17:01:07Z

Hi @EricKutschera

I have used --compare-all-within-gene when running rmats_long.py and I have a question regarding the output.
Here's the command line I used:

python ./rMATS-long/scripts/rmats_long.py --abundance espresso_data/HEK293T-all/HEK293T-all_N2_R0_abundance.esp --updated-gtf espresso_data/HEK293T-all/HEK293T-all_N2_R0_updated.gtf --gencode-gtf ${gtf} --group-1 rMATS-long_data/input/HEK293T-all_group1.txt --group-2 rMATS-long_data/input/HEK293T-all_group2.txt --group-1-name WT --group-2-name KO --out-dir ${output_path} --plot-file-type .png --compare-all-within-gene

In the differential_transcript.tsv, the lines containing ENSG00000004455.17 (just randomly picked) were 10 lines as follows:

grep ENSG00000004455.17 rMATS-long_data/differential_transcripts.tsv 
ENSG00000004455.17	ENST00000354858.11	0.281545888814435	1	0.595689772079638	1	0.0064	0.0129	0.0081	0.0066	0.0073	0.0047	0.0091	0.0062	0.0029
ENSG00000004455.17	ENST00000373449.7	56.7319289594579	1	4.99459681293473e-14	2.34608698795577e-10	0.3385	0.354	0.4015	0.529	0.6199	0.6675	0.3647	0.6055	-0.2408
ENSG00000004455.17	ENST00000467905.5	1.41173459743368	1	0.234768653534498	1	0.0128	0.0102	0.0079	0.0	0.0036	0.0093	0.0103	0.0043	0.006
ENSG00000004455.17	ENST00000480134.5	0.462403563942644	1	0.496502801619222	1	0.1079	0.0688	0.0312	0.0548	0.0578	0.0794	0.0693	0.064	0.0053
ENSG00000004455.17	ENST00000487289.1	12.9953227365513	1	0.000312270022764174	0.0371344396057979	0.0222	0.0203	0.0391	0.0032	0.0	0.0	0.0272	0.0011	0.0261
ENSG00000004455.17	ENST00000548033.5	4.71189004554435	1	0.0299547012700919	0.555825341913299	0.0	0.0025	0.0	0.0064	0.0108	0.0187	0.0008	0.01-0.0111
ENSG00000004455.17	ENST00000550338.5	1.87958598265686	1	0.170381210967297	1	0.019	0.0228	0.0469	0.0128	0.0144	0.014	0.0296	0.0137	0.0159
ENSG00000004455.17	ENST00000629371.2	3.9830445464795	1	0.0459604191091984	0.679867100930521	0.0	0.0	0.0	0.0032	0.0109	0.0094	0.0	0.0079	-0.0079
ENSG00000004455.17	ENST00000672715.1	49.5163324181758	1	1.96727117889871e-12	7.39261163606559e-09	0.4766	0.4792	0.4491	0.3296	0.2418	0.1822	0.4683	0.2512	0.2171
ENSG00000004455.17	ESPRESSO:chr1:376:3	1.94993508482912	1	0.162593846125627	1	0.0165	0.0241	0.0163	0.0544	0.0335	0.0147	0.019	0.0342	-0.0152

I found the structure figure ENSG00000004455.17_structure.png only includes 5 transcripts. I wonder why it only has 5 transcripts, not all transcripts shown in differential_transcript.tsv?

In addition, I have another question. Could you please explain How the number next to the event type (for example, exon skipping 75) are calculated?

Thank you so much for your help.

EricKutschera · 2024-07-10T19:19:23Z

visualize_isoforms.py has a parameter --max-transcripts which defaults to 5: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L82

If you want to plot more than 5 then you'll need to add more colors here: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L17

Here's the code to get the number for each event type:

rMATS-long/scripts/rmats_long.py

Line 517 in 592cb32

def summarize_classification(summary, out_tsv):

It looks at the isoform_differences files and it checks each pair of transcripts to see what splicing events were found between those two transcripts. If there is only 1 event for a pair of transcripts then the count for that event type will increase. If there are multiple events for a pair then it will count as combinatorial

kwonej0617 · 2024-07-22T19:36:43Z

Thank you @EricKutschera!

Also, I have a question. Based on the summary figure, I have 20 differential isoform pairs that has SE event.

However, the number of 'isoform_differences' files are greater than 20. There are 29 files. I wonder why the numbers are different, in other word, what is the logic to keep those 20 SE event cases? Also, is there a way to know what those 20 cases are?

Looking forward to hearing from you. Thank you!

EricKutschera · 2024-07-24T12:24:42Z

If the plot shows 20 SE events then after looking at each of the isoform differences files and adding up all the pairs of transcripts that only had a single SE difference the count should be 20. If you ran with --compare-all-within-gene then each file can have multiple pairs of transcripts. The number of isoform differences files isn't expected to match the number of events

kwonej0617 · 2024-08-15T15:45:30Z

Hi @EricKutschera

I have a question regarding the number of total genes with significant isoform. Based on my summary.txt file, it appears the total genes with significant isoforms are 82, which is much lower than I expected. Considering there are more than 20,000 genes are expressed in the cell and 95% of them undergoes alternative splicing, I thought the number should be higher than 82.
I wonder this number is reasonable or do you think I should change the cutoff (p-value or delta threshold)?

## significant differential transcript usage
total significant isoforms: 117
total genes with significant isoforms: 82
adjusted pvalue threshold: 0.05
delta isoform proportion threshold: 0.1
## alternative splicing classifications between isoform pairs
total classified isoform pairs: 1062
exon skipping: 78
alternative 5'-splice site: 35
alternative 3'-splice site: 29
mutually exclusive exons: 4
intron retention: 49
alternative first exon: 30
alternative last exon: 12
complex: 346
combinatorial: 479

Looking forward to hearing from you.
Thank you!

EricKutschera · 2024-08-15T17:09:41Z

The number of significant genes depends on the data and 82 could be reasonable for your data. rMATS-long is looking for significant splicing differences between the two groups. There might be 20k genes with alternative splicing in your data, but that splicing might not be significantly different between your two groups. It could be that the isoform proportions in the two groups are similar, or it could be that there aren't enough reads for the difference to be reported as significant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compare the structures of isoforms within a gene #20

compare the structures of isoforms within a gene #20

Oliverfeudj commented Jun 14, 2024

EricKutschera commented Jun 14, 2024

kwonej0617 commented Jul 8, 2024 •

edited

Loading

EricKutschera commented Jul 10, 2024

kwonej0617 commented Jul 22, 2024

EricKutschera commented Jul 24, 2024

kwonej0617 commented Aug 15, 2024

EricKutschera commented Aug 15, 2024

compare the structures of isoforms within a gene #20

compare the structures of isoforms within a gene #20

Comments

Oliverfeudj commented Jun 14, 2024

EricKutschera commented Jun 14, 2024

kwonej0617 commented Jul 8, 2024 • edited Loading

EricKutschera commented Jul 10, 2024

kwonej0617 commented Jul 22, 2024

EricKutschera commented Jul 24, 2024

kwonej0617 commented Aug 15, 2024

EricKutschera commented Aug 15, 2024

kwonej0617 commented Jul 8, 2024 •

edited

Loading