Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare the structures of isoforms within a gene #20

Open
Oliverfeudj opened this issue Jun 14, 2024 · 7 comments
Open

compare the structures of isoforms within a gene #20

Oliverfeudj opened this issue Jun 14, 2024 · 7 comments

Comments

@Oliverfeudj
Copy link

Hello @EricKutschera,

is there a way to do this classify_isoform_differences.py for all isoforms? not doing individually for each isoform so that I have at the end all isoform in a single tsv file, not individual tsv for each isoform

Thank you in advance for your reply

@EricKutschera
Copy link
Contributor

By default rmats_long.py outputs a separate .tsv for each of the most significant isoforms showing the differences of that isoform to the most significant isoform in that gene with a delta proportion in the opposite direction

If you run rmats_long.py with --compare-all-within-gene then when it calls classify_isoform_differences.py it won't use --second-transcript-id and each .tsv will have the differences of the most significant isoform to all other isoforms in the gene

I added a branch which changes classify_isoform_differences.py to do a pairwise comparison of isoforms in a gene when run without --second-transcript-id: bc2bdfb

Doing a pairwise comparison can take a long time for genes with many isoforms

@kwonej0617
Copy link

kwonej0617 commented Jul 8, 2024

Hi @EricKutschera

I have used --compare-all-within-gene when running rmats_long.py and I have a question regarding the output.
Here's the command line I used:

python ./rMATS-long/scripts/rmats_long.py --abundance espresso_data/HEK293T-all/HEK293T-all_N2_R0_abundance.esp --updated-gtf espresso_data/HEK293T-all/HEK293T-all_N2_R0_updated.gtf --gencode-gtf ${gtf} --group-1 rMATS-long_data/input/HEK293T-all_group1.txt --group-2 rMATS-long_data/input/HEK293T-all_group2.txt --group-1-name WT --group-2-name KO --out-dir ${output_path} --plot-file-type .png --compare-all-within-gene

In the differential_transcript.tsv, the lines containing ENSG00000004455.17 (just randomly picked) were 10 lines as follows:

grep ENSG00000004455.17 rMATS-long_data/differential_transcripts.tsv 
ENSG00000004455.17	ENST00000354858.11	0.281545888814435	1	0.595689772079638	1	0.0064	0.0129	0.0081	0.0066	0.0073	0.0047	0.0091	0.0062	0.0029
ENSG00000004455.17	ENST00000373449.7	56.7319289594579	1	4.99459681293473e-14	2.34608698795577e-10	0.3385	0.354	0.4015	0.529	0.6199	0.6675	0.3647	0.6055	-0.2408
ENSG00000004455.17	ENST00000467905.5	1.41173459743368	1	0.234768653534498	1	0.0128	0.0102	0.0079	0.0	0.0036	0.0093	0.0103	0.0043	0.006
ENSG00000004455.17	ENST00000480134.5	0.462403563942644	1	0.496502801619222	1	0.1079	0.0688	0.0312	0.0548	0.0578	0.0794	0.0693	0.064	0.0053
ENSG00000004455.17	ENST00000487289.1	12.9953227365513	1	0.000312270022764174	0.0371344396057979	0.0222	0.0203	0.0391	0.0032	0.0	0.0	0.0272	0.0011	0.0261
ENSG00000004455.17	ENST00000548033.5	4.71189004554435	1	0.0299547012700919	0.555825341913299	0.0	0.0025	0.0	0.0064	0.0108	0.0187	0.0008	0.01-0.0111
ENSG00000004455.17	ENST00000550338.5	1.87958598265686	1	0.170381210967297	1	0.019	0.0228	0.0469	0.0128	0.0144	0.014	0.0296	0.0137	0.0159
ENSG00000004455.17	ENST00000629371.2	3.9830445464795	1	0.0459604191091984	0.679867100930521	0.0	0.0	0.0	0.0032	0.0109	0.0094	0.0	0.0079	-0.0079
ENSG00000004455.17	ENST00000672715.1	49.5163324181758	1	1.96727117889871e-12	7.39261163606559e-09	0.4766	0.4792	0.4491	0.3296	0.2418	0.1822	0.4683	0.2512	0.2171
ENSG00000004455.17	ESPRESSO:chr1:376:3	1.94993508482912	1	0.162593846125627	1	0.0165	0.0241	0.0163	0.0544	0.0335	0.0147	0.019	0.0342	-0.0152

I found the structure figure ENSG00000004455.17_structure.png only includes 5 transcripts. I wonder why it only has 5 transcripts, not all transcripts shown in differential_transcript.tsv?
image

In addition, I have another question. Could you please explain How the number next to the event type (for example, exon skipping 75) are calculated?
image

Thank you so much for your help.

@EricKutschera
Copy link
Contributor

visualize_isoforms.py has a parameter --max-transcripts which defaults to 5: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L82

If you want to plot more than 5 then you'll need to add more colors here: https://github.com/Xinglab/rMATS-long/blob/v1.0.0/scripts/visualize_isoforms.py#L17

Here's the code to get the number for each event type:

def summarize_classification(summary, out_tsv):

It looks at the isoform_differences files and it checks each pair of transcripts to see what splicing events were found between those two transcripts. If there is only 1 event for a pair of transcripts then the count for that event type will increase. If there are multiple events for a pair then it will count as combinatorial

@kwonej0617
Copy link

Thank you @EricKutschera!

Also, I have a question. Based on the summary figure, I have 20 differential isoform pairs that has SE event.
image
However, the number of 'isoform_differences' files are greater than 20. There are 29 files. I wonder why the numbers are different, in other word, what is the logic to keep those 20 SE event cases? Also, is there a way to know what those 20 cases are?

Looking forward to hearing from you. Thank you!

@EricKutschera
Copy link
Contributor

If the plot shows 20 SE events then after looking at each of the isoform differences files and adding up all the pairs of transcripts that only had a single SE difference the count should be 20. If you ran with --compare-all-within-gene then each file can have multiple pairs of transcripts. The number of isoform differences files isn't expected to match the number of events

@kwonej0617
Copy link

Hi @EricKutschera

I have a question regarding the number of total genes with significant isoform. Based on my summary.txt file, it appears the total genes with significant isoforms are 82, which is much lower than I expected. Considering there are more than 20,000 genes are expressed in the cell and 95% of them undergoes alternative splicing, I thought the number should be higher than 82.
I wonder this number is reasonable or do you think I should change the cutoff (p-value or delta threshold)?

## significant differential transcript usage
total significant isoforms: 117
total genes with significant isoforms: 82
adjusted pvalue threshold: 0.05
delta isoform proportion threshold: 0.1
## alternative splicing classifications between isoform pairs
total classified isoform pairs: 1062
exon skipping: 78
alternative 5'-splice site: 35
alternative 3'-splice site: 29
mutually exclusive exons: 4
intron retention: 49
alternative first exon: 30
alternative last exon: 12
complex: 346
combinatorial: 479

Looking forward to hearing from you.
Thank you!

@EricKutschera
Copy link
Contributor

The number of significant genes depends on the data and 82 could be reasonable for your data. rMATS-long is looking for significant splicing differences between the two groups. There might be 20k genes with alternative splicing in your data, but that splicing might not be significantly different between your two groups. It could be that the isoform proportions in the two groups are similar, or it could be that there aren't enough reads for the difference to be reported as significant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants