Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splice junctions strandedness #103

Open
PeterVenhuizen opened this issue Jul 6, 2020 · 1 comment
Open

Splice junctions strandedness #103

PeterVenhuizen opened this issue Jul 6, 2020 · 1 comment

Comments

@PeterVenhuizen
Copy link

PeterVenhuizen commented Jul 6, 2020

I've run into an issue with the likely misclassification of the strandedness of certain splice junctions. I'm working with a strand-specific RNA-Seq library for A. thaliana and generated a BAM supplemented Whippet index using the STAR alignments of said library.

A. thaliana has various cases of overlapping genes on opposing strands, as can be seen in the IGV screenshot below. For this particular case, the STAR alignment identified what I presume to be the correct junctions for the AT4G36050 (on the - strand) and the AT4G36052 (on the + strand) genes. When I look for the specific junction (chr4 17052585 17052653) in the STAR SJ.out.tab, STAR reports it as a junction on the positive strand (which seems to be correct, seeing as it corresponds to the AT4G36052 gene on the plus strand). However, the same junction in the Whippet .jnc.gz output file is reported to be on the minus strand. Subsequently, an RI event using this junction is annotated by Whippet for AT4G36050 and for these particular samples gives a very strong dPSI of >0.6. However, I do believe this is an error, given the origin of the splice junction.

whippet_strandedness_issue

I would like to know how Whippet handles overlapping genes when building the (supplemented) index and when quantifying. Is there any way for me to parse out those events, or potentially fix their quantification?

I've used julia version 0.6.4 and whippet v0.11.1 for my analysis. The --bam-both-novel flag was used when generating the index.

Thanks

@timbitz
Copy link
Owner

timbitz commented Jul 24, 2020

Hi @PeterVenhuizen, if you look in src/bam.jl you'll see it's a pretty simple bit of code that's filtering reads for each gene--

isspliced( rec::BAM.Record ) = ismatch(r"N", BAM.cigar(rec))
strandpos( rec::BAM.Record ) = BAM.flag(rec) & 0x010 == 0

function process_records!( reader::BAM.Reader, seqname::String, range::UnitRange{Int64}, strand::Bool,
                           exons::CoordTree, known, oneknown::Bool,
                           novelacc::Dict{K,V}, noveldon::Dict{K,V} ) where {K,V}

...

         # if is spliced process splice sites
         if isspliced(rec) && strand == strandpos(rec)
            known = process_spliced_record!( novelacc, noveldon, rec, known, oneknown )
         end

So the strand encoded in the fifth bit of the BAM flag (0x10) has to match the strand of the gene, where 0 is + and 1 is - in both. That's not to say there isn't potentially some other bug going on here somewhere, but it would be good if you could give a minimally sized example with reproducible inputs.

On another front, my initial benchmarking (in the paper supplement) suggests that using the --bam-both-novel flag increases the FDR considerably, so I wouldn't recommend this in general. Requiring one of the splice sites to match a known splice-site in the gene is a reasonable approach I think-- also good to note here that the set of "known" splice sites in a gene expands iteratively as new splice junctions are added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants