Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing my favorite organism/source [please add as a comment to this issue] #13

Open
mikelove opened this issue Nov 1, 2018 · 8 comments

Comments

@mikelove
Copy link
Collaborator

mikelove commented Nov 1, 2018

Please add any organism or source that we are missing that you'd like to be covered by tximeta, and we will consider the best way to fold it in. We want to cover as many use cases as possible, and support and encourage linkedTxome for remaining cases.

@matthewdavidsmith
Copy link

matthewdavidsmith commented Jul 11, 2019

Wanted to try tximeta out. With Salmon 0.14.1 I prepared a salmon index from the gencode v29 (with decoys) currently up on main salmon site (using the --gencode flag). Quantified in mapping mode. When I tried to create a SummarizedExperiment though it was unable to recognize the transcriptome. Is something wrong or is gencode v29 not implemented?

Thanks!

> se <- tximeta(samples_tximeta)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 

tximeta needs a BiocFileCache directory to access and save TxDb objects.
Do you wish to use the default directory: 'C:\Users\msmit248\AppData\Local\BiocFileCache\BiocFileCache\Cache'?
If not, a temporary directory that is specific to this R session will be used.

You can always change this directory later by running: setTximetaBFC()
Or enter [0] to exit and set this directory manually now. 

1: Yes (use default)
2: No (use temp)

Selection: 2
couldn't find matching transcriptome, returning un-ranged SummarizedExperiment
> se
class: SummarizedExperiment 
dim: 205870 10 
metadata(3): tximetaInfo quantInfo countsFromAbundance
assays(3): counts abundance length
rownames(205870): ENST00000456328.2 ENST00000450305.2 ... ENST00000387460.2 ENST00000387461.2
rowData names(0):
colnames(10): kw01_ifng kw01_veh ... p154_ifng p154_veh
colData names(6): run person ... batch names
> dim(se)
[1] 205870     10

@mikelove
Copy link
Collaborator Author

I think we’ll need to get Gencode + decoy hash values from @rob-p. Correct Rob? We’ll work out a pipeline.

@mikelove
Copy link
Collaborator Author

Just more information: a short term fix would be to use linkedTxomes to connect the index to the source yourself.

But we really want these indices to automatically connect to the reference.

@rob-p do you think we should pass the hash of the -t transcripts alone to the JSON files, as a separate hash in addition to the transcripts plus the decoys? I'm not sure how the hashing is currently performed. Both hash values may be useful. Going forward, to connect to the GA4GH API we will need the hash value of the -t transcripts alone.

@mikelove
Copy link
Collaborator Author

The Gencode + decoy hash was going to break plans on integrating with GA4GH to support all txomes (as the hash value on the server side wouldn't include the decoy sequence), and so the next version of Salmon will break out the -t hash and the decoy hash separately, so tximeta will still work out of the box. In the meantime, you can explicitly link the txome to the GTF using makeLinkedTxome as shown in the vignette.

@mikelove
Copy link
Collaborator Author

This thread made me realize, the above workaround would be a useful technique to preserve the reference hash value when users want to add non-reference transcripts. For example, sometimes users will add ERCC spike-ins, viral sequences, or fusion genes. It may be useful to have a reference hash as well as a hash of non-reference sequences, and a total hash...

@mikelove mikelove changed the title You're missing my favorite organism / source [Please add as a comment to this issue] Missing my favorite organism/source [Pls add as a comment to this issue] Sep 30, 2019
@jtheorell
Copy link

ERCC would be great!

@mikelove
Copy link
Collaborator Author

mikelove commented Sep 30, 2019

Thanks for feedback @jtheorell

So we don't have this working yet, but my thoughts were that we could have Salmon distinguish between the "primary" reference sequences of interest (e.g. transcripts), plus other perhaps "technical" sequences such as spike in or decoy sequences. Salmon will quantify against all these sequences, but for the purpose of txome identification, we'd like to know the hash of the primary seqs as well as the primary plus the technical seqs. This way we will at least be able to identify the provenance of the primary. Given that the technical seqs may be very idiosyncratic, it's not likely possible to identify primary + technical without the user creating a linkedTxome.

We don't have a formalized mechanism for this now, but it's a sketch of a solution. The current solution would be linkedTxome + Zenodo deposit for FASTA and GTF.

@jtheorell
Copy link

OK! Trying as good as I can to get it to work for now then. Thanks for your super rapid response!

@mikelove mikelove changed the title Missing my favorite organism/source [Pls add as a comment to this issue] Missing my favorite organism/source [please add as a comment to this issue] Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants