gene id search is slow in large genome #2803
-
Hi I setup a Jbrowse2 with a few genomes. For the large maize genome (2G), it takes 1-2 min to search for a gene ID. It is almost instant when searching for genes in other smaller genomes. I have tired put the whole jbrowse2 folder in a SSD drive and also make a fake small GFF file with only gene ID lines and text-index it instead of the large GFF with all the isoform, intron and exon information. But it is still too slow. Any suggestion? Best http://www.epigenome.cuhk.edu.hk/jbrowse2/?session=local-_XVfgIvX5 |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 9 replies
-
Hi there I visited this link http://www.epigenome.cuhk.edu.hk/jbrowse2/?session=share-erGje5JdgP&password=2UElo and searched an example gene ID. It downloads about 700kb of data which is a fair amount of data, but the results otherwise show up quickly I think there could be a thing where many of the genes have a similar prefix, and since the code uses the prefix to do searching (but only up to a limit) then it ends up having to download a large amount of data to do the searches. Maybe we can look into how to make the prefix size large, I think it is hardcoded right now |
Beta Was this translation helpful? Give feedback.
-
I encountered a similar issue #2826. In my experience, gene ids are extremely likely to share a common prefix identifying the assembly the gene models exist for. For reference, for my 20 plant genomes pulled form phytozome, the majority have a greater than 5 character common prefix for all gene models. I think adjusting prefix size is probably the best option. In addition to making it user configurable, I think we should be trying to dynamically determine prefix size during index creation. This could be something as simple as "choose prefix size K such that the ixx file is as close to L lines as possible" where L is some configurable constant, maybe defaulting to 1000 lines or something like it, I'm not sure where performance optimum is here. Long term, the current trix concept may want for being updated. Even with dynamically set prefix sizes, there are still going to be degenerate cases where the '*.ixx' index fails to meaningfully partition the '*.ix' file into similarly sized buckets. The simplest way around this conceptually is probably to change the '*.ixx' file from being <prefix> -> <start, stop> mappings to be a sorted <object> -> <location> mapping. A search string can then be compared against those objects to identify which pair of objects the search string lies between and then that chunk of the '*.ix' file can be searched. The objects can then be selected to partition the database into equivalent sizes regardless of the content of the objects. Also, I still think a larger chunksize for reading the ix file would make sense though I really have no idea where that is controlled. |
Beta Was this translation helpful? Give feedback.
-
@teresam856 could you weigh in on this?
…On Wed, Mar 16, 2022 at 4:57 PM nhartwic ***@***.***> wrote:
I encountered a similar issue #2826
<#2826>.
In my experience, gene ids are extremely likely to share a common prefix
identifying the assembly the gene models exist for. For reference, for my
20 plant genomes pulled form phytozome, the majority have a greater than 5
character common prefix for all gene models.
I think adjusting prefix size is probably the best option. Rather than
just making it user configurable, I think we should be trying to
dynamically determine prefix size during index creation. This could be
something as simple as "choose prefix size K such that the ixx file is as
close to L lines as possible" where L is some configurable constant, maybe
defaulting to 1000 lines or something like it, I'm not sure where
performance optimum is here.
Long term, the current trix concept may want for being updated. Even with
dynamically set prefix sizes, there are still going to be degenerate cases
where the '*.ixx' index fails to meaningfully partition the '*.ix' file
into similarly sized buckets. The simplest way around this conceptually is
probably to change the '*.ixx' file from being -> <start, stop> mappings
to be a sorted -> mapping. A search string can then be compared against
those mappings to identify which pair of objects the search string lies
between and then that chunk of the '.ix' file can be searched. The objects
can then be selected to partition the database into equivalent sizes
regardless of the content of the objects.*
—
Reply to this email directly, view it on GitHub
<#2803 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAASAFIDN6MO4ECQANUVLP3VAJYM5ANCNFSM5QVPE7ZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***
.com>
|
Beta Was this translation helpful? Give feedback.
-
There is now a --prefixSize argument to |
Beta Was this translation helpful? Give feedback.
-
Users can now use a --prefixSize argument to text-index to improve search index size. In later versions, we may try to auto-calculate this to improve! |
Beta Was this translation helpful? Give feedback.
Users can now use a --prefixSize argument to text-index to improve search index size. In later versions, we may try to auto-calculate this to improve!