-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize paths for genome fetch and some of the genome indexer data managers, plus additional moderinzation #6489
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Nate!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this file need to be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is just a rename from rnastar_index2_versioned.loc
, which was inconsistent with its name in all other tables. An empty .loc
as referenced in tool_data_table_conf.xml.test
must exist in order to be written to by the test.
data_managers/data_manager_star_index_builder/data_manager/rna_star_index_builder.xml
Outdated
Show resolved
Hide resolved
@@ -10,13 +9,12 @@ | |||
<column name="path" output_ref="out_file" > | |||
<move type="directory" relativize_symlinks="True"> | |||
<!-- <source>${path}</source>--> <!-- out_file.extra_files_path is used as base by default --> <!-- if no source, eg for type=directory, then refers to base --> | |||
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">${dbkey}/bwa_mem_index/${value}</target> | |||
<target base="${GALAXY_DATA_MANAGER_DATA_PATH}">genomes/${dbkey}/bwa_mem_index/v1/${value}</target> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the "v1" comes from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the proposal, this is because it's version 1 of bwa-mem.
</move> | ||
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${dbkey}/bwa_mem2_index/${value}/${path}</value_translation> | ||
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/genomes/${dbkey}/bwa_mem_index/v2/${value}/${path}</value_translation> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I see ... Not sure if the v1/v2 should be under bwa_mem_index. I would assume those are separate tools - separate indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or in oder words, as an admin, I would search for bwa_mem2_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider that, I ended up with this because under the scheme in the proposal every DM will now contain a version directory. So this will result in:
genomes/${dbkey}/bwa_mem_index/v1/${value}
genomes/${dbkey}/bwa_mem2_index/v2/${value}
Which is kind of redundant and negates the purpose of the version directory in this case. That said I can understand how you would expect to have the directory name match the indexer name, although we already violate that for the other DMs, since the table name (bowtie_indexes
) is plural and the tool ID/directory (bowtie_index
) is not (to say nothing of the bowtie DMs using both "indexes" in the table/directory and "indices" in the loc file name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to push this if others feel it isn't a good design choice (since the design change is ultimately the motivation for this PR), so if you think it makes sense to change this I'll do so. I am not sure what the version numbers ought to be in this case - probably disconnected from the version of the underlying tool and always v1
until another version is needed?
Converted to draft:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate that we have more and more DMs that do not require the extra python stuff.
I'm a bit worried about the side effects for admins.
<inputs> | ||
<param name="all_fasta_source" type="select" label="Source FASTA Sequence"> | ||
<options from_data_table="all_fasta"/> | ||
</param> | ||
<param name="sequence_name" type="text" value="" label="Name of sequence" /> | ||
<param name="sequence_id" type="text" value="" label="ID for sequence" /> | ||
<param name="tophat2" type="boolean" truevalue="--data_table_name tophat2_indexes" falsevalue="" checked="True" label="Also make available for TopHat" help="Adds values to tophat2_indexes tool data table" /> | ||
<param name="tophat2" type="boolean" checked="True" label="Also make available for TopHat" help="Adds values to tophat2_indexes tool data table" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool to cover this, but it seems SNAFU anyway, since the tophat2 data table refers to the bowtie2 loc file.
Maybe we just deprecate the tophat2 datatable (and update the tool to use the bowtie2 one)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is probably just legacy. Good idea to update the tophat2 tool although I suppose in the unlikely case an admin has indexes in the tophat2_indexes
table that are not in bowtie2_indexes
then they would disappear. And tophat is of course deprecated itself.
</move> | ||
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/${dbkey}/bowtie2_index/${value}/${path}</value_translation> | ||
<value_translation>${GALAXY_DATA_MANAGER_DATA_PATH}/genomes/${dbkey}/bowtie_index/v2/${value}/${path}</value_translation> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand why this is done? It will require all admins to restucture reference data, or?
But certainly this would be nicer if we would start from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't have to - if you update to the new DMs and run them, they will just place data in a new folder, but the old loc with the old data at the old paths will continue to be loaded. That said, in the proposal I suggested that we recommend admins to specify a new tool_data_path
just for organizational purposes. I also said I'd write a script to restructure the data for anyone who preferred to unify it.
@@ -1,34 +0,0 @@ | |||
<tool id="bowtie_color_space_index_builder_data_manager" name="Bowtie Color index" tool_type="manage_data" version="1.2.1" profile="23.0"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move these to the deprecated/data_managers/
folder of this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- Move colorspace builder to deprecated - Drop python wrapper - Update bowtie version - Add tests
- Drop python wrapper - Update bowtie2 version - Add test of non-default options
- Drop python wrapper - Update bwa version - Add test of non-default options
- Drop python wrapper - Add options to automatically calculate --genomeSAindexNbases and --genomeChrBinNbits
- Drop python wrapper - Update samtools version - Add test of non-default options
This would probably require a lot of tool testing and so I am not going to bother with it in this PR, if we want to pursue it we can do that in a followup. It does make DM bundles quite a bit larger than they need to be for large genomes, but CVMFS deduplicates the resulting data anyway so in my specific use case it's just a build time annoyance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
Update the genome fetch and most commonly used indexer DMs to normalize the on-disk layout as proposed in galaxyproject/galaxy#19013.
In addition:
--genomeSAindexNbases
and--genomeChrBinNbits
options as recommended by the manual, to drastically reduce the index size for small genomes.FOR CONTRIBUTOR: