-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cutadapt fixes and improvements #5954
Conversation
Would named macro tokens be an option, i.e. using Then we could insert the read2 parameters only if needed .. and at the correct positions. |
--minimum-length=$filter_options.minimum_length:$library.minimum_length2 | ||
#else if str($filter_options.minimum_length): | ||
--minimum-length=$filter_options.minimum_length | ||
#if $paired and str($filter_options.minimum_length2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should minimum_length2
be treated the same way as minimum_length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it shouldn't. That part is a bit confusing, but minimum_length2
should normally be unset, in which case the value of minimum_length
will also be used for rv reads in PE data.
Only if the user sets it explicitly, it special-cases treatment of rv reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if a user specifies a value for minimum_length2
we need to use the colon syntax to pass both params. Otherwise, we need to pass only minimum_length
unless its set to the command line default anyway.
The old version was buggy because it always ignored minimum_length2
when the user didn't specify minimum_length
.
tools/cutadapt/cutadapt.xml
Outdated
@@ -227,32 +284,19 @@ $read_mod_options.zero_cap | |||
<param argument="--times" type="integer" min="1" value="1" label="Match times" help="Try to remove adapters at most COUNT times. Useful when an adapter gets appended multiple times." /> | |||
<param argument="--overlap" type="integer" min="1" value="3" label="Minimum overlap length" help="Minimum overlap length. If the overlap between the adapter and the sequence is shorter than LENGTH, the read is not modified. This reduces the number of bases trimmed purely due to short random adapter matches." /> | |||
<param argument="--match-read-wildcards" type="boolean" checked="false" truevalue="--match-read-wildcards" falsevalue="" label="Match wilcards in reads" help="Interpret IUPAC wildcards in reads"/> | |||
<param argument="--no-match-adapter-wildcards" type="boolean" checked="true" truevalue="" falsevalue="--no-match-adapter-wildcards" label="Match wilcards in adapters" help="Interpret IUPAC wildcards in adapters."/> | |||
<param argument="--revcomp" type="boolean" checked="false" truevalue="--revcomp" falsevalue="" label="Look for adapters in the reverse complement" help="Check both the read and its reverse complement for adapter matches. If match is on reverse-complemented version, output that one. Default: check only read." /> | |||
<param argument="--no-match-adapter-wildcards" type="boolean" checked="true" truevalue="" falsevalue="-N" label="Match wildcards in adapters" help="Interpret IUPAC wildcards in adapters."/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the long one? Bit confusing, or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could sure, but the current vversion just produces unnecessary long command lines, which need to be stored by Galaxy. Not sold on using short option names throughout though. @bgruening what's your opinion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we use argument, it will be part of the help text. From an educational point of view it would be nice to show the same param in the help and in the CLI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. https://galaxy-iuc-standards.readthedocs.io/en/latest/best_practices/tool_xml.html#parameter-name-argument-and-help says that argument
should use the long form of the wrapped option, which then means that we would always use the long form on the command line, too.
If that's the consensus, then I'll revert the commit introducing all the short options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
<section name="filter_options" title="Read Filtering Options"> | ||
<param argument="--discard-trimmed" type="boolean" checked="false" truevalue="--discard-trimmed" falsevalue="" label="Discard Trimmed Reads" help="Discard reads that contain the adapter instead of trimming them. Use the 'Minimum overlap length' option in order to avoid throwing away too many randomly matching reads!" /> | ||
<param argument="--discard_untrimmed" type="boolean" checked="false" truevalue="--discard-untrimmed" falsevalue="" label="Discard Untrimmed Reads" help="Discard reads that do not contain the adapter." /> | ||
<param argument="--minimum-length" type="integer" min="0" value="1" label="Minimum length (R1)" help="Discard reads that, after processing, are shorter than LENGTH. Note: You can set this parameter to zero to keep empty reads (with zero-length sequence and quality string) in the output, but some downstream tools may have problems with these. Default: 1" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the default of the command line tool 1? I most of the time go with those defaults .. but I see the argument for not doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted :-)
The problem with the 0
default of the command line tool was what triggered this whole PR.
With this setting it's possible that the trimmed output contains empty reads like here:
@prefix:1_13_1259/1 | |
+ | |
@prefix:1_13_1440/1 |
We learnt this week that such output causes failures in RNAStar and picard Fastq2Sam, and floods the stdout of HISAT2 with warnings. Enough reason for us to change the default.
Sorry, wasn't finished yet with an explanation of the PR and didn't expect so fast a review ;-)
These tests need -m 0, which was the old default for the param.
That would require a huge macro though that would span like 2/3 of the whole wrapper if I understand you correctly. Would also change the nesting level of tons of params. Do we want go that far to provide the best user experience? |
@bernt-matthias ready for another round of comments/review :-) |
Switch back to use long options where possible to match param argument values. Also use = as param/value separator because this gives nicer cutadapt json output.
LGTM, @bernt-matthias? |
<option value="both">Both: filtering criteria must apply to both reads in order for a read pair to be discarded. </option> | ||
<option value="first">First: will make a decision about the read pair by inspecting whether the filtering criterion applies to the first read, ignoring the second read.</option> | ||
|
||
<section name="other_trimming_options" title="Other Read Trimming Options"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to require rechecking every parameter within the section changes in workflows, as we can't track that automatically. Is that worth it ? Not saying it isn't, but it's something that looks like a good amount of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a point here, which is worth considering.
Personally, I think all this restructuring is worth it because it will make usage of the tool a lot clearer for the average user, but that's just my opinion.
The specific change you've highlighted affects just 3 parameters I think, but, yes, I've moved around quite some others between section as well.
My point would be that it's better to move everything in one update, then hope that we'll be able to keep things constant for a while, than doing lots of incremental changes that each break workflow updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific change you've highlighted affects just 3 parameters I think,
Got lost trying to read the diff so that's not true. More like 10 params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very unlikely that any WF deviates from the default for a lot of them at once, but still you need to check whether that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we could do as a compromise is to combine things in the existing "filter_options" section, change its title to sth like "Other trimming and filtering options" and simply reorder things properly within that section so that trimming options come before filtering ones. You're right that that would minimize effort during WF updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now that I looked at this a bit more carefully: the options that I've moved into that new Other Trimming Options section all come from the Read Modifications section, which really mixed up different concepts.
The problem there is that the trimming options (cutting, trimming, shortening) should be presented before the filtering options because trimming also happens before filtering, but the remaining read modification options are really quite exotic and should go last I think. So a split of this section then seems unavoidable.
Currently, the split moves 6 out of 10 options to the new section and leaves 4 in the original one. Also these 6 are more commonly used ones, then the 4 that are left behind.
What I could do is to move the existing section up above filtering and split the 4 more exotic options out into a new bottom section. This way "only" 4 instead of 6 options would have to be rechecked on WF updates.
The downside is that the section that I wanted to name "other_trimming_options" will then be called "read_mod_options", and we'd have to think about a new name for the bottom section.
So al in all, a moderate gain in terms of WF maintainability vs somewhat inadequate internal section names. I'm undecided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do change the parameters I would stick with changing all at once, whether you check 4 or 10 options doesn't make a difference I think.
@mvdbeek any feedback to Wolfgangs latest comments? |
FOR CONTRIBUTOR:
Now here's a list of changes:
1
the default for--minimum-length
because of the reason described here: cutadapt fixes and improvements #5954 (comment)--minimum_length2
would get ignored whenminimum_length
wasn't set--maximum_length2
dataset output filter (library['type'] == 'paired' and 'multiple_output' not in output_selector) failed: argument of type 'NoneType' is not iterable