You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for writing and maintaining this excellent workflow. I suggest the following improvements to make profiling even more friendly.
The grouping of input datasets into batches is not guaranteed to be the same across workflow runs and this prevents retrieval of cached results when resuming a workflow run. Could you sort the datasets prior to batching?
The task identifier is just '(db1)', as in
name: NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:KRAKENUNIQ_PRELOADEDKRAKENUNIQ (db1). It would be nice to have more specific identifier, including the batch # ( provided the issue above is resolved ) or at least the sample identifier of one dataset of a batch.
Might it be better to split the input datasets into batches based on dataset size or read count? This may help balance the run times across batches.
The text was updated successfully, but these errors were encountered:
For 1 and 2 that should be possible I think, right @Midnighter ?
however it would risk slow down of the pipeline waiting for all prior steps to complete to force an order
For 3 will be more difficult and I'm not sure if it's worth the effort (read count/dataset size doesn't necessarily linearly correspond to run time as it depends how many hits actually are valid vs things that don't match anything)
Description of feature
Thank you for writing and maintaining this excellent workflow. I suggest the following improvements to make profiling even more friendly.
The grouping of input datasets into batches is not guaranteed to be the same across workflow runs and this prevents retrieval of cached results when resuming a workflow run. Could you sort the datasets prior to batching?
The task identifier is just '(db1)', as in
name: NFCORE_TAXPROFILER:TAXPROFILER:PROFILING:KRAKENUNIQ_PRELOADEDKRAKENUNIQ (db1). It would be nice to have more specific identifier, including the batch # ( provided the issue above is resolved ) or at least the sample identifier of one dataset of a batch.
Might it be better to split the input datasets into batches based on dataset size or read count? This may help balance the run times across batches.
The text was updated successfully, but these errors were encountered: