[proposal] Move some `publishDir` attributes to `process`-level directives #4661

ewels · 2024-01-16T08:32:50Z

Based on a comment originally posted in #4205 (comment)

publishDir supports multiple attributes that control file publishing behaviour. The problem with having them structured this way is that if any need to be overwritten, then entire publishDir statement has to be repeated. This leads to some attributes being repeated verbatim many times in pipeline configs.

This issue is to suggest moving attributes out to new directives, so that they can be set to a single value for the entire pipeline. This reduces duplicated configuration code. Could be all current attributes, but main priorities (most duplicated) are:

path (ideally split into two, with "base" and current path attribute)
mode
saveAs

If possible, this would be under a publishDir scope, though I don't know if this is possible technically.

So ideally, it'd be something like taking the following example:

publishDir = [
    path: { "${params.outdir}/genome" },
    mode: params.publish_dir_mode,
    saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
    enabled: params.save_reference
]

And it becoming:

// Defaults for all processes
process {
    publishDir { 
        base = params.outdir
        mode = params.publish_dir_mode
        saveAs = { filename -> filename.equals('versions.yml') ? null : filename }
    }
}
// Specific process overrides
process withName: "trimData" {
    publishDir.path: "trimmed"
}
process withName: "prepGenome" {
    publishDir.path: "genome",
    publishDir.enabled: params.save_reference
}

The text was updated successfully, but these errors were encountered:

bentsherman · 2024-01-16T17:21:17Z

publishDir defaults are tricky because it is a list of publish "commands" that can target specific outputs. I would suggest something like this:

// process directive
process FASTQC {
    publish 'results', mode: 'copy', saveAs: { ... }

    // ...
}

// config option
process {
    publish.path = 'results'
    publish.mode = 'copy'
    publish.saveAs = { ... }
}

It is essentially giving up the ability to target specific outputs differently, instead the setting would apply to the entire process. It seems to be what people do almost every time, but I would like to investigate it more.

But I do suggest making a new process directive, even something as simple as publishDir -> publish, so that users have to make an explicit transition, because this new directive would be structured differently in the runtime.

ewels · 2024-01-17T09:28:14Z

It is essentially giving up the ability to target specific outputs differently, instead the setting would apply to the entire process.

I don't really follow what the difference is here.. You mean that with the current publishDir syntax, the different options can be set on a per-file basis somehow? Is that not what you're still doing with the saveAs squiggly brackets in the example above?

But I do suggest making a new process directive,

Yup, absolutely - good point 👍🏻

bentsherman · 2024-01-17T14:51:28Z

I was thinking of the pattern option which allows you to select which outputs to publish. But I guess you're right that you can do the same thing with saveAs.

pditommaso · 2024-01-17T14:54:37Z

it's a -1 for me, we are going to deprecate publishDir and keep these info in the (module) output schema

bentsherman · 2024-01-17T16:03:13Z

@pditommaso do you think the publish mode i.e. copy vs symlink should be defined in the output schema?

pditommaso · 2024-01-17T16:08:52Z

I'd see a top level (default) setting, there's no really need to configure task by task

bentsherman · 2024-01-17T16:17:47Z

I could live with that, although I have a feeling there are users out there who are setting it differently per process

ewels · 2024-01-17T16:32:46Z

Definitely, I know of at least one use case for having mode process specific: some tools hate symlinks so you might want to set mode to copy for a single process (not the whole pipeline as that wastes disk space).

The other option in that issue is enabled - this is nearly always process-specific. For example, a process might be preparing a reference genome index - you don't want to save this as an output by default, but users on HPC might want to keep it so they set --save_reference_genome which selectively sets enabled back on again for those processes.

Setting them as defaults at top-level that can be overwritten with process-specific config would be great though. That's basically what this issue is all about ☝🏻 But when it's a single directive, I don't think that individual attributes can be set as default / overwritten like that, which is why I was proposing to split them up. Am I wrong about that?

marcodelapierre · 2024-02-19T13:43:00Z

it's a -1 for me, we are going to deprecate publishDir and keep these info in the (module) output schema

I would see value in also having an upgraded publishDir like configuration, aside of the schema, rather than deprecating completely. As such, I like Ben's suggestion above for instance.

My impression is that having the schema as the sole way to specify outputs (and inputs) can constitute a barrier to adoption for new users. Beginners and simple pipelines would benefit greatly from having a fall-back publishDir-like option . Thoughts?

pditommaso · 2024-02-21T06:47:54Z

@ewels My take is that these directives are needed because you still continue to think in publishDir model in which the pipeline output is defined at process level and it was designed pre-DLS2.

We need to move away from this model where the workflow output is buried in the process definition and then there's a huge config file to control all these settings.

I don't exclude that some of these settings can become a directive, but we have to see case by case to address were specific needs.

@marcodelapierre good point, it should not definitely become a mandatory requirement to create a schema file to be able to have your workflow writing some output file. I agree it would create too much friction.

ewels · 2024-02-21T06:54:24Z

Agreed - the above proposal would basically be a stop-gap: hopefully quite easy to implement / a minor change with immediate benefit. Then it can be replaced by a new system when it's ready.

pditommaso · 2024-02-21T08:29:37Z

A stop-gap is foverever .. We need to move fast with #4670, instead.

bentsherman · 2024-04-15T16:03:11Z

Closing in favor of #4784

ewels mentioned this issue Jan 16, 2024

[proposal] Add local module named configuration template #4422

Closed

bentsherman added the lang/processes label Jan 16, 2024

bentsherman mentioned this issue Jan 17, 2024

Workflow output definitions and schema #4670

Closed

bentsherman closed this as completed Apr 15, 2024

bentsherman mentioned this issue Apr 26, 2024

Workflow output definition #4784

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal] Move some `publishDir` attributes to `process`-level directives #4661

[proposal] Move some `publishDir` attributes to `process`-level directives #4661

ewels commented Jan 16, 2024

bentsherman commented Jan 16, 2024

ewels commented Jan 17, 2024

bentsherman commented Jan 17, 2024

pditommaso commented Jan 17, 2024 •

edited

Loading

bentsherman commented Jan 17, 2024

pditommaso commented Jan 17, 2024

bentsherman commented Jan 17, 2024

ewels commented Jan 17, 2024 •

edited

Loading

marcodelapierre commented Feb 19, 2024 •

edited

Loading

pditommaso commented Feb 21, 2024

ewels commented Feb 21, 2024

pditommaso commented Feb 21, 2024

bentsherman commented Apr 15, 2024

[proposal] Move some publishDir attributes to process-level directives #4661

[proposal] Move some publishDir attributes to process-level directives #4661

Comments

ewels commented Jan 16, 2024

bentsherman commented Jan 16, 2024

ewels commented Jan 17, 2024

bentsherman commented Jan 17, 2024

pditommaso commented Jan 17, 2024 • edited Loading

bentsherman commented Jan 17, 2024

pditommaso commented Jan 17, 2024

bentsherman commented Jan 17, 2024

ewels commented Jan 17, 2024 • edited Loading

marcodelapierre commented Feb 19, 2024 • edited Loading

pditommaso commented Feb 21, 2024

ewels commented Feb 21, 2024

pditommaso commented Feb 21, 2024

bentsherman commented Apr 15, 2024

[proposal] Move some `publishDir` attributes to `process`-level directives #4661

[proposal] Move some `publishDir` attributes to `process`-level directives #4661

pditommaso commented Jan 17, 2024 •

edited

Loading

ewels commented Jan 17, 2024 •

edited

Loading

marcodelapierre commented Feb 19, 2024 •

edited

Loading