-
Notifications
You must be signed in to change notification settings - Fork 129
Meta make
I'd like to propose three very simple verbs that might get us halfway towards the goal of a DSL (#233). This proposal addresses the low-level technical part, which I think is required for any DSL. This proposal is discussed in #304.
Input: A drake plan. Output: A named list. Example of two equivalent plans:
# Proposed plan
plan <- drake::drake_plan(
x = 1,
y = 2,
meta_plan = drake_plan(
a = x,
b = y
),
results = meta_make(meta_plan)
)
# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
x = 1,
y = 2,
meta_plan = NULL,
results = { meta_plan; list(
a = x,
b = y
)}
)
drake::make(plan)
#> target meta_plan
#> target results
drake::readd(results)
#> cache /tmp/RtmpArc3UM/.drake
#> $a
#> [1] 1
#>
#> $b
#> [1] 2
Created on 2018-03-07 by the reprex package (v0.2.0).
The argument to meta_make()
can be a target, that's where it becomes really powerful. If meta_make()
is called with an up to date target and unchanged code, the results remain up to date too.
Subtle difference: The list returned by meta_make()
is just a list of pointers, not a list of objects. Therefore, calling loadd()
or readd()
on such a target won't load all results into memory. See below for an implementation sketch.
Input: A named list. For each element, a target is created in the plan. Example of two equivalent plans:
# Proposed plan
plan <- drake::drake_plan(
results = list(a = 1, b = 2, c = 3),
unpack(results)
)
# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
results = list(a = 1, b = 2, c = 3),
a = results$a,
b = results$b,
c = results$c
)
drake::make(plan)
#> target results
#> target a
#> target b
#> target c
drake::readd(b)
#> cache /tmp/RtmpriGbJu/.drake
#> [1] 2
Created on 2018-03-07 by the reprex package (v0.2.0).
The arguments to unpack()
can be targets, that's where it becomes really powerful. This is related to #283 (multi-file output; and the equivalent for R objects), but I don't think #283 is a prerequisite. If unpack()
is called with an up to date target and unchanged code, all resulting targets (from the last run) remain up to date too.
The unpacking is a declarative operation, we don't (necessarily) need to materialize all targets. In particular, if the target is the result of a previous call to meta_make()
, the results are already unpacked.
Semantics identical to tibble::lst()
: Construct a list from a set of targets. The main difference is that this is a declarative operation that doesn't physically construct the list yet. It can be used to bundle targets together for use in a subsequent operation. Example of two equivalent plans:
# Proposed plan
plan <- drake::drake_plan(
a = 1,
b = 2,
packed = pack(a, b)
)
# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
a = 1,
b = 2,
packed = tibble::lst(a, b)
)
drake::make(plan)
#> target a
#> target b
#> target packed
drake::readd(packed)
#> cache /tmp/Rtmp6mdjtc/.drake
#> $a
#> [1] 1
#>
#> $b
#> [1] 2
Created on 2018-03-07 by the reprex package (v0.2.0).
Essentially the opposite of unpack()
.
Like with meta_make()
, a list of pointers is returned when calling loadd()
or readd()
for such a target. See below for implementation details.
We could do meta-make + unpack as a single operation, and not implement pack at all. I'm following the Unix philosophy here, because I feel that we can only gain by exposing these operations separately, if only for testing. From the separate verbs, we can provide a flat_meta_make()
(meta-make + unpack) or even a pack + meta-make + unpack verb. These operations feel simple enough to be understood individually and in combination.
These three verbs seem the simplest possible solution to me, maybe I'm missing a different decomposition into even simpler operations.
- Delayed plan evaluation, possibly a new target state "unknown"
- Visualization: We don't always want to expand the constructed plans when visualizing them
- Storing object hierarchies: When storing
x <- list(a = 1, b = 2)
, we want to be able to accessx$a
andx$b
without loadingx
- ...
The new verbs can be implemented in a similar way to dbplyr: When executed, they return a lightweight data structure that contains all the information necessary to assemble the result. (In dbplyr, tbl %>% select(a, b) %>% filter(a > 5)
creates an object that has a sql_render()
method which composes the corresponding SQL, and only calling collect()
will actually run the query.) This means that the objects returned by meta_make()
et al. can just be serialized without special treatment.
The return value of the new verbs could be S3 objects of classes "drake_meta_make"
, "drake_unpack"
and "drake_pack"
, respectively. When the scheduler sees that a command returned an object of these classes, appropriate action is taken:
- For
"drake_meta_make"
, jobs are enqueued to the scheduler- results will be stored in a separate storr namespace (one result per meta-target)
- the class will have
$
and[[
methods overridden, its.Names
attribute will correspond to the actual target names- for now this assumes that all jobs can read from the storr
-
readd()
would just return the"drake_meta_make"
object
- For
"drake_unpack"
, targets are added to the plan, making sure that no duplicates are created- the dependency graph is rewired to account for the new targets
- if
unpack()
is called on a"drake_meta_make"
object, we avoid copies by creating pointers (S3 objects of yet another class, say,"drake_pointer"
), which are handled specially inloadd()
andreadd()
- For
"drake_pack"
, a list of"drake_pointer"
objects is constructed and stored- the class will also have
$
and[[
methods overridden - perhaps
"drake_meta_make"
can also be just a list of"drake_pointer"
objects
- the class will also have
The examples above use named lists for illustration. This means that names for objects/targets must be strings (just like in the current implementation, so not a restriction).
Ideally I'd prefer arbitrary (multivariate) keys to describe targets, and a nested tibble as data structure. (Let's not discuss this in too much detail for now.) If we support two-column data frames (target + x) from the start, we might be able to support multivariate keys later; I'd prefer this over the named list approach.
Alternatively, we might want to stick with named lists and provide seamless support for the enframe()
and deframe()
verbs that convert a named list to a two-column tibble and vice versa.
With a data-frame-based approach and multivariate keys, the focus of the DSL will be more efficient/elegant/straightforward ways to construct plans, which then are passed on to meta_make()
.
On the other hand, restricting target names to simple strings may be enough if our DSL adds multivariate keys on top of that. Again, let's postpone discussion on that detail.