Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently each operator is planned as its own node in the dataflow graph, with queues or network stack between them. This PR adds a new capability to the dataflow system, the ability to chain operators together in a single node.
This has a number of benefits:
For large and complex pipelines, the effect can be quite large, significantly reducing the memory requirements for the pipeline (particularly at larger queue sizes).
Chaining is implemented with a type of node,
ChainedOperator
. Currently all operators aside from sources are wrapped in a ChainedOperator, which may have one or more elements in the chain. Dataflow between links in a chain is performed via function call, thanks to a new Collector trait and its implementorChainCollector
which when it receives a record batch, immediately calls the handle method of the next operator with the batch. Signal messages are similarly propagated, with care to ensure that the signals are properly interleaved with data messages produced by the signal.Currently an operator will be chained into its predecessor if:
In the future we may relax the restriction on source/sink chaining.
Initially chaining is disabled by default. It can be enabled with the configuration setting
pipeline.chaining.enabled = true