-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Extending merge rollup capabilities #14310
Comments
Thanks @davecromberge for filing this PEP, very well authored. Tagging @swaminathanmanish @Jackie-Jiang |
I have done further investigation as to how both options might be implemented and have concluded that it might be best to pursue dimensionality reduction in the first pass and re-evaluate whether varying aggregate behaviour is necessary. Dimensionality reduction/erasureRequires additional configuration for each time bucket with a list of dimension names to erase. In this context, erasing a dimension refers to looking up the The configuration might change to include an additional array configuration field as follows:
In the example above, only Concerning implementation, if the new field is provided, the MergeRollupTask will have to pass a custom
Note: The custom record transformer precedes all existing record transformers. Finally, the rollup process will consolidate all records where the dimensions have matching coordinates. The transformed records should result in a greater degree of rollup expressed as a fraction of the number of input records. Varying aggregate behaviour over time (abandoned?)Varying aggregate behaviour over time introduces complexity for indeterminate gains. Firstly, the configuration for sketch precision would have to be defined for both the different metrics and time periods which introduces confusion for how the current task is configured. Example:
In the example above, the aggregation function is configured both within the time buckets as well as on the metrics directly which might be confusing. Alternatively, the function parameters could be supplied directly on the metrics which still requires additional time configuration for each parameter. Secondly, and more importantly, varying aggregate behaviour over time can lead to incorrect results. This is because StarTree indexes are constructed using the |
Adds the capability to erase dimension values from a merged segment before rollup to reduce cardinality and increase the degree to which common dimension coordinates are aggregated. This can result in a space saving for some dimensions which are not important in historic data. See: apache#14310
Adds the capability to erase dimension values from a merged segment before rollup to reduce cardinality and increase the degree to which common dimension coordinates are aggregated. This can result in a space saving for some dimensions which are not important in historic data. See: apache#14310
What needs to be done?
Extend the merge-rollup framework to create additional transformations:
Dimensionality reduction/erasure
Eliminate a particular dimension column's values to allow more rows to aggregate as duplicates.
For example:
The above example shows the Browser dimension erased or set to some default value after some time window has passed.
Varying aggregate behaviour over time
Some aggregate values could change precision over time. The multi-level merge functionality can be used to reduce the resolution or precision of aggregates for older segments. This applies primarily to sketches, but could also be used for other binary aggregate types.
The above example shows a size reduction of 2x on existing sketches which could be achieved by reducing the lgK value by a factor of 1 as data ages. Be aware that this could cause varying precisions for queries that span time ranges, where the sketch implementation supports this.
Why the feature is needed (e.g. describing the use case).
The primary justification for such a feature is more aggressive space saving for historic data. As the merge rollup task processes older time windows, users could eliminate non-critical dimensions which would result in a greater degree of documents rolling up into a single aggregate. Similarly, users could sacrifice aggregate accuracy for historic queries and thus trade this off for a smaller storage footprint - especially when dealing with Theta / Tuple sketches which can be in the order of megabytes at lgK = 16.
Idea on how this may be implemented
Both extensions would require changes to the configuration for the Minion Merge rollup task. In particular, the most flexible approach would be to have a dynamic bag of properties that could apply to each individual aggregation function, where these could be interpreted before rolling up or merging the data.
Dimensionality reduction/erasure
Varying aggregate behaviour over time
Note: This issue should be treated PEP-request.
The text was updated successfully, but these errors were encountered: