[Data] Use AggregateFn instead of groupby.count for unique() #49298
Labels
data
Ray Data-related issues
enhancement
Request for new feature and/or capability
P2
Important issue, but not time-critical
Description
I was trying
AggregateFn
instead ofgroupby.count
forunique()
the other day. It was about 10x faster:Groupby did a sort ( O(nlog n / parallelism)), where as AggregateFn should be of O(n/parallelism).
I got a draft PR using the existing aggregation mechanics, but the following example script essentially did the same:
Use case
It's useful to get some basic statistics of a dataset very quickly.
The text was updated successfully, but these errors were encountered: