[Data] Use AggregateFn instead of groupby.count for unique() #49298

wingkitlee0 · 2024-12-17T03:51:45Z

Description

I was trying AggregateFn instead of groupby.count for unique() the other day. It was about 10x faster:

Groupby did a sort ( O(nlog n / parallelism)), where as AggregateFn should be of O(n/parallelism).

I got a draft PR using the existing aggregation mechanics, but the following example script essentially did the same:

import numpy as np
import ray
import ray.data
from ray.data.aggregate import AggregateFn

def experiment1(random_state=None) -> int:
    if random_state is None:
        random_state = np.random.default_rng()

    ds = ray.data.from_items(random_state.integers(0, 10000, size=1_000_000))

    unique_items = ds.unique("item")

    return unique_items

def get_unique_func(col: str) -> AggregateFn:
    return AggregateFn(
        init=lambda x: set(),
        merge=lambda a, b: list(set(a) | set(b)),
        accumulate_block=lambda a, x: a | set(x[col]),
        name="unique",
        finalize=lambda a: a,
    )

def experiment2(random_state=None) -> int:
    if random_state is None:
        random_state = np.random.default_rng()

    ds = ray.data.from_items(random_state.integers(0, 10000, size=1_000_000))

    unique_items = ds.aggregate(get_unique_func("item"))["unique"]

    return unique_items

def main():

    random_state = np.random.default_rng(1234)

    ray.init()

    unique_items = experiment1(random_state)
    print(f"Number of unique items: {len(unique_items)}")
    print(f"{type(unique_items)=}")

    unique_items = experiment2(random_state)
    print(f"Number of unique items: {len(unique_items)}")
    print(f"{type(unique_items)=}")

if __name__ == "__main__":
    main()

Use case

It's useful to get some basic statistics of a dataset very quickly.

The text was updated successfully, but these errors were encountered:

wingkitlee0 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 17, 2024

wingkitlee0 changed the title ~~[Data] Using AggregateFn instead of groupby.count for unique()~~ [Data] Use AggregateFn instead of groupby.count for unique() Dec 17, 2024

wingkitlee0 linked a pull request Dec 17, 2024 that will close this issue

[Data] optimize dataset.unique() #49296

Open

8 tasks

jcotant1 added the data Ray Data-related issues label Dec 17, 2024

richardliaw added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 17, 2024

richardliaw assigned raulchen Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Use AggregateFn instead of groupby.count for unique() #49298

[Data] Use AggregateFn instead of groupby.count for unique() #49298

wingkitlee0 commented Dec 17, 2024

[Data] Use AggregateFn instead of groupby.count for unique() #49298

[Data] Use AggregateFn instead of groupby.count for unique() #49298

Comments

wingkitlee0 commented Dec 17, 2024

Description

Use case