Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] repartition on Dataset removes tags from schema #179

Open
radekosmulski opened this issue Dec 2, 2022 · 4 comments
Open

[BUG] repartition on Dataset removes tags from schema #179

radekosmulski opened this issue Dec 2, 2022 · 4 comments
Labels
bug Something isn't working P1

Comments

@radekosmulski
Copy link

radekosmulski commented Dec 2, 2022

image

Reproducer code:

import numpy as np
import cudf
import nvtabular as nvt
from merlin.schema.tags import Tags

purchases = cudf.DataFrame(
    data={'user_id': [0, 1, 2, 2],
          'price': [125.04, 23.07, 101.2, 2.34],
          'color': ['blue', 'blue', 'red', 'yellow'],
          'model': ['deluxe', 'compact', 'regular', 'regular']
})

out = ['price'] >> nvt.ops.AddMetadata(tags=[Tags.TARGET])

out += ['price'] >> nvt.ops.AddTags(tags=[Tags.CONTINUOUS])
out += ['user_id'] >> nvt.ops.TagAsUserID()
out += ['color', 'model'] >> nvt.ops.TagAsItemFeatures()
out += ['color', 'model'] >> nvt.ops.AddTags(tags=[Tags.CATEGORICAL])

ds = nvt.Dataset(purchases)
wf = nvt.Workflow(out)

ds_out = wf.fit_transform(ds)
ds_out.schema

ds_out = ds_out.repartition(5)

ds_out.schema
@radekosmulski radekosmulski added the bug Something isn't working label Dec 2, 2022
@karlhigley
Copy link
Contributor

I think this is because repartition creates a brand new Dataset object which then tries to infer a schema from the raw data all over again, but it shouldn't be too hard to maintain the existing schema in this case.

Does it work if you supply schema=self.schema as a Dataset constructor argument in the definition of repartition?

@rnyak
Copy link
Contributor

rnyak commented Jan 25, 2023

@radekosmulski any update based on Karl's comment above? thanks.

@karlhigley
Copy link
Contributor

I think @sararb fixed this issue in #192

@rnyak rnyak added the P1 label Feb 1, 2023
@rnyak
Copy link
Contributor

rnyak commented Feb 1, 2023

@radekosmulski can you pls test this again with the latest branches pulled and see if this issues was fixed or not? Sara made a fix but not sure it solves your issue as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1
Projects
None yet
Development

No branches or pull requests

3 participants