Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hold and use nested column metadata #466

Open
5 tasks
hombit opened this issue Oct 25, 2024 · 0 comments
Open
5 tasks

Hold and use nested column metadata #466

hombit opened this issue Oct 25, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@hombit
Copy link
Contributor

hombit commented Oct 25, 2024

We are getting an initial nested-column support with Hats/LSDB ecosystem now. Now we have a couple of catalogs (ZTF alerts, SDSS DR7 spectra) with nested lists that represent nested data we could pack to a single nested column after we read the data.

Today we can nest these list-columns with code like this one:

from lsdb import read_hats

raw_catalog = read_hats('https://data.lsdb.io/hats/alerce/')
catalog_with_lc = raw_catalog.nest_lists(
    base_columns=[col for col in raw_catalog.columns if not col.startswith('lc_')],
    name='lc',
)
catalog_with_nondet = catalog_with_lc.nest_lists(
    base_columns=[col for col in catalog_with_lc.columns if not col.startswith('nondet_')],
    name='nondet',
)
catalog = catalog_with_nondet.nest_lists(
    base_columns=[col for col in catalog_with_nondet.columns if not col.startswith('ref_')],
    name='ref',
)

This works, but it is not a perfect user experience: how would user know which columns can be packed (here it is with name prefixes, but it is not scalable and ugly), how does user save a catalog to the initial format when calling to_hats?

We can solve these issues with a better nested columns support across the ecosystem:

  • hats: Parse metadata to hats catalog which specifies which list-columns correspond to which nested columns, e.g. mag and mjd form lc, while flux and wave form sed.
  • hats-import: Generate and save nested column metadata
  • lsdb: read_hats uses nested column metadata to pack list-columns into NestedDtyped columns. It still allows to select individual "nested" columns, e.g. if "mag" and "magerr" are selected, and "mjd" is not, the first two form an "lc" nested column.
  • lsdb: to_hats splits nested column to list-columns and creates appropriate metadata
  • nested-pandas: Reimplement parquet I/O according to HATS plans lincc-frameworks/nested-pandas#163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Suggested Todo
Development

No branches or pull requests

1 participant