Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selectors should support slicing columns #15963

Open
samukweku opened this issue Apr 30, 2024 · 10 comments
Open

selectors should support slicing columns #15963

samukweku opened this issue Apr 30, 2024 · 10 comments
Assignees
Labels
A-selectors Area: column selectors enhancement New feature or an improvement of an existing feature

Comments

@samukweku
Copy link

Description

Hi team. I would like to suggest adding a slice method to the selectors class, where users can select a slice of columns :

import polars as pl

data = {'City': ['Houston', 'Austin', 'Hoover'],
 'State': ['Texas', 'Texas', 'Alabama'],
 'Name': ['Aria', 'Penelope', 'Niko'],
 'Mango': [4, 10, 90],
 'Orange': [10, 8, 14],
 'Watermelon': [40, 99, 43],
 'Gin': [16, 200, 34],
 'Vodka': [20, 33, 18]}

df = pl.DataFrame(data)

df

┌─────────┬─────────┬──────────┬───────┬────────┬────────────┬─────┬───────┐
│ CityStateNameMangoOrangeWatermelonGinVodka │
│ ------------------------   │
│ strstrstri64i64i64i64i64   │
╞═════════╪═════════╪══════════╪═══════╪════════╪════════════╪═════╪═══════╡
│ HoustonTexasAria410401620    │
│ AustinTexasPenelope1089920033    │
│ HooverAlabamaNiko9014433418    │
└─────────┴─────────┴──────────┴───────┴────────┴────────────┴─────┴───────┘

The slicing syntax can be :

df.select(cs.slice('Mango','Vodka')) # alternative - df.select(cs['Mango':'Vodka'])
shape: (3, 5)
┌───────┬────────┬────────────┬─────┬───────┐
│ MangoOrangeWatermelonGinVodka │
│ ---------------   │
│ i64i64i64i64i64   │
╞═══════╪════════╪════════════╪═════╪═══════╡
│ 410401620    │
│ 1089920033    │
│ 9014433418    │
└───────┴────────┴────────────┴─────┴───────┘
@samukweku samukweku added the enhancement New feature or an improvement of an existing feature label Apr 30, 2024
@aut0clave
Copy link

If you know what fields you want, why do you need a selector? Why not use a simple .select("Mango","Vodka")? Or the existing cs.by_name("Mango","Vodka")?

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Apr 30, 2024

@aut0clave They want to extract the "range of columns" Mango .. Vodka

I believe first/last are the only selectors that are "positional"

>>> cs.first().meta.serialize()
'{"Nth":0}'

There is no .nth() selector, but it would be easy to add:

>>> df.select( pl.Expr.deserialize( io.StringIO("""{"Nth":3}""") ) )
shape: (3, 1)
┌───────┐
│ Mango │
│ ---   │
│ i64   │
╞═══════╡
│ 4     │
│ 10    │
│ 90    │
└───────┘

nth -> column name mapping is done here:

fn replace_nth(expr: Expr, schema: &Schema) -> Expr {

From what I can tell, there is nothing that goes the other way, i.e. column name -> nth - which I think would be needed in order to support this at the selector level?

@samukweku
Copy link
Author

@cmdlineluser i'd assume there was a way to get the positions of the column names (maybe grab the positions via list.index from python and pass it to the rust end). dont know much about the internal implementation, happy to learn. I'd also suggest, if the team feels like this is a worthwhile addition, that the slicing be limited to column names only (numeric positions should not be supported)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Apr 30, 2024

@cmdlineluser i'd assume there was a way to get the positions of the column names (maybe grab the positions via list.index from python and pass it to the rust end).

FYI: until we are actually evaluating a lazy query plan we may not know the position of all of the columns (eg: expanding a struct, or evaluating earlier selectors). Consequently we can't precompute and pass-down, because it's only at the lower level that we would know the answer (selectors are dynamic, evaluating internally at the point they are invoked) ;)

Offering index-based selection doesn't seem like a bad idea (we currently only support selection by name/dtype and the special cases of first/last, as noted by @cmdlineluser), but would need some internal additions to be possible 🤔

@samukweku
Copy link
Author

@cmdlineluser so something like cs.by_position, cs.by_range?

@cmdlineluser
Copy link
Contributor

cmdlineluser commented May 4, 2024

@alexander-beedie is the person to ask. (they created selectors :-D)

@alexander-beedie
Copy link
Collaborator

@cmdlineluser so something like cs.by_position, cs.by_range?

Probably cs.by_index, which would take one or more index values, a range, or a slice (as range/slice can be directly expanded into a list of indexes, so internally we just need to handle that). Does need additional low-level support though.

@alexander-beedie alexander-beedie added the A-selectors Area: column selectors label Jun 9, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 3, 2024

FYI, forgot to update this issue, but we do now have a new cs.by_index selector which can take indices and ranges, which gets you some of the way there: #16217

@samukweku
Copy link
Author

samukweku commented Jul 3, 2024

Thanks @alexander-beedie. Looks good. Safe to assume that slicing with labels may be implemented at a future date?

@alexander-beedie alexander-beedie self-assigned this Jul 3, 2024
@alexander-beedie
Copy link
Collaborator

Thanks @alexander-beedie. Looks good. Safe to assume that slicing with labels may be implemented at a future date?

Probably, but no timeline; the 1.0 (and a few quick point releases to address any related issues) has priority at the moment. And I'm on vacation for the next two weeks ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-selectors Area: column selectors enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants