Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add index_of() function to Series and Expr #19894

Merged
merged 100 commits into from
Jan 7, 2025

Conversation

itamarst
Copy link
Contributor

@itamarst itamarst commented Nov 20, 2024

Fixes #5503

Categoricals don't work yet; see #20171 and #20318.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Nov 20, 2024
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 98.68421% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.00%. Comparing base (72cd66a) to head (8c3e0d4).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-ops/src/series/ops/index_of.rs 98.76% 1 Missing ⚠️
.../polars-python/src/lazyframe/visitor/expr_nodes.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19894      +/-   ##
==========================================
+ Coverage   78.95%   79.00%   +0.04%     
==========================================
  Files        1564     1566       +2     
  Lines      220882   221035     +153     
  Branches     2510     2510              
==========================================
+ Hits       174407   174619     +212     
+ Misses      45900    45842      -58     
+ Partials      575      574       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@itamarst itamarst marked this pull request as ready for review November 21, 2024 14:13
crates/polars-ops/src/series/ops/index_of.rs Outdated Show resolved Hide resolved
crates/polars-ops/src/series/ops/index_of.rs Outdated Show resolved Hide resolved
crates/polars-ops/src/series/ops/index_of.rs Outdated Show resolved Hide resolved
@itamarst itamarst marked this pull request as draft November 27, 2024 15:00
@itamarst itamarst closed this Dec 2, 2024
@itamarst
Copy link
Contributor Author

itamarst commented Dec 2, 2024

I think I've figured out how to use row encoding, so now I just need to write lots and lots of tests and make sure it actually works beyond the trivial case I've already tested.

@itamarst itamarst reopened this Dec 2, 2024
@itamarst
Copy link
Contributor Author

itamarst commented Dec 2, 2024

Unfortunately categorical and enum don't work (they also don't work for search_sorted(), which would be nice to fix); they ought to work, since e.g. pl.Series(["A", "B"], dtype=pl.Categorical) == "B" works, but I'm not sure how that is different than what I'm doing, so would appreciate any hints.

E.g. for Categorical:

>>> import polars as pl
>>> pl.Series(["a", "b", "a"], dtype=pl.Categorical).index_of("a")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/itamarst/devel/polars/py-polars/polars/series/series.py", line 4771, in index_of
    return F.select(F.lit(self).index_of(element)).item()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/itamarst/devel/polars/py-polars/polars/functions/lazy.py", line 1913, in select
    return pl.DataFrame().select(*exprs, **named_exprs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/itamarst/devel/polars/py-polars/polars/dataframe/frame.py", line 9113, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/itamarst/devel/polars/py-polars/polars/lazyframe/frame.py", line 2029, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: got invalid or ambiguous dtypes: '[cat, str]' in expression 'index_of'

Consider explicitly casting your input types to resolve potential ambiguity.

Resolved plan until failure:

        ---> FAILED HERE RESOLVING 'select' <---
 SELECT [Series.index_of([String(a)])] FROM
  DF []; PROJECT */0 COLUMNS; SELECTION: None

@itamarst itamarst changed the title feat: Start of Series.index_of(), for primitive numeric types feat: Series.index_of() Dec 2, 2024
@coastalwhite
Copy link
Collaborator

My guess is that you are treating a categorical as a string when it goes into the row encoding. If you want to compare the row encoding of a series with the row encoding of another series they need to have been encoded with the exact same dtype (i.e. so the same RevMap as well) otherwise the output is undefined. If search_sorted doesn't do that either, that is a bug and I can look into it.

@itamarst
Copy link
Contributor Author

itamarst commented Dec 4, 2024

@coastalwhite search_sorted() does gets it wrong, yes. And separately if memory serves, you pass in a non-matching pl.lit("a", dtype=pl.Categorical) it doesn't error out with mismatching categoricals, it gives the wrong result.

@itamarst
Copy link
Contributor Author

itamarst commented Dec 4, 2024

@coastalwhite and the question is how/where do I convert to an enum/categorical, my attempts have failed so far.

@itamarst
Copy link
Contributor Author

itamarst commented Jan 2, 2025

Thank you for the new casting logic! I've updated to use it, and addressed the other two comments.

@itamarst itamarst requested a review from ritchie46 January 2, 2025 17:28
@ritchie46
Copy link
Member

Alright, looks great @itamarst. Thanks. I believe we only need docs entries on the python side (so that they end up in the ref guide), then it is good to go.

Copy link
Collaborator

@rodrigogiraoserrao rodrigogiraoserrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the tiny user-guide page? It's pretty much the same as the docstrings, so I feel like it's enough to have the docstrings.

docs/source/user-guide/expressions/searching.md Outdated Show resolved Hide resolved
docs/source/user-guide/expressions/searching.md Outdated Show resolved Hide resolved
docs/source/user-guide/expressions/searching.md Outdated Show resolved Hide resolved
docs/source/user-guide/expressions/searching.md Outdated Show resolved Hide resolved
@nameexhaustion nameexhaustion changed the title feat: Series.index_of() feat: Add index_of() function to Series and Expr Jan 7, 2025
@itamarst
Copy link
Contributor Author

itamarst commented Jan 7, 2025

OK, I figured out how to add index_of to the API reference guide, and removed the user guide page.

@ritchie46
Copy link
Member

Alright, can you rebase? I believe that that should resolve CI.

@itamarst
Copy link
Contributor Author

itamarst commented Jan 7, 2025

Done.

@ritchie46
Copy link
Member

Alright, thanks @itamarst, looks good!

@ritchie46 ritchie46 merged commit 785bb1e into pola-rs:main Jan 7, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Find index of item in list
6 participants