Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected index out of bounds error for specific dataset and set of operations #16830

Closed
2 tasks done
maxzw opened this issue Jun 9, 2024 · 11 comments · Fixed by #16852
Closed
2 tasks done

Unexpected index out of bounds error for specific dataset and set of operations #16830

maxzw opened this issue Jun 9, 2024 · 11 comments · Fixed by #16852
Assignees
Labels
A-panic Area: code that results in panic exceptions accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@maxzw
Copy link

maxzw commented Jun 9, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

With data of shape (872, 10):
repro_data.txt

Note: not all columns have operations performed on them, but they apparently need to be present for the error to occur!

import polars as pl

df = pl.DataFrame(
    data=data,
    schema={
        "group1": pl.Int16,
        "group2": pl.Int32,
        "val1": pl.Boolean,
        "val2": pl.Float64,
        "val3": pl.Boolean,
        "val4": pl.Float64,
        "val5": pl.Float64,
        "val6": pl.Int32,
        "val7": pl.Float64,
        "val8": pl.Float64,
    },
)

df = df.filter(pl.col("val1") | pl.col("val3"))
df = df.with_columns(pl.col("val4").max().over("group1", "group2").fill_null(0).alias("val4"))
df = df.filter(pl.col("val4") > pl.col("val7").sum().over("group1", "group2"))
df.with_columns(pl.col("val4").floor())

Log output

thread '<unnamed>' panicked at crates/polars-core/src/series/mod.rs:213:42:
index out of bounds: the len is 1 but the index is 1
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic_bounds_check
   3: polars_core::series::Series::select_chunk
   4: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
   5: polars_lazy::physical_plan::executors::stack::StackExec::execute_impl
   6: <polars_lazy::physical_plan::executors::stack::StackExec as polars_lazy::physical_plan::executors::executor::Executor>::execute
   7: polars_lazy::frame::LazyFrame::collect
   8: polars::lazyframe::PyLazyFrame::__pymethod_collect__
   9: pyo3::impl_::trampoline::trampoline
  10: polars::lazyframe::_::__INVENTORY::trampoline
  11: _method_vectorcall_VARARGS_KEYWORDS
  12: _call_function
  13: __PyEval_EvalFrameDefault
  14: __PyEval_Vector
  15: _method_vectorcall
  16: _call_function
  17: __PyEval_EvalFrameDefault
  18: __PyEval_Vector
  19: _call_function
  20: __PyEval_EvalFrameDefault
  21: __PyEval_Vector
  22: _builtin_exec
  23: _cfunction_vectorcall_FASTCALL
  24: _call_function
  25: __PyEval_EvalFrameDefault
  26: _gen_send_ex2
  27: __PyEval_EvalFrameDefault
  28: _gen_send_ex2
  29: __PyEval_EvalFrameDefault
  30: _gen_send_ex2
  31: _gen_send
  32: _method_vectorcall_O
  33: _call_function
  34: __PyEval_EvalFrameDefault
  35: __PyEval_Vector
  36: _call_function
  37: __PyEval_EvalFrameDefault
  38: __PyEval_Vector
  39: _call_function
  40: __PyEval_EvalFrameDefault
  41: __PyEval_Vector
  42: _method_vectorcall
  43: _PyVectorcall_Call
  44: __PyEval_EvalFrameDefault
  45: __PyEval_Vector
  46: _method_vectorcall
  47: _call_function
  48: __PyEval_EvalFrameDefault
  49: _gen_send_ex2
  50: __PyEval_EvalFrameDefault
  51: _gen_send_ex2
  52: __PyEval_EvalFrameDefault
  53: _gen_send_ex2
  54: __PyEval_EvalFrameDefault
  55: _gen_send_ex2
  56: __PyEval_EvalFrameDefault
  57: _gen_send_ex2
  58: __PyEval_EvalFrameDefault
  59: _gen_send_ex2
  60: _task_step_impl
  61: _task_step
  62: _task_wakeup
  63: _cfunction_vectorcall_O
  64: __PyObject_VectorcallTstate.4665
  65: _context_run
  66: _cfunction_vectorcall_FASTCALL_KEYWORDS
  67: __PyEval_EvalFrameDefault
  68: __PyEval_Vector
  69: _call_function
  70: __PyEval_EvalFrameDefault
  71: __PyEval_Vector
  72: _call_function
  73: __PyEval_EvalFrameDefault
  74: __PyEval_Vector
  75: _call_function
  76: __PyEval_EvalFrameDefault
  77: __PyEval_Vector
  78: _call_function
  79: __PyEval_EvalFrameDefault
  80: __PyEval_Vector
  81: _call_function
  82: __PyEval_EvalFrameDefault
  83: __PyEval_Vector
  84: _method_vectorcall
  85: _call_function
  86: __PyEval_EvalFrameDefault
  87: __PyEval_Vector
  88: _builtin_exec
  89: _cfunction_vectorcall_FASTCALL
  90: _call_function
  91: __PyEval_EvalFrameDefault
  92: __PyEval_Vector
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Issue description

Operations result in unexpected error.

Casting to pandas and back anywhere within these operations resolves the issue:

df = df.filter(pl.col("val1") | pl.col("val3"))

df = pl.from_pandas(df.to_pandas(), schema_overrides=df.schema)

df = df.with_columns(pl.col("val4").max().over("group1", "group2").fill_null(0).alias("val4"))
df = df.filter(pl.col("val4") > pl.col("val7").sum().over("group1", "group2"))
df.with_columns(pl.col("val4").floor())

Expected behavior

I expect this error not to occur.

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.10.0 (default, Mar  3 2022, 03:54:28) [Clang 12.0.0 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            0.17.4
fastexcel:            <not installed>
fsspec:               2023.9.0
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.7.2
nest_asyncio:         1.6.0
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               2.1.2
pyarrow:              14.0.1
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.20
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@maxzw maxzw added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 9, 2024
@maxzw
Copy link
Author

maxzw commented Jun 9, 2024

@ritchie46 here is a repro for #16605

@cmdlineluser
Copy link
Contributor

Can reproduce.

If it is of use for debugging: It does not seem to happen using the Lazy API.

df = df.lazy()
df = df.filter(pl.col("val1") | pl.col("val3"))
df = df.with_columns(pl.col("val4").max().over("group1", "group2").fill_null(0).alias("val4"))
df = df.filter(pl.col("val4") > pl.col("val7").sum().over("group1", "group2"))
df.with_columns(pl.col("val4").floor()).collect()

# shape: (9, 10)
# ┌────────┬────────┬──────┬──────┬───┬───────┬──────┬──────────┬───────────┐
# │ group1 ┆ group2 ┆ val1 ┆ val2 ┆ … ┆ val5  ┆ val6 ┆ val7     ┆ val8      │
# │ ---    ┆ ---    ┆ ---  ┆ ---  ┆   ┆ ---   ┆ ---  ┆ ---      ┆ ---       │
# │ i64    ┆ i64    ┆ bool ┆ f64  ┆   ┆ f64   ┆ i64  ┆ f64      ┆ f64       │
# ╞════════╪════════╪══════╪══════╪═══╪═══════╪══════╪══════════╪═══════════╡
# │ 1001   ┆ 100004 ┆ true ┆ null ┆ … ┆ 87.0  ┆ 0    ┆ 2.705119 ┆ 40.904418 │
# │ 1001   ┆ 100007 ┆ true ┆ null ┆ … ┆ 173.0 ┆ 0    ┆ 2.6165   ┆ 34.486    │
# │ 1001   ┆ 100009 ┆ true ┆ null ┆ … ┆ 211.0 ┆ 0    ┆ 4.458603 ┆ 77.95037  │
# │ 1001   ┆ 100010 ┆ true ┆ null ┆ … ┆ 178.0 ┆ 0    ┆ 2.3165   ┆ 37.77     │
# │ 1001   ┆ 100011 ┆ true ┆ null ┆ … ┆ 174.0 ┆ 0    ┆ 5.548593 ┆ 71.207139 │
# │ 1001   ┆ 100012 ┆ true ┆ null ┆ … ┆ 196.0 ┆ 0    ┆ 2.1685   ┆ 32.888    │
# │ 1001   ┆ 100015 ┆ true ┆ null ┆ … ┆ 89.0  ┆ 0    ┆ 2.400406 ┆ 39.732588 │
# │ 1003   ┆ 100008 ┆ true ┆ null ┆ … ┆ 238.0 ┆ 0    ┆ 4.913397 ┆ 93.076396 │
# │ 1003   ┆ 100013 ┆ true ┆ null ┆ … ┆ 101.5 ┆ 0    ┆ 2.254043 ┆ 45.486928 │
# └────────┴────────┴──────┴──────┴───┴───────┴──────┴──────────┴───────────┘

@stinodego stinodego added the A-panic Area: code that results in panic exceptions label Jun 9, 2024
@stinodego
Copy link
Member

I cannot reproduce this 🤔

@Elvynzs
Copy link

Elvynzs commented Jun 10, 2024

Surprisingly I cannot reproduce using the given data/code, however I have the same issue.

I will try to find the time to make a minimal repro code for my case.

@Elvynzs
Copy link

Elvynzs commented Jun 10, 2024

Here it is, I was able to cut out a lot of the initial code :

import polars as pl
import numpy as np

df = pl.DataFrame({"index_1":np.repeat(np.arange(100), 10), "index_2":np.repeat(np.arange(100), 10)})
df = pl.concat([df[0:500], df[500:]])
df = df.filter(df["index_1"] == 0)
df = df.with_columns(index_2 = pl.Series(values=[0]*10))
df.set_sorted("index_2") #Also crashes on write_parquet and some other operations

It crashes for me (Windows 11).

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[9], line 8
      6 df = df.filter(df["index_1"] == 0)
      7 df = df.with_columns(index_2 = pl.Series(values=[0]*10))
----> 8 df.set_sorted("index_2")

File C:\...\polars\dataframe\frame.py:10674, in DataFrame.set_sorted(self, column, descending, *more_columns)
  10653 def set_sorted(
  10654     self,
  10655     column: str | Iterable[str],
  10656     *more_columns: str,
  10657     descending: bool = False,
  10658 ) -> DataFrame:
  10659     """
  10660     Indicate that one or multiple columns are sorted.
  10661 
   (...)
  10669         Whether the columns are sorted in descending order.
  10670     """
  10671     return (
  10672         self.lazy()
  10673         .set_sorted(column, *more_columns, descending=descending)
> 10674         .collect(_eager=True)
  10675     )

File C:\...\polars\lazyframe\frame.py:1967, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1964 # Only for testing purposes atm.
   1965 callback = _kwargs.get("post_opt_callback")
-> 1967 return wrap_df(ldf.collect(callback))

PanicException: index out of bounds: the len is 1 but the index is 1

@stinodego
Copy link
Member

That one I can reproduce, thanks!

@stinodego stinodego added P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Jun 10, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jun 10, 2024
@ritchie46
Copy link
Member

Taking a look.

@ritchie46 ritchie46 self-assigned this Jun 10, 2024
@cmdlineluser
Copy link
Contributor

Perhaps the original issue could be platform specific? I can reproduce it on macOS (same as @maxzw).

@Elvynzs I can reproduce your example also.

It seems it may be a little different, and have to do with your use of Series.

Changing the filter to use expressions makes the example run for me:

df.filter(pl.col("index_1") == 0)

@maxzw
Copy link
Author

maxzw commented Jun 10, 2024

@Elvynzs I'm not sure your issue is equal to the one in the description, but I'll check if the fix also works for mine 😃

@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jun 10, 2024
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Jun 10, 2024

The original issue no longer reproduces for me thanks to #16852

@maxzw
Copy link
Author

maxzw commented Jun 11, 2024

I can confirm as well! Thanks @ritchie46! 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-panic Area: code that results in panic exceptions accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants