Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically create xindex? #9703

Open
max-sixty opened this issue Nov 1, 2024 · 4 comments
Open

Automatically create xindex? #9703

max-sixty opened this issue Nov 1, 2024 · 4 comments

Comments

@max-sixty
Copy link
Collaborator

max-sixty commented Nov 1, 2024

Is your feature request related to a problem?

I'm trying to use xindex more. Currently, trying to select values using coordinates that haven't been explicitly indexed via set_xindex() raises:

ds = xr.tutorial.open_dataset("air_temperature").assign_coords(lat2=lambda x: x.lat)

ds
# Output:
<xarray.Dataset> Size: 31MB
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
    lat2     (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
Data variables:
    air      (time, lat, lon) float64 31MB ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

# Attempting to select using the unindexed coordinate raises an error:
ds.sel(lat2=75)
# Output:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[20], line 1
----> 1 ds.sel(lat2=75)

File ~/workspace/xarray/xarray/core/dataset.py:3223, in Dataset.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   3155 """Returns a new dataset with each array indexed by tick labels
   3156 along the specified dimension(s).
   3157
   (...)
   3220
   3221 """
   3222 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 3223 query_results = map_index_queries(
   3224     self, indexers=indexers, method=method, tolerance=tolerance
   3225 )
   3227 if drop:
   3228     no_scalar_variables = {}

File ~/workspace/xarray/xarray/core/indexing.py:186, in map_index_queries(obj, indexers, method, tolerance, **indexers_kwargs)
    183     options = {"method": method, "tolerance": tolerance}
    185 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "map_index_queries")
--> 186 grouped_indexers = group_indexers_by_index(obj, indexers, options)
    188 results = []
    189 for index, labels in grouped_indexers:

File ~/workspace/xarray/xarray/core/indexing.py:145, in group_indexers_by_index(obj, indexers, options)
    143     grouped_indexers[index_id][key] = label
    144 elif key in obj.coords:
--> 145     raise KeyError(f"no index found for coordinate {key!r}")
    146 elif key not in obj.dims:
    147     raise KeyError(
    148         f"{key!r} is not a valid dimension or coordinate for "
    149         f"{obj.__class__.__name__} with dimensions {obj.dims!r}"
    150     )

KeyError: "no index found for coordinate 'lat2'"

After explicitly setting the index, it works as expected:

ds.set_xindex('lat2').sel(lat2=75)
# Output:
<xarray.Dataset> Size: 1MB
Dimensions:  (time: 2920, lon: 53)
Coordinates:
    lat      float32 4B 75.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 23kB 2013-01-01 ... 2014-12-31T18:00:00
    lat2     float32 4B 75.0
Data variables:
    air      (time, lon) float64 1MB ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

It's a bit annoying — frequently I attempt to select something, realize it doesn't have an index, add the .set_xindex call, try and remember to add each one at object creation, feel like xarray isn't being as helpful as it could be.

Describe the solution you'd like

Could we instead set the xindex automatically when calling .sel

Possibly we want to force the user to create this once, rather than paying the cost of creating a new index on each call? But OTOH it seems relatively cheap?

%timeit ds.assign_coords(lat2=ds.lat + 2).set_xindex('lat2')

349 µs ± 6.97 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(I guess it could be possible to update a cache in place, and then creating a new index from the cache would be very cheap. Though also possibly that's a source of quite confusing behavior if our implementation is in any way wrong / people are sharing objects across threads etc — i.e. the principle of "don't update in place" is useful)

Describe alternatives you've considered

A set_xindex(...) param (i.e. literally an ellipsis ...) that just creates all the indexes that it can, and folks could call after creating an object?

Additional context

No response

@headtr1ck
Copy link
Collaborator

Somehow I remember that this came up already a year ago or so. But I cannot seem to find the issue...

I think that this would be a great addition.

@shoyer
Copy link
Member

shoyer commented Nov 1, 2024

👍 for automatically creating indexes when needed.

I would not modify the xarray object in place. Users can do this if they need the performance gains.

@max-sixty
Copy link
Collaborator Author

One quick thought: should we add them when creating the object?

@headtr1ck
Copy link
Collaborator

Might be related: #8028

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants