Modin 0.16.0
This release includes support for pandas 1.5, support for the latest version of dask, and backwards compatibility with python 3.6 and pandas 1.1. Additionally, it includes many performance enhancements, bug fixes, and documentation improvements.
Key Features and Updates
- Stability and Bugfixes
- FIX-#4570: Replace
np.bool
->np.bool_
(#4571) - FIX-#4543: Fix
read_csv
in case skiprows=<0, []> (#4544) - FIX-#4059: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
- FIX-#4589: Pin protobuf<4.0.0 to fix ray (#4590)
- FIX-#4577: Set attribute of Modin dataframe to updated value (#4588)
- FIX-#4411: Fix binary_op between datetime64 Series and pandas timedelta (#4592)
- FIX-#4604: Fix
groupby
+agg
in case when multicolumn can arise (#4642) - FIX-#4582: Inherit custom log layer (#4583)
- FIX-#4639: Fix
storage_options
usage forread_csv
andread_csv_glob
(#4644) - FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
- FIX-#4584: Enable pdb debug when running cloud tests (#4585)
- FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set (#4603)
- FIX-#4641: Reindex pandas partitions in
df.describe()
(#4651) - FIX-#2064: Fix
iloc
/loc
assignment when dataframe is empty (#4677) - FIX-#4634: Check for FrozenList as
by
indf.groupby()
(#4667) - FIX-#4680: Fix
read_csv
that started defaulting to pandas again in case of reading from a buffer and when a buffer has a non-zero starting position (#4681) - FIX-#4491: Wait for all partitions in parallel in benchmark mode (#4656)
- FIX-#4358: MultiIndex
loc
shouldn't drop levels for full-key lookups (#4608) - FIX-#4658: Expand exception handling for
read_*
functions from s3 storages (#4659) - FIX-#4672: Fix incorrect warning when setting
frame.index
orframe.columns
(#4721) - FIX-#4686: Propagate metadata and drain call queue in unwrap_partitions (#4697)
- FIX-#4652: Support categorical data in
from_dataframe
(#4737) - FIX-#4756: Correctly propagate
storage_options
inread_parquet
(#4764) - FIX-#4657: Use
fsspec
for handling s3/http-like paths instead ofs3fs
(#4710) - FIX-#4676: drain sub-virtual-partition call queues (#4695)
- FIX-#4782: Exclude certain non-parquet files in
read_parquet
(#4783) - FIX-#4808: Set dtypes correctly after column rename (#4809)
- FIX-#4811: Apply dataframe -> not_dataframe functions to virtual partitions (#4812)
- FIX-#4099: Use mangled column names but keep the original when building frames from arrow (#4767)
- FIX-#4838: Bump up modin-spreadsheet to latest master (#4839)
- FIX-#4840: Change modin-spreadsheet version for notebook requirements (#4841)
- FIX-#4835: Handle Pathlike paths in
read_parquet
(#4837) - FIX-#4872: Stop checking the private ray mac memory limit (#4873)
- FIX-#4914:
base_lengths
should be computed frombase_frame
instead ofself
incopartition
(#4915) - FIX-#4848: Fix rebalancing partitions when NPartitions == 1 (#4874)
- FIX-#4927: Fix
dtypes
computation indataframe.filter
(#4928) - FIX-#4907: Implement
radd
for Series and DataFrame (#4908) - FIX-#4945: Fix
_take_2d_positional
that loses indexes due to filtering empty dataframes (#4951) - FIX-#4818, PERF-#4825: Fix where by using the new n-ary operator (#4820)
- FIX-#3983: FIX-#4107: Materialize 'rowid' columns when selecting rows by position (#4834)
- FIX-#4845: Fix KeyError from
__getitem_bool
for single row dataframes (#4845) - FIX-#4734: Handle Series.apply when return type is a DataFrame (#4830)
- FIX-#4983: Set
frac
toNone
in _sample whenn=0
(#4984) - FIX-#4993: Return
_default_to_pandas
indf.attrs
(#4995) - FIX-#5043: Fix
execute
function in ASV utils failed iflen(partitions) == 0
(#5044) - FIX-#4597: Refactor Partition handling of func, args, kwargs (#4715)
- FIX-#4996: Evaluate BenchmarkMode at each function call (#4997)
- FIX-#4022: Fixed empty data frame with index (#4910)
- FIX-#4090: Fixed check if the index is trivial (#4936)
- FIX-#4966: Fix
to_timedelta
to return Series instead of TimedeltaIndex (#5028) - FIX-#5042: Fix series getitem with invalid strings (#5048)
- FIX-#4691: Fix binary operations between virtual partitions (#5049)
- FIX-#5045: Fix ray virtual_partition.wait with duplicate object refs (#5058)
- FIX-#4570: Replace
- Performance enhancements
- PERF-#4182: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
- PERF-#4288: Improve perf of
groupby.mean
for narrow data (#4591) - PERF-#4772: Remove
df.copy
call fromfrom_pandas
since it is not needed for Ray and Dask (#4781) - PERF-#4325: Improve perf of multi-column assignment in
__setitem__
when no new column names are assigning (#4455) - PERF-#3844: Improve perf of
drop
operation (#4694) - PERF-#4727: Improve perf of
concat
operation (#4728) - PERF-#4705: Improve perf of arithmetic operations between
Series
objects with shared.index
(#4689) - PERF-#4703: Improve performance in accessing
ser.cat.categories
,ser.cat.ordered
, andser.__array_priority__
(#4704) - PERF-#4305: Parallelize
read_parquet
over row groups (#4700) - PERF-#4773: Compute
lengths
andwidths
input
method of Dask partition like Ray do (#4780) - PERF-#4732: Avoid overwriting already-evaluated
PandasOnRayDataframePartition._length_cache
andPandasOnRayDataframePartition._width_cache
(#4754) - PERF-#4862: Don't call
compute_sliced_len.remote
whenrow_labels/col_labels == slice(None)
(#4863) - PERF-#4713: Stop overriding the ray MacOS object store size limit (#4792)
- PERF-#4944: Avoid default_to_pandas in
Series.cat.codes
,Series.dt.tz
, andSeries.dt.to_pytimedelta
(#4833) - PERF-#4851: Compute
dtypes
for binary operations that can only return bool type and the right operand is not a Modin object (#4852) - PERF-#4842:
copy
should not trigger any previous computations (#4843) - PERF-#4849: Compute
dtypes
inconcat
also for ROW_WISE case when possible (#4850) - PERF-#4929: Compute
dtype
when usingSeries.dt
accessor (#4930) - PERF-#4892: Compute
lengths
inrebalance_partitions
when possible (#4893) - PERF-#4794: Compute caches in
_propagate_index_objs
(#4888) - PERF-#4860:
PandasDataframeAxisPartition.deploy_axis_func
should be serialized only once (#4861) - PERF-#4890:
PandasDataframeAxisPartition.drain
should be serialized only once (#4891) - PERF-#4870: Avoid index materialization in
__getattribute__
and__getitem__
(4911) - PERF-#4886: Use lazy index and columns evaluation in
query
method (#4887) - PERF-#4866:
iloc
function that used inpartition.mask
should be serialized only once (#4901) - PERF-#4920: Avoid index and cache computations in
take_2d_labels_or_positional
unless they are needed (#4921) - PERF-#4999: don't call
apply
in virtual partition'drain_call_queue
ifcall_queue
is empty (#4975) - PERF-#4268: Implement partition-parallel getitem for bool Series masks (#4753)
- PERF-#5017:
reset_index
shouldn't trigger index materialization if possible (#5018) - PERF-#4963: Use partition
width/length
methods instead of_compute_axis_labels_and_lengths
if index is already known (#4964) - PERF-#4940: Optimize categorical dtype check in
concatenate
(#4953)
- Benchmarking enhancements
- TEST-#5066: Add outer join case for
TimeConcat
benchmark (#5067) - TEST-#5083: Add
merge
op with categorical data (#5084) - FEAT-#4706: Add Modin ClassLogger to PandasDataframePartitionManager (#4707)
- TEST-#5014: Simplify adding new ASV benchmarks (#5015)
- TEST-#5064: Update
TimeConcat
benchmark with new parameterignore_index
(#5065) - TEST-#5068: Add binary op benchmark for Series (#5069)
- TEST-#5066: Add outer join case for
- Refactor Codebase
- REFACTOR-#4530: Standardize access to physical data in partitions (#4563)
- REFACTOR-#4534: Replace logging meta class with class decorator (#4535)
- REFACTOR-#4708: Delete combine dtypes (#4709)
- REFACTOR-#4629: Add type annotations to modin/config (#4685)
- REFACTOR-#4717: Improve PartitionMgr.get_indices() usage (#4718)
- REFACTOR-#4730: make Indexer immutable (#4731)
- REFACTOR-#4774: remove
_build_treereduce_func
call from_compute_dtypes
(#4775) - REFACTOR-#4750: Delete BaseDataframeAxisPartition.shuffle (#4751)
- REFACTOR-#4722: Stop suppressing undefined name lint (#4723)
- REFACTOR-#4832: unify
split_result_of_axis_func_pandas
(#4831) - REFACTOR-#4796: Introduce constant for reduced column name (#4799)
- REFACTOR-#4000: Remove code duplication for
PandasOnRayDataframePartitionManager
(#4895) - REFACTOR-#3780: Remove code duplication for
PandasOnDaskDataframe
(#3781) - REFACTOR-#4530: Unify access to physical data for any partition type (#4829)
- REFACTOR-#4978: Align
modin/core/execution/dask/common/__init__.py
withmodin/core/execution/ray/common/__init__.py
(#4979) - REFACTOR-#4949: Remove code duplication in
default2pandas/dataframe.py
anddefault2pandas/any.py
(#4950) - REFACTOR-#4976: Rename
RayTask
toRayWrapper
in accordance with Dask (#4977) - REFACTOR-#4885: De-duplicated take_2d_labels_or_positional methods (#4883)
- REFACTOR-#5005: Use
finalize
method instead of list comprehension +drain_call_queue
(#5006) - REFACTOR-#5001: Remove
jenkins
stuff (#5002) - REFACTOR-#5026: Change exception names to simplify grepping (#5027)
- REFACTOR-#4970: Rewrite base implementations of a partition'
width/length
(#4971) - REFACTOR-#4942: Remove
call
method in favor ofregister
due to duplication (4943) - REFACTOR-#4922: Helpers for take_2d_labels_or_positional (#4865)
- REFACTOR-#5024: Make
_row_lengths
and_column_widths
public (#5025) - REFACTOR-#5009: Use
RayWrapper.materialize
instead ofray.get
(#5010) - REFACTOR-#4755: Rewrite Pandas version mismatch warning (#4965)
- REFACTOR-#5012: Add mypy checks for singleton files in base modin directory (#5013)
- REFACTOR-#5038: Remove unnecessary _method argument from resamplers (#5039)
- REFACTOR-#5081: Remove
c323f7fe385011ed849300155de07645.db
file (#5082)
- Pandas API implementations and improvements
- OmniSci enhancements
- FEAT-#4913: Enabling pyhdk
- Update testing suite
- TEST-#4508: Reduce test_partition_api pytest threads to deflake it (#4551)
- TEST-#4550: Use much less data in test_partition_api (#4554)
- TEST-#4610: Remove explicit installation of
black
/flake8
for omnisci ci-notebooks (#4609) - TEST-#2564: Add caching and use mamba for conda setups in GH (#4607)
- TEST-#4557: Delete multiindex sorts instead of xfailing (#4559)
- TEST-#4698: Stop passing invalid storage_options param (#4699)
- TEST-#4745: Pin flake8 to <5 to workaround installation conflict (#4752)
- TEST-#4875: XFail tests failing due to file gone missing (#4876)
- TEST-#4879: Use pandas
ensure_clean()
in place ofio_tests_data
(#4881) - TEST-#4562: Use local Ray cluster in CI to resolve flaky
test-compat-win
(#5007) - TEST-#5040: Rework test_series using eval_general() (#5041)
- TEST-#5050: Add black to pre-commit hook (#5051)
- Documentation improvements
- DOCS-#4552: Change default sphinx language to en to fix sphinx >= 5.0.0 build (#4553)
- DOCS-#4628: Add to_parquet partial support notes (#4648)
- DOCS-#4668: Set light theme for readthedocs page, remove theme switcher (#4669)
- DOCS-#4748: Apply the Triage label to new issues (#4749)
- DOCS-#4790: Give all templates issue type and triage labels (#4791)
- DOCS-#4521: Document how to benchmark modin (#5020)
- Dependencies
- New Features
- FEAT-4463: Add experimental fuzzydata integration for testing against a randomized dataframe workflow (#4556)
- FEAT-#4419: Extend virtual partitioning API to pandas on Dask (#4420)
- FEAT-#4147: Add partial compatibility with Python 3.6 and pandas 1.1 (#4301)
- FEAT-#4569: Add error message when
read_
function defaults to pandas (#4647) - FEAT-#4725: Make index and columns lazy in Modin DataFrame (#4726)
- FEAT-#4664: Finalize compatibility support for Python 3.6 (#4800)
- FEAT-#4746: Sync interchange protocol with recent API changes (#4763)
- FEAT-#4733: Support fastparquet as engine for
read_parquet
(#4807) - FEAT-#4766: Support fsspec URLs in
read_csv
andread_csv_glob
(#4898) - FEAT-#4827: Implement
infer_types
dataframe algebra operator (#4871) - FEAT-#4989: Switch pandas version to 1.5 (#5037)
Contributors
@mvashishtha
@NickCrews
@prutskov
@vnlitvinov
@pyrito
@suhailrehman
@RehanSD
@helmeleegy
@anmyachev
@d33bs
@noloerino
@devin-petersohn
@YarShev
@naren-ponder
@jbrockmendel
@ienkovich
@Garra1980
@Billy2551