Skip to content

Modin 0.16.0

Compare
Choose a tag to compare
@mvashishtha mvashishtha released this 05 Oct 19:50
· 874 commits to master since this release
621bc10

This release includes support for pandas 1.5, support for the latest version of dask, and backwards compatibility with python 3.6 and pandas 1.1. Additionally, it includes many performance enhancements, bug fixes, and documentation improvements.

Key Features and Updates

  • Stability and Bugfixes
    • FIX-#4570: Replace np.bool -> np.bool_ (#4571)
    • FIX-#4543: Fix read_csv in case skiprows=<0, []> (#4544)
    • FIX-#4059: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
    • FIX-#4589: Pin protobuf<4.0.0 to fix ray (#4590)
    • FIX-#4577: Set attribute of Modin dataframe to updated value (#4588)
    • FIX-#4411: Fix binary_op between datetime64 Series and pandas timedelta (#4592)
    • FIX-#4604: Fix groupby + agg in case when multicolumn can arise (#4642)
    • FIX-#4582: Inherit custom log layer (#4583)
    • FIX-#4639: Fix storage_options usage for read_csv and read_csv_glob (#4644)
    • FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
    • FIX-#4584: Enable pdb debug when running cloud tests (#4585)
    • FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set (#4603)
    • FIX-#4641: Reindex pandas partitions in df.describe() (#4651)
    • FIX-#2064: Fix iloc/loc assignment when dataframe is empty (#4677)
    • FIX-#4634: Check for FrozenList as by in df.groupby() (#4667)
    • FIX-#4680: Fix read_csv that started defaulting to pandas again in case of reading from a buffer and when a buffer has a non-zero starting position (#4681)
    • FIX-#4491: Wait for all partitions in parallel in benchmark mode (#4656)
    • FIX-#4358: MultiIndex loc shouldn't drop levels for full-key lookups (#4608)
    • FIX-#4658: Expand exception handling for read_* functions from s3 storages (#4659)
    • FIX-#4672: Fix incorrect warning when setting frame.index or frame.columns (#4721)
    • FIX-#4686: Propagate metadata and drain call queue in unwrap_partitions (#4697)
    • FIX-#4652: Support categorical data in from_dataframe (#4737)
    • FIX-#4756: Correctly propagate storage_options in read_parquet (#4764)
    • FIX-#4657: Use fsspec for handling s3/http-like paths instead of s3fs (#4710)
    • FIX-#4676: drain sub-virtual-partition call queues (#4695)
    • FIX-#4782: Exclude certain non-parquet files in read_parquet (#4783)
    • FIX-#4808: Set dtypes correctly after column rename (#4809)
    • FIX-#4811: Apply dataframe -> not_dataframe functions to virtual partitions (#4812)
    • FIX-#4099: Use mangled column names but keep the original when building frames from arrow (#4767)
    • FIX-#4838: Bump up modin-spreadsheet to latest master (#4839)
    • FIX-#4840: Change modin-spreadsheet version for notebook requirements (#4841)
    • FIX-#4835: Handle Pathlike paths in read_parquet (#4837)
    • FIX-#4872: Stop checking the private ray mac memory limit (#4873)
    • FIX-#4914: base_lengths should be computed from base_frame instead of self in copartition (#4915)
    • FIX-#4848: Fix rebalancing partitions when NPartitions == 1 (#4874)
    • FIX-#4927: Fix dtypes computation in dataframe.filter (#4928)
    • FIX-#4907: Implement radd for Series and DataFrame (#4908)
    • FIX-#4945: Fix _take_2d_positional that loses indexes due to filtering empty dataframes (#4951)
    • FIX-#4818, PERF-#4825: Fix where by using the new n-ary operator (#4820)
    • FIX-#3983: FIX-#4107: Materialize 'rowid' columns when selecting rows by position (#4834)
    • FIX-#4845: Fix KeyError from __getitem_bool for single row dataframes (#4845)
    • FIX-#4734: Handle Series.apply when return type is a DataFrame (#4830)
    • FIX-#4983: Set frac to None in _sample when n=0 (#4984)
    • FIX-#4993: Return _default_to_pandas in df.attrs (#4995)
    • FIX-#5043: Fix execute function in ASV utils failed if len(partitions) == 0 (#5044)
    • FIX-#4597: Refactor Partition handling of func, args, kwargs (#4715)
    • FIX-#4996: Evaluate BenchmarkMode at each function call (#4997)
    • FIX-#4022: Fixed empty data frame with index (#4910)
    • FIX-#4090: Fixed check if the index is trivial (#4936)
    • FIX-#4966: Fix to_timedelta to return Series instead of TimedeltaIndex (#5028)
    • FIX-#5042: Fix series getitem with invalid strings (#5048)
    • FIX-#4691: Fix binary operations between virtual partitions (#5049)
    • FIX-#5045: Fix ray virtual_partition.wait with duplicate object refs (#5058)
  • Performance enhancements
    • PERF-#4182: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
    • PERF-#4288: Improve perf of groupby.mean for narrow data (#4591)
    • PERF-#4772: Remove df.copy call from from_pandas since it is not needed for Ray and Dask (#4781)
    • PERF-#4325: Improve perf of multi-column assignment in __setitem__ when no new column names are assigning (#4455)
    • PERF-#3844: Improve perf of drop operation (#4694)
    • PERF-#4727: Improve perf of concat operation (#4728)
    • PERF-#4705: Improve perf of arithmetic operations between Series objects with shared .index (#4689)
    • PERF-#4703: Improve performance in accessing ser.cat.categories, ser.cat.ordered, and ser.__array_priority__ (#4704)
    • PERF-#4305: Parallelize read_parquet over row groups (#4700)
    • PERF-#4773: Compute lengths and widths in put method of Dask partition like Ray do (#4780)
    • PERF-#4732: Avoid overwriting already-evaluated PandasOnRayDataframePartition._length_cache and PandasOnRayDataframePartition._width_cache (#4754)
    • PERF-#4862: Don't call compute_sliced_len.remote when row_labels/col_labels == slice(None) (#4863)
    • PERF-#4713: Stop overriding the ray MacOS object store size limit (#4792)
    • PERF-#4944: Avoid default_to_pandas in Series.cat.codes, Series.dt.tz, and Series.dt.to_pytimedelta (#4833)
    • PERF-#4851: Compute dtypes for binary operations that can only return bool type and the right operand is not a Modin object (#4852)
    • PERF-#4842: copy should not trigger any previous computations (#4843)
    • PERF-#4849: Compute dtypes in concat also for ROW_WISE case when possible (#4850)
    • PERF-#4929: Compute dtype when using Series.dt accessor (#4930)
    • PERF-#4892: Compute lengths in rebalance_partitions when possible (#4893)
    • PERF-#4794: Compute caches in _propagate_index_objs (#4888)
    • PERF-#4860: PandasDataframeAxisPartition.deploy_axis_func should be serialized only once (#4861)
    • PERF-#4890: PandasDataframeAxisPartition.drain should be serialized only once (#4891)
    • PERF-#4870: Avoid index materialization in __getattribute__ and __getitem__ (4911)
    • PERF-#4886: Use lazy index and columns evaluation in query method (#4887)
    • PERF-#4866: iloc function that used in partition.mask should be serialized only once (#4901)
    • PERF-#4920: Avoid index and cache computations in take_2d_labels_or_positional unless they are needed (#4921)
    • PERF-#4999: don't call apply in virtual partition' drain_call_queue if call_queue is empty (#4975)
    • PERF-#4268: Implement partition-parallel getitem for bool Series masks (#4753)
    • PERF-#5017: reset_index shouldn't trigger index materialization if possible (#5018)
    • PERF-#4963: Use partition width/length methods instead of _compute_axis_labels_and_lengths if index is already known (#4964)
    • PERF-#4940: Optimize categorical dtype check in concatenate (#4953)
  • Benchmarking enhancements
    • TEST-#5066: Add outer join case for TimeConcat benchmark (#5067)
    • TEST-#5083: Add merge op with categorical data (#5084)
    • FEAT-#4706: Add Modin ClassLogger to PandasDataframePartitionManager (#4707)
    • TEST-#5014: Simplify adding new ASV benchmarks (#5015)
    • TEST-#5064: Update TimeConcat benchmark with new parameter ignore_index (#5065)
    • TEST-#5068: Add binary op benchmark for Series (#5069)
  • Refactor Codebase
    • REFACTOR-#4530: Standardize access to physical data in partitions (#4563)
    • REFACTOR-#4534: Replace logging meta class with class decorator (#4535)
    • REFACTOR-#4708: Delete combine dtypes (#4709)
    • REFACTOR-#4629: Add type annotations to modin/config (#4685)
    • REFACTOR-#4717: Improve PartitionMgr.get_indices() usage (#4718)
    • REFACTOR-#4730: make Indexer immutable (#4731)
    • REFACTOR-#4774: remove _build_treereduce_func call from _compute_dtypes (#4775)
    • REFACTOR-#4750: Delete BaseDataframeAxisPartition.shuffle (#4751)
    • REFACTOR-#4722: Stop suppressing undefined name lint (#4723)
    • REFACTOR-#4832: unify split_result_of_axis_func_pandas (#4831)
    • REFACTOR-#4796: Introduce constant for reduced column name (#4799)
    • REFACTOR-#4000: Remove code duplication for PandasOnRayDataframePartitionManager (#4895)
    • REFACTOR-#3780: Remove code duplication for PandasOnDaskDataframe (#3781)
    • REFACTOR-#4530: Unify access to physical data for any partition type (#4829)
    • REFACTOR-#4978: Align modin/core/execution/dask/common/__init__.py with modin/core/execution/ray/common/__init__.py (#4979)
    • REFACTOR-#4949: Remove code duplication in default2pandas/dataframe.py and default2pandas/any.py (#4950)
    • REFACTOR-#4976: Rename RayTask to RayWrapper in accordance with Dask (#4977)
    • REFACTOR-#4885: De-duplicated take_2d_labels_or_positional methods (#4883)
    • REFACTOR-#5005: Use finalize method instead of list comprehension + drain_call_queue (#5006)
    • REFACTOR-#5001: Remove jenkins stuff (#5002)
    • REFACTOR-#5026: Change exception names to simplify grepping (#5027)
    • REFACTOR-#4970: Rewrite base implementations of a partition' width/length (#4971)
    • REFACTOR-#4942: Remove call method in favor of register due to duplication (4943)
    • REFACTOR-#4922: Helpers for take_2d_labels_or_positional (#4865)
    • REFACTOR-#5024: Make _row_lengths and _column_widths public (#5025)
    • REFACTOR-#5009: Use RayWrapper.materialize instead of ray.get (#5010)
    • REFACTOR-#4755: Rewrite Pandas version mismatch warning (#4965)
    • REFACTOR-#5012: Add mypy checks for singleton files in base modin directory (#5013)
    • REFACTOR-#5038: Remove unnecessary _method argument from resamplers (#5039)
    • REFACTOR-#5081: Remove c323f7fe385011ed849300155de07645.db file (#5082)
  • Pandas API implementations and improvements
    • FEAT-#4670: Implement convert_dtypes by mapping across partitions (#4671)
  • OmniSci enhancements
    • FEAT-#4913: Enabling pyhdk
  • Update testing suite
    • TEST-#4508: Reduce test_partition_api pytest threads to deflake it (#4551)
    • TEST-#4550: Use much less data in test_partition_api (#4554)
    • TEST-#4610: Remove explicit installation of black/flake8 for omnisci ci-notebooks (#4609)
    • TEST-#2564: Add caching and use mamba for conda setups in GH (#4607)
    • TEST-#4557: Delete multiindex sorts instead of xfailing (#4559)
    • TEST-#4698: Stop passing invalid storage_options param (#4699)
    • TEST-#4745: Pin flake8 to <5 to workaround installation conflict (#4752)
    • TEST-#4875: XFail tests failing due to file gone missing (#4876)
    • TEST-#4879: Use pandas ensure_clean() in place of io_tests_data (#4881)
    • TEST-#4562: Use local Ray cluster in CI to resolve flaky test-compat-win (#5007)
    • TEST-#5040: Rework test_series using eval_general() (#5041)
    • TEST-#5050: Add black to pre-commit hook (#5051)
  • Documentation improvements
    • DOCS-#4552: Change default sphinx language to en to fix sphinx >= 5.0.0 build (#4553)
    • DOCS-#4628: Add to_parquet partial support notes (#4648)
    • DOCS-#4668: Set light theme for readthedocs page, remove theme switcher (#4669)
    • DOCS-#4748: Apply the Triage label to new issues (#4749)
    • DOCS-#4790: Give all templates issue type and triage labels (#4791)
    • DOCS-#4521: Document how to benchmark modin (#5020)
  • Dependencies
    • FEAT-#4598: Add support for pandas 1.4.3 (#4599)
    • FEAT-#4619: Integrate mypy static type checking (#4620)
    • FEAT-#4202: Allow dask past 2022.2.0 (#4769)
    • FEAT-#4925: Upgrade pandas to 1.4.4 (#4926)
    • TEST-#4998: Add flake8 plugins to dev requirements (#5000)
  • New Features
    • FEAT-4463: Add experimental fuzzydata integration for testing against a randomized dataframe workflow (#4556)
    • FEAT-#4419: Extend virtual partitioning API to pandas on Dask (#4420)
    • FEAT-#4147: Add partial compatibility with Python 3.6 and pandas 1.1 (#4301)
    • FEAT-#4569: Add error message when read_ function defaults to pandas (#4647)
    • FEAT-#4725: Make index and columns lazy in Modin DataFrame (#4726)
    • FEAT-#4664: Finalize compatibility support for Python 3.6 (#4800)
    • FEAT-#4746: Sync interchange protocol with recent API changes (#4763)
    • FEAT-#4733: Support fastparquet as engine for read_parquet (#4807)
    • FEAT-#4766: Support fsspec URLs in read_csv and read_csv_glob (#4898)
    • FEAT-#4827: Implement infer_types dataframe algebra operator (#4871)
    • FEAT-#4989: Switch pandas version to 1.5 (#5037)

Contributors

@mvashishtha
@NickCrews
@prutskov
@vnlitvinov
@pyrito
@suhailrehman
@RehanSD
@helmeleegy
@anmyachev
@d33bs
@noloerino
@devin-petersohn
@YarShev
@naren-ponder
@jbrockmendel
@ienkovich
@Garra1980
@Billy2551