Query: What is the difference between Dask and Modin? #515

jithurjacob · 2019-03-28T10:41:04Z

It may sound silly, but a lot of us are not able to understand the difference between the difference in what is being offered by Dask and Modin. Could you please write a post on this or answer it here for the community?

devin-petersohn · 2019-03-28T18:39:28Z

Hi @jithurjacob, thanks for the question! I have heard this a lot, so it is not silly at all!

I will give a high level summary of the main points, we are planning to discuss this more in future blog posts. The TL;DR is that Modin's API is identical to pandas, whereas Dask's is not.

Note: The projects are fundamentally different in their aims, so a fair comparison is challenging.

Also, I will assume that you are asking about Dask DataFrame, rather than Dask as a whole. Dask's compute engine is more appropriately compared to Ray, which this project uses. See a comparison of Ray vs Dask here.

API

Dask DataFrame

Dask DataFrame does not scale the entire pandas API, and it isn't trying to. See this explained in their documentation here: http://docs.dask.org/en/latest/dataframe.html#common-uses-and-anti-uses

Dask DataFrames API is also different from the pandas API in that it is lazy and needs .compute() to materialize the DataFrame. This makes the API less convenient but allows to do certain query optimizations/rearrangement, which can give speedups in certain situations. We are planning to incorporate similar capabilities into Modin but hope we can do so without having to change the API. We will outline plans for speeding up Modin in an upcoming blog post.

Modin

Modin attempts to parallelize as much of the pandas API as is possible. We have worked through a significant portion of the DataFrame API. It is intended to be used as a drop-in replacement for pandas, such that even if the API is not yet parallelized, it is still defaulting to pandas.

Architecture

Dask DataFrame

Dask DataFrame has row-based partitioning, similar to Spark. This can be seen in their documentation: http://docs.dask.org/en/latest/dataframe.html#design. They also have a custom index object for indexing into the object, which is not pandas compatible. Dask DataFrame seems to treat operations on the DataFrame as MapReduce operations, which is a good paradigm for the subset of the pandas API they have chosen to implement.

Modin

Modin is more of a column-store, which we inherited from modern database systems. We laterally partition the columns for scalability (many systems, such as Google BigTable already did this), so we can scale in both directions and have finer grained partitioning. This is explained at a high level in Modin's documentation: https://modin.readthedocs.io/en/latest/developer/architecture.html. Because we have this finer grained control over the partitioning, we can support a number of operations that are very challenging in MapReduce systems (e.g. transpose, median, quantile).

Modin aims

In the long-term, Modin is planned to become a DataFrame library that supports the popular APIs (SQL, pandas, etc.) and runs on a variety of compute engines and backends. In fact, a group was able to contribute a dask.delayed backend to Modin already in <200 lines of code (PR).

edit: Add tl;dr above

jithurjacob · 2019-03-29T06:04:15Z

Thank you so much 😃 It would be really helpful to others if you could put this info upfront.

devin-petersohn added the question ❓ Questions about Modin label Mar 28, 2019

jithurjacob closed this as completed Mar 29, 2019

devin-petersohn pinned this issue Sep 11, 2019

devin-petersohn mentioned this issue Apr 24, 2020

Questions for a conference talk #1390

Closed

abdulelahsm mentioned this issue Nov 14, 2020

[DOCS] Update ReadMe.md with Modin vs Dask DataFrame explanation #2433

Closed

dchigarev unpinned this issue Mar 28, 2023

dchigarev pinned this issue Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query: What is the difference between Dask and Modin? #515

Query: What is the difference between Dask and Modin? #515

jithurjacob commented Mar 28, 2019

devin-petersohn commented Mar 28, 2019 •

edited

Loading

jithurjacob commented Mar 29, 2019

Query: What is the difference between Dask and Modin? #515

Query: What is the difference between Dask and Modin? #515

Comments

jithurjacob commented Mar 28, 2019

devin-petersohn commented Mar 28, 2019 • edited Loading

API

Dask DataFrame

Modin

Architecture

Dask DataFrame

Modin

Modin aims

jithurjacob commented Mar 29, 2019

devin-petersohn commented Mar 28, 2019 •

edited

Loading