Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: outofmemory read from big file and dump to a new one #7363

Open
3 tasks done
wanghaisheng opened this issue Aug 7, 2024 · 1 comment
Open
3 tasks done

BUG: outofmemory read from big file and dump to a new one #7363

wanghaisheng opened this issue Aug 7, 2024 · 1 comment
Labels
bug 🦗 Something isn't working Triage 🩹 Issues that need triage

Comments

@wanghaisheng
Copy link

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd
import os
inputfilepath = "top-domains-1m-in.csv"

os.environ["RAY_memory_usage_threshold"] = '0.9'

# Combine all conditions
df = pd.read_csv(inputfilepath, encoding="ISO-8859-1")

Issue Description

my file is almost 2 G try to set os.environ["RAY_memory_usage_threshold"] =0.9 it says float not support
after some filter, tocsv dump give me memory error

Expected Behavior

it should max use 90% of my laptop

Error Logs

2024-08-07 13:07:41,906	INFO worker.py:1781 -- Started a local Ray instance.
UserWarning: `read_*` implementation has mismatches with pandas:
Data types of partitions are different! Please refer to the troubleshooting section of the Modin documentation to fix this issue.
UserWarning: <function Series.tolist> is not currently supported by PandasOnRay, defaulting to pandas implementation.
Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.
(_remote_exec_multi_chain pid=20536) 
(_remote_exec_multi_chain pid=20536) Traceback (most recent call last):
(_remote_exec_multi_chain pid=20536)   File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 423, in deserialize_objects
(_remote_exec_multi_chain pid=20536)     obj = self._deserialize_object(data, metadata, object_ref)
(_remote_exec_multi_chain pid=20536)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_remote_exec_multi_chain pid=20536)   File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 280, in _deserialize_object
(_remote_exec_multi_chain pid=20536)     return self._deserialize_msgpack_data(data, metadata_fields)
(_remote_exec_multi_chain pid=20536)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_remote_exec_multi_chain pid=20536)   File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 235, in _deserialize_msgpack_data
(_remote_exec_multi_chain pid=20536)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(_remote_exec_multi_chain pid=20536)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(_remote_exec_multi_chain pid=20536)   File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 225, in _deserialize_pickle5_data
(_remote_exec_multi_chain pid=20536)     obj = pickle.loads(in_band)
(_remote_exec_multi_chain pid=20536)           ^^^^^^^^^^^^^^^^^^^^^
(_remote_exec_multi_chain pid=20536) MemoryError
(_remote_exec_multi_chain pid=6340) 
(_remote_exec_multi_chain pid=6340)     obj = pickle.loads(in_band, buffers=buffers)
(_remote_exec_multi_chain pid=6340)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
---------------------------------------------------------------------------
RayTaskError(RaySystemError)              Traceback (most recent call last)
Cell In[1], [line 41](vscode-notebook-cell:?execution_count=1&line=41)
     [33](vscode-notebook-cell:?execution_count=1&line=33) # filtered_df = df[df['indexdate'] != 'unk']
     [34](vscode-notebook-cell:?execution_count=1&line=34) # filtered_df = df[df['indexdate'].str.contains('month', case=False, na=False)]
     [35](vscode-notebook-cell:?execution_count=1&line=35) # filtered_df = df[df['indexdate'].str.contains('1 year', case=False, na=False)]
   (...)
     [38](vscode-notebook-cell:?execution_count=1&line=38) # filtered_df = df[df['indexdate'].str.contains('2 years', case=False, na=False)]
     [39](vscode-notebook-cell:?execution_count=1&line=39) # filtered_df = df[df['domain'].str.contains('ai', case=False, na=False)]
     [40](vscode-notebook-cell:?execution_count=1&line=40) filtered_df = df[df['Intheirownwords'].str.contains(' ai ', case=False, na=False)]
---> [41](vscode-notebook-cell:?execution_count=1&line=41) filtered_df.to_csv('domain-ai-in-title.csv')
     [43](vscode-notebook-cell:?execution_count=1&line=43) filtered_df = filtered_df[filtered_df['domain'].isin(rankdomains)]
     [44](vscode-notebook-cell:?execution_count=1&line=44) filtered_df.to_csv('top-4m-domain-ai-in-title.csv')

File d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\modin\logging\logger_decorator.py:144, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    [129](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:129) """
    [130](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:130) Compute function with logging if Modin logging is enabled.
    [131](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:131) 
   (...)
    [141](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:141) Any
    [142](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:142) """
    [143](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:143) if LogMode.get() == "disable":
--> [144](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:144)     return obj(*args, **kwargs)
    [146](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:146) logger = get_logger()
    [147](file:///D:/Download/audio-visual/a_ideas/.venv/Lib/site-packages/modin/logging/logger_decorator.py:147) logger.log(log_level, start_line)
...
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Download\audio-visual\a_ideas\.venv\Lib\site-packages\ray\_private\serialization.py", line 225, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
          ^^^^^^^^^^^^^^^^^^^^^
MemoryError
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?3eee492b-abf0-439d-872b-e3378420424f) or open in a [text editor](command:workbench.action.openLargeOutput?3eee492b-abf0-439d-872b-e3378420424f). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

Installed Versions

INSTALLED VERSIONS

commit : c8bbca8
python : 3.12.3.final.0
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : AMD64 Family 23 Model 96 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Chinese (Simplified)_China.936

Modin dependencies

modin : 0.31.0
ray : 2.34.0
dask : 2024.7.1
distributed : None

pandas dependencies

...
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

@wanghaisheng wanghaisheng added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Aug 7, 2024
@Retribution98
Copy link
Collaborator

Hi @wanghaisheng

Sorry, I'm not sure I understood you correctly.
Modin is not responsible for the Ray parameter RAY_memory_usage_threshold.

Your reproducer seems to be correct. Please contact Ray for more information on this.

I might also suggest that you use Modin cofiguration variable to limit the memory used:

import modin.config as cfg

cfg.Memory.put(2 * 2**30)

or

import os

os.environ["MODIN_MEMORY"] = "2147483648"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working Triage 🩹 Issues that need triage
Projects
None yet
Development

No branches or pull requests

2 participants