-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cur.fetch() can be extremely slow #36
Comments
how large its the data set you are retrieving? Is this the case for all queries or just a few? |
Hive 0.13.0.2.1.4.0-632 Correction: piping to a file from the CLI takes about 5-8 minutes, not 1 minute. The data is stored in a managed table. The returned data size is 4.7M rows * 13 columns, or about 306MB (Hive 'describe extended tablename' reports that the size of the table is 878MB). All queries seem to run slowly using cursor.fetch(). My "feeling" is that something in the python code is inefficiently allocating & filling memory, or maybe just grabbing one or a few rows at a time. When I watch the memory usage of the python process it climbs slowly but steadily. |
Using pyhs2 cursor::fetch() on our cluster, reading a 66 column table (a single 10-character date column, an integer column, and 64 float columns): So it’s reading about 20,000 elements a second (30000 rows x 66 columns ~= 2,000,000 elements). From this, I’d guess that 300,000 rows takes 17 minutes and 3,000,000 rows takes about three hours I'm running cProfile now... |
cProfile stats: ncalls tottime percall cumtime percall filename:lineno(function) 11753141/11753119 85.004 0.000 279.991 0.000 TTransport.py:54(readAll) 11753118 63.297 0.000 162.265 0.000 thrift_sasl.py:148(read) 643 59.518 0.093 59.518 0.093 {method 'recv' of '_socket.socket' objects} 1980000 49.948 0.000 434.618 0.000 ttypes.py:1773(read) 35261749/35261748 49.587 0.000 49.587 0.000 {len} 7248226 40.663 0.000 189.522 0.000 TBinaryProtocol.py:195(readByte) 7218080 34.901 0.000 307.126 0.000 TBinaryProtocol.py:154(readFieldBegin) 30000 28.844 0.001 478.409 0.016 ttypes.py:1894(read) 11753128 22.575 0.000 22.575 0.000 {method 'read' of 'cStringIO.StringI' objects} 11722981 20.213 0.000 20.213 0.000 {_struct.unpack} 3227507 17.853 0.000 83.466 0.000 TBinaryProtocol.py:200(readI16) 930000 15.269 0.000 121.194 0.000 ttypes.py:1320(read) 1020000 11.462 0.000 72.414 0.000 ttypes.py:1440(read) 4 7.290 1.822 496.383 124.096 cursor.py:219(_fetch) 1980000 7.114 0.000 7.114 0.000 ttypes.py:1764(__init__) 4020468 6.236 0.000 6.236 0.000 {method 'append' of 'list' objects} 3990573 6.000 0.000 6.000 0.000 TBinaryProtocol.py:148(readStructBegin) 3990573 5.984 0.000 5.984 0.000 TBinaryProtocol.py:151(readStructEnd) 1980000 5.471 0.000 5.471 0.000 cursor.py:24(get_value) 932268 5.276 0.000 83.881 0.000 TBinaryProtocol.py:205(readI32) 3227507 4.872 0.000 4.872 0.000 TBinaryProtocol.py:161(readFieldEnd) 4 2.353 0.588 498.736 124.684 cursor.py:71(fetchSet) 1020000 1.922 0.000 1.922 0.000 ttypes.py:1437(__init__) 930000 1.769 0.000 1.769 0.000 ttypes.py:1317(__init__) 314964 1.737 0.000 8.287 0.000 TBinaryProtocol.py:215(readDouble) 30000 0.500 0.000 4.830 0.000 ttypes.py:1500(read) 66 0.403 0.006 0.403 0.006 {pandas.lib.maybe_convert_objects} 4 0.311 0.078 478.833 119.708 ttypes.py:1971(read) 30138 0.187 0.000 1.717 0.000 TBinaryProtocol.py:173(readListBegin) 30149 0.183 0.000 1.545 0.000 TBinaryProtocol.py:220(readString) 33 0.104 0.003 0.104 0.003 {pandas.lib.is_possible_datetimelike_array} 30000 0.061 0.000 0.061 0.000 ttypes.py:1497(__init__) 30000 0.060 0.000 0.060 0.000 ttypes.py:1891(__init__) 1 0.049 0.049 0.049 0.049 {pandas.lib.to_object_array} 30138 0.047 0.000 0.047 0.000 TBinaryProtocol.py:178(readListEnd) |
answer is simple: don't use fetch(). use one of the Python standard PEP 249 calls: fetchall() (not a good idea for a large data set), fetchone() (for going one row at a time) or fetchmany() (return a block of records at a time). just using fetch() might (and probably does) overwhelms your system and should not be generally used. |
@zklopman-rovi what option would you recommend works best for a large data set? Sorry, I am a little late to the party. |
that depends on your needs, and what you consider "large". In general, you hope this helps, Zachi Klopman On Wed, May 25, 2016 at 1:55 PM, anshanno [email protected] wrote:
|
@zklopman well, I have about 170 million rows and 200 columns so by big data standards, its not that big. It seems however that it is far too large to pull in a reasonable amount of time with this module though. It takes around 45 seconds per 10k block. Fetching seems to be the bottleneck of the module. |
Hi, 170,000,000 x 200 x 5 bytes/col (assumption - could be much bigger) ~ 170GB As for the interface speed: 10,000x200x5 ~ 10MB in 45s seems slow (about 1/2 MB/s). I would expect it hope this helps, Zachi On Thu, May 26, 2016 at 9:12 AM, anshanno [email protected] wrote:
|
I haven't quite figured out when this is the case, but some queries can cause cursor.fetch() to take several hours to complete even though just running the query from a hive CLI and piping to a file only take 60 seconds.
Is there anything that I can do to help you diagose this problem?
The text was updated successfully, but these errors were encountered: