Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cur.fetch() can be extremely slow #36

Open
BAM-BAM-BAM opened this issue Jan 26, 2015 · 9 comments
Open

cur.fetch() can be extremely slow #36

BAM-BAM-BAM opened this issue Jan 26, 2015 · 9 comments

Comments

@BAM-BAM-BAM
Copy link

I haven't quite figured out when this is the case, but some queries can cause cursor.fetch() to take several hours to complete even though just running the query from a hive CLI and piping to a file only take 60 seconds.

Is there anything that I can do to help you diagose this problem?

@BradRuderman
Copy link
Owner

how large its the data set you are retrieving? Is this the case for all queries or just a few?

@BAM-BAM-BAM
Copy link
Author

Hive 0.13.0.2.1.4.0-632

Correction: piping to a file from the CLI takes about 5-8 minutes, not 1 minute.

The data is stored in a managed table.

The returned data size is 4.7M rows * 13 columns, or about 306MB (Hive 'describe extended tablename' reports that the size of the table is 878MB).

All queries seem to run slowly using cursor.fetch().

My "feeling" is that something in the python code is inefficiently allocating & filling memory, or maybe just grabbing one or a few rows at a time.

When I watch the memory usage of the python process it climbs slowly but steadily.

@BAM-BAM-BAM
Copy link
Author

Using pyhs2 cursor::fetch() on our cluster, reading a 66 column table (a single 10-character date column, an integer column, and 64 float columns):
fetched 3000 rows in 10.7 secs
fetched 30000 rows in 104.2 secs

So it’s reading about 20,000 elements a second (30000 rows x 66 columns ~= 2,000,000 elements).
A crappy HD reads at around 50MB/s on average. If each element was 2,500 bytes, then it should take this long to read. But each element is a float (let’s just say it’s double-precision, or 8 bytes). So the read speed is 300x slower than it should be.

From this, I’d guess that 300,000 rows takes 17 minutes and 3,000,000 rows takes about three hours

I'm running cProfile now...

@BAM-BAM-BAM
Copy link
Author

cProfile stats:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
11753141/11753119   85.004    0.000  279.991    0.000 TTransport.py:54(readAll)
 11753118   63.297    0.000  162.265    0.000 thrift_sasl.py:148(read)
      643   59.518    0.093   59.518    0.093 {method 'recv' of '_socket.socket' objects}
  1980000   49.948    0.000  434.618    0.000 ttypes.py:1773(read)
35261749/35261748   49.587    0.000   49.587    0.000 {len}
  7248226   40.663    0.000  189.522    0.000 TBinaryProtocol.py:195(readByte)
  7218080   34.901    0.000  307.126    0.000 TBinaryProtocol.py:154(readFieldBegin)
    30000   28.844    0.001  478.409    0.016 ttypes.py:1894(read)
 11753128   22.575    0.000   22.575    0.000 {method 'read' of 'cStringIO.StringI' objects}
 11722981   20.213    0.000   20.213    0.000 {_struct.unpack}
  3227507   17.853    0.000   83.466    0.000 TBinaryProtocol.py:200(readI16)
   930000   15.269    0.000  121.194    0.000 ttypes.py:1320(read)
  1020000   11.462    0.000   72.414    0.000 ttypes.py:1440(read)
        4    7.290    1.822  496.383  124.096 cursor.py:219(_fetch)
  1980000    7.114    0.000    7.114    0.000 ttypes.py:1764(__init__)
  4020468    6.236    0.000    6.236    0.000 {method 'append' of 'list' objects}
  3990573    6.000    0.000    6.000    0.000 TBinaryProtocol.py:148(readStructBegin)
  3990573    5.984    0.000    5.984    0.000 TBinaryProtocol.py:151(readStructEnd)
  1980000    5.471    0.000    5.471    0.000 cursor.py:24(get_value)
   932268    5.276    0.000   83.881    0.000 TBinaryProtocol.py:205(readI32)
  3227507    4.872    0.000    4.872    0.000 TBinaryProtocol.py:161(readFieldEnd)
        4    2.353    0.588  498.736  124.684 cursor.py:71(fetchSet)
  1020000    1.922    0.000    1.922    0.000 ttypes.py:1437(__init__)
   930000    1.769    0.000    1.769    0.000 ttypes.py:1317(__init__)
   314964    1.737    0.000    8.287    0.000 TBinaryProtocol.py:215(readDouble)
    30000    0.500    0.000    4.830    0.000 ttypes.py:1500(read)
       66    0.403    0.006    0.403    0.006 {pandas.lib.maybe_convert_objects}
        4    0.311    0.078  478.833  119.708 ttypes.py:1971(read)
    30138    0.187    0.000    1.717    0.000 TBinaryProtocol.py:173(readListBegin)
    30149    0.183    0.000    1.545    0.000 TBinaryProtocol.py:220(readString)
       33    0.104    0.003    0.104    0.003 {pandas.lib.is_possible_datetimelike_array}
    30000    0.061    0.000    0.061    0.000 ttypes.py:1497(__init__)
    30000    0.060    0.000    0.060    0.000 ttypes.py:1891(__init__)
        1    0.049    0.049    0.049    0.049 {pandas.lib.to_object_array}
    30138    0.047    0.000    0.047    0.000 TBinaryProtocol.py:178(readListEnd)

@zklopman-rovi
Copy link

answer is simple: don't use fetch(). use one of the Python standard PEP 249 calls: fetchall() (not a good idea for a large data set), fetchone() (for going one row at a time) or fetchmany() (return a block of records at a time). just using fetch() might (and probably does) overwhelms your system and should not be generally used.

@anshanno
Copy link

@zklopman-rovi what option would you recommend works best for a large data set? Sorry, I am a little late to the party.

@zklopman
Copy link
Collaborator

that depends on your needs, and what you consider "large". In general, you
should not overwhelm the memory. Assuming a typical 4GB available RAM, I
would day that around a million normal sized (not very wide) records can
still be handled in fetchall(). If you have more than that, bring them over
in blocks (10,000 or 100,000 records) using fetchmany(). If you get blocks,
you could dispatch them to separate threads to be worked upon (if
feasible). If your number of records is way more than that (billions or
more) you should not be exporting them to a lonely python program to be
processed - harness the power of Hive (or other Hadoop technologies) for
your work and trim it down. Exact details are, of course, dependent on the
application and environment.

hope this helps,

Zachi Klopman

On Wed, May 25, 2016 at 1:55 PM, anshanno [email protected] wrote:

@zklopman-rovi https://github.com/zklopman-rovi what option would you
recommend works best for a large data set? Sorry, I am a little late to the
party.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#36 (comment)

@anshanno
Copy link

@zklopman well, I have about 170 million rows and 200 columns so by big data standards, its not that big. It seems however that it is far too large to pull in a reasonable amount of time with this module though. It takes around 45 seconds per 10k block. Fetching seems to be the bottleneck of the module.

@zklopman
Copy link
Collaborator

Hi,

170,000,000 x 200 x 5 bytes/col (assumption - could be much bigger) ~ 170GB
data, (and we do not even consider protocol overhead here). That is usually
way too much to handle on a little laptop or small server, that usually
don't have more than 16GB RAM. For this size, you should do your
calculations on the cluster or use other means to transfer your data.

As for the interface speed:

10,000x200x5 ~ 10MB in 45s seems slow (about 1/2 MB/s). I would expect it
to be at least 10x faster with normal hardware and network speeds. Of
course, if the typical width of the columns is not 5 but 50 bytes, that
will make a huge difference. There are also other issues of network
performance tuning, but those vary wildly between locations.

hope this helps,

Zachi

On Thu, May 26, 2016 at 9:12 AM, anshanno [email protected] wrote:

@zklopman https://github.com/zklopman well, I have about 170 million
rows and 200 columns so by big data standards, its not that big. It seems
however that it is far too large to pull in a reasonable amount of time
with this module though. It takes around 45 seconds per 10k block. Fetching
seems to be the bottleneck of the module.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#36 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants