Numba support #2

MainRo · 2020-06-21T13:49:25Z

From belm0:

Performance wise, I consider pypy as the only option today on a real application.

I'm not sure what this means. Pypy has limitations (it is not a 1:1 replacement for CPython), and there are legacy applications which cannot transition to pypy easily, or which are heavily dependent on numpy. For such applications, a numpy + numba implementation is useful. Having such an implementation does not preclude having a pure Python implementation which supports Pypy. They can exist along side each other.

I do not want to use numpy because on a streaming application it will never allow CPython to close the gap with pypy, an numpy just breaks pypy's jit. However I am interested in numba support,

A numba-only implementation will not perform, because numba does not support fast mode with Python arrays. The only way to get performance on this algorithm with numba is via numpy arrays.

I adapted the distogram bench to streamhist to compare the update function. On CPython distogram is 25% faster, and on pypy distogram is 13 times faster.

I measured my pure Python implementation (no numa or numpy) vs. distogram. It is 20% faster (and less code, but I didn't compare closely). The implementation uses a "maintain cost function array" approach just as distogram does. So distogram appears to have some room for improvement.

My numba+numpy implementation is 20x faster than streamhist (with 64 bins).

MainRo · 2020-06-21T14:02:23Z

To be honest I did not try numba yet but only read the doc and viewed some talks on it. My understanding is that its jit can unroll loops to simd instructions. So I expect this to work with plain python types, but I may be wrong. I this is correct I think that numba will be interesting also with pypy, hence my interest in it.
Concerning your implementation, I did not find it in your streamhist fork. Did you publish it somewhere ?
For numpy, it do not want its support not clutter the code. btw I am interested in why numpy speeds up your implementation while it slows down mine even on CPython (maybe because I did not try to optimize for it by explicitly using numpy functions).

belm0 · 2020-06-22T13:30:04Z

After spending many weeks on this last year, I'm fairly confident saying 1) numba does not operate in its fast mode with python arrays, and 2) numpy by itself will not be fast because this isn't a vectorized algorithm (the data point additions are incremental). Notably, the numpy.insert/delete performance is terrible.

I haven't published my approximate histogram code yet. It isn't derived from streamhist or anything else. I guess I'll finally get around to it since nothing else is filling my needs.

As a quick example, here is timing of stdlib bisect_left vs. your hand-coded one (4x slower) vs. njit of your hand-coded one (not faster).

import numba
import bisect

def _bisect_left(a, x):
    lo = 0
    hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if a[mid] < x: lo = mid+1
        else: hi = mid
    return lo

_bisect_left_njit = numba.njit(_bisect_left)

a = list(range(1000))
na = numba.typed.List(a)

%timeit bisect.bisect_left(a, 15)
%timeit _bisect_left(a, 15)
%timeit _bisect_left_njit(na, 15)

381 ns ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.53 µs ± 69.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.88 µs ± 52.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

belm0 · 2020-06-22T14:24:07Z

shows clearly that numba only works well with numpy arrays:

import numba
import numpy as np

def argmin_diff(a):
    i, i_m = 1, -1
    m = inf
    last_item = a[0]
    for item in a[1:]:
        d = item - last_item
        if d < m:
            m = d
            i_m = i-1
        last_item = item
        i += 1
    return i_m, m

argmin_diff_jit = njit(argmin_diff)

a = list(range(1000))
na = numba.typed.List(a)
npa = np.array(a)
%timeit argmin_diff(a)
%timeit argmin_diff_jit(na)
%timeit argmin_diff_jit(npa)

89.5 µs ± 557 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
65.3 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.68 µs ± 37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

MainRo · 2020-06-22T21:31:02Z

As a quick example, here is timing of stdlib bisect_left vs. your hand-coded one (4x slower)

There is something wrong here, because the bisect code in distogram is a copy of the stdlib. So it is exactly the same code:
https://github.com/python/cpython/blob/master/Lib/bisect.py#L50
Performances should be the same.

belm0 · 2020-06-23T04:27:37Z

There is something wrong here, because the bisect code in distogram is a copy of the stdlib

read below that stdlib function definition:

# Overwrite above definitions with a fast C implementation
try:
    from _bisect import *
except ImportError:
    pass

(if tuple values are used, there is no reason to re-implement stdlib bisect in the first place)

MainRo · 2020-06-24T08:06:13Z

ok thanks, I did not see that previously.
I copied the bisect function so that I can work on bins directly. Otherwise I have to convert the bins to a value-only list. I guess that to be able to use it, i have to split the bin in two value/count lists. But this means two inserts/removal instead of one in some branches of trim. I will check this.

belm0 · 2020-06-24T11:13:20Z

Otherwise I have to convert the bins to a value-only list

In the implementations I've seen, (point, count) tuple works fine and gives the most performance opportunities in pure Python. To update an item, simply write a new tuple.

belm0 · 2020-06-30T13:58:47Z

My ApproximateHistogram in Python using only stdlib is now contained within a performance timer project I've published:

https://github.com/belm0/perf-timer/blob/063ef693702a9a8dc227d7be14b89c096c643873/src/perf_timer/_histogram.py

belm0 · 2020-07-23T07:10:30Z

So using stdlib bisect / tuples made a huge difference, or the old numbers were mistaken?

Interpreter   Operation   Numpy         Req/s
============  ==========  =======  ==========
- CPython 3.7   update      no            65763
- CPython 3.7   update      yes           39277
+ CPython 3.7   update      no           436709
+ CPython 3.7   update      yes          251603

MainRo · 2020-07-23T20:22:21Z

Yes, for CPython this is a huge bump.

belm0 · 2020-07-24T04:10:45Z

It looks like the laptop numbers need to be updated?

On a i7-9800X Intel CPU, performances are:

============  ==========  =======  ==========
Interpreter   Operation   Numpy         Req/s
============  ==========  =======  ==========
CPython 3.7   update      no           436709
CPython 3.7   update      yes          251603
============  ==========  =======  ==========

On a modest 2014 13" macbook pro, performances are:

============  ==========  =======  ==========
Interpreter   Operation   Numpy         Req/s
============  ==========  =======  ==========
CPython 3.7   update      no           107345
CPython 3.7   update      yes           71843
============  ==========  =======  ==========

belm0 · 2021-03-04T22:08:03Z

Yes, for CPython this is a huge bump.

@MainRo would you consider adding me to the README credits as I helped make this package about 10x faster and produce exact histograms when under capacity?

MainRo · 2021-03-06T22:34:41Z

sure, I just updated the credits section of the README.

MainRo mentioned this issue Jun 21, 2020

suggestions #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba support #2

Numba support #2

MainRo commented Jun 21, 2020

MainRo commented Jun 21, 2020

belm0 commented Jun 22, 2020 •

edited

Loading

belm0 commented Jun 22, 2020

MainRo commented Jun 22, 2020

belm0 commented Jun 23, 2020 •

edited

Loading

MainRo commented Jun 24, 2020

belm0 commented Jun 24, 2020

belm0 commented Jun 30, 2020

belm0 commented Jul 23, 2020 •

edited

Loading

MainRo commented Jul 23, 2020

belm0 commented Jul 24, 2020

belm0 commented Mar 4, 2021

MainRo commented Mar 6, 2021 •

edited

Loading

Numba support #2

Numba support #2

Comments

MainRo commented Jun 21, 2020

MainRo commented Jun 21, 2020

belm0 commented Jun 22, 2020 • edited Loading

belm0 commented Jun 22, 2020

MainRo commented Jun 22, 2020

belm0 commented Jun 23, 2020 • edited Loading

MainRo commented Jun 24, 2020

belm0 commented Jun 24, 2020

belm0 commented Jun 30, 2020

belm0 commented Jul 23, 2020 • edited Loading

MainRo commented Jul 23, 2020

belm0 commented Jul 24, 2020

belm0 commented Mar 4, 2021

MainRo commented Mar 6, 2021 • edited Loading

belm0 commented Jun 22, 2020 •

edited

Loading

belm0 commented Jun 23, 2020 •

edited

Loading

belm0 commented Jul 23, 2020 •

edited

Loading

MainRo commented Mar 6, 2021 •

edited

Loading