Binary search #2147

Durman · 2023-01-20T15:38:04Z

Durman
Jan 20, 2023

In short I'm looking for something like serchsorted in numpy.

The problem.
I'm trying to implement some operations with polylines. I have two lines:

They are represented similar to such array - [[p1, p2, p3, p4], [p1, p2, p3]]

And have serious of numbers which represent position of some points on the lines, like this [[1.5, 4,0],[2,0]].

What comes to my mind is using binary search with Python loops, but they might be too slow. Probably there is another solution?

Answered by agoose77

Jan 20, 2023

Hi @Durman!

Awkward Array does not yet have a direct function to compute positions within ragged arrays, which is what you appear to be looking for. However, we do have tools that let you build this yourself.

There are two ways of achieving this that I can think of. One is to use Numba, and the other is to flatten, search over the flattened list, and rebuild. I'll show both, but I would recommend the use of Numba for readability

1. Numba

The simplest approach that I can think of is to use np.searchsorted in Numba, and then employ ArrayBuilder to re-build the ragged result:

import numba as nb
import awkward as ak
import numpy as np

haystack = ak.Array([[0.0, 1.86, 2.93, 3.62, 4.9], [0.0, 0…

View full answer

agoose77 · 2023-01-20T16:28:59Z

agoose77
Jan 20, 2023
Collaborator

Hi @Durman!

Awkward Array does not yet have a direct function to compute positions within ragged arrays, which is what you appear to be looking for. However, we do have tools that let you build this yourself.

There are two ways of achieving this that I can think of. One is to use Numba, and the other is to flatten, search over the flattened list, and rebuild. I'll show both, but I would recommend the use of Numba for readability

1. Numba

The simplest approach that I can think of is to use np.searchsorted in Numba, and then employ ArrayBuilder to re-build the ragged result:

import numba as nb
import awkward as ak
import numpy as np

haystack = ak.Array([[0.0, 1.86, 2.93, 3.62, 4.9], [0.0, 0.83, 1.52, 3.28, 4.08]])
needles = ak.Array([[1.5, 4.0], [2.0]])


@nb.njit
def searchsorted_2d(builder, array, needle):
    for array_i, needle_i in zip(array, needle):
        result_i = np.searchsorted(np.asarray(array_i), np.asarray(needle_i))
        builder.begin_list()
        for x in result_i:
            builder.integer(x)
        builder.end_list()
    return builder


index = searchsorted_2d(ak.ArrayBuilder(), haystack, needles).snapshot()

However, we know exactly how big our result is, so it is wasteful to use ak.ArrayBuilder, which is slower and uses more memory as it does not know up-front how much data (or what kinds of data) it will be working with. We can make this faster by creating a flat NumPy array to contain the result, and unflattening it after we are done:

import numba as nb
import awkward as ak
import numpy as np

haystack = ak.Array([[0.0, 1.86, 2.93, 3.62, 4.9], [0.0, 0.83, 1.52, 3.28, 4.08]])
needles = ak.Array([[1.5, 4.0], [2.0]])


@nb.njit
def searchsorted_2d(flat_result, array, needle):
    j = 0
    for array_i, needle_i in zip(array, needle):
        result_i = np.searchsorted(np.asarray(array_i), np.asarray(needle_i))
        
        k = j + len(result_i)
        flat_result[j:k] = result_i
		j = k


num = ak.num(needles)

result_length = ak.sum(num)
flat_result = np.zeros(result_length, dtype=np.int64)

# Fill the result
searchsorted_2d(flat_result, haystack, needles)

index = ak.unflatten(flat_result, num)

2. Flattened search

In this example, I'm assuming that you have a two dimensional array, and that none of the sublists are empty. If those assumptions aren't true, then we would need to update this example.

The manual way of doing this is to flatten both the "needle" and the "haystack" into one-dimensional arrays, and keep track of the "list boundaries" that separate elements from different sublists. We can then adjust these two arrays so that the haystack is monotonically increasing, i.e. for every input there is exactly one output (or, there are multiple outputs but they're all adjacent on the number line):

# From
[0, 1.86, 2.93, 3.62, 4.9, [boundary], 0.0, 0.83, 1.52, 3.28, 4.08]
# To
[0, 1.86, 2.93, 3.62, 4.9, [boundary], 4.9, 5.73, 6.42, 8.18, 8.98]

To adjust the haystack, we can take the end values and find their cumulative sum. By assuming that each sublist in haystack is monotonically increasing, we can iteratively shift each sublist by the previous list's end value, e.g.

offset = 0
for sublist in lists:
    sublist += offset
    offset = sublist[-1]

In NumPy arrays, we can do this using cumsum:

import awkward as ak
import numpy as np

haystack = ak.Array([[0.0, 1.86, 2.93, 3.62, 4.9], [0.0, 0.83, 1.52, 3.28, 4.08]])
needles = ak.Array([[1.5, 4.0], [2.0]])

This would give us the value_starts that we need to shift the needles and haystacks, so that we can then flatten them

# To flatten this list so that it is monotonic increasing,
# we assume each list is monotonic increasing, and cumulatively shift
# each sublist by the final value
value_starts = np.zeros(len(haystack))
value_starts[1:] = np.cumsum(haystack[1:, -1])

When we've computed the result, it is going to be an array of indices into the flat search result. We'll need to adjust these indices to account for the fact that they came from smaller sublists. We can do this by building an array of the cumulative sum of sublist lengths: [0, length_1, length_1+length_2, length_1+length_2+length_3, ...].

# What are the start positions of each sublist in the flattened list?
starts = np.zeros(len(haystack), dtype=np.int64)
starts[1:] = np.cumsum(ak.num(haystack))[:-1]

Now we actually perform the search over the adjusted, flattened array

# Compute search over flattened, adjusted array
flat_index = np.searchsorted(
    ak.flatten(haystack + value_starts),
    ak.flatten(needles + value_starts),
)

Then we unflatten this result into sublists

# Adjust the indices back
unadjusted_index = ak.unflatten(flat_index, ak.num(needles))

And adjust the indices

# Adjust the indices back
index = unadjusted_index - starts

5 replies

Durman Jan 23, 2023
Author

Thanks. I also thought about the second approach but it seemed that it can be too slow if there are a lot of small polylines because in this case each value is going to be searched through all of them.

I will try to implement the first suggestion. )

agoose77 Jan 23, 2023
Collaborator

The time complexity of the search should be k*M*log(N) in all solutions here; where k is constant differing between Numba and Python solutions, M is the number of "needles" and N the number of items in the "haystack". For the case where you have an equal number of points and sublists, I get this performance trend:

The cause of this below-n-squared behaviour is the use of a binary search in np.searchsorted. Increasing the number of items to search through (i.e. the main difference between the Numba implementations and the non-jitted version) by searching over the entire haystack for each item only changes the performance by some constant shift: log(cN) = log(c) + log(N).

Note that I've already sorted the lists here!

Durman Jan 23, 2023
Author

Interesting! I would never guess that the two solutions have the same complexity. I will start with the second one then because it seems simpler.

agoose77 Jan 23, 2023
Collaborator

Whatever you prefer! Note that the Numba solution is still faster on the wall time: it's a constant factor faster. You could probably speed it up even further, but this might already be sufficient for your use case.

Durman Jan 25, 2023
Author

I'm now on stage of proving a concept. Reaching maximum performance will be a separate step.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary search #2147

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Binary search #2147

Durman Jan 20, 2023

1. Numba

Replies: 1 comment · 5 replies

agoose77 Jan 20, 2023 Collaborator

1. Numba

2. Flattened search

Durman Jan 23, 2023 Author

agoose77 Jan 23, 2023 Collaborator

Durman Jan 23, 2023 Author

agoose77 Jan 23, 2023 Collaborator

Durman Jan 25, 2023 Author

Durman
Jan 20, 2023

Replies: 1 comment 5 replies

agoose77
Jan 20, 2023
Collaborator

Durman Jan 23, 2023
Author

agoose77 Jan 23, 2023
Collaborator

Durman Jan 23, 2023
Author

agoose77 Jan 23, 2023
Collaborator

Durman Jan 25, 2023
Author