Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Generating boolean column vector with roaring bitmap. #2113

Open
liurenjie1024 opened this issue Jun 5, 2024 · 9 comments
Open

[FEA] Generating boolean column vector with roaring bitmap. #2113

liurenjie1024 opened this issue Jun 5, 2024 · 9 comments

Comments

@liurenjie1024
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
Deletion vector is a widely used technique used for efficiently filtering data in columnar storage.

Describe the solution you'd like

We need to implement a gpu kernel to generate a boolean vector efficiently, following java code demo the logic in cpu:

public static boolean[] filter(Roaring64Bitmap bitmap, int start, int end) {
  int len = end - start;
  boolean[] ret = new boolean[len];
  for(int i=0; i<len; i++) {
    ret[i] = bitmap.contains(start + i);
  }

  return ret;
}

Describe alternatives you've considered

Currently we use roaring bitmap due to it's efficient for compression and bitmap operations. We can use other gpu optimized bitmaps instead if someone could find one which is more gpu friendly.

@liurenjie1024
Copy link
Collaborator Author

This is related to NVIDIA/spark-rapids#10904

@ttnghia
Copy link
Collaborator

ttnghia commented Jun 6, 2024

I don't think we have any GPU implementation for roaring bitmap yet.

@pmixer
Copy link

pmixer commented Jul 5, 2024

I have the experimental code porting parts of RoaringBitmap to CUDA, not a perfect fit for the need(only support int32 keys instead of int64 ones, the later is more challenging) and not open-sourced yet(still some perf optimization work in progress).

Just want to join in to continue the discussion.

Assuming we have it already, like in below:

public static boolean[] filter(CuRoaring64Bitmap gpu_bmp, int start, int end) {
  boolean[] ret = new boolean[len];
  gpu_bmp.contains_range(start, end, ret); // better to be a batched function call for efficiency
  return ret;
}

there're still some questions -

is 64 bits a hard requirement or 32 bits would also be fine sometimes(by remapping etc.)?

where would be gpu_bmp created and how would it be maintained?

any idea?

@liurenjie1024
Copy link
Collaborator Author

gpu_bmp will be created by other cpu algorithm, and we can pass it to gpu with serialized format.

@pmixer
Copy link

pmixer commented Jul 5, 2024

gpu_bmp will be created by other cpu algorithm, and we can pass it to gpu with serialized format.

if it just serves as a on gpu bitmap based filter, seems sth worth to try by minimal engineering efforts.

pls let me check whether https://github.com/NVIDIA/cuCollections could provide the required functionality or not in next few days.

@liurenjie1024
Copy link
Collaborator Author

if it just serves as a on gpu bitmap based filter, seems sth worth to try by minimal engineering efforts.

Currently we use roaring bitmap because it's highly efficient for compression and OR operation, which is quite important for previouse operation.

@pmixer
Copy link

pmixer commented Jul 5, 2024

if it just serves as a on gpu bitmap based filter, seems sth worth to try by minimal engineering efforts.

Currently we use roaring bitmap because it's highly efficient for compression and OR operation, which is quite important for previouse operation.

RoaringBitmap is popular and highly efficient, well, it's 64bits version seems could be further improved according to my limited knowledge https://mp.weixin.qq.com/s/tJQoNRZ5UDJ_IASZLlhB4Q.

Anyway, it does not matter that much as only a subset of roaring features required for this use case.

@liurenjie1024 could u pls elaborate on expected set size and indices distribution(I mean the raw integers feed into the set), and to be filtered range length(end - start) for the target application?

@mattahrens @ttnghia @winningsix @firestarman @sameerz FYI

@liurenjie1024
Copy link
Collaborator Author

Hi, @pmixer

could u pls elaborate on expected set size and indices distribution(I mean the raw integers feed into the set), and to be filtered range length(end - start) for the target application?

The expected set size depends on actual customer's file size. Simple bitset maybe not practical since 8 million rows requires 1M size, and that's only for one file. We may have thousands of files.

@pmixer
Copy link

pmixer commented Jul 17, 2024

Well, thousands of files means few GBs mem needed, may be a waste but possible to have a try before further optimization.

Just FYI, currently there are https://github.com/NVIDIA/cuCollections/tree/dev/include/cuco/detail/trie/dynamic_bitset in cuCollection and https://nvidia.github.io/cccl/thrust/api/function_group__set__operations_1gacf5edd558fd96eee01b6425b3909c6b4.html in Thrust open-sourced for on-gpu set operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants