Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About memory alignment of container values #9

Open
Kerollmops opened this issue Jan 29, 2021 · 9 comments
Open

About memory alignment of container values #9

Kerollmops opened this issue Jan 29, 2021 · 9 comments

Comments

@Kerollmops
Copy link
Member

Kerollmops commented Jan 29, 2021

Hey @lemire,

I am currently in the process of greatly improving the deserialization algorithm of the roaring-rs library. The binary serialization format is quite good. It allows the Bitmap types to borrow a reference to the memory from which it deserializes.

However, there is a little problem, a memory alignment problem. In Rust (and other languages), it is undefined behavior to read a type that isn't well aligned in memory, e.g., reading a bitmap, aligned on 64bits, at address 33 (33 % 8 is 1, not 0, therefore unaligned).

I have two solutions to this problem in mind:

  • The easy one could be to backup on copying the bytes of a container into an aligned buffer (a Vec) if those are not valid. This is a little bit frustrating as it could happen a lot.
  • The alternative is more a question: Do you think that I could use the header offsets when serializing to force the containers to be correctly aligned, introducing some garbage bytes between containers? Or do I need to keep no space between contiguous containers?

Thank you!

@lemire
Copy link
Member

lemire commented Jan 29, 2021

However, there is a little problem, a memory alignment problem. In Rust (and other languages), it is undefined behavior to read a type that isn't well aligned in memory

Note that Roaring was first implemented in Java and Go, and later in C. Java and Go have, obviously, strictly memory safety features.

This being said, what you describe is also undefined behaviour in C/C++ in the sense pointer casting sense, so don't do that. In C++, what you do is that you use memcpy (which is safe). In Rust, I'd expect you to use from_le_bytes or the equivalent.

Can you elaborate as to why something like from_le_bytes won't do?

@Kerollmops
Copy link
Member Author

This being said, what you describe is also undefined behavior in C/C++ in the sense pointer casting sense, so don't do that. In C++, what you do is that you use memcpy (which is safe). In Rust, I'd expect you to use from_le_bytes or the equivalent.

Indeed the Rust implementation should maybe use memcpy (copy_from_slice) like in the C++ implementation but the current roaring-rs implementation deserializes from any type that implements io::Read. Copying the memory from a slice then changing the endianness could be way faster than reading one by one.

Can you elaborate as to why something like from_le_bytes won't do?

It is the current implementation, we are using the byteorder library to do that, but that's equivalent.
The real problem is that I am using roaring-rs in one of my projects and I found out that 51.5% of the work done to return a result is to deserialize the bitmaps from slices located in an LMDB database. Also, 12.6% of the time is spent on doing intersections between two bitmaps.

Here, I linked a zip with the SVG, you can search for deserialize_from and intersect_with.
flamegraph.svg.zip

I made some progress on an experiment I am doing with a BorrowedRoaringBitmap type where containers are deserialized only on frist access. In a workload where I deserialize from a slice and intersect multiple bitmaps, the gain is quite impressive; however, doing unions is slower as all containers must be loaded. I linked you many HTML files that you can open and check by yourself. I also need to use real roaring datasets to do correct measurements.

You can open the criterion/report/index.html file.
criterion.zip

@lemire
Copy link
Member

lemire commented Jan 30, 2021

@Kerollmops Roaring bitmaps have a simple format that is easy to deserialize quickly, but if you are constantly grabbing bytes, and copying them into Roaring bitmaps data structures, it is going to add up to something costly, so don't do that !

Assuming that the bitmaps you are deserializing are meant to be immutable, then you can just avoid making copies at all and just map the bitmap in place.

We do it in Go with FromBuffer:

https://github.com/RoaringBitmap/roaring/blob/4f9df8a961a30316c4ea0f9f3f85d310f971a3e3/roaring.go#L95

In Java, we have ImmutableRoaringBitmap instances that work that way (the data is not copied, it is just left in place and pointed to).

Note that both of these solutions (Go and Java) are used in production systems.

I don't think that the issues of value alignment are relevant here as demonstrated by the fact that we have deployed these solutions in both Go and Java which are certainly higher-level languages than Rust.

If we can do it in Go and Java, surely it can be done in Rust.

Of course, there are safety issues: since you are not copying the data, you must ensure that source data remains available for the lifetime of your roaring bitmap or else things will go badly. But it is not rocket science.

@lemire
Copy link
Member

lemire commented Jan 30, 2021

@lemire
Copy link
Member

lemire commented Jan 30, 2021

In C it is called a 'frozen view'. The C function can be called from Go (we have a package that does so).

@Kerollmops
Copy link
Member Author

Kerollmops commented Jan 30, 2021

Note that both of these solutions (Go and Java) are used in production systems.

I don't think that the issues of value alignment are relevant here as demonstrated by the fact that we have deployed these solutions in both Go and Java which are certainly higher-level languages than Rust.

In fact, I read a lot about memory alignment and the fact that it is a source of undefined behavior. In Rust, it is prohibited to read unaligned memory, this is why we use libraries to check and cast bytes for us safely, unfortunately, casts can fail.

In case the cast fails due to bad memory alignment, I can backup on allocating the container instead of keeping a view. Unfortunately, there is a great chance that the cast fails as the file format hasn't been thought for that use case and even if it had been, LMDB doesn't ensure that values are properly aligned. So I am kind of stuck with no real solution for now.

The only solution that I found is the one I described where containers are lazily deserialized, this technique can speed-up the workload I described previously. I will maybe try the new version of sanakirja that has been designed to align keys and values on specific boundaries combined with a serializing format that support the correct alignment of containers data.

If we can do it in Go and Java, surely it can be done in Rust.

Of course, there are safety issues: since you are not copying the data, you must ensure that source data remains available for the lifetime of your roaring bitmap or else things will go badly. But it is not rocket science.

Fortunately, in Rust, it is straightforward to ensure that the bitmap view will never outlive the buffer from which it comes from at compile-time.

Same functionality in C:

https://github.com/RoaringBitmap/CRoaring/blob/140e6a8bd92395d5392b02266b533e178a3e00d2/include/roaring/roaring.h#L521

I looked at the function, and indeed there is a lot of potential unaligned memory reads in it.

Thank you anyway for your time, research work, and help!

@lemire
Copy link
Member

lemire commented Jan 31, 2021

I looked at the function, and indeed there is a lot of potential unaligned memory reads in it.

How confident are you about this statement? The frozen serialization and deserialization in C are tested with sanitizers on. With sanitizers, undefined behaviour, including unaligned reads, are usually identified. It is entirely possible that we have bugs, either because the sanitizers are insufficient, or because our testing is not thorough enough... But I think it should be hard to identify such issues just by reading casually the functions. Maybe you ran some tests? Can you share your findings ? Be mindful that the frozen deserialization function has an alignment check and returns NULL when the proper alignment is not provided.

I read a lot about memory alignment and the fact that it is a source of undefined behavior.

Please refer to my earlier answer: "This being said, what you describe is also undefined behaviour in C/C++ in the sense pointer casting sense, so don't do that. In C++, what you do is that you use memcpy (which is safe). In Rust, I'd expect you to use from_le_bytes or the equivalent." I am quite confident that when calling u32::from_le_bytes, you do not need for the argument to be aligned on a 32-bit boundary. If I am wrong, can you please provide me with a reference or a way to test it out? The following is unsafe *(p as *const u16) unless you know something about p, so, as I wrote, don't do that.

You absolutely can read 4 consecutive bytes anywhere in memory as a 32-bit integer, safely. Except maybe for really high-level languages like JavaScript and Python, this should always be possible.

After that, it becomes a matter of software engineering. In Java, you can do...

ByteBuffer mybuffer = ... // any location
LongBuffer bitmapArray = mybuffer.asLongBuffer();

This is perfectly allowed irrespective of alignment.

In Go, you can do binary.LittleEndian.Uint64 and so forth.

In C, we were a bit limited, since C is such a simple language... and the only standard way to avoid the issue is a memcpy, and we did not want to write memcpy's everywhere or macros, we introduced some padding to alleviate the issue. But even then, it only works if we put some condition on the overall alignment... which is not quite as nice as the Go and Java versions... but it is a software engineering compromise.

Further reading if you are interested in performance implications: Data alignment for speed: myth or reality? (2012 blog post by myself)

@Kerollmops
Copy link
Member Author

Kerollmops commented Jan 31, 2021

How confident are you about this statement? The frozen serialization and deserialization in C are tested with sanitizers on. With sanitizers, undefined behaviour, including unaligned reads, are usually identified. It is entirely possible that we have bugs, either because the sanitizers are insufficient, or because our testing is not thorough enough... But I think it should be hard to identify such issues just by reading casually the functions. Maybe you ran some tests? Can you share your findings ? Be mindful that the frozen deserialization function has an alignment check and returns NULL when the proper alignment is not provided.

I'm sorry if what I said seemed aggressive, it wasn't intended. I was maybe a little too confident in the fact that those two lines were triggering an undefined behavior in the sense that it was interpreting a pointer into a pointer of u16 then dereferencing it, but as you said it is checked by the sanitizer, the order it is dereferenced (count[i]) must be valid.

uint16_t *counts = (uint16_t *)(buf + length - 4 - num_containers * 3);
// ...
run_zone_size += counts[i] * sizeof(rle16_t);

Please refer to my earlier answer: "This being said, what you describe is also undefined behaviour in C/C++ in the sense pointer casting sense, so don't do that. In C++, what you do is that you use memcpy (which is safe). In Rust, I'd expect you to use from_le_bytes or the equivalent." I am quite confident that when calling u32::from_le_bytes, you do not need for the argument to be aligned on a 32-bit boundary. If I am wrong, can you please provide me with a reference or a way to test it out? The following is unsafe *(p as *const u16) unless you know something about p, so, as I wrote, don't do that.

You absolutely can read 4 consecutive bytes anywhere in memory as a 32-bit integer, safely. Except maybe for really high-level languages like JavaScript and Python, this should always be possible.

You are totally right, this is part of the solution I came up with today: I keep a pointer to an array of bytes, and every time an operation needs an u64 or an u16 from this slice I read it from the slice by copying it in memory and returning the whole type.

In C, we were a bit limited, since C is such a simple language... and the only standard way to avoid the issue is a memcpy, and we did not want to write memcpy's everywhere or macros, we introduced some padding to alleviate the issue. But even then, it only works if we put some condition on the overall alignment... which is not quite as nice as the Go and Java versions... but it is a software engineering compromise.

Ok, so what I understand here is that any C library that needs to use this bitmap view system must also be compiled with the memory alignment sanitizer and must realign the memory by itself for the alignment to be valid? Maybe I didn't understand the "we introduced some padding to alleviate the issue" part.

Further reading if you are interested in performance implications: Data alignment for speed: myth or reality? (2012 blog post by myself)

Thank you, will read the blog post, and stop taking more of your time, I love your blog posts but didn't took the time to read them all 😊 I read about memory alignment and the amount of generated instructions in a blog post about some cap'n proto experiments.

@lemire
Copy link
Member

lemire commented Jan 31, 2021

Ok, so what I understand here is that any C library that needs to use this bitmap view system must also be compiled with the memory alignment sanitizer and must realign the memory by itself for the alignment to be valid? Maybe I didn't understand the "we introduced some padding to alleviate the issue" part.

No. The sanitizers are for testing/validation. A sanitizer does not realign. Let me define the terms. So GNU GCC and LLVM provide sanitizers tools that compile into your code checks for undefined behavior, out-of-bound memory addresses and so forth. You should never have them on in production. We use sanitizers to determine whether we have bugs. You write your tests, then you compile them with sanitizers on. The sanitizers will flag undefined behavior. So if you do *((uint64_t*)p) and p does not have the proper alignment, the sanitizer will blow. However, it will happily let you use memcpy.

See my 2016 blog post on the topic:
https://lemire.me/blog/2016/04/20/no-more-leaks-with-sanitize-flags-in-gcc-and-clang/

I focus mostly on memory accesses, but undefined behavior is there too (I give the flag). Undefined behavior is relatively trivial to catch compared to memory accesses (which requires you to map the memory regions).

In C, we use a distinct format (which we refer to as frozen) which provides the necessary alignment assuming that the starting pointer is itself 32-bit aligned.

It is distinct from the format described in the current repository. The format in the current directory is portable, in the sense that it can be used unchanged from C, C++, Go, Java and so forth. It can also be used, unchanged, without copy from Java and Go. In C/C++, you cannot do so, you need the frozen format. Note that there is nothing fundamental that prevents C/C++ from doing the same kind of mapping that we do in Go and Java. Ultimately, it is a matter of how much programming you are willing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants