-
Notifications
You must be signed in to change notification settings - Fork 15.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opensource C++ zero-copy API #1896
Comments
Any updates on this feature? |
@jjyao This unfortunately hasn't made into our agenda yet. If this feature is useful to you, can you post here your use case and estimate how much it can help? More concrete use case example can help us prioritize it. |
I'm also quite interested in this feature.
@xfxyjwf I'm writing an application-specific database server with gRPC and RocksDB. I want to:
I want this because parsing and serialization currently take ~30% of my total response time and I don't really need them. |
Is this the thing that cap'n proto does that makes it fast than protobuf? |
@stellanhaglund No, it's not the main cause of the performance difference. cap'n proto is very similar to FlatBuffer and what I described in #3296 can be said to cap'n proto as well. |
I am very interested in this feature. I have been suggesting at my work that we adopt something like protobuf for a long time. One of the major push backs has been the ability to zero copy large binary/string values. This is because we have many applications where an extra copy or two of the data means the processors/memory bus is now saturated. Our usual process stream for data look a lot like:
Control message and meta data are small enough that copying is no problem (and in fact encoding as json etc. is usually good enough). Typical data is large matrices (think 16+MB) of complex integer (often 16 bit), complex IEEE binary16 (half) or complex IEEE binary32 (float). While meta data may be 64 bytes in total encoded as a struct. Note we often also have the requirement that the data be machine vector aligned (typically 32 byte align). A "slow" data rate is 3-5 Gigabit/s. It'd be great if we could encode such data as something like protobuf and not have to manually maintain readers and writers and representations in multiple languages. We are already making an effort to use protobuf for control data, which IMO it already excels at. |
Perhaps Cord will / could be open sourced as part of the Abseil library. The initial release doesn't include it, although there is a passing mention in |
@arthur-tacca Yep. The Cord type will be included as part of Abseil. And after we migrate to use Abseil, supporting zero-copy ctypes should be straightforward. |
Hello, I ran into some performance problems at my previous HFT job an thought it be nice to have a zero-copy, heap free, protobuf parser. If I were going to hand write code that parsed a specific protobuf schema, I'd typically do all my processing on the stack and consume all data in one pass. I could see writing a C++ functional template heavy low level decoder giving me the same performance. I would best describe it as X(name proposals welcome) is to SAX as regular protobuf bindings are to DOM.
On the generation side I could see doing something something similar.
Is there interest for this kind of thing? My fear is that C++ guys that really care about performance would avoid protobuf anyway. I guess my target audience are skilled C++ devs worried about performance forced to speak protobuf for historical reasons or a contract with outside components. Does anyone have a spiffy name? Has anyone seen something like this? I found lots of alternative wire formats with language bindings: SBE, CapNProto, etc. |
I don't think we would switch to using a SAX-like parser except maybe in some very specific circumstances in our project. For us, the overhead of most PBs is negligible (and I expect to be even lower when we switch to using arenas). The main exception is the std::string allocation of lots of tiny strings -- we're stuck on the pre-C++11 ABI, so every string ends up being a heap allocation/free pair. |
I can't use this library without that feature at all! I use an arena, because I store sensitive key material in my protobuf messages and I provided an allocator with safe memory to the arena (sodium_malloc, not swapped out, zeroed out on free, guard pages etc.). Given that the key material is stored in I already halfway ported my code from protobuf-c to protobuf, only now finding out that all my key material completely bypasses the arena. So now it seems like I have to throw that away and stick with protobuf-c (which makes me really unhappy). |
Any updates on this feature? |
I think string_view should be a solid contender to be fully released soon. Cord's are a thoroughly more heavy weight type. Integrating ZCIS with Cord's ties our most basic library directly into ABSL. We thread a little more carefully here. |
@gerben-s |
Zero copy parsing of strings can be achieved by aliasing string_view's or Cord's with the underlying buffer. Cord is a heavy weight type from the absl lib, which needs to be directly supported by our ZeroCopyInputStream (ZCIS) abstraction. |
@FSMaxB On the level of safety I understand your wishes, but its hard for us to make any such guarantee about not storing memory on the heap. If you have such stringent security demands, I think C++ protobuf is not the right fit. We are thinking about how to expose aliasing but we want to be careful and expose the right API. |
That makes sense, especially without |
Is there any update on this feature? This would be very useful. Now that absl has a string_view implementation (https://github.com/abseil/abseil-cpp/blob/master/absl/strings/string_view.h) it seems like that could be used :) |
|
This changes allow the usage of the c++ grpc arena feature to allocate the messages from the same location, the initial motivation of this was save all the message strings in the same memory portion but seems that the open source version of the grpc don't allow it and like google has this internal feature isn't possible create a PR. Please google, if you dont mind, could you spend a little of your engineering team time to open source it? protocolbuffers/protobuf#1896
I think there are two requests here: (a) Allow ctype = STRING_PIECE The original comment says "(1) opensource ... Cord and its dependencies ... is probably the most difficult part" But surely that's only needed for ctype = CORD? For ctype = STRING_PIECE, a vendored copy of The ctype = STRING_PIECE feature solves the zero-copy problem in the case the string you want refer to without copying is contiguous, which is probably enough functionality for many people (e.g. me 😃). So perhaps, rather than waiting for some solution involving cord, just the string piece functionality could be open sourced? I thought this was already well understood, but reading through the comments it seems it hasn't been mentioned here before. The comments mostly discuss alternative types such as |
|
That makes sense, thanks. |
There are also standalone |
I also encountered the same problem, have you solved it? |
Yes, by first porting my code back to protobuf-c. Later abandoning the entire project and then never using protobuf ever again in the future. |
Just in case anyone finds it useful, I did a little hacking on a branch that supports storing string buffers in arenas: toddlipcon@00cc310 The above only supports it on the serialization side -- i.e. if you call |
This is a huge deal, especially on mobile. Now that |
I'm very interested in this feature. I have a project where we embed long strings (several KiBs) in protobuf messages. It would significantly save CPU time if this feature is available. |
Chiming in to say that we use Protobuf in virtually all of our projects here and would love to see this fixed, even if it required upgrading to C++17 (we're only on C++14 for the most part now.) |
I would even love to see this implemented. We can use protobuf for our data plane APIs as well then |
We would also really like to see std::string_view support as well. Would make arenas actually useful. |
We have a lot of long term plans that will drive us towards this space; however, the migrations required make it slow going. Expect to see us start breaking ground in over the next year. |
Looking forward this feature |
Support for |
This would be a very vital feature, really. I just wrote a question in StackOverflow: I seek to fully process the incoming data without copying it, so the intended use is to feed protobuf with the starting address The intended output of array data (strings and raw data) would be For embedded systems copying yields more heap fragmentation, and with large messages, this becomes worse. I really can't see a reason why it's not already built other being not supporting c++17 forward (though can be an optional compiling option). |
I try this, but it is not work. string buffers also on heap |
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment. This issue is labeled |
This bugreport is going to school by now. |
Why tease is with the possibility for zero-copy API, and then let it hang and dingle like this 😅 |
While the overall issue is for allowing to use strings without any copy, is there at least a possibility/plan to support storing string contents on an arena? Even when using |
That would still require a custom alternative to std::string as the string
type, but it would be a great (and safe) improvement over the status quo.
…On Fri, 22 Nov 2024 at 10:54 Alexander Krabler ***@***.***> wrote:
While the overall issue is for allowing to use strings without any copy,
is there at least a possibility/plan to support storing string contents on
an arena?
This would still require a copy, but it would save at least the dynamic
memory allocation introduced by using std::string.
Even when using features.(pb.cpp).string_type = VIEW, a classic
std::string is allocated currently.
What's missing to support that?
—
Reply to this email directly, view it on GitHub
<#1896 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABSXM65KFOJ3R3TDPHMMFL2B4ZRPAVCNFSM6AAAAABJYHGH36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJTHAZDGMBVGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
We do plan to support |
Protobuf has zero-copy support to avoid copying string/bytes fields when parsing protobuf messages and it's used pretty much everywhere inside Google, but the feature has never made its way into the opensource repo. Now protobuf 3.0.0 is released and we will probably have more time to look into incremental improvements. The zero-copy API is a good candidate to be included in the next 3.x release.
Opensourcing the zero-copy API will involve:
(1) is probably the most difficult part as that's a large chunk of code and it may not be portable.
The text was updated successfully, but these errors were encountered: