-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Buffer with Uint8Array #452
Conversation
ab64683
to
dc4a675
Compare
Thanks @valadaptive! I'll try to find time to merge #428. |
FYI @valadaptive - #428 is in. |
6f39bde
to
bd70cce
Compare
Working on removing |
@mtth Do you intend to support the ability to roundtrip various Avro types to/from JSON? With the current representation that uses
I lean towards removing the |
It's nice to have, but I'm OK dropping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the slow review. Overall this looks great but I would like more time to cover all the changes. I'm sending a first batch of comments now, mostly questions, but feel free to wait until I send the next batch (hopefully this weekend) to respond.
etc/benchmarks/avro-serialization-implementations/scripts/decode/node-avro-io.js
Outdated
Show resolved
Hide resolved
I believe it works right now in terms of recognizing I can either try to implement code to recognize JSON'd |
Short reference to https://gist.github.com/joscha/d8603a1f0af5b0b055546c792b2a8ff6, which can be updated once this pull request lands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments, and I expect this will be good to go.
lib/types.js
Outdated
return RANDOM.nextString( | ||
Math.floor(-Math.log(RANDOM.nextFloat()) * 16) + 1 | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I touched on this in the PR description above, but this is solely for benchmarking purposes--when rewriting and optimizing the string encoding/decoding functions, I wanted to make sure I exercised both the "short string" and "long string" code paths. This is an exponential distribution, which means that shorter strings are more likely but longer ones are possible.
I can revert this, although in the long term it might be better for Type#random
to be moved into the testing code, since I'm not sure what purpose it serves to users of the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the background. This is fine. Agreed that this method would be best moved out of the core types, but this is best done separately.
// The maximum number that a signed varint can store in a single byte is 63. | ||
// The maximum size of a UTF-8 representation of a UTF-16 string is 3 times | ||
// its length, as one UTF-16 character can be represented by up to 3 bytes | ||
// in UTF-8. Therefore, if the string is 21 characters or less, we know that | ||
// its length can be stored in a single byte, which is why we choose 21 as | ||
// the small-string threshold specifically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great comments, here and throughout.
return new Tap(buf); | ||
} | ||
|
||
toBuffer () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Out of scope for this PR.) It's surprising to have "buffer" functions return typed arrays now. It will be good to update the methods throughout the package to have consistent names later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely something that should be done before a breaking 6.0 release. I contemplated doing it as part of this PR, but I'm not sure what terminology would be best. toBinary
sounds like it could refer to a binary string. Maybe toTypedArray
or toBytes
? Maybe rename Type#encode
to Type#encodeInto
a la TextEncoder
and repurpose the old method name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An "Into" suffix for encode
sounds good; I like the consistency with TextEncoder
. binaryEncode
(and jsonEncode
) might be fine, they mirror the wording in the Avro specification.
It seems to be slower than just using Uint8Array.slice, since subarray is so expensive to call.
This should assist in optimizing writeString properly, allowing both long and short string encoding performance to be benchmarked while still prioritizing short strings.
This took some fiddling but it's now *faster* than the previous implementation.
Not sure if Deno/Bun implement Buffer, but latin1Slice borders on an implementation detail since it isn't mentioned in the docs, so we should make sure it exists before using it.
6d6c595
to
3941f74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is a great step forward - thank you @valadaptive. The performance improvements are particularly impressive.
I'll follow up with a refactor, renaming methods to make them consistent with the new types.
return new Tap(buf); | ||
} | ||
|
||
toBuffer () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An "Into" suffix for encode
sounds good; I like the consistency with TextEncoder
. binaryEncode
(and jsonEncode
) might be fine, they mirror the wording in the Avro specification.
lib/types.js
Outdated
return RANDOM.nextString( | ||
Math.floor(-Math.log(RANDOM.nextFloat()) * 16) + 1 | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the background. This is fine. Agreed that this method would be best moved out of the core types, but this is best done separately.
I've noticed that this change breaks types in Line 77 in 5db1165
Buffer should become Uint8Array ?
|
Yes. |
Great, see #488 |
I still need to do some cleanups and replace uses of
Buffer.compare
, but I'm putting this PR here so you can benchmark it. Before I can complete this, #428 (minus the version bump) should be merged in, as the services code usesBuffer
everywhere and expects fixed/bytes types to decode toBuffer
s.I've made some tweaks to the benchmarking setup, so comparing this directly to the master branch won't work:
ArrayFloat
benchmark to go along withArrayDouble
.--expose-gc
flag when benchmarking in order to manually trigger garbage collection between benches, hopefully making results more consistent.Tap#writeString
, since it only generated strings up to a length of 32.I've cherry-picked those benchmarking changes into the bench-tweaks branch, which you can use to compare benchmarks.