Replace Buffer with Uint8Array #452

valadaptive · 2024-02-24T00:55:44Z

I still need to do some cleanups and replace uses of Buffer.compare, but I'm putting this PR here so you can benchmark it. Before I can complete this, #428 (minus the version bump) should be merged in, as the services code uses Buffer everywhere and expects fixed/bytes types to decode to Buffers.

I've made some tweaks to the benchmarking setup, so comparing this directly to the master branch won't work:

I've added an ArrayFloat benchmark to go along with ArrayDouble.
I use the --expose-gc flag when benchmarking in order to manually trigger garbage collection between benches, hopefully making results more consistent.
I've changed the length distribution of strings to be exponentially weighted, so shorter strings are still likely but longer strings will now be occasionally generated. The previous code was only benchmarking the manual path of Tap#writeString, since it only generated strings up to a length of 32.

I've cherry-picked those benchmarking changes into the bench-tweaks branch, which you can use to compare benchmarks.

mtth · 2024-03-04T00:02:32Z

Thanks @valadaptive! I'll try to find time to merge #428.

mtth · 2024-03-30T17:07:14Z

FYI @valadaptive - #428 is in.

valadaptive · 2024-03-30T21:39:23Z

Working on removing Buffer usage from types.js now. I noticed the isJsonBuffer function, which seems to check if a given object is the JSON representation of a Buffer. Under what circumstances are Avro types directly serialized to JSON and/or parsed back directly? I can't easily polyfill Uint8Array to stringify to a regular array, so I'll probably need to insert a fixup step when stringifying/parsing.

valadaptive · 2024-09-10T23:01:07Z

@mtth Do you intend to support the ability to roundtrip various Avro types to/from JSON? With the current representation that uses Buffers, this works mostly fine, but with Uint8Array, the JSON representation is a lot more bloated:

> JSON.stringify(Buffer.from([1, 2, 3, 4, 5]))
'{"type":"Buffer","data":[1,2,3,4,5]}'
> JSON.stringify(new Uint8Array([1, 2, 3, 4, 5]))
'{"0":1,"1":2,"2":3,"3":4,"4":5}'

I lean towards removing the coerceBuffers option entirely to discourage people from serializing Uint8Arrays to JSON.

mtth · 2024-09-13T04:04:17Z

Do you intend to support the ability to roundtrip various Avro types to/from JSON?

It's nice to have, but I'm OK dropping coerceBuffers if it adds significant complexity.

mtth

Apologies for the slow review. Overall this looks great but I would like more time to cover all the changes. I'm sending a first batch of comments now, mostly questions, but feel free to wait until I send the next batch (hopefully this weekend) to respond.

etc/benchmarks/avro-serialization-implementations/scripts/decode/node-avro-io.js

etc/scripts/perf

lib/types.js

lib/utils.js

valadaptive · 2024-09-13T08:31:41Z

It's nice to have, but I'm OK dropping coerceBuffers if it adds significant complexity.

I believe it works right now in terms of recognizing Buffers. However, since we now use Uint8Arrays instead of Buffers in types, they'll serialize to much larger objects in JSON (e.g. Uint8Array([1, 2, 3, 4, 5] is serialized as '{"0":1,"1":2,"2":3,"3":4,"4":5}').

I can either try to implement code to recognize JSON'd Uint8Arrays, or remove coerceBuffers entirely. Leaving it as-is would be a huge footgun, because you can no longer round-trip Avro types through JSON (it'll serialize Uint8Arrays to a JSON representation it cannot itself recognize as a Uint8Array).

joscha · 2024-09-13T08:42:26Z

Short reference to https://gist.github.com/joscha/d8603a1f0af5b0b055546c792b2a8ff6, which can be updated once this pull request lands.

mtth

Just a few minor comments, and I expect this will be good to go.

lib/types.js

mtth · 2024-09-16T03:39:57Z

lib/types.js

+    return RANDOM.nextString(
+      Math.floor(-Math.log(RANDOM.nextFloat()) * 16) + 1
+    );


Why this change?

I touched on this in the PR description above, but this is solely for benchmarking purposes--when rewriting and optimizing the string encoding/decoding functions, I wanted to make sure I exercised both the "short string" and "long string" code paths. This is an exponential distribution, which means that shorter strings are more likely but longer ones are possible.

I can revert this, although in the long term it might be better for Type#random to be moved into the testing code, since I'm not sure what purpose it serves to users of the library.

Thanks for the background. This is fine. Agreed that this method would be best moved out of the core types, but this is best done separately.

lib/types.js

lib/utils.js

mtth · 2024-09-16T03:59:25Z

lib/utils.js

+    // The maximum number that a signed varint can store in a single byte is 63.
+    // The maximum size of a UTF-8 representation of a UTF-16 string is 3 times
+    // its length, as one UTF-16 character can be represented by up to 3 bytes
+    // in UTF-8. Therefore, if the string is 21 characters or less, we know that
+    // its length can be stored in a single byte, which is why we choose 21 as
+    // the small-string threshold specifically.


Thank you for the great comments, here and throughout.

mtth · 2024-09-16T04:06:25Z

lib/utils.js

+    return new Tap(buf);
+  }
+
+  toBuffer () {


(Out of scope for this PR.) It's surprising to have "buffer" functions return typed arrays now. It will be good to update the methods throughout the package to have consistent names later on.

Definitely something that should be done before a breaking 6.0 release. I contemplated doing it as part of this PR, but I'm not sure what terminology would be best. toBinary sounds like it could refer to a binary string. Maybe toTypedArray or toBytes? Maybe rename Type#encode to Type#encodeInto a la TextEncoder and repurpose the old method name?

An "Into" suffix for encode sounds good; I like the consistency with TextEncoder. binaryEncode (and jsonEncode) might be fine, they mirror the wording in the Avro specification.

It seems to be slower than just using Uint8Array.slice, since subarray is so expensive to call.

This should assist in optimizing writeString properly, allowing both long and short string encoding performance to be benchmarked while still prioritizing short strings.

This took some fiddling but it's now *faster* than the previous implementation.

Not sure if Deno/Bun implement Buffer, but latin1Slice borders on an implementation detail since it isn't mentioned in the docs, so we should make sure it exists before using it.

mtth

This PR is a great step forward - thank you @valadaptive. The performance improvements are particularly impressive.

I'll follow up with a refactor, renaming methods to make them consistent with the new types.

mtth · 2024-09-21T15:37:28Z

lib/utils.js

+    return new Tap(buf);
+  }
+
+  toBuffer () {


An "Into" suffix for encode sounds good; I like the consistency with TextEncoder. binaryEncode (and jsonEncode) might be fine, they mirror the wording in the Avro specification.

mtth · 2024-09-21T15:50:58Z

lib/types.js

+    return RANDOM.nextString(
+      Math.floor(-Math.log(RANDOM.nextFloat()) * 16) + 1
+    );


Thanks for the background. This is fine. Agreed that this method would be best moved out of the core types, but this is best done separately.

joscha · 2024-10-09T13:51:49Z

I've noticed that this change breaks types in

avsc/types/index.d.ts

Line 77 in 5db1165

type Codec = (buffer: Buffer, callback: Callback<Buffer>) => void;

for example. Before I update the types, is it safe to assume that all occurrences of Buffer should become Uint8Array?

refs #128 and #479

mtth · 2024-10-13T16:05:38Z

is it safe to assume that all occurrences of Buffer should become Uint8Array

Yes.

joscha · 2024-10-17T12:50:36Z

is it safe to assume that all occurrences of Buffer should become Uint8Array

Yes.

Great, see #488

valadaptive force-pushed the debufferify branch from ab64683 to dc4a675 Compare February 24, 2024 00:56

valadaptive force-pushed the debufferify branch 3 times, most recently from 6f39bde to bd70cce Compare March 30, 2024 21:17

mtth reviewed Sep 13, 2024

View reviewed changes

mtth reviewed Sep 16, 2024

View reviewed changes

valadaptive added 17 commits September 16, 2024 10:22

Replace Tap constructor with factory methods

e1ba8ed

Remove utils.bufferFrom

8a1379f

Remove utils.newBuffer

e92e52c

Replace Buffer.slice with Buffer.subarray

e968a0a

Encapsulate Tap.buf

9a2db96

Loosen isBuffer checks to accept Uint8Array

84edd1e

WIP debufferify

7daba22

Add float benchmark

9e6b842

Finally fix perf woes

a0be1cc

Manually trigger GC in perf

61a1623

Improve text encode/decode performance

eac545a

Remove shared buffer pool

a902cff

It seems to be slower than just using Uint8Array.slice, since subarray is so expensive to call.

Remove BufferPool entirely

1195470

Use exponential distribution for random strings

c733678

This should assist in optimizing writeString properly, allowing both long and short string encoding performance to be benchmarked while still prioritizing short strings.

Redo writeString without Buffer.byteLength

5a6176a

This took some fiddling but it's now *faster* than the previous implementation.

Polyfill Buffer.compare in browsers

6d56968

Remove Buffer usage from (un)packLongBytes

f1e6db8

valadaptive added 17 commits September 16, 2024 10:22

Remove buffer imports

756e79f

Return Uint8Array from Lcg#nextBuffer

b490a2f

Remove Buffer usage from containers.js

048e9fb

Remove Buffer usage from index.js

d3b8973

Remove Buffer usage from Type#fingerprint

2017980

Replace Buffer.compare with utils.bufCompare

61cbab2

Polyfill BytesType._copy Buffer API usages

898ce75

Update comments in types.js

108f177

Fix isBufferLike import

23c9bea

Add bufEqual helper

42e97f6

Cache text encoding subarrays on-demand

005b94b

Further optimize string decoding

50a1e67

Guard Buffer.prototype.latin1Slice as well

f5c6858

Not sure if Deno/Bun implement Buffer, but latin1Slice borders on an implementation detail since it isn't mentioned in the docs, so we should make sure it exists before using it.

Store cached fingerprint as string again

b00b88d

Add comments on Buffer function specializations

3ae2c4d

Fix setting this.pos in Tap#writeString

2f8a010

Make Tap#length a getter

3941f74

valadaptive force-pushed the debufferify branch from 6d6c595 to 3941f74 Compare September 17, 2024 11:52

Revert StringType#random change

62b8e93

valadaptive requested a review from mtth September 17, 2024 12:04

valadaptive marked this pull request as ready for review September 19, 2024 01:14

mtth reviewed Sep 21, 2024

View reviewed changes

mtth merged commit c80c670 into mtth:master Sep 21, 2024
3 checks passed

This was referenced Sep 24, 2024

Migrate to Uint8Array #410

Open

Using seprately declared enum in union in record. #461

Closed

mtth mentioned this pull request Sep 28, 2024

feat: allow named types in unions #469

Merged

joscha mentioned this pull request Oct 17, 2024

fix(types): Buffer -> Uint8Array #488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Buffer with Uint8Array #452

Replace Buffer with Uint8Array #452

valadaptive commented Feb 24, 2024

mtth commented Mar 4, 2024

mtth commented Mar 30, 2024

valadaptive commented Mar 30, 2024

valadaptive commented Sep 10, 2024

mtth commented Sep 13, 2024

mtth left a comment

valadaptive commented Sep 13, 2024

joscha commented Sep 13, 2024

mtth left a comment

mtth Sep 16, 2024

valadaptive Sep 17, 2024

mtth Sep 21, 2024

mtth Sep 16, 2024

mtth Sep 16, 2024 •

edited

Loading

valadaptive Sep 17, 2024

mtth Sep 21, 2024

mtth left a comment

mtth Sep 21, 2024

mtth Sep 21, 2024

joscha commented Oct 9, 2024 •

edited

Loading

mtth commented Oct 13, 2024

joscha commented Oct 17, 2024

Replace Buffer with Uint8Array #452

Replace Buffer with Uint8Array #452

Conversation

valadaptive commented Feb 24, 2024

mtth commented Mar 4, 2024

mtth commented Mar 30, 2024

valadaptive commented Mar 30, 2024

valadaptive commented Sep 10, 2024

mtth commented Sep 13, 2024

mtth left a comment

Choose a reason for hiding this comment

valadaptive commented Sep 13, 2024

joscha commented Sep 13, 2024

mtth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtth Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joscha commented Oct 9, 2024 • edited Loading

mtth commented Oct 13, 2024

joscha commented Oct 17, 2024

mtth Sep 16, 2024 •

edited

Loading

joscha commented Oct 9, 2024 •

edited

Loading