Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse bytes directly #356

Open
robsmith11 opened this issue Apr 20, 2023 · 5 comments
Open

Parse bytes directly #356

robsmith11 opened this issue Apr 20, 2023 · 5 comments

Comments

@robsmith11
Copy link

It would be nice if JSON.Parser.parse could be passed a vector of bytes and parse it assuming UTF-8 encoding without having to manually allocate a new String. My most common use case (probably for many other people too?) is downloading a JSON file with HTTP.get("...").body, which returns bytes.

@KristofferC
Copy link
Member

You could maybe use https://github.com/JuliaStrings/StringViews.jl.

@robsmith11
Copy link
Author

StringViews.jl does look good for use in projects, but would it make sense for more casual interactive use to have JSON.jl do something automatically when passed bytes?

@KristofferC
Copy link
Member

One issue with that is that that means that arguably anything that accepts a string should also accept a byte buffer. And the best way to do that would probably be to use StringViews as a dependency and wrap the bytes in that. So it would kind of be equivalent except that all functions would have to define this instead of just the caller doing it.

@kpa28-git
Copy link

I've noticed that using StringViews instead of String does not improve performance for me (actually slightly worse performance and higher alloc). These are in the docs for String (julia 1.8.5). If I'm understanding right, strings produced from UTF-8 bytes already act like views.

String(v::AbstractVector{UInt8})
Create a new String object from a byte vector v containing UTF-8 encoded characters.
...
When possible, the memory of v will be used without copying when the String object is
created. This is guaranteed to be the case for byte vectors returned by take! on a writable
IOBuffer and by calls to read(io, nb). This allows zero-copy conversion of I/O data to
strings. In other cases, Vector{UInt8} data may be copied, but v is truncated anyway to
guarantee consistent behavior.

@KristofferC
Copy link
Member

KristofferC commented May 5, 2023

"When possible"

This is not that often the case, the array need to have been allocated in a special way for this.

And copying a chunk of memory like a string tends to be quite fast so it isn't unfeasible that you don't notice it. And maybe StringViews has some issue which make it slower than it should be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants