Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds "decode all" option #92

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Adds "decode all" option #92

wants to merge 2 commits into from

Conversation

rjp
Copy link

@rjp rjp commented Jan 17, 2022

Fixes #70 (implicitly), #23. May also have an impact on the
"high memory usage" issues but I'm doing more testing there.

Adds: -a, -all flag which means "decode all the objects,
pretending it's a JSON stream even if it's not actually."

Rationale: gron only decodes the first object, gron -s
requires a "correctly" formatted JSON stream (one object per
line), but it's not uncommon to get multiple objects per line
with tools that don't support JSON stream formatting.

This does require a positionable stream, however, since the
JSON decoder can read past the end of an object to be sure its
parsed correctly. io.Seekable doesn't work, unfortunately,
because whilst we know where we want to be (d.InputOffset()),
we don't actually know where we currently are which precludes
the use of io.SeekCurrent and, bizarrely, it turns out that
io.SeekSet gets progressively slower as you seek further and
further into your (in this case) bytes.Buffer.

Thus we keep track of where we want to be (moved) and create
a bytes.NewReader for each attempted decode at the correct
position. Crufty, definitely, and memory-allocation heavy,
probably, but it works and is surprisingly not that bad even
on large files.

My test 85MB JSON single line input takes ~64s (x86_64),
~43s (arm64) and ~275M to parse into 1024 objects comprising
1GB of output text. Compare to jq: ~25s (x86_64),
~11s (arm64) using ~630M giving 350MB of output.

rjp added 2 commits January 14, 2022 11:48
Adds: `-a`, `-all` flag which means "decode all the objects,
pretending it's a JSON stream even if it's not actually."

Rationale: `gron` only decodes the first object, `gron -s`
requires a "correctly" formatted JSON stream (one object per
line), but it's not uncommon to get multiple objects per line
with tools that don't support JSON stream formatting.

This does require a positionable stream, however, since the
JSON decoder can read past the end of an object to be sure its
parsed correctly.  `io.Seekable` doesn't work, unfortunately,
because whilst we know where we want to be (`d.InputOffset()`),
we don't actually know where we currently are which precludes
the use of `io.SeekCurrent` and, bizarrely, it turns out that
`io.SeekSet` gets progressively slower as you seek further and
further into your (in this case) `bytes.Buffer`.

Thus we keep track of where we want to be (`moved`) and create
a `bytes.NewReader` for each attempted decode at the correct
position.  Crufty, definitely, and memory-allocation heavy,
probably, but it works and is surprisingly not that bad even
on large files.

My test 85MB JSON single line input takes ~64s (x86_64),
~43s (arm64) and ~275M to parse into 1024 objects comprising
1GB of output text.  Compare to `jq`: ~25s (x86_64),
~11s (arm64) using ~630M giving 350MB of output.
@milahu

This comment was marked as off-topic.

@rjp
Copy link
Author

rjp commented Apr 18, 2022

what else do we need all the (non-option) argv for?

Ah, this is "decode all the objects in the input", not "decode all the objects in the command line arguments", because I have things that output multiple objects in a single file non-stream format which I needed to decode.

But yes, iterating over the arguments does make sense if only for xargs usage.

@milahu
Copy link

milahu commented Apr 18, 2022

oops, i confused this issue with #28

Adds: -a, -all flag which means "decode all the objects,
pretending it's a JSON stream even if it's not actually."

now it makes sense to hide this feature behind a flag
as {"a":1}{"b":2} is an invalid json document

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scanner error: token too long
2 participants