Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON scalar texts in the stream are not captured properly #3

Open
00dani opened this issue Sep 27, 2016 · 2 comments
Open

JSON scalar texts in the stream are not captured properly #3

00dani opened this issue Sep 27, 2016 · 2 comments

Comments

@00dani
Copy link

00dani commented Sep 27, 2016

It is valid for a JSON text to represent only a single scalar value, rather than an object or array - this is supported by Python's json module:

>>> import json
>>> json.loads('true')
True
>>> json.loads('false')
False
>>> json.loads('"an example"')
'an example'

However, a stream containing such texts will not be split correctly by splitstream. The keywords true, false, and null are silently dropped, as are numeric literals:

>>> import io; from splitstream import splitfile
>>> split_buf = lambda data: list(splitfile(io.BytesIO(data), format='json'))
>>> split_buf(b'true false null [5] null true false true false {"a": 6}')
[b'[5]', b'{"a": 6}']
>>> split_buf(b'4 5 6 7 []')
[b'[]']

Attempting to insert a string literal will cause different, still incorrect behaviour. If there are no objects or arrays in the stream, the text is still silently dropped; however, if there is an object or array occurring somewhere after the string, the entire stream up to that object or array will be captured as one buffer.

>>> split_buf(b'"abc" 56 "def"')
[]
>>> split_buf(b'"abc" 56 "def" {} 3 4')
[b'"abc" 56 "def" {}']
>>> split_buf(b'"abc" 56 "def" {} 3 4 "5" 6 7 []')
[b'"abc" 56 "def" {}', b' 3 4 "5" 6 7 []']

Attempting to parse these buffers with json.loads, naturally, does not work.

The correct behaviour would be to split the stream on every toplevel JSON value, producing separate buffers for each - in other words:

>>> fixed_split_buf(b'true false null 1 "hello world" ["goodbye", "world"] {"a": 12, "b": [null]}')
[b'true', b'false', b'null', b'1', b'"hello world"', b'["goodbye", "world"]', b'{"a": 12, "b": [null]}']
@rickardp
Copy link
Owner

I agree that ideally the stream should be split on every properly terminated (sub)document. This issue is currently documented as a known limitation (only arrays and objects are supported at the top level).

Currently the splitter uses a very basic lexer for splitting. There is unfortunately no really quick fix that maintains the current level of performance.

Pull requests are always welcome :)

@amcgregor
Copy link

amcgregor commented Nov 11, 2021

@00dani I wrote a partial fragment tokenizer for Python statements and expressions as part of my template engine work on cinje. With one interesting edge case (null [5]) this appears to split as you would desire.

Admittedly, Python fragments, using Python's internal AST representation, not JSON, but. (Thus the oddity.) Conveying an approach, @rickardp 😉

>>> splitexpr('True False None [5] None True False True False {"a": 6}')
['True', 'False', 'None [5]', 'None', 'True', 'False', 'True', 'False', '{"a": 6}']

Edit to name this approach: "maximal syntactically valid substring matching". And to note that the JSON version (lower-case true, false, &c.) would also tokenize just fine. Those are valid Python symbols, even if they're not the correct ones for those singletons. Edit edit: this could implement parsing, too, by invoking literal_eval across the isolated fragments. JSON is valid Python, after all. 😜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants