JSON scalar texts in the stream are not captured properly #3

00dani · 2016-09-27T12:00:16Z

It is valid for a JSON text to represent only a single scalar value, rather than an object or array - this is supported by Python's json module:

>>> import json
>>> json.loads('true')
True
>>> json.loads('false')
False
>>> json.loads('"an example"')
'an example'

However, a stream containing such texts will not be split correctly by splitstream. The keywords true, false, and null are silently dropped, as are numeric literals:

>>> import io; from splitstream import splitfile
>>> split_buf = lambda data: list(splitfile(io.BytesIO(data), format='json'))
>>> split_buf(b'true false null [5] null true false true false {"a": 6}')
[b'[5]', b'{"a": 6}']
>>> split_buf(b'4 5 6 7 []')
[b'[]']

Attempting to insert a string literal will cause different, still incorrect behaviour. If there are no objects or arrays in the stream, the text is still silently dropped; however, if there is an object or array occurring somewhere after the string, the entire stream up to that object or array will be captured as one buffer.

>>> split_buf(b'"abc" 56 "def"')
[]
>>> split_buf(b'"abc" 56 "def" {} 3 4')
[b'"abc" 56 "def" {}']
>>> split_buf(b'"abc" 56 "def" {} 3 4 "5" 6 7 []')
[b'"abc" 56 "def" {}', b' 3 4 "5" 6 7 []']

Attempting to parse these buffers with json.loads, naturally, does not work.

The correct behaviour would be to split the stream on every toplevel JSON value, producing separate buffers for each - in other words:

>>> fixed_split_buf(b'true false null 1 "hello world" ["goodbye", "world"] {"a": 12, "b": [null]}')
[b'true', b'false', b'null', b'1', b'"hello world"', b'["goodbye", "world"]', b'{"a": 12, "b": [null]}']

The text was updated successfully, but these errors were encountered:

rickardp · 2016-09-27T15:19:58Z

I agree that ideally the stream should be split on every properly terminated (sub)document. This issue is currently documented as a known limitation (only arrays and objects are supported at the top level).

Currently the splitter uses a very basic lexer for splitting. There is unfortunately no really quick fix that maintains the current level of performance.

Pull requests are always welcome :)

amcgregor · 2021-11-11T13:35:42Z

@00dani I wrote a partial fragment tokenizer for Python statements and expressions as part of my template engine work on cinje. With one interesting edge case (null [5]) this appears to split as you would desire.

Admittedly, Python fragments, using Python's internal AST representation, not JSON, but. (Thus the oddity.) Conveying an approach, @rickardp 😉

>>> splitexpr('True False None [5] None True False True False {"a": 6}')
['True', 'False', 'None [5]', 'None', 'True', 'False', 'True', 'False', '{"a": 6}']

Edit to name this approach: "maximal syntactically valid substring matching". And to note that the JSON version (lower-case true, false, &c.) would also tokenize just fine. Those are valid Python symbols, even if they're not the correct ones for those singletons. Edit edit: this could implement parsing, too, by invoking literal_eval across the isolated fragments. JSON is valid Python, after all. 😜

rickardp added the enhancement label Sep 27, 2016

rickardp added the help wanted label Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON scalar texts in the stream are not captured properly #3

JSON scalar texts in the stream are not captured properly #3

00dani commented Sep 27, 2016

rickardp commented Sep 27, 2016

amcgregor commented Nov 11, 2021 •

edited

Loading

JSON scalar texts in the stream are not captured properly #3

JSON scalar texts in the stream are not captured properly #3

Comments

00dani commented Sep 27, 2016

rickardp commented Sep 27, 2016

amcgregor commented Nov 11, 2021 • edited Loading

amcgregor commented Nov 11, 2021 •

edited

Loading