Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing and serialization of control characters #103

Open
j-moeller opened this issue Jul 3, 2024 · 5 comments
Open

Parsing and serialization of control characters #103

j-moeller opened this issue Jul 3, 2024 · 5 comments

Comments

@j-moeller
Copy link

Hello,

we found json.h to parse and serialize control characters below 0x20 which technically is in violation of the JSON grammar. We collected a minimum working sample here.

@sheredom
Copy link
Owner

sheredom commented Jul 3, 2024

Cannot see the sample (it 404's for me). Can you point me at the offending JSON grammar language that the lib is violating by any chance? Happy to have this fixed, but just wanna know where it says!

@j-moeller
Copy link
Author

Sorry, the repository was still set to "private". It should be public now.

I am referencing Section 7 "Strings" from (https://datatracker.ietf.org/doc/html/rfc8259):

All Unicode characters may be placed within the quotation marks, except for the characters that MUST be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

If my understanding of this is correct, the control characters below U+001F must be passed as "\u0000" - "\u001f" to be valid JSON (except for U+0008, U+000C, U+000A, U+000D, U+0009 which may also be passed as "\b", "\f", "\n", "\r", "\t").

json.h behaves as follows:

These are expected to return parsing errors:

  • 0x01 - 0x07 are parsed and serialized back to 0x01 - 0x07
  • 0x08 is parsed and serialized back to "\u0008"
  • 0x0b is parsed and serialized back to 0x0b
  • 0x0c is parsed and serialized back to "\u000c"
  • 0x0e - 0x1f are parsed and serialized back to 0x0e - 0x1f

These are expected to be parsed and return "\u00xx":

  • "\u0001" - "\u0007" are parsed and serialized back to 0x01 - 0x07
  • "\u000b" is parsed and serialized back to 0x000b
  • "\u000e" - "\u001f" are parsed and serialized back to 0x0e - 0x1f

Note that since there is also Section 9 "Parser", json.h is technically still adhering to the specification. So feel free to decide on the correct way to handle this.

A JSON parser MAY accept non-JSON forms or extensions.

@sheredom
Copy link
Owner

sheredom commented Jul 4, 2024

Nice summary thanks! I think we'll fix this - seems worthwhile to err on the side of caution here.

I can take this change up if you wish, but happy to accept a PR if you'd rather do the coding!

@j-moeller
Copy link
Author

Hi, sorry for the late reply. Unfortunately, I am not that familiar with the code base, so I think it would be better, if you implemented the necessary changes.

@sheredom
Copy link
Owner

Totally fine. When I get the time I'll look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants