Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore character data in mixed content XML #10

Open
nichtich opened this issue Jul 15, 2019 · 4 comments
Open

Ignore character data in mixed content XML #10

nichtich opened this issue Jul 15, 2019 · 4 comments
Labels
format:xml kind:bug An existing feature isn't doing something correctly status:help-wanted Accepting PRs

Comments

@nichtich
Copy link
Contributor

As discussed at #7, document-oriented XML requires another JSON serialization anyway. Supporting mixed content XML in the current JSON form is error-prone anyway. Even this simple case seems to be handled wrong (not the additional whitespace after x):

echo '{ "y": { "#text": "x", "z": "1" } }' | oq -o xml .
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <y>x    <z>1</z>
  </y>
</root>

The reason is whitespace handling in mixed content elements requires a more sophisticated algorithm (see this explanation).

Better ignore character data (#text) when there are child elements:

echo '{ "y": { "#text": "x", "z": "1" } }' | oq -o xml .
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <y>
    <z>1</z>
  </y>
</root>
@Blacksmoke16 Blacksmoke16 added format:xml kind:bug An existing feature isn't doing something correctly status:todo labels Jul 15, 2019
@Blacksmoke16
Copy link
Owner

Blacksmoke16 commented Jul 16, 2019

@nichtich This seems related to a bug in libxml2. https://bugzilla.gnome.org/show_bug.cgi?id=760160.

@Blacksmoke16
Copy link
Owner

Blacksmoke16 commented Jul 17, 2019

I checked with libxml2 devs and they said that this is the expected behavior as they do not alter whitespace.

Given I can't think of a way to replicate

<p>  The <emph> cat </emph> ate  the <foreign>grande croissant</foreign>. I didn't </p>

since #text only would allow defining the first The; I think this is worth talking about.

@nichtich Do you think its worth explicitly removing it, versus just leaving that up to the end user? The code to implement this change feels a bit hacky to me. I'll see if I can figure out a better solution.

@nichtich
Copy link
Contributor Author

I checked with libxml2 devs and they said that this is the expected behavior as they do not alter whitespace.

It's a bug if you use libxml2 with indent value other than 0. This works:

echo '{ "y": { "#text": "x ", "z": "1" } }' | oq -o xml --indent 0 .
<?xml version="1.0" encoding="UTF-8"?>
<root><y>x <z>1</z></y></root>

You could ignore character data in mixed content XML unless indent is 0.

@Blacksmoke16
Copy link
Owner

This would require some logic that would see if there are other keys that aren't #text or @*, and if so, don't emit the #text value's node.

But because of #18, I'm now not loading anything into memory and only reading one token a time; which makes this much trickier.

@Blacksmoke16 Blacksmoke16 added status:help-wanted Accepting PRs and removed status:todo labels Oct 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format:xml kind:bug An existing feature isn't doing something correctly status:help-wanted Accepting PRs
Projects
None yet
Development

No branches or pull requests

2 participants