Ignore character data in mixed content XML #10

nichtich · 2019-07-15T07:46:06Z

As discussed at #7, document-oriented XML requires another JSON serialization anyway. Supporting mixed content XML in the current JSON form is error-prone anyway. Even this simple case seems to be handled wrong (not the additional whitespace after x):

echo '{ "y": { "#text": "x", "z": "1" } }' | oq -o xml .
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <y>x    <z>1</z>
  </y>
</root>

The reason is whitespace handling in mixed content elements requires a more sophisticated algorithm (see this explanation).

Better ignore character data (#text) when there are child elements:

echo '{ "y": { "#text": "x", "z": "1" } }' | oq -o xml .
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <y>
    <z>1</z>
  </y>
</root>

The text was updated successfully, but these errors were encountered:

Blacksmoke16 · 2019-07-16T12:29:48Z

@nichtich This seems related to a bug in libxml2. https://bugzilla.gnome.org/show_bug.cgi?id=760160.

Blacksmoke16 · 2019-07-17T14:20:30Z

I checked with libxml2 devs and they said that this is the expected behavior as they do not alter whitespace.

Given I can't think of a way to replicate

<p>  The <emph> cat </emph> ate  the <foreign>grande croissant</foreign>. I didn't </p>

since #text only would allow defining the first The; I think this is worth talking about.

@nichtich Do you think its worth explicitly removing it, versus just leaving that up to the end user? The code to implement this change feels a bit hacky to me. I'll see if I can figure out a better solution.

nichtich · 2019-07-17T20:00:39Z

I checked with libxml2 devs and they said that this is the expected behavior as they do not alter whitespace.

It's a bug if you use libxml2 with indent value other than 0. This works:

echo '{ "y": { "#text": "x ", "z": "1" } }' | oq -o xml --indent 0 .
<?xml version="1.0" encoding="UTF-8"?>
<root><y>x <z>1</z></y></root>

You could ignore character data in mixed content XML unless indent is 0.

Blacksmoke16 · 2019-07-20T17:14:15Z

This would require some logic that would see if there are other keys that aren't #text or @*, and if so, don't emit the #text value's node.

But because of #18, I'm now not loading anything into memory and only reading one token a time; which makes this much trickier.

Blacksmoke16 added format:xml kind:bug An existing feature isn't doing something correctly status:todo labels Jul 15, 2019

Blacksmoke16 added status:help-wanted Accepting PRs and removed status:todo labels Oct 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore character data in mixed content XML #10

Ignore character data in mixed content XML #10

nichtich commented Jul 15, 2019

Blacksmoke16 commented Jul 16, 2019 •

edited

Loading

Blacksmoke16 commented Jul 17, 2019 •

edited

Loading

nichtich commented Jul 17, 2019

Blacksmoke16 commented Jul 20, 2019

Ignore character data in mixed content XML #10

Ignore character data in mixed content XML #10

Comments

nichtich commented Jul 15, 2019

Blacksmoke16 commented Jul 16, 2019 • edited Loading

Blacksmoke16 commented Jul 17, 2019 • edited Loading

nichtich commented Jul 17, 2019

Blacksmoke16 commented Jul 20, 2019

Blacksmoke16 commented Jul 16, 2019 •

edited

Loading

Blacksmoke16 commented Jul 17, 2019 •

edited

Loading