Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] XML UTF-8 with BOM fails #330

Open
Kochise opened this issue Jun 20, 2023 · 4 comments
Open

[BUG] XML UTF-8 with BOM fails #330

Kochise opened this issue Jun 20, 2023 · 4 comments

Comments

@Kochise
Copy link

Kochise commented Jun 20, 2023

You can test any XML file with a BOM :

D:\Pyenv310>xml22yaml -i "d:\Pyenv310\TEST\Alarms.xml" -o "d:\Pyenv310\TEST\Alarms.yaml"
Traceback (most recent call last):
  File "D:\Pyenv310\Python\Lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\Pyenv310\Python\Lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Pyenv310\Python\Scripts\xml22yaml.exe\__main__.py", line 7, in <module>
  File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\Pyenv310\Python\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "D:\Pyenv310\Python\lib\site-packages\yaplon\__main__.py", line 701, in xml2yaml
    reader.xml(
  File "D:\Pyenv310\Python\lib\site-packages\yaplon\reader.py", line 71, in xml
    obj = oxml.parse(input.read(), process_namespaces=namespaces)
  File "D:\Pyenv310\Python\lib\site-packages\xmltodict.py", line 378, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 1

Regards.

@mpf82
Copy link

mpf82 commented Jun 20, 2023

You can specify the encoding in parse(), the default is utf-8

IANA currently lists 250+ character encodings.

Python natively supports a subset of 109 encodings (plus some Python specific encodings).

You cannot possibly expect xmltodict to know or to guess which one your input uses.

@Kochise
Copy link
Author

Kochise commented Jun 20, 2023

set "PYTHONIOENCODING=utf8"

xmltodict shouldn't care about BOM

Alarms.xml.txt

@mpf82
Copy link

mpf82 commented Jun 20, 2023

Seems you're right, explicitely passing bytes with BOM works just fine:

import xmltodict
xml = '''<?xml version="1.0"?><test>123</test>'''
xml = xml.encode("utf-8-sig")
out = xmltodict.parse(xml)
print(out) # {'test': '123'}

So maybe the error is somewhere else? Either the file has a different encoding, or the other libs you're using are modifying the string/bytes somehow.


Edit: these work also:

from io import BytesIO, StringIO

b = BytesIO(b'\xef\xbb\xbf<?xml version="1.0"?><test>123</test>')
print(xmltodict.parse(b.read()))

b = StringIO(b'<?xml version="1.0"?><test>123</test>'.decode("utf-8-sig"))
print(xmltodict.parse(b.read()))

@Kochise
Copy link
Author

Kochise commented Jun 20, 2023

Just using https://github.com/twardoch/yaplon :

D:\Pyenv310>xml22yaml -i "d:\Pyenv310\TEST\Alarms.xml" -o "d:\Pyenv310\TEST\Alarms.yaml"

It is failing there :

https://github.com/martinblech/xmltodict/blob/master/xmltodict.py#L378

From there :

https://github.com/twardoch/yaplon/blob/master/yaplon/reader.py#L71

There should be an issue around here :

https://github.com/martinblech/xmltodict/blob/master/xmltodict.py#L341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants