large memory usage increase in streaming mode between 0.13.0 and 0.14.0 #365

erikhansenwong · 2024-11-04T18:09:40Z

I recently upgraded from 0.13.0 to 0.14.2 and a job that uses xmltodict went from using about 2.5 GB to about 29 GB of memory. I was able to narrow the difference down to changes between 0.13.0 and 0.14.0.

Here are plots of the memory usage for the exact same code using the two different versions. (Plots created using memory-profiler==0.61.0)

The file that is being parsed is an 8.8 GB xml.

I am unable to include all of the surrounding code or the raw xml, but I will show the call that I make to xmltodict.parse(...) below:

        with open(xml_path, "rb") as file:
            try:
                xmltodict.parse(
                    file,
                    item_depth=3,
                    item_callback=callback,
                    xml_attribs=True,
                )
            except xmltodict.ParsingInterrupted:
                pass

If this is not enough to go on then please let me know and I will try to produce a simple example with anonymized data.

The text was updated successfully, but these errors were encountered:

erikhansenwong · 2024-11-05T18:06:01Z

I was able to create a standalone example without any proprietary code or data. Peak memory usage for this example is

48 MiB using xmltodict==0.13.0
3768 MiB using xmltodict==0.14.0

import random

import xmltodict


class Counter:
    def __init__(self):
        self.counts = {}

    def __call__(self, path, record):
        path_name, path_attribs = path[-1]
        if path_name == "instrument":
            for k, v in record.items():
                if k.startswith("@"):
                    continue
                tag_counts = self.counts.get(k, {})
                value = v["#text"]
                tag_counts[value] = tag_counts.get(value, 0) + 1
                self.counts[k] = tag_counts
        return True


def write_xml_file(path: str, items: int, tags: int, seed: int) -> None:
    random.seed(seed)
    alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    with open(path, "w") as file:
        file.write('<?xml version="1.0" encoding="UTF-8"?>\n')
        file.write('<security_master v="47">\n')
        file.write("  <header>\n")
        file.write("    <type>initialize</type>\n")
        file.write("    <timestamp>2024-05-02T01:15:05Z</timestamp>\n")
        file.write("  </header>\n")
        file.write("  <payload>\n")
        for i in range(items):
            file.write(f'    <instrument id="{i}">\n')
            for j in range(tags):
                file.write(
                    f'      <tag_{j} type="TYPE_{j}">{random.choice(alphabet)}</tag_{j}>\n'
                )
            file.write("    </instrument>\n")
        file.write("  </payload>\n")
        file.write("</security_master>\n")


def main():
    path = "example.xml"
    items = 1_000_000
    tags = 10
    seed = 0

    write_xml_file(path, items, tags, seed)

    counter = Counter()
    with open(path, "rb") as file:
        try:
            xmltodict.parse(
                file,
                item_depth=3,
                item_callback=counter,
                xml_attribs=True,
            )
        except xmltodict.ParsingInterrupted:
            pass

    print(counter.counts)


if __name__ == "__main__":
    main()

Memory usage over time

The data for these plots comes from memory-profiler==0.61.0

https://pypi.org/project/memory-profiler/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large memory usage increase in streaming mode between 0.13.0 and 0.14.0 #365

large memory usage increase in streaming mode between 0.13.0 and 0.14.0 #365

erikhansenwong commented Nov 4, 2024

erikhansenwong commented Nov 5, 2024

large memory usage increase in streaming mode between 0.13.0 and 0.14.0 #365

large memory usage increase in streaming mode between 0.13.0 and 0.14.0 #365

Comments

erikhansenwong commented Nov 4, 2024

erikhansenwong commented Nov 5, 2024

Memory usage over time