Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large memory usage increase in streaming mode between 0.13.0 and 0.14.0 #365

Open
erikhansenwong opened this issue Nov 4, 2024 · 1 comment

Comments

@erikhansenwong
Copy link

I recently upgraded from 0.13.0 to 0.14.2 and a job that uses xmltodict went from using about 2.5 GB to about 29 GB of memory. I was able to narrow the difference down to changes between 0.13.0 and 0.14.0.

Here are plots of the memory usage for the exact same code using the two different versions. (Plots created using memory-profiler==0.61.0)

image

image

The file that is being parsed is an 8.8 GB xml.

I am unable to include all of the surrounding code or the raw xml, but I will show the call that I make to xmltodict.parse(...) below:

        with open(xml_path, "rb") as file:
            try:
                xmltodict.parse(
                    file,
                    item_depth=3,
                    item_callback=callback,
                    xml_attribs=True,
                )
            except xmltodict.ParsingInterrupted:
                pass

If this is not enough to go on then please let me know and I will try to produce a simple example with anonymized data.

@erikhansenwong
Copy link
Author

I was able to create a standalone example without any proprietary code or data. Peak memory usage for this example is

  • 48 MiB using xmltodict==0.13.0
  • 3768 MiB using xmltodict==0.14.0
import random

import xmltodict


class Counter:
    def __init__(self):
        self.counts = {}

    def __call__(self, path, record):
        path_name, path_attribs = path[-1]
        if path_name == "instrument":
            for k, v in record.items():
                if k.startswith("@"):
                    continue
                tag_counts = self.counts.get(k, {})
                value = v["#text"]
                tag_counts[value] = tag_counts.get(value, 0) + 1
                self.counts[k] = tag_counts
        return True


def write_xml_file(path: str, items: int, tags: int, seed: int) -> None:
    random.seed(seed)
    alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    with open(path, "w") as file:
        file.write('<?xml version="1.0" encoding="UTF-8"?>\n')
        file.write('<security_master v="47">\n')
        file.write("  <header>\n")
        file.write("    <type>initialize</type>\n")
        file.write("    <timestamp>2024-05-02T01:15:05Z</timestamp>\n")
        file.write("  </header>\n")
        file.write("  <payload>\n")
        for i in range(items):
            file.write(f'    <instrument id="{i}">\n')
            for j in range(tags):
                file.write(
                    f'      <tag_{j} type="TYPE_{j}">{random.choice(alphabet)}</tag_{j}>\n'
                )
            file.write("    </instrument>\n")
        file.write("  </payload>\n")
        file.write("</security_master>\n")


def main():
    path = "example.xml"
    items = 1_000_000
    tags = 10
    seed = 0

    write_xml_file(path, items, tags, seed)

    counter = Counter()
    with open(path, "rb") as file:
        try:
            xmltodict.parse(
                file,
                item_depth=3,
                item_callback=counter,
                xml_attribs=True,
            )
        except xmltodict.ParsingInterrupted:
            pass

    print(counter.counts)


if __name__ == "__main__":
    main()

Memory usage over time

The data for these plots comes from memory-profiler==0.61.0

https://pypi.org/project/memory-profiler/

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant