Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip newlines when parsing TextDicts to avoid OverflowError #540

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

NeoLegends
Copy link
Member

Closes #539

@NeoLegends NeoLegends self-assigned this Sep 3, 2024
@NeoLegends NeoLegends force-pushed the moritz-line-counter-overflow branch 2 times, most recently from e45ac8a to c646539 Compare September 3, 2024 08:26
@NeoLegends NeoLegends force-pushed the moritz-line-counter-overflow branch from c646539 to d0cf85e Compare September 3, 2024 08:30
@NeoLegends NeoLegends requested a review from Stefanwuu September 3, 2024 08:40
util.py Outdated

def parse_text_dict(path: Union[str, tk.Path]) -> Dict[str, str]:
"""
Loads the text dict at :param:`path` making sure not to trigger line counter overflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand this explanation?

  • Put a ref to the issue here: Parsing a large TextDict fails on recent python versions #539
  • Also quote the exception you got: OverflowError: line number table is too long
  • Explain the exception. What does it mean? Why does this function fix it? I don't really understand what the exception means. My guess would be that the dict becomes too big. But then this function here would have the same problem, because as I understand it, you intend to create just the same dict, just in a different way?

util.py Outdated
"""

with uopen(path, "rt") as text_dict_file:
txt = text_dict_file.read()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder a bit: as I understand it, your problem is that the dict is really huge (how huge? how many entries? how big is the file?). But then, isn't it a problem to load it all into memory? Also via the following code (strip, splitlines, etc), you are even creating multiple copies of the data in memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my understanding that the memory itself is not an issue, but rather that the file in my case is about 280k lines long, which is where python's line counter overflows for some reason. This is why I'm not worried about memory consumption itself in this module, but rather just the way the string is parsed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Python line counter? There are no "lines" in a dict? Or you mean for Python source code?

But anyway, you should add this explanation to the code, so that you can understand why we have this code there.

util.py Outdated Show resolved Hide resolved
util.py Outdated
Comment on lines 402 to 404
# parse chunkwise to avoid line counter overflow when the text dict is very large
for chunk in chunks(lines, max(1, len(lines) // 1000))
for k, v in eval("\n".join(["{", *chunk, "}"]), {"nan": float("nan"), "inf": float("inf")}).items()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code does not work in all cases. E.g. consider a dict formatted as this:

{
"a":
42
}

(Yea, this probably will not happen when it was written with some of our tools, but who knows, maybe it comes from elsewhere, and/or there was some black on the generated output, or whatever.)
This should at least be discussed in a code comment.

Copy link
Member

@albertz albertz Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, in RETURNN, we use returnn.util.literal_py_to_pickle.literal_eval, which would solve this problem in another way, which should be faster and more generic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should at least be discussed in a code comment.

Yes, I'll add one.

Note, in RETURNN, we use returnn.util.literal_py_to_pickle.literal_eval, which would solve this problem in another way, which should be faster and more generic.

That implementation contains a native C++ module, I don't think this should be made part of i6_core.


Why did we end up with an implementation based on Python literals anyway, and why haven't we used a proper data format (e.g. JSON) from the get go for this? All this custom code would never have had to be written and the issue here would likely have never occured in the first place with a proper data exchange format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

native C++ module

Well, you could argue, the Python stdlib is full of native code... I don't really see being "native" a problem here. The main problem is maybe that then RETURNN becomes a dependency for those specific jobs. That's maybe unwanted. But I'm not sure if it's really so much a problem. But then, maybe we can also use this code here for now and keep that in mind. (Maybe add it in a comment, that this is an alternative.)

Why did we end up with an implementation based on Python literals anyway, and why haven't we used a proper data format (e.g. JSON) from the get go for this?

My assumption was, it's faster and simpler. Those assumptions turned out to be wrong though, so yes, JSON would have been better.

Python literals are also more flexible and generic though, but in many cases (like here), this is not really needed, and/or you can work around that. But JSON is really restricted. I think it also does not support inf/nan, so it doesn't properly work for N-best lists (rarely, but sometimes we get inf in the score; it would be annoying if it then rarely crashes, and/or you need a stupid workaround just to clip it to 1e30 or so).

@NeoLegends NeoLegends changed the title Parse TextDicts chunkwise to avoid OverflowError Strip newlines when parsing TextDicts to avoid OverflowError Sep 3, 2024
@NeoLegends
Copy link
Member Author

I've noticed you can also simply strip any newline from the text dict before parsing, this also avoids the error. Newer python versions than 3.10 also don't seem to run into this error anymore.

@albertz
Copy link
Member

albertz commented Sep 3, 2024

I've noticed you can also simply strip any newline from the text dict before parsing,

But this can break the code even more, and even silently? The other complaint I had before would at least fail with an error if the chunking is incorrect. This stripping can corrupt the data silently without error. I definitely would not do that.

Example:

{
"a":
"""
a
b
c
"""
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing a large TextDict fails on recent python versions
2 participants