-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption when reading from two files in parallel in an uncompressed .zip #127847
Comments
Here is a completely trivial test case: #!/usr/bin/env python3
from io import BytesIO
import zipfile
buf = BytesIO()
with zipfile.ZipFile(buf, mode='w', compression=0) as z:
with z.open('a.txt', mode='w') as a:
a.write(b'123')
with z.open('b.txt', mode='w') as b:
b.write(b'456')
with zipfile.ZipFile(buf) as z:
with z.open('a.txt') as a, z.open('b.txt') as b:
a.read(1)
b.seek(1)
data = b.read(1)
assert data == b'5', data When run with Python 3.9, it finishes successfully. When run with Python 3.12, you get:
|
Looks like it is getting confused about its position in the archive in the special case of seeking in a cpython/Lib/zipfile/__init__.py Line 1170 in 8bbd379
should be:
Did you want to make a test and add the fix? |
Oh nice, thank you! Well, let me see if I can figure out how to run Python tests... |
@danifus could you explain to me why changing the new position to the value of Going through the code, it seemed as if calling seek on _fileobj with Hence, after correcting the Can you explain the flaw in my understanding? I am newbie trying to go through the CPython internals. |
I have similar kind of issue, where data was getting corrupted when reading the files parallel, below code is working fine with 3.11 version and not working with 3.12, 3.13. ERROR: Error -3 while decompressing data: invalid stored block lengths
|
@MercMayhem we continued the discussion here: #127856 (comment) The problem is more likely in @ankasani this bug is specific to zipped entries that aren't compressed and the error you reported says there is an error decompressing so I think your problem isn't related to this current issue. I'm not sure about the async behaviour of zipfile but it is probably easier to open the zip inside |
@danifus Thanks for your help! However, if I open the ZIP file inside the process_file function, it could lead to higher memory usage. This is because each task would open a new instance of the ZIP file, potentially loading multiple instances into memory simultaneously, especially if the ZIP contains many files. I’m looking for a solution that minimizes memory usage while still allowing for fast, simultaneous execution. so do you have any other suggestions on it ? Maybe i need to move this to another new thread as you mentioned this is a different issue. |
--------- Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>
…onGH-127856) --------- (cherry picked from commit 7ed6c5c) Co-authored-by: Dima Ryazanov <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>
…onGH-127856) --------- (cherry picked from commit 7ed6c5c) Co-authored-by: Dima Ryazanov <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>
…127856) (#128226) gh-127847: Fix position in the special-cased zipfile seek (GH-127856) --------- (cherry picked from commit 7ed6c5c) Co-authored-by: Dima Ryazanov <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>
PRs merged; closing. |
…127856) (#128225) gh-127847: Fix position in the special-cased zipfile seek (GH-127856) --------- (cherry picked from commit 7ed6c5c) Co-authored-by: Dima Ryazanov <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>
…on#127856) --------- Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>
Bug report
Bug description:
I ran into a data corruption bug that seems to be triggered by interleaving reads/seeks from different files inside of an uncompressed zip file. As far as I can tell from the docs, this is allowed by
zipfile
. It works correctly in Python 3.7 and 3.9, but fails in 3.12.I'm attaching a somewhat convoluted testcase (still working on a simpler one). It parses a dBase IV database by reading records from a .dbf file, and for each record, reading a corresponding record from a .dbt file.
When run using Python 3.9, you will see a bunch of data printed out. When run using Python 3.12, you will get an exception
ValueError: Invalid dBase IV block: b'PK\x03\x04\n\x00\x00\x00'
. That block does not appear in the input file at all. (Though, when tested with a larger input, I got a block of bytes that appeared in the wrong file.)For some context, here is a workaround I used in my project: I changed it to read the .dbf file first, then the .dbt.
Testcase:
Input file:
notams.zip
CPython versions tested on:
3.9, 3.12
Operating systems tested on:
Linux
Linked PRs
The text was updated successfully, but these errors were encountered: