Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-128131: Completely support random read access of uncompressed unencrypted files in ZipFile #128143

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
37 changes: 37 additions & 0 deletions Lib/test/test_zipfile/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -3447,6 +3447,43 @@ def test_too_short(self):
self.assertEqual(
b"zzz", zipfile._Extra.strip(b"zzz", (self.ZIP64_EXTRA,)))

class StoredZipExtFileRandomReadTest(unittest.TestCase):
5ec1cff marked this conversation as resolved.
Show resolved Hide resolved
def test_random_read(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring describing what this function is testing. Give an overview of what the problem is and what is being guaranteed by the test. Also, link to the reported issue, where the problem should be described in exquisite detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this function seems too broad. It's not simply testing a random read, but it's also testing the case where it's following an optimized path for an uncompressed file. Consider test_stored_seek.

from _pyio import BytesIO
class StatIO(BytesIO):
def __init__(self):
super().__init__()
self.bytes_read = 0

def read(self, size=-1):
bs = super().read(size)
self.bytes_read += len(bs)
return bs
5ec1cff marked this conversation as resolved.
Show resolved Hide resolved

sio = StatIO()
# 20000 bytes
txt = b'0123456789' * 2000

with zipfile.ZipFile(sio, "w", compression=zipfile.ZIP_STORED) as zipf:
zipf.writestr("foo.txt", txt)

# check random seek and read on a file
with zipfile.ZipFile(sio, "r") as zipf:
with zipf.open("foo.txt", "r") as fp:
# seek length must be greater than ZipExtFile.MIN_READ_SIZE (4096)
5ec1cff marked this conversation as resolved.
Show resolved Hide resolved
# forward seek
old_count = sio.bytes_read
fp.seek(10002, os.SEEK_CUR)
arr = fp.read(100)
self.assertEqual(arr, txt[10002:10102])
self.assertLessEqual(sio.bytes_read - old_count, 4096, 'Redundant bytes were read during forward seek and read!')
5ec1cff marked this conversation as resolved.
Show resolved Hide resolved

# backward seek
old_count = sio.bytes_read
fp.seek(-5003, os.SEEK_CUR)
picnixz marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fp.seek(-5003, os.SEEK_CUR)
fp.seek(-backward_seek, os.SEEK_CUR)

arr = fp.read(100)
self.assertEqual(arr, txt[5099:5199])
self.assertLessEqual(sio.bytes_read - old_count, 4096, 'Redundant bytes were read during backward seek and read!')
5ec1cff marked this conversation as resolved.
Show resolved Hide resolved

5ec1cff marked this conversation as resolved.
Show resolved Hide resolved
if __name__ == "__main__":
unittest.main()
5 changes: 4 additions & 1 deletion Lib/zipfile/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1162,13 +1162,16 @@ def seek(self, offset, whence=os.SEEK_SET):
self._offset = buff_offset
read_offset = 0
# Fast seek uncompressed unencrypted file
elif self._compress_type == ZIP_STORED and self._decrypter is None and read_offset > 0:
elif self._compress_type == ZIP_STORED and self._decrypter is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uneasy about this change, especially because the next block is elif read_offset < 0. This change affects both read_offset == 0 and read_offset < 0. Is that what you intended? I get the feeling what you're really aiming to address is the condition where:

Suggested change
elif self._compress_type == ZIP_STORED and self._decrypter is None:
elif self._compress_type == ZIP_STORED and self._decrypter is None and read_offset >= 0:

This is why it's so important to provide a detailed description of the problem, how you discovered it, and what cases it affects.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, here we consider the case of read_offset < 0, and should exclude the case of read_offset = 0

# disable CRC checking after first seeking - it would be invalid
self._expected_crc = None
# seek actual file taking already buffered data into account
read_offset -= len(self._readbuffer) - self._offset
self._fileobj.seek(read_offset, os.SEEK_CUR)
self._left -= read_offset
self._compress_left -= read_offset
5ec1cff marked this conversation as resolved.
Show resolved Hide resolved
if self._eof and read_offset < 0:
self._eof = False
read_offset = 0
# flush read buffer
self._readbuffer = b''
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Completely support random access of uncompressed unencrypted :class:`!zipfile.ZipExtFile`.
5ec1cff marked this conversation as resolved.
Show resolved Hide resolved
Loading