Data corruption when reading from two files in parallel in an uncompressed .zip #127847

dimaryaz · 2024-12-12T00:43:04Z

Bug report

Bug description:

I ran into a data corruption bug that seems to be triggered by interleaving reads/seeks from different files inside of an uncompressed zip file. As far as I can tell from the docs, this is allowed by zipfile. It works correctly in Python 3.7 and 3.9, but fails in 3.12.

I'm attaching a somewhat convoluted testcase (still working on a simpler one). It parses a dBase IV database by reading records from a .dbf file, and for each record, reading a corresponding record from a .dbt file.

When run using Python 3.9, you will see a bunch of data printed out. When run using Python 3.12, you will get an exception ValueError: Invalid dBase IV block: b'PK\x03\x04\n\x00\x00\x00'. That block does not appear in the input file at all. (Though, when tested with a larger input, I got a block of bytes that appeared in the wrong file.)

For some context, here is a workaround I used in my project: I changed it to read the .dbf file first, then the .dbt.

Testcase:

#!/usr/bin/env python3

import datetime
import pathlib
import struct
import zipfile

from dataclasses import dataclass
from typing import Any, BinaryIO, List, Tuple


ZIP_PATH = pathlib.Path(__file__).parent / 'notams.zip'


@dataclass
class DbfHeader:
    SIZE = 32
    VERSION = 3

    info: int
    last_update: datetime.date
    num_records: int
    header_bytes: int
    record_bytes: int

    @classmethod
    def from_bytes(cls, data: bytes):
        info, year, month, day, num_records, header_bytes, record_bytes = struct.unpack('<4BIHH20x', data)
        version = info & 0x3
        if version != cls.VERSION:
            raise ValueError(f"Unsupported DBF version: {version}")

        return cls(info, datetime.date(year + 1900, month, day), num_records, header_bytes, record_bytes)


@dataclass
class DbfField:
    SIZE = 32

    name: str
    type: str
    length: int

    @classmethod
    def from_bytes(cls, data: bytes):
        name, typ, length = struct.unpack('<11sc4xB15x', data)
        return cls(name.rstrip(b'\x00').decode(), typ.decode(), length)


class DbfFile:
    @classmethod
    def read_header(cls, fd: BinaryIO) -> Tuple[DbfHeader, List[DbfField]]:
        header = DbfHeader.from_bytes(fd.read(DbfHeader.SIZE))
        num_fields = (header.header_bytes - 33) // 32
        fields = [DbfField.from_bytes(fd.read(DbfField.SIZE)) for _ in range(num_fields)]
        if fd.read(1) != b'\x0D':
            raise ValueError("Missing array terminator")
        return header, fields

    @classmethod
    def read_record(cls, fd: BinaryIO, fields: List[DbfField]) -> List[Any]:
        fd.read(1)

        values = []
        for field in fields:
            data = fd.read(field.length).decode('latin-1').strip(' ')
            if field.type == 'C':
                value = data
            elif field.type == 'D':
                s = data.strip(' ')
                if s:
                    value = datetime.datetime.strptime(data, '%Y%m%d').date()
                else:
                    value = None
            elif field.type == 'L':
                if len(data) != 1:
                    raise ValueError(f"Incorrect length: {data!r}")
                if data in 'YyTt':
                    value = True
                elif data in 'NnFf':
                    value = False
                elif data == '?':
                    value = None
                else:
                    raise ValueError(f"Incorrect boolean: {data!r}")
            elif field.type in ('M', 'N'):
                value = int(data) if data else None
            else:
                raise ValueError(f"Unsupported field: {field.type}")
            values.append(value)
        return values


@dataclass
class DbtHeader:
    SIZE = 512

    next_free_block: int
    dbf_filename: str
    reserved: int
    block_length: int

    @classmethod
    def from_bytes(cls, data: bytes):
        next_free_block, dbf_filename, reserved, block_length = struct.unpack('<I4x8sIH490x', data)
        return cls(next_free_block, dbf_filename.decode('latin-1'), reserved, block_length)


class DbtFile:
    DBT3_BLOCK_SIZE = 512
    DBT4_BLOCK_START = b'\xFF\xFF\x08\x00'

    @classmethod
    def read_header(cls, fd: BinaryIO) -> DbtHeader:
        fd.seek(0)
        block = fd.read(DbtHeader.SIZE)
        return DbtHeader.from_bytes(block)

    @classmethod
    def read_record(cls, fd: BinaryIO, header: DbtHeader, idx: int) -> str:
        fd.seek(header.block_length * idx)
        block_start = fd.read(8)
        if block_start[0:4] != cls.DBT4_BLOCK_START:
            raise ValueError(f"Invalid dBase IV block: {block_start}")
        length = int.from_bytes(block_start[4:8], 'little')
        data = fd.read(length - len(block_start))

        return data.decode('latin-1')


def main():
    with zipfile.ZipFile(ZIP_PATH) as z:
        with z.open('notams.dbf') as dbf_in, z.open('notams.dbt') as dbt_in:
            dbf_header, dbf_fields = DbfFile.read_header(dbf_in)
            dbt_header = DbtFile.read_header(dbt_in)

            for _ in range(dbf_header.num_records):
                record = DbfFile.read_record(dbf_in, dbf_fields)
                print(record)
                memo = DbtFile.read_record(dbt_in, dbt_header, record[3])
                print(memo)


if __name__ == '__main__':
    main()

Input file:
notams.zip

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

Linux

Linked PRs

The text was updated successfully, but these errors were encountered:

dimaryaz · 2024-12-12T02:24:42Z

Here is a completely trivial test case:

#!/usr/bin/env python3

from io import BytesIO
import zipfile

buf = BytesIO()

with zipfile.ZipFile(buf, mode='w', compression=0) as z:
    with z.open('a.txt', mode='w') as a:
        a.write(b'123')
    with z.open('b.txt', mode='w') as b:
        b.write(b'456')


with zipfile.ZipFile(buf) as z:
    with z.open('a.txt') as a, z.open('b.txt') as b:
        a.read(1)

        b.seek(1)
        data = b.read(1)
        assert data == b'5', data

When run with Python 3.9, it finishes successfully. When run with Python 3.12, you get:

Traceback (most recent call last):
  File "/home/dima/src/zipfile_bug/./test.py", line 22, in <module>
    assert data == b'5', data
           ^^^^^^^^^^^^
AssertionError: b'K'

danifus · 2024-12-12T05:08:30Z

Looks like it is getting confused about its position in the archive in the special case of seeking in a ZIP_STORED entry:

cpython/Lib/zipfile/__init__.py

Line 1170 in 8bbd379

self._fileobj.seek(read_offset, os.SEEK_CUR)

should be:

            self._fileobj.seek(self._orig_compress_start + read_offset)

Did you want to make a test and add the fix?

dimaryaz · 2024-12-12T05:20:02Z

Oh nice, thank you!

Well, let me see if I can figure out how to run Python tests...

MercMayhem · 2024-12-13T14:10:58Z

@danifus could you explain to me why changing the new position to the value of self._orig_compress_start + read_offset fixes this issue?

Going through the code, it seemed as if calling seek on _fileobj with read_offset and os.SEEK_CUR as arguments seems sound in my limited understanding, since we have loaded the buffer into ZipFileExt object, the pointer in the actual _SharedFile points to the position of the last byte in the buffer within the shared file.

Hence, after correcting the read_offset by subtracting len(self._readbuffer) - self._offset and calling the seek method with os.SEEK_CUR, we should land on the correct byte?

Can you explain the flaw in my understanding? I am newbie trying to go through the CPython internals.

ankasani · 2024-12-13T15:11:49Z

I have similar kind of issue, where data was getting corrupted when reading the files parallel, below code is working fine with 3.11 version and not working with 3.12, 3.13.

ERROR: Error -3 while decompressing data: invalid stored block lengths

import zipfile
import asyncio

async def process_file(text_file_name: str, zip_file: zipfile.ZipFile):
    try:
        with zip_file.open(text_file_name, mode='r') as text_file:
            try:
                content = await asyncio.to_thread(text_file.read)
                lines = content.decode('utf-8').splitlines()
            except UnicodeDecodeError as e:
                print(f"Error decoding file {text_file_name}: {e}")
                return None
            except Exception as e:
                print(f"Error reading file {text_file_name}: {e}")
                return None
            if not lines:
                return None
            # Process lines here
            return lines
    except Exception as e:
        print(f"Error opening file {text_file_name}: {e}")
        return None

async def main():
    temp_file_path = 'Tests.zip'
    zip_file = zipfile.ZipFile(temp_file_path, 'r')
    tasks = [process_file(text_file_name, zip_file) for text_file_name in zip_file.namelist()]
    await asyncio.gather(*tasks)

asyncio.run(main())

danifus · 2024-12-13T21:40:40Z

@MercMayhem we continued the discussion here: #127856 (comment) The problem is more likely in _SharedFile.seek()

@ankasani this bug is specific to zipped entries that aren't compressed and the error you reported says there is an error decompressing so I think your problem isn't related to this current issue. I'm not sure about the async behaviour of zipfile but it is probably easier to open the zip inside process_file instead (or as well).

ankasani · 2024-12-14T07:12:36Z

@danifus Thanks for your help! However, if I open the ZIP file inside the process_file function, it could lead to higher memory usage. This is because each task would open a new instance of the ZIP file, potentially loading multiple instances into memory simultaneously, especially if the ZIP contains many files. I’m looking for a solution that minimizes memory usage while still allowing for fast, simultaneous execution. so do you have any other suggestions on it ?

Maybe i need to move this to another new thread as you mentioned this is a different issue.

--------- Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>

…onGH-127856) --------- (cherry picked from commit 7ed6c5c) Co-authored-by: Dima Ryazanov <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>

…127856) (#128226) gh-127847: Fix position in the special-cased zipfile seek (GH-127856) --------- (cherry picked from commit 7ed6c5c) Co-authored-by: Dima Ryazanov <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>

jaraco · 2024-12-24T16:15:55Z

PRs merged; closing.

…127856) (#128225) gh-127847: Fix position in the special-cased zipfile seek (GH-127856) --------- (cherry picked from commit 7ed6c5c) Co-authored-by: Dima Ryazanov <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>

…on#127856) --------- Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Jason R. Coombs <[email protected]>

dimaryaz added the type-bug An unexpected behavior, bug, or error label Dec 12, 2024

dimaryaz added a commit to dimaryaz/cpython that referenced this issue Dec 12, 2024

pythongh-127847: Fix position in the special-cased zipfile seek

5145ac9

bedevere-app bot mentioned this issue Dec 12, 2024

gh-127847: Fix position in the special-cased zipfile seek #127856

Merged

dimaryaz added a commit to dimaryaz/cpython that referenced this issue Dec 12, 2024

pythongh-127847: Fix position in the special-cased zipfile seek

540c82f

picnixz added the stdlib Python modules in the Lib dir label Dec 13, 2024

picnixz added this to Zipfile issues Dec 13, 2024

picnixz added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Dec 13, 2024

dimaryaz added a commit to dimaryaz/cpython that referenced this issue Dec 14, 2024

pythongh-127847: Fix position in the special-cased zipfile seek

8135237

bedevere-app bot mentioned this issue Dec 24, 2024

[3.13] gh-127847: Fix position in the special-cased zipfile seek (GH-127856) #128225

Merged

bedevere-app bot mentioned this issue Dec 24, 2024

[3.12] gh-127847: Fix position in the special-cased zipfile seek (GH-127856) #128226

Merged

jaraco closed this as completed Dec 24, 2024

github-project-automation bot moved this to Done in Zipfile issues Dec 24, 2024

danifus mentioned this issue Dec 26, 2024

Error -3 while decompressing data: invalid stored block lengths #128258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption when reading from two files in parallel in an uncompressed .zip #127847

Data corruption when reading from two files in parallel in an uncompressed .zip #127847

dimaryaz commented Dec 12, 2024 •

edited by bedevere-app bot

Loading

dimaryaz commented Dec 12, 2024

danifus commented Dec 12, 2024

dimaryaz commented Dec 12, 2024

MercMayhem commented Dec 13, 2024

ankasani commented Dec 13, 2024 •

edited

Loading

danifus commented Dec 13, 2024

ankasani commented Dec 14, 2024 •

edited

Loading

jaraco commented Dec 24, 2024

Data corruption when reading from two files in parallel in an uncompressed .zip #127847

Data corruption when reading from two files in parallel in an uncompressed .zip #127847

Comments

dimaryaz commented Dec 12, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

dimaryaz commented Dec 12, 2024

danifus commented Dec 12, 2024

dimaryaz commented Dec 12, 2024

MercMayhem commented Dec 13, 2024

ankasani commented Dec 13, 2024 • edited Loading

danifus commented Dec 13, 2024

ankasani commented Dec 14, 2024 • edited Loading

jaraco commented Dec 24, 2024

dimaryaz commented Dec 12, 2024 •

edited by bedevere-app bot

Loading

ankasani commented Dec 13, 2024 •

edited

Loading

ankasani commented Dec 14, 2024 •

edited

Loading