Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption when reading from two files in parallel in an uncompressed .zip #127847

Closed
dimaryaz opened this issue Dec 12, 2024 · 8 comments
Closed
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dimaryaz
Copy link
Contributor

dimaryaz commented Dec 12, 2024

Bug report

Bug description:

I ran into a data corruption bug that seems to be triggered by interleaving reads/seeks from different files inside of an uncompressed zip file. As far as I can tell from the docs, this is allowed by zipfile. It works correctly in Python 3.7 and 3.9, but fails in 3.12.

I'm attaching a somewhat convoluted testcase (still working on a simpler one). It parses a dBase IV database by reading records from a .dbf file, and for each record, reading a corresponding record from a .dbt file.

When run using Python 3.9, you will see a bunch of data printed out. When run using Python 3.12, you will get an exception ValueError: Invalid dBase IV block: b'PK\x03\x04\n\x00\x00\x00'. That block does not appear in the input file at all. (Though, when tested with a larger input, I got a block of bytes that appeared in the wrong file.)

For some context, here is a workaround I used in my project: I changed it to read the .dbf file first, then the .dbt.

Testcase:

#!/usr/bin/env python3

import datetime
import pathlib
import struct
import zipfile

from dataclasses import dataclass
from typing import Any, BinaryIO, List, Tuple


ZIP_PATH = pathlib.Path(__file__).parent / 'notams.zip'


@dataclass
class DbfHeader:
    SIZE = 32
    VERSION = 3

    info: int
    last_update: datetime.date
    num_records: int
    header_bytes: int
    record_bytes: int

    @classmethod
    def from_bytes(cls, data: bytes):
        info, year, month, day, num_records, header_bytes, record_bytes = struct.unpack('<4BIHH20x', data)
        version = info & 0x3
        if version != cls.VERSION:
            raise ValueError(f"Unsupported DBF version: {version}")

        return cls(info, datetime.date(year + 1900, month, day), num_records, header_bytes, record_bytes)


@dataclass
class DbfField:
    SIZE = 32

    name: str
    type: str
    length: int

    @classmethod
    def from_bytes(cls, data: bytes):
        name, typ, length = struct.unpack('<11sc4xB15x', data)
        return cls(name.rstrip(b'\x00').decode(), typ.decode(), length)


class DbfFile:
    @classmethod
    def read_header(cls, fd: BinaryIO) -> Tuple[DbfHeader, List[DbfField]]:
        header = DbfHeader.from_bytes(fd.read(DbfHeader.SIZE))
        num_fields = (header.header_bytes - 33) // 32
        fields = [DbfField.from_bytes(fd.read(DbfField.SIZE)) for _ in range(num_fields)]
        if fd.read(1) != b'\x0D':
            raise ValueError("Missing array terminator")
        return header, fields

    @classmethod
    def read_record(cls, fd: BinaryIO, fields: List[DbfField]) -> List[Any]:
        fd.read(1)

        values = []
        for field in fields:
            data = fd.read(field.length).decode('latin-1').strip(' ')
            if field.type == 'C':
                value = data
            elif field.type == 'D':
                s = data.strip(' ')
                if s:
                    value = datetime.datetime.strptime(data, '%Y%m%d').date()
                else:
                    value = None
            elif field.type == 'L':
                if len(data) != 1:
                    raise ValueError(f"Incorrect length: {data!r}")
                if data in 'YyTt':
                    value = True
                elif data in 'NnFf':
                    value = False
                elif data == '?':
                    value = None
                else:
                    raise ValueError(f"Incorrect boolean: {data!r}")
            elif field.type in ('M', 'N'):
                value = int(data) if data else None
            else:
                raise ValueError(f"Unsupported field: {field.type}")
            values.append(value)
        return values


@dataclass
class DbtHeader:
    SIZE = 512

    next_free_block: int
    dbf_filename: str
    reserved: int
    block_length: int

    @classmethod
    def from_bytes(cls, data: bytes):
        next_free_block, dbf_filename, reserved, block_length = struct.unpack('<I4x8sIH490x', data)
        return cls(next_free_block, dbf_filename.decode('latin-1'), reserved, block_length)


class DbtFile:
    DBT3_BLOCK_SIZE = 512
    DBT4_BLOCK_START = b'\xFF\xFF\x08\x00'

    @classmethod
    def read_header(cls, fd: BinaryIO) -> DbtHeader:
        fd.seek(0)
        block = fd.read(DbtHeader.SIZE)
        return DbtHeader.from_bytes(block)

    @classmethod
    def read_record(cls, fd: BinaryIO, header: DbtHeader, idx: int) -> str:
        fd.seek(header.block_length * idx)
        block_start = fd.read(8)
        if block_start[0:4] != cls.DBT4_BLOCK_START:
            raise ValueError(f"Invalid dBase IV block: {block_start}")
        length = int.from_bytes(block_start[4:8], 'little')
        data = fd.read(length - len(block_start))

        return data.decode('latin-1')


def main():
    with zipfile.ZipFile(ZIP_PATH) as z:
        with z.open('notams.dbf') as dbf_in, z.open('notams.dbt') as dbt_in:
            dbf_header, dbf_fields = DbfFile.read_header(dbf_in)
            dbt_header = DbtFile.read_header(dbt_in)

            for _ in range(dbf_header.num_records):
                record = DbfFile.read_record(dbf_in, dbf_fields)
                print(record)
                memo = DbtFile.read_record(dbt_in, dbt_header, record[3])
                print(memo)


if __name__ == '__main__':
    main()

Input file:
notams.zip

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

Linux

Linked PRs

@dimaryaz dimaryaz added the type-bug An unexpected behavior, bug, or error label Dec 12, 2024
@dimaryaz
Copy link
Contributor Author

Here is a completely trivial test case:

#!/usr/bin/env python3

from io import BytesIO
import zipfile

buf = BytesIO()

with zipfile.ZipFile(buf, mode='w', compression=0) as z:
    with z.open('a.txt', mode='w') as a:
        a.write(b'123')
    with z.open('b.txt', mode='w') as b:
        b.write(b'456')


with zipfile.ZipFile(buf) as z:
    with z.open('a.txt') as a, z.open('b.txt') as b:
        a.read(1)

        b.seek(1)
        data = b.read(1)
        assert data == b'5', data

When run with Python 3.9, it finishes successfully. When run with Python 3.12, you get:

Traceback (most recent call last):
  File "/home/dima/src/zipfile_bug/./test.py", line 22, in <module>
    assert data == b'5', data
           ^^^^^^^^^^^^
AssertionError: b'K'

@danifus
Copy link
Contributor

danifus commented Dec 12, 2024

Looks like it is getting confused about its position in the archive in the special case of seeking in a ZIP_STORED entry:

self._fileobj.seek(read_offset, os.SEEK_CUR)

should be:

            self._fileobj.seek(self._orig_compress_start + read_offset)

Did you want to make a test and add the fix?

@dimaryaz
Copy link
Contributor Author

Oh nice, thank you!

Well, let me see if I can figure out how to run Python tests...

@MercMayhem
Copy link

@danifus could you explain to me why changing the new position to the value of self._orig_compress_start + read_offset fixes this issue?

Going through the code, it seemed as if calling seek on _fileobj with read_offset and os.SEEK_CUR as arguments seems sound in my limited understanding, since we have loaded the buffer into ZipFileExt object, the pointer in the actual _SharedFile points to the position of the last byte in the buffer within the shared file.

Hence, after correcting the read_offset by subtracting len(self._readbuffer) - self._offset and calling the seek method with os.SEEK_CUR, we should land on the correct byte?

Can you explain the flaw in my understanding? I am newbie trying to go through the CPython internals.

@ankasani
Copy link

ankasani commented Dec 13, 2024

I have similar kind of issue, where data was getting corrupted when reading the files parallel, below code is working fine with 3.11 version and not working with 3.12, 3.13.

ERROR: Error -3 while decompressing data: invalid stored block lengths

import zipfile
import asyncio

async def process_file(text_file_name: str, zip_file: zipfile.ZipFile):
    try:
        with zip_file.open(text_file_name, mode='r') as text_file:
            try:
                content = await asyncio.to_thread(text_file.read)
                lines = content.decode('utf-8').splitlines()
            except UnicodeDecodeError as e:
                print(f"Error decoding file {text_file_name}: {e}")
                return None
            except Exception as e:
                print(f"Error reading file {text_file_name}: {e}")
                return None
            if not lines:
                return None
            # Process lines here
            return lines
    except Exception as e:
        print(f"Error opening file {text_file_name}: {e}")
        return None

async def main():
    temp_file_path = 'Tests.zip'
    zip_file = zipfile.ZipFile(temp_file_path, 'r')
    tasks = [process_file(text_file_name, zip_file) for text_file_name in zip_file.namelist()]
    await asyncio.gather(*tasks)

asyncio.run(main())

@danifus
Copy link
Contributor

danifus commented Dec 13, 2024

@MercMayhem we continued the discussion here: #127856 (comment) The problem is more likely in _SharedFile.seek()

@ankasani this bug is specific to zipped entries that aren't compressed and the error you reported says there is an error decompressing so I think your problem isn't related to this current issue. I'm not sure about the async behaviour of zipfile but it is probably easier to open the zip inside process_file instead (or as well).

@picnixz picnixz added the stdlib Python modules in the Lib dir label Dec 13, 2024
@picnixz picnixz added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Dec 13, 2024
@ankasani
Copy link

ankasani commented Dec 14, 2024

@danifus Thanks for your help! However, if I open the ZIP file inside the process_file function, it could lead to higher memory usage. This is because each task would open a new instance of the ZIP file, potentially loading multiple instances into memory simultaneously, especially if the ZIP contains many files. I’m looking for a solution that minimizes memory usage while still allowing for fast, simultaneous execution. so do you have any other suggestions on it ?

Maybe i need to move this to another new thread as you mentioned this is a different issue.

dimaryaz added a commit to dimaryaz/cpython that referenced this issue Dec 14, 2024
jaraco added a commit that referenced this issue Dec 24, 2024
---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Jason R. Coombs <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Dec 24, 2024
…onGH-127856)

---------
(cherry picked from commit 7ed6c5c)

Co-authored-by: Dima Ryazanov <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Jason R. Coombs <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Dec 24, 2024
…onGH-127856)

---------
(cherry picked from commit 7ed6c5c)

Co-authored-by: Dima Ryazanov <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Jason R. Coombs <[email protected]>
jaraco added a commit that referenced this issue Dec 24, 2024
…127856) (#128226)

gh-127847: Fix position in the special-cased zipfile seek (GH-127856)

---------
(cherry picked from commit 7ed6c5c)

Co-authored-by: Dima Ryazanov <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Jason R. Coombs <[email protected]>
@jaraco
Copy link
Member

jaraco commented Dec 24, 2024

PRs merged; closing.

@jaraco jaraco closed this as completed Dec 24, 2024
jaraco added a commit that referenced this issue Dec 24, 2024
…127856) (#128225)

gh-127847: Fix position in the special-cased zipfile seek (GH-127856)

---------
(cherry picked from commit 7ed6c5c)

Co-authored-by: Dima Ryazanov <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Jason R. Coombs <[email protected]>
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 8, 2025
…on#127856)

---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Peter Bierma <[email protected]>
Co-authored-by: Jason R. Coombs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

6 participants