Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WARC writing bug #23

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jeffcasavant
Copy link

I had this issue writing a WARC record to a file:

[jeff@lamarzocco warcs]$ ./cleanwarc.py  in.warc.gz filtered.warc.gz
Traceback (most recent call last):
  File "./cleanwarc.py", line 86, in <module>
    main()
  File "./cleanwarc.py", line 82, in main
    filter_warc(args.infile, args.outfile)
  File "./cleanwarc.py", line 61, in filter_warc
    output_warc.write_record(record)
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 268, in write_record
    warc_record.write_to(self.fileobj)
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 161, in write_to
    f.write(self.payload)
  File "/usr/lib/python2.7/site-packages/warc/gzip2.py", line 71, in write
    BaseGzipFile.write(self, data)
  File "/usr/lib/python2.7/gzip.py", line 240, in write
    if len(data) > 0:
AttributeError: FilePart instance has no attribute '__len__'

I added a __len__ function to FilePart to fix this, but got this error:

Traceback (most recent call last):
  File "./cleanwarc.py", line 90, in <module>
    main()
  File "./cleanwarc.py", line 86, in main
    filter_warc(args.infile, args.outfile)
  File "./cleanwarc.py", line 65, in filter_warc
    output_warc.write_record(record)
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 268, in write_record
    warc_record.write_to(self.fileobj)
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 161, in write_to
    f.write(self.payload)
  File "/usr/lib/python2.7/site-packages/warc/gzip2.py", line 71, in write
    BaseGzipFile.write(self, data)
  File "/usr/lib/python2.7/gzip.py", line 241, in write
    self.fileobj.write(self.compress.compress(data))
TypeError: must be string or read-only buffer, not instance

This PR fixes both issues by passing the buf attribute of the FilePart (rather than the whole FilePart) to gzip.

@jeffcasavant jeffcasavant changed the title Fix Fix WARC writing bug Jun 10, 2016
@wolfgangmeyers
Copy link

This would be great to merge. Is the project abandoned?

@jeffcasavant
Copy link
Author

@wolfgangmeyers I guess? This has seen no attention since I submitted it, getting on a year ago. Figured it would be a no-brainer 😛 Who's the maintainer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants