Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while extracting citations from the dumps #17

Open
kodchi opened this issue May 28, 2019 · 0 comments
Open

Error while extracting citations from the dumps #17

kodchi opened this issue May 28, 2019 · 0 comments

Comments

@kodchi
Copy link
Contributor

kodchi commented May 28, 2019

While extracting citations from the hewiki dumps of 2019/05/01, the following error occurs:

$ mwcites extract /mnt/data/xmldatadumps/public/hewiki/20190501/hewiki-20190501-pages-meta-history*.xml*.bz2 > hewiki-20190501-citations.tsv
Traceback (most recent call last):
  File "/srv/home/bmansurov/venv/mwcites/bin/mwcites", line 11, in <module>
    sys.exit(main())
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/mwcites.py", line 49, in main
    module.main(sys.argv[2:])
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 58, in main
    run(dump_files, extractors)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 65, in run
    for page_id, title, rev_id, timestamp, type, id in cites:
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/map.py", line 87, in map
Failed while processing dump '/mnt/data/xmldatadumps/public/hewiki/20190501/hewiki-20190501-pages-meta-history1.xml-p13702p18009.bz2':
Traceback (most recent call last):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/processor.py", line 35, in run
    for out in self.process_dump(dump, path):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 94, in process_dump
    for cite in extract_cite_history(page, extractors):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 116, in extract_cite_history
    for revision in page:
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/page.py", line 72, in load_revisions
    yield Revision.from_element(sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/revision.py", line 99, in from_element
    values = consume_tags(cls.TAG_MAP, element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/util.py", line 7, in consume_tags
    value_map[tag_name] = tag_map[tag_name](sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/revision.py", line 20, in <lambda>
    'contributor': lambda e: Contributor.from_element(e),
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/contributor.py", line 40, in from_element
    values = consume_tags(cls.TAG_MAP, element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/util.py", line 7, in consume_tags
    value_map[tag_name] = tag_map[tag_name](sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/contributor.py", line 14, in <lambda>
    'id': lambda e: int(e.text),
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
    re_raise(error, path)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/map.py", line 12, in re_raise

    raise error
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Failed while processing dump '/mnt/data/xmldatadumps/public/hewiki/20190501/hewiki-20190501-pages-meta-history1.xml-p6536p13701.bz2':
Traceback (most recent call last):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/processor.py", line 35, in run
    for out in self.process_dump(dump, path):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 94, in process_dump
    for cite in extract_cite_history(page, extractors):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 116, in extract_cite_history
    for revision in page:
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/page.py", line 72, in load_revisions
    yield Revision.from_element(sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/revision.py", line 99, in from_element
    values = consume_tags(cls.TAG_MAP, element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/util.py", line 7, in consume_tags
    value_map[tag_name] = tag_map[tag_name](sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/revision.py", line 20, in <lambda>
    'contributor': lambda e: Contributor.from_element(e),
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/contributor.py", line 40, in from_element
    values = consume_tags(cls.TAG_MAP, element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/util.py", line 7, in consume_tags
    value_map[tag_name] = tag_map[tag_name](sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/contributor.py", line 14, in <lambda>
    'id': lambda e: int(e.text),
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant