Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create extract-wikidata.py #45

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

vrandezo
Copy link

Script to turn Wikidata lexemes into a dictionary for Unicode inflections.

Based on #9

Script to turn Wikidata lexemes into a dictionary for Unicode inflections.

Based on unicode-org#9
Copy link
Member

@grhoten grhoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is a good starting point. After this is merged, I can take a crack at this script to integrate some of the logic from dictionary-parser into this script. The code2qid and qid2name should probably be made external to this script, since those are likely to be updated, especially when adding a new language.

A list of all available versions can be found on the Web here:
https://dumps.wikimedia.org/wikidatawiki/entities/

Or you can download the most recent version using the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this as a top option with a comment on how to download specific one (by date) - so essentially flip the order you have now.

formcount = 0
errorcount = 0

code2qid = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these are the permitted ones from above (line 22)?


After that run this script in the dictionary in which the lexemes.json.bz2 file is located.
The result will be a dictionary_xxx.lst file for the given language.
Per default the language is 'en', but you can use one of the permitted language code insteads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:
Language can be specified in command line, default is 'en'. Allowed languages are listed below in code2qid object.


dictionary = {}

languagecode = 'en'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is current best practice on parsing command line args in Python - https://docs.python.org/3/library/argparse.html.

I can fix that once you merge (or you can in this round). Let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of how to fix it:

import argparse  

def main():
  parser = argparse.ArgumentParser(description="Process a language code.")
  parser.add_argument("language_code", nargs="?", default="en", help="The language code (e.g., 'en', 'es', 'fr'). Defaults to 'en'.")
  process_language(args.language_code)

if __name__ == "__main__":
  main()  

The name = "main": part helps when this is a program run from command line vs being imported as a module - it allows both ways of using it. I can refactor once you merge - important part is arg parsing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the extract-wikidata.py into extract_wikidata.py. Python can't import files with - sign, so this file can't become a module in the future, e.g. you can't:

import extract-wikidata

@grhoten
Copy link
Member

grhoten commented Jan 14, 2025

FYI this is an important part of the Wikidata structure.

  "type": "lexeme",
  "id": "L3257",
  "lemmas": {
    "en": {
      "language": "en",
      "value": "apple"
    }
  },
  "lexicalCategory": "Q1084",
  "language": "Q1860",
...
  "forms": [
    {
      "id": "L3257-F1",
      "representations": {
        "en": {
          "language": "en",
          "value": "apple"
        }
      },
      "grammaticalFeatures": [
        "Q110786"
      ],
      "claims": {
        "P898": [
          {
            "mainsnak": {
              "snaktype": "value",
              "property": "P898",
              "datavalue": {
                "value": "\u02c8\u00e6p.\u0259l",
                "type": "string"
              },
              "datatype": "string"
            },
            "type": "statement",
            "id": "L3257-F1$6b031033-493d-3667-b9f1-981fbffbdb28",
            "rank": "normal"
          }
        ]
      }
    },
    {
      "id": "L3257-F2",
      "representations": {
        "en": {
          "language": "en",
          "value": "apples"
        }
      },
      "grammaticalFeatures": [
        "Q146786"
      ]
    }
  ]

Based on this information, the following should be created.

dictionary_en.lst

apple: singular vowel-start noun inflection=1
apples: plural noun inflection=1

inflectional_en.xml

    <pattern name="1" words="224929">
        <pos>noun</pos>
        <suffix/>
        <inflections>
            <inflection number="singular"><t><stem/></t></inflection>
            <inflection number="plural"><t><stem/>s</t></inflection>
        </inflections>
    </pattern>

'Q3438770': 'Karai-karai',
'Q9176': 'Korean',
'Q9237': 'Malay',
'Q25167': 'Norwegian Boksmal',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Norwegian Bokmål or Bokmal.

@nciric
Copy link
Contributor

nciric commented Jan 14, 2025

dictionary_en.lst

 apple: singular vowel-start noun inflection=1
 apples: plural noun inflection=1

inflectional_en.xml

    <pattern name="1" words="224929">
        <pos>noun</pos>
        <suffix/>
        <inflections>
            <inflection number="singular"><t><stem/></t></inflection>
            <inflection number="plural"><t><stem/>s</t></inflection>
        </inflections>
    </pattern>

How do you generate inflection=1 in dictionary_en.lst, and how do you generate the name="1" in xml file?

I assume there's some kind of analysis of nouns to slot them into groups of similar/same grammatical rules for inflection (which is the hard part of the problem). Do we need to run some Java tool? Is the long term plan to replace it with Python?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants