Create extract-wikidata.py #45

vrandezo · 2025-01-13T21:27:56Z

Script to turn Wikidata lexemes into a dictionary for Unicode inflections.

Based on #9

Script to turn Wikidata lexemes into a dictionary for Unicode inflections. Based on unicode-org#9

grhoten

I think that this is a good starting point. After this is merged, I can take a crack at this script to integrate some of the logic from dictionary-parser into this script. The code2qid and qid2name should probably be made external to this script, since those are likely to be updated, especially when adding a new language.

nciric · 2025-01-13T23:14:16Z

data/tools/extract-wikidata.py

+A list of all available versions can be found on the Web here:
+https://dumps.wikimedia.org/wikidatawiki/entities/
+
+Or you can download the most recent version using the following command:


I would put this as a top option with a comment on how to download specific one (by date) - so essentially flip the order you have now.

nciric · 2025-01-13T23:17:27Z

data/tools/extract-wikidata.py

+formcount = 0
+errorcount = 0
+
+code2qid = {


I assume these are the permitted ones from above (line 22)?

nciric · 2025-01-13T23:18:42Z

data/tools/extract-wikidata.py

+
+After that run this script in the dictionary in which the lexemes.json.bz2 file is located.
+The result will be a dictionary_xxx.lst file for the given language.
+Per default the language is 'en', but you can use one of the permitted language code insteads.


Maybe:
Language can be specified in command line, default is 'en'. Allowed languages are listed below in code2qid object.

nciric · 2025-01-13T23:20:42Z

data/tools/extract-wikidata.py

+
+dictionary = {}
+
+languagecode = 'en'


This is current best practice on parsing command line args in Python - https://docs.python.org/3/library/argparse.html.

I can fix that once you merge (or you can in this round). Let me know.

Here's an example of how to fix it:

import argparse def main(): parser = argparse.ArgumentParser(description="Process a language code.") parser.add_argument("language_code", nargs="?", default="en", help="The language code (e.g., 'en', 'es', 'fr'). Defaults to 'en'.") process_language(args.language_code) if __name__ == "__main__": main()

The name = "main": part helps when this is a program run from command line vs being imported as a module - it allows both ways of using it. I can refactor once you merge - important part is arg parsing.

nciric · 2025-01-13T23:24:24Z

data/tools/extract-wikidata.py

Please rename the extract-wikidata.py into extract_wikidata.py. Python can't import files with - sign, so this file can't become a module in the future, e.g. you can't:

import extract-wikidata

grhoten · 2025-01-14T08:08:49Z

FYI this is an important part of the Wikidata structure.

  "type": "lexeme",
  "id": "L3257",
  "lemmas": {
    "en": {
      "language": "en",
      "value": "apple"
    }
  },
  "lexicalCategory": "Q1084",
  "language": "Q1860",
...
  "forms": [
    {
      "id": "L3257-F1",
      "representations": {
        "en": {
          "language": "en",
          "value": "apple"
        }
      },
      "grammaticalFeatures": [
        "Q110786"
      ],
      "claims": {
        "P898": [
          {
            "mainsnak": {
              "snaktype": "value",
              "property": "P898",
              "datavalue": {
                "value": "\u02c8\u00e6p.\u0259l",
                "type": "string"
              },
              "datatype": "string"
            },
            "type": "statement",
            "id": "L3257-F1$6b031033-493d-3667-b9f1-981fbffbdb28",
            "rank": "normal"
          }
        ]
      }
    },
    {
      "id": "L3257-F2",
      "representations": {
        "en": {
          "language": "en",
          "value": "apples"
        }
      },
      "grammaticalFeatures": [
        "Q146786"
      ]
    }
  ]

Based on this information, the following should be created.

dictionary_en.lst

apple: singular vowel-start noun inflection=1
apples: plural noun inflection=1

inflectional_en.xml

    <pattern name="1" words="224929">
        <pos>noun</pos>
        <suffix/>
        <inflections>
            <inflection number="singular"><t><stem/></t></inflection>
            <inflection number="plural"><t><stem/>s</t></inflection>
        </inflections>
    </pattern>

grhoten · 2025-01-14T16:24:48Z

data/tools/extract-wikidata.py

+  'Q3438770': 'Karai-karai',
+  'Q9176': 'Korean',
+  'Q9237': 'Malay',
+  'Q25167': 'Norwegian Boksmal',


This is Norwegian Bokmål or Bokmal.

nciric · 2025-01-14T16:51:42Z

dictionary_en.lst

 apple: singular vowel-start noun inflection=1
 apples: plural noun inflection=1

inflectional_en.xml

    <pattern name="1" words="224929">
        <pos>noun</pos>
        <suffix/>
        <inflections>
            <inflection number="singular"><t><stem/></t></inflection>
            <inflection number="plural"><t><stem/>s</t></inflection>
        </inflections>
    </pattern>

How do you generate inflection=1 in dictionary_en.lst, and how do you generate the name="1" in xml file?

I assume there's some kind of analysis of nouns to slot them into groups of similar/same grammatical rules for inflection (which is the hard part of the problem). Do we need to run some Java tool? Is the long term plan to replace it with Python?

Create extract-wikidata.py

7d31333

Script to turn Wikidata lexemes into a dictionary for Unicode inflections. Based on unicode-org#9

grhoten approved these changes Jan 13, 2025

View reviewed changes

nciric reviewed Jan 13, 2025

View reviewed changes

nciric approved these changes Jan 13, 2025

View reviewed changes

grhoten reviewed Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create extract-wikidata.py #45

Create extract-wikidata.py #45

vrandezo commented Jan 13, 2025

grhoten left a comment

nciric Jan 13, 2025

nciric Jan 13, 2025

nciric Jan 13, 2025

nciric Jan 13, 2025

nciric Jan 13, 2025

nciric Jan 13, 2025

grhoten commented Jan 14, 2025 •

edited

Loading

grhoten Jan 14, 2025

nciric commented Jan 14, 2025

Create extract-wikidata.py #45

Are you sure you want to change the base?

Create extract-wikidata.py #45

Conversation

vrandezo commented Jan 13, 2025

grhoten left a comment

Choose a reason for hiding this comment

nciric Jan 13, 2025

Choose a reason for hiding this comment

nciric Jan 13, 2025

Choose a reason for hiding this comment

nciric Jan 13, 2025

Choose a reason for hiding this comment

nciric Jan 13, 2025

Choose a reason for hiding this comment

nciric Jan 13, 2025

Choose a reason for hiding this comment

nciric Jan 13, 2025

Choose a reason for hiding this comment

grhoten commented Jan 14, 2025 • edited Loading

grhoten Jan 14, 2025

Choose a reason for hiding this comment

nciric commented Jan 14, 2025

grhoten commented Jan 14, 2025 •

edited

Loading