-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create extract-wikidata.py #45
base: main
Are you sure you want to change the base?
Conversation
Script to turn Wikidata lexemes into a dictionary for Unicode inflections. Based on unicode-org#9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is a good starting point. After this is merged, I can take a crack at this script to integrate some of the logic from dictionary-parser into this script. The code2qid and qid2name should probably be made external to this script, since those are likely to be updated, especially when adding a new language.
A list of all available versions can be found on the Web here: | ||
https://dumps.wikimedia.org/wikidatawiki/entities/ | ||
|
||
Or you can download the most recent version using the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this as a top option with a comment on how to download specific one (by date) - so essentially flip the order you have now.
formcount = 0 | ||
errorcount = 0 | ||
|
||
code2qid = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume these are the permitted ones from above (line 22)?
|
||
After that run this script in the dictionary in which the lexemes.json.bz2 file is located. | ||
The result will be a dictionary_xxx.lst file for the given language. | ||
Per default the language is 'en', but you can use one of the permitted language code insteads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
Language can be specified in command line, default is 'en'. Allowed languages are listed below in code2qid object.
|
||
dictionary = {} | ||
|
||
languagecode = 'en' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is current best practice on parsing command line args in Python - https://docs.python.org/3/library/argparse.html.
I can fix that once you merge (or you can in this round). Let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of how to fix it:
import argparse
def main():
parser = argparse.ArgumentParser(description="Process a language code.")
parser.add_argument("language_code", nargs="?", default="en", help="The language code (e.g., 'en', 'es', 'fr'). Defaults to 'en'.")
process_language(args.language_code)
if __name__ == "__main__":
main()
The name = "main": part helps when this is a program run from command line vs being imported as a module - it allows both ways of using it. I can refactor once you merge - important part is arg parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename the extract-wikidata.py into extract_wikidata.py. Python can't import files with - sign, so this file can't become a module in the future, e.g. you can't:
import extract-wikidata
FYI this is an important part of the Wikidata structure.
Based on this information, the following should be created. dictionary_en.lst
inflectional_en.xml
|
'Q3438770': 'Karai-karai', | ||
'Q9176': 'Korean', | ||
'Q9237': 'Malay', | ||
'Q25167': 'Norwegian Boksmal', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Norwegian Bokmål or Bokmal.
How do you generate inflection=1 in dictionary_en.lst, and how do you generate the name="1" in xml file? I assume there's some kind of analysis of nouns to slot them into groups of similar/same grammatical rules for inflection (which is the hard part of the problem). Do we need to run some Java tool? Is the long term plan to replace it with Python? |
Script to turn Wikidata lexemes into a dictionary for Unicode inflections.
Based on #9