Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading Lexeme by ID doesn't work #37

Closed
soshial opened this issue Jul 4, 2018 · 12 comments
Closed

Loading Lexeme by ID doesn't work #37

soshial opened this issue Jul 4, 2018 · 12 comments

Comments

@soshial
Copy link

soshial commented Jul 4, 2018

I tried to use function calls, as you mentioned here:

    Analyzer analyzer = new Analyzer(true);
    Lexeme lexeme = analyzer.lexemeByID(43716);
    List<Wordform> wordforms = analyzer.generateInflections(lexeme);

The returned lexeme is null. What did I do wrong? Thank you, @PeterisP !

@PeterisP
Copy link
Owner

PeterisP commented Jul 4, 2018

From 2.x we're switching to a lexicon automatically derived from tēzaurs.lv data export - the current default lexicon setup (Lexicon_v2.xml) does not have a lexeme with ID 43716.

Explicitly loading the old lexicon (new Analyzer("Lexicon.xml";) might work; or using lexeme IDs from tezaurs_lexemes.json.

@soshial
Copy link
Author

soshial commented Jul 4, 2018

Thanks for the explanation. I took the lexeme ID from here (this is probably lexicon v2) and tried with both Lexicon.xml and Lexicon_v2.xml. Is it the right workflow to get Lexeme by its ID?

@soshial
Copy link
Author

soshial commented Jul 7, 2018

When will this switching to v2 happen? Is web functionality of the same version & database, that this morphology library has at the moment?

For example, your /analyze/ API and Java analyzer.analyze(...) provide different results. The web API doesn't have some simple words like "kino", "atelje", "spēlētājs".

Also, will this library give information if a verb is of pabeigta/nepabeigta veida?

@lauma
Copy link
Collaborator

lauma commented Jul 7, 2018

Pabeigtība in Latvian is not a verb feature to be easely derived from morphological elements like endings and infixes, so even if it were there (I doubt it), you should not trust it or use it.

@soshial
Copy link
Author

soshial commented Jul 7, 2018

I clearly understand that. But there is other information that cannot be derived from word form only:

  • Īpašvārds vs Sugas vārds
  • Transitivitāte
  • Īpašības vārda tips
  • Apstākļa vārda tips

but this information is in the lexicon though. That is why I am asking this question: does this information exist anywhere at all (some dictionaries)?

@lauma
Copy link
Collaborator

lauma commented Jul 8, 2018

Some of the Tēzaurs sources contains transitivity, see "trans." or "intrans." near the head of verb entries. Also for transitivity there are more or less strightforward syntactical check - transitivity basically mean, if verb can be used together with object:

  • dzert alu, lasīt grāmatu - OK,
  • spēlēties ... - not OK,
  • gulēt ... miegu? ... - only rarely.

Meanwhile pabeigtība in Latvian is much more fuzzy and vague than transitivity, thus, I don't think it will end up in Tēzaurs. As some languages has morphological distinction for that, linguists of the world do speak of such feature, but for Latvian it is purely semantical.

Types of adverbs and adjectives, I think, mostly comes from grammar books, where it mostly goes like one type can be enumerated and all the rest goes in other type. As tagset used in korpuss.lv requires this feature, Tēzaurs probably will eventually become augumented with it.

When it comes to proper/common nouns: well, traditional dictionaries like LLVV or MLVV just do not include proper nouns. Tēzaurs sometimes contains markings "vietv." or "persv.", e.g., http://tezaurs.lv/#/sv/Liepa, but as it is with all the things in the Tēzaurs - coverage is partial. The same as with adjective/adverb type, some augumentation eventually will be done, but I don't know when or to what extent yet. For now quite telling is the usage of capital letters in the entryword - if it contains at least one capital letter anywere, it is either some abbrievation or proper noun, but not your average common noun.

@soshial
Copy link
Author

soshial commented Jul 9, 2018

Thank you very much. I didn't know that the information that I put in the list, wasn't fully added to tezaurs. Is it possible to add pabeigtība as a parameter, so that maybe I can try to add it to some verbs? I think I might have an idea how to do that not manually.

@lauma
Copy link
Collaborator

lauma commented Jul 10, 2018

Umm, how do You plan to obtain such info? Usually even linguists strugle to assign pabeigtība unambiguously.

@soshial
Copy link
Author

soshial commented Jul 10, 2018

If we had Latvian-Russian dictionary electronically, then we might had been able to parse verb articles for presense of double verbs, for example:

screen shot 2018-07-10 at 14 21 47

Compare perf.-imp. "встречаться-встретиться" vs imp. "нравиться". We might mine this data and add it to tezaurs. Without doubt, this is raw and preliminary data that needs linguists' approval, but this might be a good start, isn't it?

@lauma
Copy link
Collaborator

lauma commented Jul 10, 2018

I'm not convinced:

  1. translation never happens one word <-> one word, it is always about matching some subset of the whole possible meanings each word in each language can have, and if some word in Russian has morphologically marked perfect, it does not mean all the meanings of "corresponding" Latvian verb will feature finished action semmantically.
  2. various prefixes and verb/participle forms can impact perfect/imperfect,
  3. generally perfect/imperfect is just a feature Latvian just do not have - the same way as Latvian has exactly two grammatical genders, while Russian has three, the same way as Latvian has some verb tenses Russian don't etc, thus, the applicability of such annotation is very limited.

@soshial
Copy link
Author

soshial commented Oct 6, 2019

Thank you for your explanation. I will try to accept the fact that this parameter is much more ephemeral in Latvian, than in Slavic languages. Nevajag to likt Prokrusta gultā =D

But in some cases, like "izdarīt" we can always say that it is used only as perfective, isnt it?

@soshial
Copy link
Author

soshial commented Oct 6, 2019

Returning to the original question I eventually got it to work (even with being weird that this nethod demands lemma explicitly, while lexeme must contain it):

List<Wordform> wordforms = analyzer.generateInflections(lexeme, lexeme.getValue("Pamatforma"));

@PeterisP PeterisP closed this as completed Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants