You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
greek_lemmatized_sents.pickle contains a few hundred lemmas that do not begin with an alphabetic character, for words that do begin with an alphabetic character. In most cases, it these are punctuation, or combining characters wrongly placed before the first letter of the word, not after.
An example of this is ἑκάεργον mapping to ʽἑκάεργος; where the latter string is
U+02BD MODIFIER LETTER REVERSED COMMA
U+1F11 GREEK SMALL LETTER EPSILON WITH DASIA
U+03BA GREEK SMALL LETTER KAPPA
U+03AC GREEK SMALL LETTER ALPHA WITH TONOS
U+03B5 GREEK SMALL LETTER EPSILON
U+03C1 GREEK SMALL LETTER RHO
U+03B3 GREEK SMALL LETTER GAMMA
U+03BF GREEK SMALL LETTER OMICRON
U+03C2 GREEK SMALL LETTER FINAL SIGMA
Another example is Ἄρτεμιν to ῎ἄρτεμις,
U+1FCE GREEK PSILI AND OXIA
U+1F04 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA
U+03C1 GREEK SMALL LETTER RHO
U+03C4 GREEK SMALL LETTER TAU
U+03B5 GREEK SMALL LETTER EPSILON
U+03BC GREEK SMALL LETTER MU
U+03B9 GREEK SMALL LETTER IOTA
U+03C2 GREEK SMALL LETTER FINAL SIGMA
To get a list of the affected mappings:
>>> import os.path, pickle, unicodedata
>>> with open(os.path.expanduser("~/cltk_data/grc/model/grc_models_cltk/lemmata/backoff/greek_lemmatized_sents.pickle"), "rb") as f:
... model = pickle.load(f)
>>> len(model)
33555
>>> len([(word, lemma) for sent in model for (word, lemma) in sent if unicodedata.category(word[0]) in ("Ll", "Lu") and not unicodedata.category(lemma[0]) in ("Ll", "Lu")])
417
>>> for x in sorted((word, lemma) for sent in model for (word, lemma) in sent if unicodedata.category(word[0]) in ("Ll", "Lu") and not unicodedata.category(lemma[0]) in ("Ll", "Lu")): print(x)
This analysis stems from sasansom/sedes#72. I am using commit 94c04ac.
greek_lemmatized_sents.pickle contains a few hundred lemmas that do not begin with an alphabetic character, for words that do begin with an alphabetic character. In most cases, it these are punctuation, or combining characters wrongly placed before the first letter of the word, not after.
An example of this is
ἑκάεργον
mapping toʽἑκάεργος
; where the latter string isAnother example is
Ἄρτεμιν
to῎ἄρτεμις
,To get a list of the affected mappings:
>>> for x in sorted((word, lemma) for sent in model for (word, lemma) in sent if unicodedata.category(word[0]) in ("Ll", "Lu") and not unicodedata.category(lemma[0]) in ("Ll", "Lu")): print(x)
The text was updated successfully, but these errors were encountered: