-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking changes #43
Comments
Hi,
I was thinking of using Tag much more broadly, for example to show roots in
Malay, irregular (broken) plurals in Arabic, voweled and vowelless variants
in Hebrew and so forth. So I don't think it can be replaced by just script.
…On Mon, Feb 8, 2021 at 2:19 PM Michael Wayne Goodman < ***@***.***> wrote:
This issue is meant to collect the changes we would like to make to WN-LMF
but have not because doing so would break backward compatibility. When we
get to a 2.0 version we have a chance for some simplification and
belt-tightening, so it would be a same if we miss some and have to wait for
the next major version.
For better discussion, these issues could be broken up into separate
issues (maybe with an appropriate label or milestone to group them?).
Deferred Changes
These are changes we would have made in WN-LMF 1.1 if backwards
compatibility were not an issue.
- Remove <SyntacticBehaviour> from <LexicalEntry>; it became a child
of <Lexicon>
- Remove the senses attribute from <SyntacticBehaviour>; these
associations are handled by the subcat attribute on <Sense> elements
- Make the id attribute on <SyntacticBehaviour> required
Proposed Changes
These are new changes that we might consider
-
Remove <Tag>? The use case presented in Bond et al. 2020 ("Some Issues
with Building a Multilingual Wordnet") seems more elegantly handled by the
script attribute on <Lemma> and <Form>:
<Lemma writtenForm="头发" partOfSpeech="n" script="Hans" />
<Form writtenForm="頭髮" script="Hant" />
<Form writtenForm="tóufa" script="Latn-pinyin" />
<Form writtenForm="tou2fa5" script="Latn-pinyin-x-numeric" />
<Form writtenForm="toufa" script="Latn-pinyin-x-simple" />
Above, if script were limited to ISO15924 script names, then all 3
pinyin variants would be just "Latn", so I used BCP-47-like tags minus the
language and region names. The "pinyin" variant and private-use tags
"numeric" and "simple" can be used to distinguish them.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#43>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRQ4MAM3UHZUY2BZZSLS5565ZANCNFSM4XIL72ZA>
.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
I agree with Francis. I would very much like to keep the Tag to store
flexible annotations on Lemmas and Forms.
These won't be meaningful for OMW (as it reads the LMF) but they can be
displayed as a list of tag-values.
Also, if projects keep using Tag as a flexible layer to store information,
OMW can also better understand what special "tags" could be embedded in the
DTD as 'officially supported' with an agreed upon format/meaning.
On Mon, Feb 8, 2021 at 3:14 PM Francis Bond <[email protected]>
wrote:
… Hi,
I was thinking of using Tag much more broadly, for example to show roots in
Malay, irregular (broken) plurals in Arabic, voweled and vowelless variants
in Hebrew and so forth. So I don't think it can be replaced by just script.
On Mon, Feb 8, 2021 at 2:19 PM Michael Wayne Goodman <
***@***.***> wrote:
> This issue is meant to collect the changes we would like to make to
WN-LMF
> but have not because doing so would break backward compatibility. When we
> get to a 2.0 version we have a chance for some simplification and
> belt-tightening, so it would be a same if we miss some and have to wait
for
> the next major version.
>
> For better discussion, these issues could be broken up into separate
> issues (maybe with an appropriate label or milestone to group them?).
> Deferred Changes
>
> These are changes we would have made in WN-LMF 1.1 if backwards
> compatibility were not an issue.
>
> - Remove <SyntacticBehaviour> from <LexicalEntry>; it became a child
> of <Lexicon>
> - Remove the senses attribute from <SyntacticBehaviour>; these
> associations are handled by the subcat attribute on <Sense> elements
> - Make the id attribute on <SyntacticBehaviour> required
>
> Proposed Changes
>
> These are new changes that we might consider
>
> -
>
> Remove <Tag>? The use case presented in Bond et al. 2020 ("Some Issues
> with Building a Multilingual Wordnet") seems more elegantly handled by
the
> script attribute on <Lemma> and <Form>:
>
> <Lemma writtenForm="头发" partOfSpeech="n" script="Hans" />
>
> <Form writtenForm="頭髮" script="Hant" />
>
> <Form writtenForm="tóufa" script="Latn-pinyin" />
>
> <Form writtenForm="tou2fa5" script="Latn-pinyin-x-numeric" />
>
> <Form writtenForm="toufa" script="Latn-pinyin-x-simple" />
>
> Above, if script were limited to ISO15924 script names, then all 3
> pinyin variants would be just "Latn", so I used BCP-47-like tags minus
the
> language and region names. The "pinyin" variant and private-use tags
> "numeric" and "simple" can be used to distinguish them.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#43>, or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAIPZRQ4MAM3UHZUY2BZZSLS5565ZANCNFSM4XIL72ZA
>
> .
>
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB73XHQSPEKS6G3ENKHVPZTS56FNBANCNFSM4XIL72ZA>
.
|
@fcbond, @lmorgadodacosta thanks for the context. I haven't seen tags used at all aside from in the paper, so if there's a good and active use case (except the "script" one, for which I stand by my previous statement) then it makes sense to leave it in. For instance, I've been wondering how to distinguish various lemmas+forms in EWN, like stimulus/stimuli. Could be with <Lemma partOfSpeech="n" writtenForm="stimulus" />
<Form writtenForm="stimuli">
<Tag category="number">PL</Tag>
</Form> Relatedly, I've been wondering which elements of WN-LMF are meant for modeling a language's wordnet and which are for peripheral annotation tasks or processes. For instance, |
Hi,
I think frequency information is a part of knowledge of language. Any
corpus count is only an imperfect sample, but I would rather make available
what we have when we have it.
For the ILI I think we tried to get a balance between purely modelling and
generally useful. We only want candidates that come with a wordnet, and
packaging them together makes this easier to manage.
…On Mon, Feb 8, 2021 at 3:57 PM Michael Wayne Goodman < ***@***.***> wrote:
@fcbond <https://github.com/fcbond>, @lmorgadodacosta
<https://github.com/lmorgadodacosta> thanks for the context. I haven't
seen tags used at all aside from in the paper, so if there's a good and
active use case (except the "script" one, for which I stand by my previous
statement) then it makes sense to leave it in. For instance, I've been
wondering how to distinguish various lemmas+forms in EWN, like *stimulus*/
*stimuli*. Could be with <Tag>:
<Lemma partOfSpeech="n" writtenForm="stimulus" />
<Form writtenForm="stimuli">
<Tag category="number">PL</Tag>
</Form>
Relatedly, I've been wondering which elements of WN-LMF are meant for
modeling a language's wordnet and which are for peripheral annotation tasks
or processes. For instance, <Count> doesn't really model something true
about a language, but something that can be computed for some corpora, so
why is this part of WN-LMF? And <ILIDefinition> is only used when a
wordnet is the vehicle by which new ILI candidates are proposed, otherwise
those definitions are included with the ILI resource, so it seems like
there could be another channel for proposing candidates (e.g., by creating
issues at https://github.com/globalwordnet/cili/).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRXHEMWP5LRWK466JHTS56J5BANCNFSM4XIL72ZA>
.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
Sorry, I think my "something true" comment wasn't accurate. I was trying to draw a line between "gold", human-added information and the automatically computed information. I think the line is even blurrier because those computed counts are, I think, from human annotations. So you have this information and you'd like to make it available. That's great, but I still think it would be better as a separate resource, similar to how the information-content (IC) data files are distributed separately. It's also easier that way to track where the counts came from, e.g., in a file called Also, practically, I have not seen any wordnets distributed with this information (I suspect you use it internally for annotation projects), and trying to model it properly in Wn complicates the database schema and code. I guess I'm arguing for a worse-is-better approach.
My position here is essentially the same as my last argument regarding schema/code complexity. It seems like the format has been refitted with a feature that's only relevant for CILI's development and not for modeling a wordnet. A proposed ILI with synset definition
ewn-05698967-n the barrier preventing Blacks from participating in various activities with whites
ewn-05822120-n (plural) something that reminds you of someone or something
... Furthermore, we cannot express in a DTD that |
Also note that I've updated the original issue text. I added some attributes as candidates for removal. I understand that they had some original purpose but I don't see evidence of their use, so it's worth discussing whether they can be removed. Generally, though, these attributes are relatively simple to model in the database and they can just not appear in the XML when unused, but they can still cause surprises (e.g., see here). |
This issue is meant to collect the changes we would like to make to WN-LMF but have not because doing so would break backward compatibility. When we get to a 2.0 version we have a chance for some simplification and belt-tightening, so it would be a shame if we miss some and have to wait for the next major version.
For better discussion, these issues could be broken up into separate issues (maybe with an appropriate label or milestone to group them?).
Deferred Changes
These are changes we would have made in WN-LMF 1.1 if backwards compatibility were not an issue.
<SyntacticBehaviour>
from<LexicalEntry>
; it became a child of<Lexicon>
senses
attribute from<SyntacticBehaviour>
; these associations are handled by thesubcat
attribute on<Sense>
elementsid
attribute on<SyntacticBehaviour>
requiredProposed Changes
These are new changes that we might consider
Remove(edit: in the comments below, a case is made for other uses of<Tag>
<Tag>
)Click to show/hide original text
The use case presented in Bond et al. 2020 ("Some Issues with Building a Multilingual Wordnet") seems more elegantly handled by the
script
attribute on<Lemma>
and<Form>
:Above, if
script
were limited to ISO15924 script names, then all 3 pinyin variants would be just "Latn", so I used BCP-47-like tags minus the language and region names. The "pinyin" variant and private-use tags "numeric" and "simple" can be used to distinguish them.Remove
<Count>
? (see comments below)Remove
<ILIDefinition>
? (see comments below)Remove (apparently) unused attributes?
sourceSense
on<Definition>
lexicalized
on<Sense>
and<Synset>
status
on anything with metadataThe text was updated successfully, but these errors were encountered: