-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Rule-based Annotator for Cardinal/Ordinal #1429
Comments
Is GPL vs ASL an issue for you? If you run a compatible POS tagger before the
|
On Thu, Nov 21, 2019 at 9:58 AM Richard Eckart de Castilho < ***@***.***> wrote:
Is GPL vs ASL an issue for you?
Not for me. But it could be an issue for other users of this annotator.
If you run a compatible POS tagger before the CoreNlpNamedEntityRecognizer
(i.e. the CoreNlpPosTagger), then you can also get e.g. ORDINAL tags:
What about Cardinal?
|
Looks like they are simply tagged as |
On Thu, Nov 21, 2019 at 1:03 PM Richard Eckart de Castilho < ***@***.***> wrote:
JCas jcas = runTest("en", "John bought one hundred laptops .");
String[] ne = {
"[ 0, 4]Person(PERSON) (John)",
"[ 12, 15]NamedEntity(NUMBER) (one)",
"[ 16, 23]NamedEntity(NUMBER) (hundred)" };
Looks like they are simply tagged as NUMBER. I'm not sure ifCARDINAL is
even produced by CoreNLP - references to it
<https://github.com/stanfordnlp/CoreNLP/search?p=1&q=CARDINAL&type=&utf8=%E2%9C%93>
never seem to be assignments.
Would float numbers like "2.1" in "I bought 2.1 kg of meat" also be tagged
as NUMBER? I am looking for something that would specifically tag Integers.
Alain
… —
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1429?email_source=notifications&email_token=AAIMA4DXSPSOVFN3RTR7WO3QU3ENTA5CNFSM4JQCEHQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3EBFY#issuecomment-557203607>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIMA4FJ57SQNM4TFAKPCALQU3ENTANCNFSM4JQCEHQA>
.
|
For this, you could look into RUTA patterns:
https://uima.apache.org/ruta.html
…-Torsten
On 21.11.19, 22:14, "Alain Désilets" <[email protected]> wrote:
On Thu, Nov 21, 2019 at 1:03 PM Richard Eckart de Castilho <
[email protected]> wrote:
JCas jcas = runTest("en", "John bought one hundred laptops .");
String[] ne = {
"[ 0, 4]Person(PERSON) (John)",
"[ 12, 15]NamedEntity(NUMBER) (one)",
"[ 16, 23]NamedEntity(NUMBER) (hundred)" };
Looks like they are simply tagged as NUMBER. I'm not sure ifCARDINAL is
even produced by CoreNLP - references to it
<https://github.com/stanfordnlp/CoreNLP/search?p=1&q=CARDINAL&type=&utf8=%E2%9C%93>
never seem to be assignments.
Would float numbers like "2.1" in "I bought 2.1 kg of meat" also be tagged
as NUMBER? I am looking for something that would specifically tag Integers.
Alain
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1429?email_source=notifications&email_token=AAIMA4DXSPSOVFN3RTR7WO3QU3ENTA5CNFSM4JQCEHQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3EBFY#issuecomment-557203607>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIMA4FJ57SQNM4TFAKPCALQU3ENTANCNFSM4JQCEHQA>
.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly,
view it on GitHub <#1429?email_source=notifications&email_token=AAURBYFCGADL23B3MOS4NWLQU324RA5CNFSM4JQCEHQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3VNRY#issuecomment-557274823>, or
unsubscribe <https://github.com/notifications/unsubscribe-auth/AAURBYBTOM3UR3MPGATHFNDQU324RANCNFSM4JQCEHQA>.
|
I have committed the extended test which combines the Pos Tagger and the NER from CoreNLP here: Feel free to play around with it and test additional number types (durations, percentages, etc.). If I saw it correctly, the components you implemented work without requiring POS tags. If you would like to contribute them, it would be best if you create a PR. Since the classes depend on CoreNLP, the CoreNLP (GPL) module would be the best to place them. I saw in the CoreNLP code, that there is also some support for normalizing quantities. If normalization is also something you are after, we might consider extending the DKPro Core type system with a way of storing such normalizations and to transfer them out of components such as CoreNLP which produce them. |
Le ven. 22 nov. 2019 à 04:14, Richard Eckart de Castilho <
[email protected]> a écrit :
I have committed the extended test which combines the Pos Tagger and the
NER from CoreNLP here:
https://github.com/dkpro/dkpro-core/blob/4f8d74fdb003c90fdef8ccff7039a799ab471699/dkpro-core-corenlp-gpl/src/test/java/org/dkpro/core/corenlp/CoreNlpPosTaggerAndNamedEntityRecognizerTest.java
Feel free to play around with it and test additional number types
(durations, percentages, etc.).
Thx I definitely will!
If I saw it correctly, the components you implemented work without
requiring POS tags.
That is correct. It might be an advantage as my understanding is that POS
tagging is relatively slow and requires a trained model. i will compare
performance of the two to confirm.
If you would like to contribute them, it would be best if you create a PR.
I am interested. Wat is a PR?
I saw in the CoreNLP code, that there is also some support for normalizing
quantities. If normalization is also something you are after, we might
consider extending the DKPro Core type system with a way of storing such
normalizations and to transfer them out of components such as CoreNLP which
produce them.
Yes! Note that I am interested in having a normalization attribute for all
NamedEntity types not just quantities. I don't know if the UIMA type system
can support that because the type of this normalisation attribute will
depend on the type of NamedEntity. For Cardinal and Ordinal it should be a
Long while for other quantities it should be a Double. For Location,
Person, Date and Time probably a String.
The problem with the UIMA type system is that afaik it would not allow you
to define an Object attribute in NamedEntity which could be overridden to
more specific types in subclasses. This is one of my pet leaves about the
typesystem. I don't know what possessed the good UIMA people to cook up
their own very limited type system instead of just allowing features to be
arbitrary java classes.
|
I'd probably simply use a string feature even for numeric/boolean values... IMHO having an "Object" attribute also isn't a great solution because it would also require type-casting. The equivalent to an "Object" attribute in UIMA would be a "Feature Structure"-type attribute which could then point to e.g. a to-be-defined "DoubleValue" Feature structure which simply has a feature "value" of the type "double". UIMAv3 also has new features to store custom objects in the CAS - but I have never tried this out so far: https://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.custom_java_objects - might be worth a look. As for contributing via a PR - see here: https://dkpro.github.io/contributing/ |
I wonder why time expressions, monetary expressions and so on are even considered as named entities / handled by the CoreNLP NER tools. They are not really entities... in particular not named ones. |
They are annotated in Ontonotes at least. |
On Fri, Nov 22, 2019 at 7:10 AM Richard Eckart de Castilho < ***@***.***> wrote:
I'd probably simply use a string feature even for numeric/boolean values...
That would do.
While you are at it, you might want to define another attribute called say,
altNormalizations, which would be a list of Strings. I build lots of NLP
apps that have a human in the loop. In those kinds of apps, it's often
useful to be able to provide the user with alternative plausible
interpretations of a piece of a text.
For example, for a location "Victoria", there are literally dozens of
likely place with that name. Usually you can tell from the application
context or the content of the document which one is referred to. But it is
also useful to be able to provide alternative normalizations. Actually,
even without a human in the loop, alternative location normalization can be
useful. For example, if I process a collection of documents that all have
to do with the Ebola virus, you might conclude that a reference to
"Victoria" refers to a location in Africa, eventhough that specific
specific document by itself does not have sufficient information to
conclude that (but the doc collection as a whole does).
Note that alternative normalizations can be useful not only for "proper"
named entities (Location, Person, Org, etc...) but also to quantities. For
example, a relative Date like "next Tuesday" cannot be normalized without
having first established a "reference date" (i.e. the date that you would
get if you replaced "next Tuesday" by "today"). But figuring out the
reference date can be tricky if it was not provided as part of the
document's metadata. Even if a reference date IS provided in the document's
metadata, there are scenarios where the reference date may change in the
course of the document. For example in this case:
"A bunch of things happened last week. On Monday, etc.."
In this case, the reference date for that excerpt is last week, not "this
week" (which would be the reference date that would have been provided in
the document's metadata).
IMHO having an "Object" attribute also isn't a great solution because it
would also require type-casting.
The equivalent to an "Object" attribute in UIMA would be a "Feature
Structure"-type attribute which could then point to e.g. a to-be-defined
"DoubleValue" Feature structure which simply has a feature "value" of the
type "double".
Very awkward in my opinion. I am sure the UIMA people had a good reason for
inventing their own type system instead of just going with Java's, but I
have never seen an explanation of the rationale.
UIMAv3 also has new features to store custom objects in the CAS - but I
have never tried this out so far:
https://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.custom_java_objects
- might be worth a look.
Cool! I have been craving this ever since I started working with UIMA 4
years ago.
As for contributing via a PR - see here:
https://dkpro.github.io/contributing/
Ah, PR = Pull Request. Yes, I am already familiar with that process.
I probably will get going on that some time in January.
Alain
… |
On Fri, Nov 22, 2019 at 7:27 AM Richard Eckart de Castilho < ***@***.***> wrote:
I wonder why time expressions, monetary expressions and so on are even
considered as named entities / handled by the CoreNLP NER tools. They are
not really entities... in particular not named ones.
Yeah, I have the same issue. It seems to be a common misunderstanding in
the NLP community, not just the Stanford folks.
It seems the meaning of the term "Named Entity" has evolved to encompass
all "things and concepts from the physical world".
Alain
… |
UIMA is supposed to be cross-platform. There is a C++ implementation provided by the Apache UIMA project. There are also some outdated Python bindings and the more recent DKPro Cassis library which implements the CAS in Python. So just using the full Java type system wouldn't really do. |
Le ven. 22 nov. 2019 à 08:29, Richard Eckart de Castilho <
[email protected]> a écrit :
Very awkward in my opinion. I am sure the UIMA people had a good reason for
inventing their own type system instead of just going with Java's, but I
have never seen an explanation of the rationale.
UIMA is supposed to be cross-platform. There is a C++ implementation
provided by the Apache UIMA project. There are also some outdated Python
bindings and the more recent DKPro Cassis library which implements the CAS
in Python. So just using the full Java type system wouldn't really do.
I thought that might be the reason but I wasn't sure since I had never
heard of a non-java implementation.
Too bad the Python bindings are outdated. There are lots of excellent
Python NLP frameworks out there (Spacy in particular).
The next thing I wonder is why they didn't just go for making all
attributes be Json serialization strings instead of forcing devs to learn a
different, uima-specific (and in my view awkward) object serialization
framework. I understand that serialization and deserialisation encurs an
overhead but you could cache the deserialised version so that this overhead
is only encurred once per attribute.
I have written many annotation wrappers that use this approach and it was a
lot easier to use than the UIMA feature structure system.
… —
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1429?email_source=notifications&email_token=AAIMA4HCUSNRLNRQLBI5ED3QU7NCXA5CNFSM4JQCEHQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE5UFRI#issuecomment-557531845>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIMA4G6S7KOKSUKADVYNY3QU7NCXANCNFSM4JQCEHQA>
.
|
That's why we have built DKPro Cassis :) We use it amongst other things to connect tools such as spacy to the UIMA-based INCEpTION annotation editor. Wrt. object serialization - the best place to discuss this would be the UIMA user's mailing list. |
While DKPro has UIMA types for Cardinal and Ordinal, it seems there are no annotators that can produce them.
So I implemented my own CardOrdAnnotator for English based on the Stanford NLP QuantifiableEntityNormalizer class.
If you are interested, I could roll that into dkpro-core-api-ner-asl, or whatever module you think is appropriate.
I attach the classes and tests that I wrote for that. Note that you won't be able to run them as they use some utilities that I wrote for myself, but it should give you an idea of how they work.
Basically, the annotator uses a class CardOrdParser, which I wrote based on QuantifiableEntityNormalizer. This means that the annotator would have to be GPLed.
Note that at the moment, the parser is only available for English, but it would be probably be relatively easy to implement it for other languages. To do that however, we would have to re-write (or extend) QuantifiableEntityNormalizer because in its current implementation, it uses static variables to store words for cardinals and ordinals (ex: "first", "one", etc...). As a result, you cannot have different instances of QuantifiableEntityNormalizer for different languages. I guess we could rewrite QuantifiableEntityNormalizer altogether (using its code as "inspiration"). Not sure if that would be sufficient to remove the GPL constraint on CardOrdParser.
Let me know if you are interested.
CardOrdAnnotator_files.zip
The text was updated successfully, but these errors were encountered: