Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add SCONJ to REMOVE_POS to exclude subordinating conjunction from mention span detection #276

Merged
merged 2 commits into from
Sep 7, 2020

Conversation

noelslice
Copy link
Contributor

Using the same example input mentioned here: #215 (comment) there seems to be a spurious mention "than Shyam" because the subordinating conjunction "than" was not excluded in the mention span detection.

This PR adds the SCONJ tag to the REMOVE_POS list.

Test case:

import spacy
import neuralcoref

nlp = spacy.load('en_core_web_lg')
neuralcoref.add_to_pipe(nlp, greedyness=0.5)

doc = nlp(u'Ram and Shyam are good boys. Ram is older than Shyam. But, they are not friends.')

from pprint import pprint
print(doc._.coref_clusters)
pprint(doc._.coref_scores)

Current output:

[Ram: [Ram, Ram], Shyam: [Shyam, Shyam]]
{Ram: {Ram: 1.775342583656311},
 Ram and Shyam: {Ram and Shyam: 1.7628642320632935, Ram: -1.576068639755249},
 Shyam: {Ram: -1.5397948026657104,
         Ram and Shyam: -1.5207256078720093,
         Shyam: 1.6105855703353882},
 good boys: {Ram: -1.5992192029953003,
             Ram and Shyam: -1.5002832412719727,
             Shyam: -1.6027263402938843,
             good boys: 1.738552212715149},
 Ram: {Ram: 7.551267623901367,
       Ram and Shyam: -0.8156640529632568,
       Shyam: -1.614872932434082,
       good boys: -1.514532446861267,
       Ram: 1.5904799699783325},
 than Shyam: {Ram: -1.5681246519088745,
              Ram and Shyam: -1.4285391569137573,
              Shyam: -1.5769987106323242,
              good boys: -1.5076005458831787,
              Ram: -1.5768016576766968,
              than Shyam: 1.704783320426941},
 Shyam: {Ram: -1.6349478960037231,
         Ram and Shyam: -1.1569286584854126,
         Shyam: 5.653580665588379,
         good boys: -1.526012897491455,
         Ram: -1.6253626346588135,
         than Shyam: -1.5083305835723877,
         Shyam: 1.242653489112854},
 they: {Ram: -2.0989551544189453,
        Ram and Shyam: -0.7402747869491577,
        Shyam: -2.3023903369903564,
        good boys: -1.5382691621780396,
        Ram: -2.296427011489868,
        than Shyam: -1.0285108089447021,
        Shyam: -2.670758008956909,
        they: 0.07739335298538208},
 friends: {Ram: -1.5777109861373901,
           Ram and Shyam: -1.5296742916107178,
           Shyam: -1.725807785987854,
           good boys: -1.5094072818756104,
           Ram: -1.5740591287612915,
           than Shyam: -1.5106748342514038,
           Shyam: -1.783818006515503,
           they: -1.5725568532943726,
           friends: 2.009723663330078}}

New output ("than Sham" excluded):

[Ram: [Ram, Ram], Shyam: [Shyam, Shyam]]
{Ram: {Ram: 1.775342583656311},
 Ram and Shyam: {Ram and Shyam: 1.7629910707473755, Ram: -1.5760746002197266},
 Shyam: {Ram: -1.5397844314575195,
         Ram and Shyam: -1.5207990407943726,
         Shyam: 1.6113454103469849},
 good boys: {Ram: -1.5991358757019043,
             Ram and Shyam: -1.5002236366271973,
             Shyam: -1.602735996246338,
             good boys: 1.7384239435195923},
 Ram: {Ram: 7.543191909790039,
       Ram and Shyam: -0.8214647769927979,
       Shyam: -1.6146637201309204,
       good boys: -1.5146090984344482,
       Ram: 1.5892621278762817},
 Shyam: {Ram: -1.578922986984253,
         Ram and Shyam: -0.6316158771514893,
         Shyam: 7.046931266784668,
         good boys: -1.525830626487732,
         Ram: -1.813422441482544,
         Shyam: 1.1222282648086548},
 they: {Ram: -2.0966665744781494,
        Ram and Shyam: -0.29233384132385254,
        Shyam: -2.266399621963501,
        good boys: -1.5540210008621216,
        Ram: -2.2621068954467773,
        Shyam: -2.6278762817382812,
        they: 0.0765305757522583},
 friends: {Ram: -1.5773955583572388,
           Ram and Shyam: -1.5293686389923096,
           Shyam: -1.721515417098999,
           good boys: -1.5099279880523682,
           Ram: -1.5666728019714355,
           Shyam: -1.809272050857544,
           they: -1.5722771883010864,
           friends: 2.0099644660949707}}

The live demo also doesn't display this mention:

Screenshot from 2020-07-15 13-53-24

@noelslice
Copy link
Contributor Author

disclaimer: I'm still not convinced the logic in extract_mentions_spans and _extract_from_sent is robust. Working on my understanding of the code. It would help to add some test cases.

@svlandeg
Copy link
Collaborator

svlandeg commented Sep 7, 2020

Thanks for this PR @noelslice! Looks good to me.
There are definitely parts of the code base that could use more test cases - all contributions welcome!

@svlandeg svlandeg merged commit 18c0f4c into huggingface:master Sep 7, 2020
@noelslice
Copy link
Contributor Author

Thanks for having a look and merging this in @svlandeg !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants