Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brat writer - multiline annotation #1247

Closed
parisni opened this issue Jun 16, 2018 · 7 comments
Closed

brat writer - multiline annotation #1247

parisni opened this issue Jun 16, 2018 · 7 comments
Assignees
Labels
Milestone

Comments

@parisni
Copy link

parisni commented Jun 16, 2018

Hi

Brat actually handle multiline annotations. For eg:
Annotating this bold words

foo bar baz first part of tokens
second part of tokens foo bar baz

Would result in brat :

T205 dict4 959 971;972 974 first part of tokens second part of tokens

By mean, the newline character is removed from the original text, and there is two spans separated by a ";".

I am about to implement such behavior in BratWriter class. Are you interested this feature is part of dkpro-core ? I guess that would make sense.

Thanks

@reckart
Copy link
Member

reckart commented Jun 18, 2018

What you are observing in the brat file is really an instance of a discontinuous annotation, not explicitly a multi-line annotation support. So when you observe that brat splits an annotation at a line boundary into multiple spans, it may not really clear later if these spans are meant to be separate or if they should be joined.

Unlike apparently brat, UIMA (DKPro Core) has no problem handling annotation spans which go across lines, however, DKPro Core has no support for discontinuous annotations (#895).

a) So on the side of the BratWriter, it would make sense to split the cross-line annotations up into multiple segments (as you note) so that brat can properly load them.

b) But this also should entail changes on the side of the reader. A sensible approach would probably be to specially handle the case where the discontinuous segments are really adjacent and in that case to merge them.

However, in cases where the segments are not adjacent, we'd either need to continue ignoring them (#1210) or to implement support for discontinuous annotations in DKPro Core (#895) in order to properly handle them.

Would you like to contribute a patch for a) and b)?

@reckart reckart added ⭐️ Enhancement New feature or request Module-io.brat labels Jun 18, 2018
@parisni
Copy link
Author

parisni commented Jun 20, 2018

Thanks for the details. That makes sense. Correct me if I am wrong, but implementing a) and b) would result in:

for brat reader:

  • when a brat row contains a discontinuous annotation then two cases: 1) it is adjacent -> merge both segments 2) it is not adjacent -> create two annotation

for brat writer:

  • when an annotation contains a newline, produce a discontinuous annotation.

If that's ok, I will give it a try within few weeks.

Thanks

@reckart
Copy link
Member

reckart commented Jun 21, 2018

I didn't have option 2 for the reader in mind, but it seems a viable option.

So I think your suggestion is good.

@parisni
Copy link
Author

parisni commented Jun 25, 2018

Let's see the reader first:

right now, the BratTextAnnotation.parse returns a single annotation. Since a discontinuous annotation can contain multiple annotations an approach is to make it returning an array of annotation instead of a single annotation.

Two impacts are :

  1. a unique id need to be generated for each annotation contained in a discontinuous annotation. eg:

T2 12 13;19 20 a b

could produce two annotation with ID "T2-1" and "T2-2"

  1. the json writer needs to be modified accordingly (not sure if the json writer is used anywhere). eg:
   // Format: [${ID}, ${TYPE}, [[${START}, ${END}]]]
  // note that range of the offsets are [${START},${END})
   // ['T1', 'Person', [[0, 11],[12, 13], [19, 20]]]

Is that approach correct for the moment ?

@reckart
Copy link
Member

reckart commented Jun 25, 2018

The BratTextAnnotation represents a single brat annotation. Brat annotations can be discontinuous, i.e. they can have more than one begin/end pair. So instead of creating multiple BratTextAnnotation from BratTextAnnotation.parse, the BratTextAnnotation should be changed to contain more than one begin/end pair. When BratTextAnnotation is serialized to JSON, these begin/end pairs must be serialized in the same way that the were represented in the brat file that was originally loaded.

Best start by creating a unit test with a discontinuous segment and then try to fix it.

@parisni
Copy link
Author

parisni commented Jul 1, 2018

When BratTextAnnotation is serialized to JSON, these begin/end pairs must be serialized in the same way that the were represented in the brat file that was originally loaded.

Is this correct for a discontinuous annotation ?

Brat : T1 Person 0 11;12 15 John Doe
Json: ['T1', 'Person', [[0, 11], [12, 15]]]

@reckart
Copy link
Member

reckart commented Jul 1, 2018

I think that should be right.

But what I'd be doing to verify this: open the original brat in a browser, create a cross-sentence annotation, and then check what JSON the server is sending to the browser. Unfortunately, the brat JSON is not documented.

@reckart reckart added this to the 1.10.0 milestone Jul 28, 2018
reckart added a commit to parisni/dkpro-core that referenced this issue Aug 12, 2018
reckart added a commit that referenced this issue Sep 3, 2018
@reckart reckart closed this as completed Sep 3, 2018
reckart added a commit that referenced this issue Sep 23, 2018
* master: (30 commits)
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release de.tudarmstadt.ukp.dkpro.core-1.10.0
  #1281 - RAT failures in doc module
  #1278 - Upgrade dependencies (1.10.0)
  #1278 - Upgrade dependencies (1.10.0)
  #1278 - Upgrade dependencies (1.10.0)
  #1216 - Enable Arabic segmentation with CoreNLP
  #1216 - Enable Arabic segmentation with CoreNLP
  #1277 - Prepare 1.10.0 release
  #1275 - ApplyChangesAnnotator uses print() messages
  #1247 - brat writer - multiline annotation
  #1272 - GUM dataset description license comment is wrong
  #1052 - Add GUM 4.1.0 dataset description
  #1270 - Bad dependency scope in dkpro-core-frequency-asl
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release de.tudarmstadt.ukp.dkpro.core-1.9.3
  Take account of @rckart review
  Fix formatting and and some remarks from @reckart
  Add contributors
  Improve brat writer newline
  ...

% Conflicts:
%	dkpro-core-frequency-asl/pom.xml
reckart added a commit that referenced this issue Oct 9, 2018
* master: (91 commits)
  No issue. Allow running test repeatedly without cleaning the target folder in between.
  #1290 - Upgrade to LanguageTool 4.3
  No issue. Removed duplicate dependency.
  #1281 - RAT failures in doc module
  #1288 - Add support for the CoreNLP CoNLL flavor
  #1286 - Tika 100000 characters Limit
  No issue. Set version to 1.11.0-SNAPSHOT
  #1284 - Upgrade to Apache Tika 1.19
  [maven-release-plugin] prepare for next development iteration
  [maven-release-plugin] prepare release de.tudarmstadt.ukp.dkpro.core-1.10.0
  #1281 - RAT failures in doc module
  #1278 - Upgrade dependencies (1.10.0)
  #1278 - Upgrade dependencies (1.10.0)
  #1278 - Upgrade dependencies (1.10.0)
  #1216 - Enable Arabic segmentation with CoreNLP
  #1216 - Enable Arabic segmentation with CoreNLP
  #1277 - Prepare 1.10.0 release
  #1275 - ApplyChangesAnnotator uses print() messages
  #1247 - brat writer - multiline annotation
  #1272 - GUM dataset description license comment is wrong
  ...

% Conflicts:
%	dkpro-core-lbj-asl/pom.xml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants