brat writer - multiline annotation #1247

parisni · 2018-06-16T09:54:34Z

Hi

Brat actually handle multiline annotations. For eg:
Annotating this bold words

foo bar baz first part of tokens
second part of tokens foo bar baz

Would result in brat :

T205 dict4 959 971;972 974 first part of tokens second part of tokens

By mean, the newline character is removed from the original text, and there is two spans separated by a ";".

I am about to implement such behavior in BratWriter class. Are you interested this feature is part of dkpro-core ? I guess that would make sense.

Thanks

reckart · 2018-06-18T11:51:12Z

What you are observing in the brat file is really an instance of a discontinuous annotation, not explicitly a multi-line annotation support. So when you observe that brat splits an annotation at a line boundary into multiple spans, it may not really clear later if these spans are meant to be separate or if they should be joined.

Unlike apparently brat, UIMA (DKPro Core) has no problem handling annotation spans which go across lines, however, DKPro Core has no support for discontinuous annotations (#895).

a) So on the side of the BratWriter, it would make sense to split the cross-line annotations up into multiple segments (as you note) so that brat can properly load them.

b) But this also should entail changes on the side of the reader. A sensible approach would probably be to specially handle the case where the discontinuous segments are really adjacent and in that case to merge them.

However, in cases where the segments are not adjacent, we'd either need to continue ignoring them (#1210) or to implement support for discontinuous annotations in DKPro Core (#895) in order to properly handle them.

Would you like to contribute a patch for a) and b)?

parisni · 2018-06-20T21:56:58Z

Thanks for the details. That makes sense. Correct me if I am wrong, but implementing a) and b) would result in:

for brat reader:

when a brat row contains a discontinuous annotation then two cases: 1) it is adjacent -> merge both segments 2) it is not adjacent -> create two annotation

for brat writer:

when an annotation contains a newline, produce a discontinuous annotation.

If that's ok, I will give it a try within few weeks.

Thanks

reckart · 2018-06-21T20:56:58Z

I didn't have option 2 for the reader in mind, but it seems a viable option.

So I think your suggestion is good.

parisni · 2018-06-25T21:37:53Z

Let's see the reader first:

right now, the BratTextAnnotation.parse returns a single annotation. Since a discontinuous annotation can contain multiple annotations an approach is to make it returning an array of annotation instead of a single annotation.

Two impacts are :

a unique id need to be generated for each annotation contained in a discontinuous annotation. eg:

T2 12 13;19 20 a b

could produce two annotation with ID "T2-1" and "T2-2"

the json writer needs to be modified accordingly (not sure if the json writer is used anywhere). eg:

   // Format: [${ID}, ${TYPE}, [[${START}, ${END}]]]
  // note that range of the offsets are [${START},${END})
   // ['T1', 'Person', [[0, 11],[12, 13], [19, 20]]]

Is that approach correct for the moment ?

reckart · 2018-06-25T21:43:14Z

The BratTextAnnotation represents a single brat annotation. Brat annotations can be discontinuous, i.e. they can have more than one begin/end pair. So instead of creating multiple BratTextAnnotation from BratTextAnnotation.parse, the BratTextAnnotation should be changed to contain more than one begin/end pair. When BratTextAnnotation is serialized to JSON, these begin/end pairs must be serialized in the same way that the were represented in the brat file that was originally loaded.

Best start by creating a unit test with a discontinuous segment and then try to fix it.

parisni · 2018-07-01T09:41:22Z

When BratTextAnnotation is serialized to JSON, these begin/end pairs must be serialized in the same way that the were represented in the brat file that was originally loaded.

Is this correct for a discontinuous annotation ?

Brat : T1 Person 0 11;12 15 John Doe
Json: ['T1', 'Person', [[0, 11], [12, 15]]]

reckart · 2018-07-01T09:46:41Z

I think that should be right.

But what I'd be doing to verify this: open the original brat in a browser, create a cross-sentence annotation, and then check what JSON the server is sending to the browser. Unfortunately, the brat JSON is not documented.

- Address checkstyle issues

…dling #1247 - brat writer - multiline annotation

@reckart

* master: (30 commits) [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release de.tudarmstadt.ukp.dkpro.core-1.10.0 #1281 - RAT failures in doc module #1278 - Upgrade dependencies (1.10.0) #1278 - Upgrade dependencies (1.10.0) #1278 - Upgrade dependencies (1.10.0) #1216 - Enable Arabic segmentation with CoreNLP #1216 - Enable Arabic segmentation with CoreNLP #1277 - Prepare 1.10.0 release #1275 - ApplyChangesAnnotator uses print() messages #1247 - brat writer - multiline annotation #1272 - GUM dataset description license comment is wrong #1052 - Add GUM 4.1.0 dataset description #1270 - Bad dependency scope in dkpro-core-frequency-asl [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release de.tudarmstadt.ukp.dkpro.core-1.9.3 Take account of @rckart review Fix formatting and and some remarks from @reckart Add contributors Improve brat writer newline ... % Conflicts: % dkpro-core-frequency-asl/pom.xml

* master: (91 commits) No issue. Allow running test repeatedly without cleaning the target folder in between. #1290 - Upgrade to LanguageTool 4.3 No issue. Removed duplicate dependency. #1281 - RAT failures in doc module #1288 - Add support for the CoreNLP CoNLL flavor #1286 - Tika 100000 characters Limit No issue. Set version to 1.11.0-SNAPSHOT #1284 - Upgrade to Apache Tika 1.19 [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release de.tudarmstadt.ukp.dkpro.core-1.10.0 #1281 - RAT failures in doc module #1278 - Upgrade dependencies (1.10.0) #1278 - Upgrade dependencies (1.10.0) #1278 - Upgrade dependencies (1.10.0) #1216 - Enable Arabic segmentation with CoreNLP #1216 - Enable Arabic segmentation with CoreNLP #1277 - Prepare 1.10.0 release #1275 - ApplyChangesAnnotator uses print() messages #1247 - brat writer - multiline annotation #1272 - GUM dataset description license comment is wrong ... % Conflicts: % dkpro-core-lbj-asl/pom.xml

reckart added ⭐️ Enhancement New feature or request Module-io.brat labels Jun 18, 2018

parisni mentioned this issue Jul 1, 2018

#1247 - brat writer - multiline annotation #1253

Merged

reckart added this to the 1.10.0 milestone Jul 28, 2018

reckart assigned parisni and reckart Jul 28, 2018

reckart added a commit to parisni/dkpro-core that referenced this issue Aug 12, 2018

dkpro#1247 - brat writer - multiline annotation

901a9e4

- Address checkstyle issues

reckart added a commit that referenced this issue Sep 3, 2018

Merge pull request #1253 from parisni/feature/1247-brat-multiline-han…

1cbe01d

…dling #1247 - brat writer - multiline annotation

reckart closed this as completed Sep 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

brat writer - multiline annotation #1247

brat writer - multiline annotation #1247

parisni commented Jun 16, 2018 •

edited

Loading

reckart commented Jun 18, 2018

parisni commented Jun 20, 2018 •

edited

Loading

reckart commented Jun 21, 2018

parisni commented Jun 25, 2018 •

edited

Loading

reckart commented Jun 25, 2018

parisni commented Jul 1, 2018

reckart commented Jul 1, 2018

brat writer - multiline annotation #1247

brat writer - multiline annotation #1247

Comments

parisni commented Jun 16, 2018 • edited Loading

reckart commented Jun 18, 2018

parisni commented Jun 20, 2018 • edited Loading

reckart commented Jun 21, 2018

parisni commented Jun 25, 2018 • edited Loading

reckart commented Jun 25, 2018

parisni commented Jul 1, 2018

reckart commented Jul 1, 2018

parisni commented Jun 16, 2018 •

edited

Loading

parisni commented Jun 20, 2018 •

edited

Loading

parisni commented Jun 25, 2018 •

edited

Loading