-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
brat writer - multiline annotation #1247
Comments
What you are observing in the brat file is really an instance of a discontinuous annotation, not explicitly a multi-line annotation support. So when you observe that brat splits an annotation at a line boundary into multiple spans, it may not really clear later if these spans are meant to be separate or if they should be joined. Unlike apparently brat, UIMA (DKPro Core) has no problem handling annotation spans which go across lines, however, DKPro Core has no support for discontinuous annotations (#895). a) So on the side of the BratWriter, it would make sense to split the cross-line annotations up into multiple segments (as you note) so that brat can properly load them. b) But this also should entail changes on the side of the reader. A sensible approach would probably be to specially handle the case where the discontinuous segments are really adjacent and in that case to merge them. However, in cases where the segments are not adjacent, we'd either need to continue ignoring them (#1210) or to implement support for discontinuous annotations in DKPro Core (#895) in order to properly handle them. Would you like to contribute a patch for a) and b)? |
Thanks for the details. That makes sense. Correct me if I am wrong, but implementing a) and b) would result in: for brat reader:
for brat writer:
If that's ok, I will give it a try within few weeks. Thanks |
I didn't have option 2 for the reader in mind, but it seems a viable option. So I think your suggestion is good. |
Let's see the reader first: right now, the BratTextAnnotation.parse returns a single annotation. Since a discontinuous annotation can contain multiple annotations an approach is to make it returning an array of annotation instead of a single annotation. Two impacts are :
could produce two annotation with ID "T2-1" and "T2-2"
Is that approach correct for the moment ? |
The Best start by creating a unit test with a discontinuous segment and then try to fix it. |
Is this correct for a discontinuous annotation ?
|
I think that should be right. But what I'd be doing to verify this: open the original brat in a browser, create a cross-sentence annotation, and then check what JSON the server is sending to the browser. Unfortunately, the brat JSON is not documented. |
- Address checkstyle issues
…dling #1247 - brat writer - multiline annotation
* master: (30 commits) [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release de.tudarmstadt.ukp.dkpro.core-1.10.0 #1281 - RAT failures in doc module #1278 - Upgrade dependencies (1.10.0) #1278 - Upgrade dependencies (1.10.0) #1278 - Upgrade dependencies (1.10.0) #1216 - Enable Arabic segmentation with CoreNLP #1216 - Enable Arabic segmentation with CoreNLP #1277 - Prepare 1.10.0 release #1275 - ApplyChangesAnnotator uses print() messages #1247 - brat writer - multiline annotation #1272 - GUM dataset description license comment is wrong #1052 - Add GUM 4.1.0 dataset description #1270 - Bad dependency scope in dkpro-core-frequency-asl [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release de.tudarmstadt.ukp.dkpro.core-1.9.3 Take account of @rckart review Fix formatting and and some remarks from @reckart Add contributors Improve brat writer newline ... % Conflicts: % dkpro-core-frequency-asl/pom.xml
* master: (91 commits) No issue. Allow running test repeatedly without cleaning the target folder in between. #1290 - Upgrade to LanguageTool 4.3 No issue. Removed duplicate dependency. #1281 - RAT failures in doc module #1288 - Add support for the CoreNLP CoNLL flavor #1286 - Tika 100000 characters Limit No issue. Set version to 1.11.0-SNAPSHOT #1284 - Upgrade to Apache Tika 1.19 [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release de.tudarmstadt.ukp.dkpro.core-1.10.0 #1281 - RAT failures in doc module #1278 - Upgrade dependencies (1.10.0) #1278 - Upgrade dependencies (1.10.0) #1278 - Upgrade dependencies (1.10.0) #1216 - Enable Arabic segmentation with CoreNLP #1216 - Enable Arabic segmentation with CoreNLP #1277 - Prepare 1.10.0 release #1275 - ApplyChangesAnnotator uses print() messages #1247 - brat writer - multiline annotation #1272 - GUM dataset description license comment is wrong ... % Conflicts: % dkpro-core-lbj-asl/pom.xml
Hi
Brat actually handle multiline annotations. For eg:
Annotating this bold words
Would result in brat :
By mean, the newline character is removed from the original text, and there is two spans separated by a ";".
I am about to implement such behavior in BratWriter class. Are you interested this feature is part of dkpro-core ? I guess that would make sense.
Thanks
The text was updated successfully, but these errors were encountered: