Text or sentences? #13363

shashko-a · 2024-03-01T13:05:21Z

shashko-a
Mar 1, 2024

I'm trying to train spaCy model with new data to find colors.
I collected a few hundreds of independent sentences and during annotation (I used https://tecoholic.github.io/ner-annotator/) I faced with the question: what's the best way to annotate data and to train spaCy with them?
Should I put all my sentences as one giant string and get one "entities" block in my json from annotator, or would it be better to separate each sentence and to get a json structure like "1 sentence - it's entities, 2 sentence - it's entities,..."?

Thanks!

svlandeg · 2024-03-19T10:38:11Z

svlandeg
Mar 19, 2024
Maintainer

Hi!

The NER model in spaCy will mostly look at local context. For annotators as well, it's usually sufficient to see the local context to do NE annotation - so I think the granularity of a single sentence will probably work best.

Either way - if the sentences are independent and not coming from the same original document, I definitely wouldn't merge them into a single annotation/document, as this may actually be confusing ML models trained on such data.

Hope that helps!

1 reply

shashko-a Mar 20, 2024
Author

You literally confirmed my guess, it was really helpful, thanks so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text or sentences? #13363

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Text or sentences? #13363

shashko-a Mar 1, 2024

Replies: 1 comment · 1 reply

svlandeg Mar 19, 2024 Maintainer

shashko-a Mar 20, 2024 Author

shashko-a
Mar 1, 2024

Replies: 1 comment 1 reply

svlandeg
Mar 19, 2024
Maintainer

shashko-a Mar 20, 2024
Author