You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a software issue, we're just wondering whether anyone can shead some light on the results we're seeing.
We've been working on an Icelandic named entity recognizer using NeuroNER. Our training corpus contains 200,000 tokens, thereof around 7000 named entities.
We are seeing huge improvements by incorporating external word embeddings, from F1=61% (without word embeddings) to F1=81%.
This is great news, but we would like to understand why this is happening. Has anyone here experienced such a big jump in performance when incorporating word embeddings?
I'm wondering whether the fact that Icelandic is a morphologically complex language explains why the word embeddings are working so well?
First experiment with word embeddings was done on 500.000 Icelandic words, and it gave us F1=75%. Then we created word embeddings from 500.000.000 words, and F1 went up to 81%.
Looking for ideas, thoughts, stories from anyone who has tried NeuroNER with and without word embeddings.
Best regards!
The text was updated successfully, but these errors were encountered:
While the token embeddings capture the semantics of tokens to some degree, they may still suffer from data sparsity. For example, they cannot account for out-of-vocabulary tokens, misspellings,
and different noun forms or verb endings.
We address this issue by using character-based token embeddings, which incorporate each individual character of a token to generate its vector representation. This approach enables the model to
learn sub-token patterns such as morphemes (e.g.,
suffix or prefix) and roots, thereby capturing outof-vocabulary tokens, different surface forms, and
other information not contained in the token embeddings.
This is also mentioned in their ablation analysis.
Removal of character embedding results in significant drop in the model's performance.
This should be the reason for such a big improvement in morphologically rich language e.g. Icelandic.
Alternate approach:
There's another approach to utilize the subword information: https://arxiv.org/abs/1607.04606
Here they have created vector embeddiing for the subwords. And word embedding is computed as sum of these subword embeddings.
Quoting from the paper:
Most of these techniques represent each word of
the vocabulary by a distinct vector, without parameter sharing. In particular, they ignore the internal
structure of words, which is an important limitation
for morphologically rich languages, such as Turkish or Finnish. For example, in French or Spanish,
most verbs have more than forty different inflected
forms, while the Finnish language has fifteen cases
for nouns. These languages contain many word
forms that occur rarely (or not at all) in the training
corpus, making it difficult to learn good word representations. Because many word formations follow
rules, it is possible to improve vector representations
for morphologically rich languages by using character level information.
This is not a software issue, we're just wondering whether anyone can shead some light on the results we're seeing.
We've been working on an Icelandic named entity recognizer using NeuroNER. Our training corpus contains 200,000 tokens, thereof around 7000 named entities.
We are seeing huge improvements by incorporating external word embeddings, from F1=61% (without word embeddings) to F1=81%.
This is great news, but we would like to understand why this is happening. Has anyone here experienced such a big jump in performance when incorporating word embeddings?
I'm wondering whether the fact that Icelandic is a morphologically complex language explains why the word embeddings are working so well?
First experiment with word embeddings was done on 500.000 Icelandic words, and it gave us F1=75%. Then we created word embeddings from 500.000.000 words, and F1 went up to 81%.
Looking for ideas, thoughts, stories from anyone who has tried NeuroNER with and without word embeddings.
Best regards!
The text was updated successfully, but these errors were encountered: