Drastic improvement using word embeddings (+20%) - explanation? #129

svanhvitlilja · 2018-12-09T13:53:55Z

This is not a software issue, we're just wondering whether anyone can shead some light on the results we're seeing.

We've been working on an Icelandic named entity recognizer using NeuroNER. Our training corpus contains 200,000 tokens, thereof around 7000 named entities.

We are seeing huge improvements by incorporating external word embeddings, from F1=61% (without word embeddings) to F1=81%.

This is great news, but we would like to understand why this is happening. Has anyone here experienced such a big jump in performance when incorporating word embeddings?

I'm wondering whether the fact that Icelandic is a morphologically complex language explains why the word embeddings are working so well?
First experiment with word embeddings was done on 500.000 Icelandic words, and it gave us F1=75%. Then we created word embeddings from 500.000.000 words, and F1 went up to 81%.

Looking for ideas, thoughts, stories from anyone who has tried NeuroNER with and without word embeddings.

Best regards!

kaushikacharya · 2020-06-14T14:20:50Z

@svanhviti16
My understanding is that the improvement is due to character embedding.

https://arxiv.org/abs/1606.03475
The paper mentions:

While the token embeddings capture the semantics of tokens to some degree, they may still suffer from data sparsity. For example, they cannot account for out-of-vocabulary tokens, misspellings,
and different noun forms or verb endings.

We address this issue by using character-based
token embeddings, which incorporate each individual character of a token to generate its vector representation. This approach enables the model to
learn sub-token patterns such as morphemes (e.g.,
suffix or prefix) and roots, thereby capturing outof-vocabulary tokens, different surface forms, and
other information not contained in the token embeddings.

This is also mentioned in their ablation analysis.

Removal of character embedding results in significant drop in the model's performance.

This should be the reason for such a big improvement in morphologically rich language e.g. Icelandic.

Alternate approach:

There's another approach to utilize the subword information:
https://arxiv.org/abs/1607.04606
Here they have created vector embeddiing for the subwords. And word embedding is computed as sum of these subword embeddings.

Quoting from the paper:

Most of these techniques represent each word of
the vocabulary by a distinct vector, without parameter sharing. In particular, they ignore the internal
structure of words, which is an important limitation
for morphologically rich languages, such as Turkish or Finnish. For example, in French or Spanish,
most verbs have more than forty different inflected
forms, while the Finnish language has fifteen cases
for nouns. These languages contain many word
forms that occur rarely (or not at all) in the training
corpus, making it difficult to learn good word representations. Because many word formations follow
rules, it is possible to improve vector representations
for morphologically rich languages by using character level information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drastic improvement using word embeddings (+20%) - explanation? #129

Drastic improvement using word embeddings (+20%) - explanation? #129

svanhvitlilja commented Dec 9, 2018

kaushikacharya commented Jun 14, 2020 •

edited

Loading

Drastic improvement using word embeddings (+20%) - explanation? #129

Drastic improvement using word embeddings (+20%) - explanation? #129

Comments

svanhvitlilja commented Dec 9, 2018

kaushikacharya commented Jun 14, 2020 • edited Loading

Alternate approach:

kaushikacharya commented Jun 14, 2020 •

edited

Loading