Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drastic improvement using word embeddings (+20%) - explanation? #129

Open
svanhvitlilja opened this issue Dec 9, 2018 · 1 comment
Open

Comments

@svanhvitlilja
Copy link

This is not a software issue, we're just wondering whether anyone can shead some light on the results we're seeing.

We've been working on an Icelandic named entity recognizer using NeuroNER. Our training corpus contains 200,000 tokens, thereof around 7000 named entities.

We are seeing huge improvements by incorporating external word embeddings, from F1=61% (without word embeddings) to F1=81%.

This is great news, but we would like to understand why this is happening. Has anyone here experienced such a big jump in performance when incorporating word embeddings?

I'm wondering whether the fact that Icelandic is a morphologically complex language explains why the word embeddings are working so well?
First experiment with word embeddings was done on 500.000 Icelandic words, and it gave us F1=75%. Then we created word embeddings from 500.000.000 words, and F1 went up to 81%.

Looking for ideas, thoughts, stories from anyone who has tried NeuroNER with and without word embeddings.

Best regards!

@kaushikacharya
Copy link

kaushikacharya commented Jun 14, 2020

@svanhviti16
My understanding is that the improvement is due to character embedding.

https://arxiv.org/abs/1606.03475
The paper mentions:

While the token embeddings capture the semantics of tokens to some degree, they may still suffer from data sparsity. For example, they cannot account for out-of-vocabulary tokens, misspellings,
and different noun forms or verb endings.

We address this issue by using character-based
token embeddings, which incorporate each individual character of a token to generate its vector representation. This approach enables the model to
learn sub-token patterns such as morphemes (e.g.,
suffix or prefix) and roots, thereby capturing outof-vocabulary tokens, different surface forms, and
other information not contained in the token embeddings.

This is also mentioned in their ablation analysis.
image

Removal of character embedding results in significant drop in the model's performance.

This should be the reason for such a big improvement in morphologically rich language e.g. Icelandic.

Alternate approach:

There's another approach to utilize the subword information:
https://arxiv.org/abs/1607.04606
Here they have created vector embeddiing for the subwords. And word embedding is computed as sum of these subword embeddings.

Quoting from the paper:

Most of these techniques represent each word of
the vocabulary by a distinct vector, without parameter sharing. In particular, they ignore the internal
structure of words, which is an important limitation
for morphologically rich languages, such as Turkish or Finnish. For example, in French or Spanish,
most verbs have more than forty different inflected
forms, while the Finnish language has fifteen cases
for nouns. These languages contain many word
forms that occur rarely (or not at all) in the training
corpus, making it difficult to learn good word representations. Because many word formations follow
rules, it is possible to improve vector representations
for morphologically rich languages by using character level information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants