Entity Linking with Wikipedia and Wikidata on spaCy v3.7 #13049
-
I'm planning to perform custom training with spaCy for entity linking using Wikidata. Could you provide guidance on creating a KnowledgeBase, especially since the latest spaCy version needs InMemoryLookupKB? I'm also considering refactoring the code in the current GitHub repository (https://github.com/explosion/projects/tree/master/nel-wikipedia) to match the latest version. Do you think this is a good approach, or is there a simpler way? Any helpful tips or tricks would be welcome. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 49 replies
-
Hi @RistovaIvona! Having said that, some resources that might be useful for you:
In general I'd recommend adjusting the linked EL benchmark project - I think you'll have less work updating that one to your requirements than the (considerably older) If you have any specific questions, feel free to drop them here. |
Beta Was this translation helpful? Give feedback.
-
Both the Wikipedia dump as well as the Wikidata dump are considerably too small for them to be the full dumps (they should be around 21 and 80 GB, respectively). Try downloading them directly:
Yes, that's important for the processing steps. It doesn't explain why you ended up with only a subset of the data though. So as next steps: download the dumps directly and set |
Beta Was this translation helpful? Give feedback.
-
Hello @rmitsch and @RistovaIvona , I'm following this thread and I'm currently on the wikid_parse step, and it's been running for hours now. I'm currently on iteration ~30,000,000. Wondering how long this parse takes? They didn't add tqdm so I'm not sure how far along I am. Here's what I see currently:
|
Beta Was this translation helpful? Give feedback.
-
Well I ran through training but I'm not really happy with the results. It looks like training didn't really learn much, and it only performed 1000 steps.
I set max_training_steps to 0 and max_epochs is also 0 (which means unlimited for both), so the patience parameter now controls early stopping, but we'll see if that helps or is just a continued trend of the above for even longer training times... |
Beta Was this translation helpful? Give feedback.
-
Great ideas for debugging this @svlandeg! Here's the distribution of candidate counts across the training Corpus's entities:
Results in:
So 8991 (39.3% of entities) of them only have 1 candidate, but a total of 13858 (60.6%) entities have >1 candidate. This doesn't account for possible duplicate entities in the training corpus, but I would have had to do NEL to write that code to begin with, so this is the best snapshot I can think of ha ha. Ran with
Ran with
Another thing I noticed is that this benchmarking NEL code was written 2 years ago (commit was dated July 4, 2022), so at best they were developing in Spacy v3.3.1, and I also see in the requirements for the entire repo that spacy should be
|
Beta Was this translation helpful? Give feedback.
-
Hello maintainers and contributors, I am following this forum to implement the nel benchmark and got the following error on the Looking at the error and examining the code in The size of my sqlite DB stands at 20.2 GB which seems to correspond to another person in this forum but not the 12GB mentioned by rmitsch. I also ran some sanity check queries on the main tables of the db and everything seems to be in order. Machine specs: RAM 32GB, CPUs 16, Storage 500 GB (
|
Beta Was this translation helpful? Give feedback.
Hi @RistovaIvona!
InMemoryLookupKB
is identical to the previousKnowledgeBase
class - the renaming was done as part of a refactoring that will make knowledge bases more customizable and capable in spaCy v4.Having said that, some resources that might be useful for you:
wikid
is a spaCy project for pulling Wikidata dumps, preprocessing and storing them in a SQLite database.wikid
to use Wikidata as knowledge base. The dataset used for training and evaluation is Mewsli-9, but it should be easy enough to adjust the scaffolding to train with any other dataset.In general I'd recommend adjusting the linked EL benchmark project - I think you'll have less wor…