Training NER on Large Dataset #8456
Replies: 5 comments 4 replies
-
Hi! It looks like in the end your approach for dealing with the size limits works. That said, it would probably be easier/faster/less mem-heavy to create and save the In step 2, in the
you could keep a counter and save up to 20 documents (or whatever) into one Your approach with the two config files seems correct. What is your motivation for this cold/hot start? There are a few additional things you can try to speed things up. First, create a
You could also try increasing the learning rate to train faster (but that takes some experimentation). How large is your dev set? Note that every training iteration, the accuracy scores are calculated on the dev set. So each line that you print out on the console, triggers that evaluation. You could consider making your dev set smaller and running a full evaluation when training is done. Or you could evaluate less often by setting the Finally, have a look at the option to stream data by setting Hope these tips are useful, let us know if you run into (more) issues! |
Beta Was this translation helpful? Give feedback.
-
Hi @svlandeg , Thank you for your reply! You recommendation in breaking the dataset into smaller size when creating DocBin "batches" is what I already implemented in the hot/cold approach. The motivation of the cold/hot start is, given that I cannot save the entire training dataset in one big file due to DocBin limits, I break the training dataset is small bits/"batches" that can be saved into file without DocBin issues. Then, at the very beginning of the training process, when the first "batch" is going into training, the config file sets up a new NER component and a new Word2Vec component. However, as of the second "batch", since I do not want to start training from scratch and I want to have continuation from the training on the previous batch, I setup the config file of the hot start to have the "source" using model-last for NER and Word2Vec components as I mentioned in my first post. Note that I also set the max epoch of both cold and hot config files to be equal to "1". Once the last "batch" of training data is processed, one true actual epoch is completed, at which point I will return to the first batch using hot start to have continuation as opposed to start from scratch (cold start). I hope that makes sense :) Thank you for the tips about: 1- creating a labels file. I will look into it. Seems like a nice approach that saves time during the initialization - though that step is not super time consuming. The size of word2vec model (depth and width) is super influential on the speed as going from 4 depth and 96 width to 6 and 256 decreases the speed by about 10 times. 2- playing with the learning rate. I will look into manipulation it to decrease the total number of epochs needed. 3- Dev set size: Under the "batching" approach, each training batch is 10000 records of (text, {'entities' : []}) before it is converted to DocBin and each dev set is 1248 records. I will consider making a smaller dev set to see if it has a noticeable impact on the speed. I prefer to keep the frequency of evaluation as it is (200) since I want to see the progress somewhat often as opposed to waiting a long time to see if things are progressing. Here are the follow up questions: 1- Can you think of any other approach to train a large database other than the col/hot approach? The only alternative I can think of is using nlp.update via the API because, unlike CLI mode, it does not require saving the Example of training data and I can train in one single execution. But for some reason, it seems like nlp.update is slower than the CLI approach. I may be wrong and maybe I could setup the parameters better in using nlp.update. 2- Does the streaming of data option require saving the DocBin file? If yes, I will have the same issues there and I would have to come up with some sort of cold/hot starts again. 3- I never had success using ray as my training always crashed. Nor did I ever had any luck using GPU when in "accuracy" training as opposed to "efficiency" - I am using these terms in accordance to the quickstart widget on the website. In all these cases, my training crashes down, even when I use a super powerful machine on AWS with lots of RAM and GPU memory. The question that I have is whether or not the length of the text matters in how memory hungry spaCy is. In each record of my training, I am training on a 6 to 7 pages long text extracted from PDF. Is there any value, for training efficiencies, to cut the text into smaller bits (the bits that contain the entity labels)? I have researches alot on this topic and understand that for inference, the text length does not matter since spaCy uses a greedy approach - unlike I got it wrong. But I wonder if for training it is a relevant technique. 4- Given that I can only execute my training on "efficiency" mode using GPU OR "accuracy" mode without GPU (any other combination, with or without ray, my training crashes), what are the parameters that I can play with to improve accuracy? For example, is there any way to change the architecture of the convolutional NN that spaCy uses? I read a lot of documents but could not figure out if that part is manupolate-able. I understand that training on "accuracy" mode is supposed to be higher given the higher dimensions of the word2vec model in the config file, but the execution is SUPER slow in that mode partially because I cannot use GPU (it crashes). Thank you for your support, |
Beta Was this translation helpful? Give feedback.
-
I think most of your problems will be solved by providing the labels in advance and streaming your train corpus. I would get the initial training working with those changes, and then if necessary, as a second step try splitting your documents into smaller units like paragraphs or sections, especially for training on GPU. The CLI handles a lot of defaults better than Streaming does not require saving a For NER, paragraph-sized documents are often a good size for training. The local context used by the model to disambiguate entities is typically within the same paragraph. If you have them, you want to include sections without entities, too, since that's how the model learns what to label as If you still run into memory issues (most likely on GPU), be sure to look at the batch sizes used for training ( |
Beta Was this translation helpful? Give feedback.
-
Let me convert this to a discussion. This issue will be locked but the new discussion thread linked here will be open. |
Beta Was this translation helpful? Give feedback.
-
@adrianeboyd Thank you for the reply. Thank you for your inputs on streaming, custom corpus reader, text size, etc. All very helpful! I will play with [training.batcher]) and for evaluation ([nlp.batch_size], and the custom corpus reader for streaming. I have one last note for this conversation, no replies needed :) As a feedback for consideration in future versions of spaCy, currently, there is no way to allocate multiple GPUs in the training. The reason that I mentioned this is not efficiency/speed. Regardless of speed, the issue that I ran into is that even though my machine has 8 GPU cores each with 11441MiB memory, as soon as one of them gets filled, the training crashes while the memory of other cores are 0% used. :) Thank you again for your support! |
Beta Was this translation helpful? Give feedback.
-
What is the problem?
I have a large corpus of data that I need to train an NER model with. The corpus is 300k PDF documents each around 6 pages. The size of the training data in the v2 format [(text, {'entities': []}), ...], before converting it to v3 format, is about 16Gib.
1- The first issue that I ran into was in trying the training data to v3 format. I tied using the "Migrating from v2" guidelines and used: python -m spacy convert ./training.json ./output
However, I got the following message:
When I looked out the output, it was 118 bytes and clearly the conversion had crashed.
2- So, I changed my approach and used the following code to convert my data:
This approach worked fine and successfully converted the data. Though I need to mention that I had to execute it on an p2.8xlarge AWS EC2 with 488GiB RAM. So far so good.
3- Then, I tried to save the v3 data because I need a saved file to use spaCy v3 CLI training approach. Therefore, I used the following code to save the training data:
4- However, I got the error "bytes object is too large", similar to issue #5219 in the closed issues here on github.
I looked further around it seems like the only solution is to break down the training data into smaller sections to save it. So, I tried a few attempts and I found that I could break it down into 30 pieces and successfully saved them.
5- Then, to use these files, I created 2 config files. One for a "cold" start which is using the very first batch of training data. One for a "hot" start using the following 29 batches of data. The primary difference between these two config files are in the following:
For cold start:
For hot start:
This approach seem to be working. For the first batch of saved v3 training data (aka cold start) the report says:
As of the second batch of saved v3 training data (aka hot start), the report says that the training is being resumed:
The issues that I am having are:
Is this even a reasonable approach? I feel like it's too "hacky" and does not inspire confidence. It is super slow, even though I am using GPU.
What is the best approach for NER training on very large datasets like mine? I understand that spaCy's philosophy is industrial applications and scale-ablity is taken into account. So I thought there should be a better way to do this. I considered using the API and instead of saving the v3 training data on disk, just feed it to nlp.update() to train my model, but I am not sure if that is recommended in v3 as I used to go that way in v2 and I'm afraid I might lose computational efficiencies. Please advice on the best approach for large scale training. Thank you!
I tried using ray for multiprocessing but I never had success with it as it always crashed half way which is reported in the last comment in issue Training NER models on multiple GPUs (not just one) #8093.
Your Environment
All of the code above is in one conda_python3 notebook on AWS Sagemaker using ml.p2.8xlarge EC2 instance.
Python Version Used: 3
spaCy Version Used: 3.0.6
Beta Was this translation helpful? Give feedback.
All reactions