Training NER on Large Dataset #8456

Julia-Penfield · 2021-06-09T21:30:52Z

Julia-Penfield
Jun 9, 2021

What is the problem?

I have a large corpus of data that I need to train an NER model with. The corpus is 300k PDF documents each around 6 pages. The size of the training data in the v2 format [(text, {'entities': []}), ...], before converting it to v3 format, is about 16Gib.

1- The first issue that I ran into was in trying the training data to v3 format. I tied using the "Migrating from v2" guidelines and used: python -m spacy convert ./training.json ./output

However, I got the following message:

UserWarning: [W027] Found a large training file of 5429543893 bytes. Note that it may be more efficient to split your training data into multiple smaller JSON files instead.
  for json_doc in json_iterate(input_data):
✔ Generated output file (0 documents):
output3/dataset_multilabel1.spacy

When I looked out the output, it was 118 bytes and clearly the conversion had crashed.

2- So, I changed my approach and used the following code to convert my data:

def make_v3_dataset(data, db = []):
    nlp = spacy.blank('en')
    failed_record = []
    if not db:
        db = DocBin()
    for text, annot, _, _ in tqdm(data):
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annot['entities']:
            span = doc.char_span(start, end, label = label, alignment_mode = 'contract')
            if span is None:
                print(f'empty entity, {text}, {annot["entities"]}') #I expect this to never happen
            else:
                ents.append(span)
        try:
            doc.ents = ents
        except:
            failed_record.append((text, annot))
        db.add(doc)
    return db, failed_record

This approach worked fine and successfully converted the data. Though I need to mention that I had to execute it on an p2.8xlarge AWS EC2 with 488GiB RAM. So far so good.

3- Then, I tried to save the v3 data because I need a saved file to use spaCy v3 CLI training approach. Therefore, I used the following code to save the training data:

v3_data.to_disk("train.spacy")

4- However, I got the error "bytes object is too large", similar to issue #5219 in the closed issues here on github.

I looked further around it seems like the only solution is to break down the training data into smaller sections to save it. So, I tried a few attempts and I found that I could break it down into 30 pieces and successfully saved them.

5- Then, to use these files, I created 2 config files. One for a "cold" start which is using the very first batch of training data. One for a "hot" start using the following 29 batches of data. The primary difference between these two config files are in the following:

For cold start:

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.tok2vec]
factory = "tok2vec"

For hot start:

[components.ner]
source = "./output2/model-best"

[components.tok2vec]
source = "./output2/model-best"

This approach seem to be working. For the first batch of saved v3 training data (aka cold start) the report says:

ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2021-06-09 15:54:44,637] [INFO] Set up nlp object from config
**[2021-06-09 15:54:44,652] [INFO] Pipeline: ['tok2vec', 'ner']**
[2021-06-09 15:54:44,657] [INFO] Created vocabulary
[2021-06-09 15:55:24,815] [INFO] Added vectors: ./wordvec
[2021-06-09 15:55:24,815] [INFO] Finished initializing nlp object
[2021-06-09 15:58:31,397] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    864.15    0.00    0.00    0.00    0.00
  0     200       7569.50  17518.94   24.94   32.41   20.26    0.25
  0     400      15093.60   2880.46   36.62   37.27   36.00    0.37
  0     600       8486.77   1953.45   42.19   48.01   37.63    0.42
  0     800       2062.73   1591.75   39.89   43.21   37.05    0.40
...
0    3400       2483.47    911.21   47.15   62.86   37.72    0.47
  0    3600       2587.18    782.31   47.80   57.47   40.91    0.48
  0    3800       2160.07    791.27   47.31   59.70   39.18    0.47
✔ Saved pipeline to output directory
output2/model-last

As of the second batch of saved v3 training data (aka hot start), the report says that the training is being resumed:

ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2021-06-09 17:13:14,429] [INFO] Set up nlp object from config
[2021-06-09 17:13:14,442] [INFO] Pipeline: ['tok2vec', 'ner']
**[2021-06-09 17:13:14,442] [INFO] Resuming training for: ['ner', 'tok2vec']**
[2021-06-09 17:13:14,448] [INFO] Created vocabulary
[2021-06-09 17:13:14,448] [INFO] Finished initializing nlp object
[2021-06-09 17:13:14,448] [INFO] Initialized pipeline components: []
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          4.14      4.08   50.89   59.96   44.21    0.51
  0     200       1066.24    709.79   51.08   60.53   44.17    0.51
  0     400       1880.10    942.89   47.21   58.50   39.57    0.47
...

The issues that I am having are:

Is this even a reasonable approach? I feel like it's too "hacky" and does not inspire confidence. It is super slow, even though I am using GPU.
What is the best approach for NER training on very large datasets like mine? I understand that spaCy's philosophy is industrial applications and scale-ablity is taken into account. So I thought there should be a better way to do this. I considered using the API and instead of saving the v3 training data on disk, just feed it to nlp.update() to train my model, but I am not sure if that is recommended in v3 as I used to go that way in v2 and I'm afraid I might lose computational efficiencies. Please advice on the best approach for large scale training. Thank you!
I tried using ray for multiprocessing but I never had success with it as it always crashed half way which is reported in the last comment in issue Training NER models on multiple GPUs (not just one) #8093.

Your Environment

All of the code above is in one conda_python3 notebook on AWS Sagemaker using ml.p2.8xlarge EC2 instance.
Python Version Used: 3
spaCy Version Used: 3.0.6

svlandeg · 2021-06-11T11:50:17Z

svlandeg
Jun 11, 2021
Maintainer

Hi!

It looks like in the end your approach for dealing with the size limits works. That said, it would probably be easier/faster/less mem-heavy to create and save the DocBin's on the fly, without having to keep all the data in memory and create one large DocBin which isn't going to work anyway.

In step 2, in the for loop

for text, annot, _, _ in tqdm(data):

you could keep a counter and save up to 20 documents (or whatever) into one DocBin, then write that to file and start a new one.

Your approach with the two config files seems correct. What is your motivation for this cold/hot start?

There are a few additional things you can try to speed things up. First, create a labels file: https://spacy.io/api/cli#init-labels - this will prevent the training process from looping over your entire dataset just to fetch all the labels during the training phase. once you have the labels file, you can always reuse it:

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json

You could also try increasing the learning rate to train faster (but that takes some experimentation).

How large is your dev set? Note that every training iteration, the accuracy scores are calculated on the dev set. So each line that you print out on the console, triggers that evaluation. You could consider making your dev set smaller and running a full evaluation when training is done. Or you could evaluate less often by setting the training.eval_frequency higher.

Finally, have a look at the option to stream data by setting training.max_epochs to -1 (see also here). In that case, the training batches are streamed rather than loaded into memory. Do note that in this case, there is no automated random shuffling of the training dataset.

Hope these tips are useful, let us know if you run into (more) issues!

0 replies

Julia-Penfield · 2021-06-11T21:12:06Z

Julia-Penfield
Jun 11, 2021
Author

Hi @svlandeg ,

Thank you for your reply!

You recommendation in breaking the dataset into smaller size when creating DocBin "batches" is what I already implemented in the hot/cold approach. The motivation of the cold/hot start is, given that I cannot save the entire training dataset in one big file due to DocBin limits, I break the training dataset is small bits/"batches" that can be saved into file without DocBin issues. Then, at the very beginning of the training process, when the first "batch" is going into training, the config file sets up a new NER component and a new Word2Vec component. However, as of the second "batch", since I do not want to start training from scratch and I want to have continuation from the training on the previous batch, I setup the config file of the hot start to have the "source" using model-last for NER and Word2Vec components as I mentioned in my first post. Note that I also set the max epoch of both cold and hot config files to be equal to "1". Once the last "batch" of training data is processed, one true actual epoch is completed, at which point I will return to the first batch using hot start to have continuation as opposed to start from scratch (cold start). I hope that makes sense :)

Thank you for the tips about:

1- creating a labels file. I will look into it. Seems like a nice approach that saves time during the initialization - though that step is not super time consuming. The size of word2vec model (depth and width) is super influential on the speed as going from 4 depth and 96 width to 6 and 256 decreases the speed by about 10 times.

2- playing with the learning rate. I will look into manipulation it to decrease the total number of epochs needed.

3- Dev set size: Under the "batching" approach, each training batch is 10000 records of (text, {'entities' : []}) before it is converted to DocBin and each dev set is 1248 records. I will consider making a smaller dev set to see if it has a noticeable impact on the speed. I prefer to keep the frequency of evaluation as it is (200) since I want to see the progress somewhat often as opposed to waiting a long time to see if things are progressing.

Here are the follow up questions:

1- Can you think of any other approach to train a large database other than the col/hot approach? The only alternative I can think of is using nlp.update via the API because, unlike CLI mode, it does not require saving the Example of training data and I can train in one single execution. But for some reason, it seems like nlp.update is slower than the CLI approach. I may be wrong and maybe I could setup the parameters better in using nlp.update.

2- Does the streaming of data option require saving the DocBin file? If yes, I will have the same issues there and I would have to come up with some sort of cold/hot starts again.

3- I never had success using ray as my training always crashed. Nor did I ever had any luck using GPU when in "accuracy" training as opposed to "efficiency" - I am using these terms in accordance to the quickstart widget on the website. In all these cases, my training crashes down, even when I use a super powerful machine on AWS with lots of RAM and GPU memory. The question that I have is whether or not the length of the text matters in how memory hungry spaCy is. In each record of my training, I am training on a 6 to 7 pages long text extracted from PDF. Is there any value, for training efficiencies, to cut the text into smaller bits (the bits that contain the entity labels)? I have researches alot on this topic and understand that for inference, the text length does not matter since spaCy uses a greedy approach - unlike I got it wrong. But I wonder if for training it is a relevant technique.

4- Given that I can only execute my training on "efficiency" mode using GPU OR "accuracy" mode without GPU (any other combination, with or without ray, my training crashes), what are the parameters that I can play with to improve accuracy? For example, is there any way to change the architecture of the convolutional NN that spaCy uses? I read a lot of documents but could not figure out if that part is manupolate-able. I understand that training on "accuracy" mode is supposed to be higher given the higher dimensions of the word2vec model in the config file, but the execution is SUPER slow in that mode partially because I cannot use GPU (it crashes).

Thank you for your support,
Julia

0 replies

adrianeboyd · 2021-06-21T09:45:11Z

adrianeboyd
Jun 21, 2021

I think most of your problems will be solved by providing the labels in advance and streaming your train corpus. I would get the initial training working with those changes, and then if necessary, as a second step try splitting your documents into smaller units like paragraphs or sections, especially for training on GPU.

The CLI handles a lot of defaults better than nlp.update, so it would be better to use the CLI with a custom corpus reader.

Streaming does not require saving a DocBin file, but you'll need to write a custom corpus reader (that also handles repeating examples and shuffling out of memory if that's needed for better performance). A toy example is the even_odd example in this section: https://spacy.io/usage/training#custom-code-readers-batchers.

For NER, paragraph-sized documents are often a good size for training. The local context used by the model to disambiguate entities is typically within the same paragraph. If you have them, you want to include sections without entities, too, since that's how the model learns what to label as O. The NER model doesn't scale well in terms of memory use as the doc gets longer, so shorter documents will be more efficient, especially on GPU.

If you still run into memory issues (most likely on GPU), be sure to look at the batch sizes used for training ([training.batcher]) and for evaluation ([nlp.batch_size]).

0 replies

adrianeboyd · 2021-06-21T09:45:44Z

adrianeboyd
Jun 21, 2021

Let me convert this to a discussion. This issue will be locked but the new discussion thread linked here will be open.

0 replies

Julia-Penfield · 2021-06-22T20:52:00Z

Julia-Penfield
Jun 22, 2021
Author

@adrianeboyd Thank you for the reply.

Thank you for your inputs on streaming, custom corpus reader, text size, etc. All very helpful! I will play with [training.batcher]) and for evaluation ([nlp.batch_size], and the custom corpus reader for streaming.

I have one last note for this conversation, no replies needed :)

As a feedback for consideration in future versions of spaCy, currently, there is no way to allocate multiple GPUs in the training. The reason that I mentioned this is not efficiency/speed. Regardless of speed, the issue that I ran into is that even though my machine has 8 GPU cores each with 11441MiB memory, as soon as one of them gets filled, the training crashes while the memory of other cores are 0% used. :)

Thank you again for your support!

4 replies

kmsingh-git Aug 11, 2021

Hey @Julia-Penfield, were you able to achieve a performance boost using these tips (streaming, custom corpus reader, labels)? Or via any other approach?

I'm working with a similar problem - I actually have fewer documents, around 400 but each document is 300-700 pages long. And when I train for accuracy, it's VERY slow (When I only used 5 documents, it took 1.5 days on my local machine to run)

adrianeboyd Aug 12, 2021

Those documents are way way too large for spacy, especially for NER, and you'll also see better results if you can batch and shuffle smaller docs during training.

You want to split your large documents up into documents that are at most a few pages long. We train most of our models from paragraph-sized docs.

One option is to keep your training .spacy files as they are and use a custom corpus reader to do the splitting, or you can create an alternate set of modified .spacy files with shorter docs that you've converted from your original data as a one-time step.

If you want to use Span.as_doc() to create smaller docs in the corpus reader, this is a case that would be much faster following a recent PR in master that hasn't been released yet. I can explain a bit more if this seems relevant.

kmsingh-git Aug 12, 2021

Thanks @adrianeboyd , that's useful to know, I'll give that a shot next.

Also, just to confirm - My documents are actually in PDF form, and I'm converting them into strings using a simple python library (like pdfminer). So in order to batch this data, I'll just create, say 100, substrings of this document_string. This will look something like document_string[0:4] or more generally document_string[n:n+4]. And the idea is that each chunk / substring / batch of size 4, will be separate entry in the Training Data.

Does that sound right?

brownsloth Nov 15, 2024

@kmsingh-git As the other guy mentioned above, training with documents of the size of a paragraph would be ideal because of enough context. I don't think 4 characters is enough context ... How did it go?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training NER on Large Dataset #8456

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training NER on Large Dataset #8456

Julia-Penfield Jun 9, 2021

What is the problem?

The issues that I am having are:

Your Environment

Replies: 5 comments · 4 replies

svlandeg Jun 11, 2021 Maintainer

Julia-Penfield Jun 11, 2021 Author

adrianeboyd Jun 21, 2021

adrianeboyd Jun 21, 2021

Julia-Penfield Jun 22, 2021 Author

kmsingh-git Aug 11, 2021

adrianeboyd Aug 12, 2021

kmsingh-git Aug 12, 2021

brownsloth Nov 15, 2024

Julia-Penfield
Jun 9, 2021

Replies: 5 comments 4 replies

svlandeg
Jun 11, 2021
Maintainer

Julia-Penfield
Jun 11, 2021
Author

adrianeboyd
Jun 21, 2021

adrianeboyd
Jun 21, 2021

Julia-Penfield
Jun 22, 2021
Author