FineTuning a transformer SpanCat pipeline on a small corpus - a Best pratice Inquiry. #13398
Sifortin
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there,
I am currently trying to fine tune a Roberta on a SpanCat task with ~20 labels over a thousand document corpus(80% Training/20% Validation).
While we are getting relevant results (average of 0.65 f1 score over all labels), I am finding it hard to pinpoint what could be refined so that we get better results. To give more context, we currently have 1000 documents parsed in Tesseract, manually labelled over the field we want to recognise, and then made into a spacy dataset. We believe that precision and recall should somehow grow with the size of the dataset, but as a paper I read pointed out, they may not be directly correlated.
There is also the question of the config file we used, we basically did not deviate from the standard config of spancat (apart from using the singlelabel), but this may or may not affect our result.
Some fine-tuning focus papers I read talked about longer epochs with more steps per epoch, but how does that translate into SpaCy?
Thanks and Kind regards.
Here is our Config file:
Beta Was this translation helpful? Give feedback.
All reactions