v0.7.1
News
- GluonNLP will be featured in KDD 2019 Alaska! Check out our tutorial: From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond.
- GluonNLP was featured in JSALT 2019 in Montreal, 2019-6-14! Checkout https://jsalt19.mxnet.io.
- This is the last release in GluonNLP that will officially support Python 2. #721
Models and Scripts
BERT
- a BERT BASE model pre-trained on a large corpus including OpenWebText Corpus, BooksCorpus, and English Wikipedia, which has comparable performance with the BERT large model from Google. The test score on GLUE Benchmark is reported below. Also improved usability of the BERT pre-training script: on-the-fly training data generation, sentencepiece, horovod, etc. (#799, #687, #806, #669, #665). Thank you @davisliang @vanyacohen @Skylion007
Source | GluonNLP | google-research/bert | google-research/bert |
---|---|---|---|
Model | bert_12_768_12 | bert_12_768_12 | bert_24_1024_16 |
Dataset | openwebtext_book_corpus_wiki_en_uncased |
book_corpus_wiki_en_uncased |
book_corpus_wiki_en_uncased |
SST-2 | 95.3 | 93.5 | 94.9 |
RTE | 73.6 | 66.4 | 70.1 |
QQP | 72.3 | 71.2 | 72.1 |
SQuAD 1.1 | 91.0/84.4 | 88.5/80.8 | 90.9/84.1 |
STS-B | 87.5 | 85.8 | 86.5 |
MNLI-m/mm | 85.3/84.9 | 84.6/83.4 | 86.7/85.9 |
-
The SciBERT model introduced by Iz Beltagy and Arman Cohan and Kyle Lo in "SciBERT: Pretrained Contextualized Embeddings for Scientific Text". The model checkpoints are converted from the original repository from AllenAI with the following datasets (#735):
scibert_scivocab_uncased
scibert_scivocab_cased
scibert_basevocab_uncased
scibert_basevocab_cased
-
The BioBERT model introduced by Lee, Jinhyuk, et al. in "BioBERT: a pre-trained biomedical language representation model for biomedical text mining". The model checkpoints are converted from the original repository with the following datasets (#735):
biobert_v1.0_pmc_cased
biobert_v1.0_pubmed_cased
biobert_v1.0_pubmed_pmc_cased
biobert_v1.1_pubmed_cased
-
The ClinicalBERT model introduced by Kexin Huang and Jaan Altosaar and Rajesh Ranganath in "ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission". The model checkpoints are converted from the original repository with the
clinicalbert_uncased
dataset (#735) -
The ERNIE model introduced by Sun, Yu, et al. in "ERNIE: Enhanced Representation through Knowledge Integration". You can get the model checkpoints converted from the original repository with
model.get_model("ernie_12_768_12", "baidu_ernie_uncased")
(#759) thanks @paperplanet -
BERT fine-tuning script for named entity recognition on CoNLL2003 with test F1 92.2 (#612).
-
BERT fine-tuning script for Chinese XNLI dataset with 78.3% validation accuracy. (#759) thanks @paperplanet
-
BERT fine-tuning script for intent classification and slot labelling on ATIS (95.9 F1) and SNIPS (95.9 F1). (#817)
GPT-2
- The GPT-2 language model introduced by Radford, Alec, et al. in "Language Models are Unsupervised Multitask Learners". The model checkpoints are converted from the original repository, with a script to generate text from GPT-2 model (
gpt2_117m
,gpt2_345m
) trained on theopenai_webtext
dataset (#761).
ESIM
- The ESIM model for text matching introduced by Chen, Qian, et al. in "Enhanced LSTM for Natural Language Inference". (#689)
Data
- Natural language understanding with datasets from the GLUE benchmark: CoLA, SST-2, MRPC, STS-B, MNLI, QQP, QNLI, WNLI, RTE (#682)
- Sentiment analysis datasets: CR, MPQA (#663)
- Intent classification and slot labeling datasets: ATIS and SNIPS (#816)
New Features
- [Feature] support save model / trainer states to S3 (#700)
- [Feature] support load model/trainer states from s3 (#702)
- [Feature] Add SentencePieceTokenizer for BERT (#669)
- [FEATURE] Flexible vocabulary (#732)
- [API] Moving MaskedSoftmaxCELoss and LabelSmoothing to model API (#754) thanks @ThomasDelteil
- [Feature] add the List batchify function (#812) thanks @ThomasDelteil
- [FEATURE] Add LAMB optimizer (#733)
Bug Fixes
- [BUGFIX] Fixes for BERT embedding, pretraining scripts (#640) thanks @Deseaus
- [BUGFIX] Update hash of wiki_cn_cased and wiki_multilingual_cased vocab (#655)
- fix bert forward call parameter mismatch (#695) thanks @paperplanet
- [BUGFIX] Fix mlm_loss reporting for eval dataset (#696)
- Fix _get_rnn_cell (#648) thanks @MarisaKirisame
- [BUGFIX] fix mrpc dataset idx (#708)
- [bugfix] fix hybrid beam search sampler(#710)
- [BUGFIX] [DOC] Update nlp.model.get_model documentation and get_model API (#734)
- [BUGFIX] Fix handling of duplicate special tokens in Vocabulary (#749)
- [BUGFIX] Fix TokenEmbedding serialization with
emb[emb.unknown_token] != 0
(#763) - [BUGFIX] Fix glue test result serialization (#773)
- [BUGFIX] Fix init bug for multilevel BiLMEncoder (#783) thanks @Ishitori
API Changes
- [API] Dropping support for wiki_multilingual and wiki_cn (#764)
- [API] Remove get_bert_model from the public API list (#767)
Enhancements
- [FEATURE] offer load_w2v_binary method to load w2v binary file (#620)
- [Script] Add inference function for BERT classification (#639) thanks @TaoLv
- [SCRIPT] - Add static BERT base export script (for use with MXNet Module API) (#672)
- [Enhancement] One script to export bert for classification/regression/QA (#705)
- [enhancement] refactor bert finetuning script (#692)
- [Enhancement] only use the best model for inference for bert classification (#716)
- [Dataset] redistribute conll2004 (#719)
- [Enhancement] add periodic evaluation for BERT pre-training (#720)
- [FEATURE]add XNLI task (#717)
- [refactor] Refactor BERT script folder (#744)
- [Enhancement] BERT pre-training data generation from sentencepiece vocab (#743)
- [REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals (#750)
- [Refactor] Refactor BERT SQuAD inference code (#758)
- [Enhancement] Fix dtype conversion, add sentencepiece support for SQuAD (#766)
- [Dataset] Move MRPC dataset to API (#780)
- [BiDAF-QANet] Common data processing logic for BiDAF and QANet (#739) thanks @Ishitori
- [DATASET] add LCQMC, ChnSentiCorp dataset (#774) thanks @paperplanet
- [Improvement] Implement parser evaluation in Python (#772)
- [Enhancement] Add whole word masking for BERT (#770) thanks @basicv8vc
- [Enhancement] Mix precision support for BERT finetuning (#793)
- Generate BERT training samples in compressed format (#651)
Minor Fixes
- Various documentation fixes: #635, #637, #647, #656, #664, #667, #670, #676, #678, #681, #698, #704, #731, #745, #762, #771, #746, #778, #800, #810, #807 #814 thanks @rongruosong @crcrpar @mrchypark @xwind-h
- Fix BERT multiprocessing data creation bug which causes unnecessary dispatching to single worker (#649)
- [BUGFIX] Update BERT test and pre-train script (#661)
- update url for ws353 (#701)
- bump up version (#742)
- [DOC] Update textCNN results (#737)
- padding value warning (#747)
- [TUTORIAL][DOC] Tutorial Updates (#802) thanks @faramarzmunshi
Continuous Integration
- skip failing tests in mxnet master (#685)
- [CI] update nodes for CI (#686)
- [CI] CI refactoring to speed up tests (#566)
- [CI] fix codecov (#693)
- use fixture for squad dataset tests (#699)
- [CI] create zipped notebooks for link check (#712)
- Fix test infrastructure for pytest > 4 and bump CI pytest version (#728)
- [CI] set root in BERT tests (#738)
- Fix conftest.py function_scope_seed (#748)
- [CI] Fix links in contribute.rst (#752)
- [CI] Update CI dependencies (#756)
- Revert "[CI] Update CI dependencies (#756)" (#769)
- [CI] AWS Batch serverless CI Pipeline for parallel notebook execution during website build step (#791)
- [CI] Don't exit pipeline before displaying AWS Batch logfiles (#801)
- [CI] Fix for "Don't exit pipeline before displaying AWS Batch logfile (#803)
- add license checker (#804)
- enable timeout (#813)
- Fix website build on master branch (#819)