News

GluonNLP will be featured in KDD 2019 Alaska! Check out our tutorial: From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond.
GluonNLP was featured in JSALT 2019 in Montreal, 2019-6-14! Checkout https://jsalt19.mxnet.io.
This is the last release in GluonNLP that will officially support Python 2. #721

Models and Scripts

BERT

a BERT BASE model pre-trained on a large corpus including OpenWebText Corpus, BooksCorpus, and English Wikipedia, which has comparable performance with the BERT large model from Google. The test score on GLUE Benchmark is reported below. Also improved usability of the BERT pre-training script: on-the-fly training data generation, sentencepiece, horovod, etc. (#799, #687, #806, #669, #665). Thank you @davisliang @vanyacohen @Skylion007

Source	GluonNLP	google-research/bert	google-research/bert
Model	bert_12_768_12	bert_12_768_12	bert_24_1024_16
Dataset	`openwebtext_book_corpus_wiki_en_uncased`	`book_corpus_wiki_en_uncased`	`book_corpus_wiki_en_uncased`
SST-2	95.3	93.5	94.9
RTE	73.6	66.4	70.1
QQP	72.3	71.2	72.1
SQuAD 1.1	91.0/84.4	88.5/80.8	90.9/84.1
STS-B	87.5	85.8	86.5
MNLI-m/mm	85.3/84.9	84.6/83.4	86.7/85.9

The SciBERT model introduced by Iz Beltagy and Arman Cohan and Kyle Lo in "SciBERT: Pretrained Contextualized Embeddings for Scientific Text". The model checkpoints are converted from the original repository from AllenAI with the following datasets (#735):
- scibert_scivocab_uncased
- scibert_scivocab_cased
- scibert_basevocab_uncased
- scibert_basevocab_cased
The BioBERT model introduced by Lee, Jinhyuk, et al. in "BioBERT: a pre-trained biomedical language representation model for biomedical text mining". The model checkpoints are converted from the original repository with the following datasets (#735):
- biobert_v1.0_pmc_cased
- biobert_v1.0_pubmed_cased
- biobert_v1.0_pubmed_pmc_cased
- biobert_v1.1_pubmed_cased
The ClinicalBERT model introduced by Kexin Huang and Jaan Altosaar and Rajesh Ranganath in "ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission". The model checkpoints are converted from the original repository with the clinicalbert_uncased dataset (#735)
The ERNIE model introduced by Sun, Yu, et al. in "ERNIE: Enhanced Representation through Knowledge Integration". You can get the model checkpoints converted from the original repository with model.get_model("ernie_12_768_12", "baidu_ernie_uncased") (#759) thanks @paperplanet
BERT fine-tuning script for named entity recognition on CoNLL2003 with test F1 92.2 (#612).
BERT fine-tuning script for Chinese XNLI dataset with 78.3% validation accuracy. (#759) thanks @paperplanet
BERT fine-tuning script for intent classification and slot labelling on ATIS (95.9 F1) and SNIPS (95.9 F1). (#817)

GPT-2

The GPT-2 language model introduced by Radford, Alec, et al. in "Language Models are Unsupervised Multitask Learners". The model checkpoints are converted from the original repository, with a script to generate text from GPT-2 model (gpt2_117m, gpt2_345m) trained on the openai_webtext dataset (#761).

ESIM

The ESIM model for text matching introduced by Chen, Qian, et al. in "Enhanced LSTM for Natural Language Inference". (#689)

Data

Natural language understanding with datasets from the GLUE benchmark: CoLA, SST-2, MRPC, STS-B, MNLI, QQP, QNLI, WNLI, RTE (#682)
Sentiment analysis datasets: CR, MPQA (#663)
Intent classification and slot labeling datasets: ATIS and SNIPS (#816)

New Features

[Feature] support save model / trainer states to S3 (#700)
[Feature] support load model/trainer states from s3 (#702)
[Feature] Add SentencePieceTokenizer for BERT (#669)
[FEATURE] Flexible vocabulary (#732)
[API] Moving MaskedSoftmaxCELoss and LabelSmoothing to model API (#754) thanks @ThomasDelteil
[Feature] add the List batchify function (#812) thanks @ThomasDelteil
[FEATURE] Add LAMB optimizer (#733)

Bug Fixes

[BUGFIX] Fixes for BERT embedding, pretraining scripts (#640) thanks @Deseaus
[BUGFIX] Update hash of wiki_cn_cased and wiki_multilingual_cased vocab (#655)
fix bert forward call parameter mismatch (#695) thanks @paperplanet
[BUGFIX] Fix mlm_loss reporting for eval dataset (#696)
Fix _get_rnn_cell (#648) thanks @MarisaKirisame
[BUGFIX] fix mrpc dataset idx (#708)
[bugfix] fix hybrid beam search sampler(#710)
[BUGFIX] [DOC] Update nlp.model.get_model documentation and get_model API (#734)
[BUGFIX] Fix handling of duplicate special tokens in Vocabulary (#749)
[BUGFIX] Fix TokenEmbedding serialization with emb[emb.unknown_token] != 0 (#763)
[BUGFIX] Fix glue test result serialization (#773)
[BUGFIX] Fix init bug for multilevel BiLMEncoder (#783) thanks @Ishitori

API Changes

[API] Dropping support for wiki_multilingual and wiki_cn (#764)
[API] Remove get_bert_model from the public API list (#767)

Enhancements

[FEATURE] offer load_w2v_binary method to load w2v binary file (#620)
[Script] Add inference function for BERT classification (#639) thanks @TaoLv
[SCRIPT] - Add static BERT base export script (for use with MXNet Module API) (#672)
[Enhancement] One script to export bert for classification/regression/QA (#705)
[enhancement] refactor bert finetuning script (#692)
[Enhancement] only use the best model for inference for bert classification (#716)
[Dataset] redistribute conll2004 (#719)
[Enhancement] add periodic evaluation for BERT pre-training (#720)
[FEATURE]add XNLI task (#717)
[refactor] Refactor BERT script folder (#744)
[Enhancement] BERT pre-training data generation from sentencepiece vocab (#743)
[REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals (#750)
[Refactor] Refactor BERT SQuAD inference code (#758)
[Enhancement] Fix dtype conversion, add sentencepiece support for SQuAD (#766)
[Dataset] Move MRPC dataset to API (#780)
[BiDAF-QANet] Common data processing logic for BiDAF and QANet (#739) thanks @Ishitori
[DATASET] add LCQMC, ChnSentiCorp dataset (#774) thanks @paperplanet
[Improvement] Implement parser evaluation in Python (#772)
[Enhancement] Add whole word masking for BERT (#770) thanks @basicv8vc
[Enhancement] Mix precision support for BERT finetuning (#793)
Generate BERT training samples in compressed format (#651)

Minor Fixes

Various documentation fixes: #635, #637, #647, #656, #664, #667, #670, #676, #678, #681, #698, #704, #731, #745, #762, #771, #746, #778, #800, #810, #807 #814 thanks @rongruosong @crcrpar @mrchypark @xwind-h
Fix BERT multiprocessing data creation bug which causes unnecessary dispatching to single worker (#649)
[BUGFIX] Update BERT test and pre-train script (#661)
update url for ws353 (#701)
bump up version (#742)
[DOC] Update textCNN results (#737)
padding value warning (#747)
[TUTORIAL][DOC] Tutorial Updates (#802) thanks @faramarzmunshi

Continuous Integration

skip failing tests in mxnet master (#685)
[CI] update nodes for CI (#686)
[CI] CI refactoring to speed up tests (#566)
[CI] fix codecov (#693)
use fixture for squad dataset tests (#699)
[CI] create zipped notebooks for link check (#712)
Fix test infrastructure for pytest > 4 and bump CI pytest version (#728)
[CI] set root in BERT tests (#738)
Fix conftest.py function_scope_seed (#748)
[CI] Fix links in contribute.rst (#752)
[CI] Update CI dependencies (#756)
Revert "[CI] Update CI dependencies (#756)" (#769)
[CI] AWS Batch serverless CI Pipeline for parallel notebook execution during website build step (#791)
[CI] Don't exit pipeline before displaying AWS Batch logfiles (#801)
[CI] Fix for "Don't exit pipeline before displaying AWS Batch logfile (#803)
add license checker (#804)
enable timeout (#813)
Fix website build on master branch (#819)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.1

News

Models and Scripts

BERT

GPT-2

ESIM

Data

New Features

Bug Fixes

API Changes

Enhancements

Minor Fixes

Continuous Integration