Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter

Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Arxiv link of the paper: https://arxiv.org/abs/2105.07148

If any questions, please contact the email: willie1206@163.com

Requirement

Python 3.7.0
Transformer 3.4.0
Numpy 1.18.5
Packaging 17.1
skicit-learn 0.23.2
torch 1.6.0+cu92
tqdm 4.50.2
multiprocess 0.70.10
tensorflow 2.3.1
tensorboardX 2.1
seqeval 1.2.1

Input Format

CoNLL format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.

美   B-LOC  
国   E-LOC  
的   O  
华   B-PER  
莱   I-PER  
士   E-PER  

我   O  
跟   O  
他   O  
谈   O  
笑   O  
风   O  
生   O

Chinese BERT，Chinese Word Embedding, and Checkpoints

Chinese BERT

Chinese BERT: https://huggingface.co/bert-base-chinese/tree/main

Chinese word embedding:

~~Word Embedding: https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz~~

The original download link does not work. We update it as:

Word Embedding: https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0.tar.gz

More info refers to: Tencent AI Lab Word Embedding

Checkpoints and Shells

Weibo NER
Ontonote4 NER
MSRA NER
Resume NER
CTB5 POS
CTB6 POS
UD1 POS
UD2 POS
CTB6 CWS
MSR CWS
PKU CWS

Directory Structure of data

berts
- bert
  - config.json
  - vocab.txt
  - pytorch_model.bin
dataset, you can download from here
- NER
  - weibo
  - note4
  - msra
  - resume
- POS
  - ctb5
  - ctb6
  - ud1
  - ud2
- CWS
  - ctb6
  - msr
  - pku
vocab
- tencent_vocab.txt, the vocab of pre-trained word embedding table, downlaod from here.
embedding
- word_embedding.txt
result
- NER
  - weibo
  - note4
  - msra
  - resume
- POS
  - ctb5
  - ctb6
  - ud1
  - ud2
- CWS
  - ctb6
  - msr
  - pku
log

Run

1.Convert .char.bmes file to .json file, python3 to_json.py
2.run the shell, sh run_demo.sh

If you want to load my checkpoints, you need to make some revisions to your transformers.

My model is trained in distribution mode so it can not be directly loaded by single-GPU mode. You can follow the below steps to revise the transformers before load my checkpoints.

Enter the source code director of Transformer, cd source/transformers-master
Find the modeling_util.py, and positioned to about 995 lines
change the code as follows:
Compile the revised source code and install. python3 setup.py install

Cite

@inproceedings{liu-etal-2021-lexicon,
    title = "Lexicon Enhanced {C}hinese Sequence Labeling Using {BERT} Adapter",
    author = "Liu, Wei  and
      Fu, Xiyan  and
      Zhang, Yue  and
      Xiao, Wenming",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.454",
    doi = "10.18653/v1/2021.acl-long.454",
    pages = "5847--5858"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter

Requirement

Input Format

Chinese BERT，Chinese Word Embedding, and Checkpoints

Chinese BERT

Chinese word embedding:

Checkpoints and Shells

Directory Structure of data

Run

If you want to load my checkpoints, you need to make some revisions to your transformers.

Cite

Files

README.md

Latest commit

History

README.md

File metadata and controls

Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter

Requirement

Input Format

Chinese BERT，Chinese Word Embedding, and Checkpoints

Chinese BERT

Chinese word embedding:

Checkpoints and Shells

Directory Structure of data

Run

If you want to load my checkpoints, you need to make some revisions to your transformers.

Cite