Chinese is written using characters (hanzi), where each character represents a syllable. A word is usually taken to consist of one or more character tokens. There are no spaces between words. Less than 3500 distinct characters are normally encountered. Word segmentation (or tokenization) is the process of dividing up a sequence of characters into a sequence of words.
Input:
亲 请问有什么可以帮您的吗?
Output:
亲 请问 有 什么 可以 帮 您 的 吗 ?
Word F1 score:
Gold: 共同 创造 美好 的 新 世纪 —— 二○○一年 新年 贺词
Hypothesis: 共同 创造 美 好 的 新 世纪 —— 二○○一年 新年 贺词
Precision = 9 / 11 = 0.818
Recall = 9 / 10 = 0.9
F1 = 0.857
- Website, Detailed Instruction, Overview Paper
- Includes 4 datasets: AS, CityU in traditional Chinese, PK, MSRA in simplified Chinese.
Corpus | Abbrev. | Encoding | Test Size (Tokens/Types) |
---|---|---|---|
Traditional Chinese | |||
Academia Sinica(Taipei) | AS | Unicode/Big Five Plus | 122K / 19K |
City University of Hong Kong | CityU | HKSCS Unicode/Big Five | 104K / 13K |
Simplified Chinese | |||
Peking University | PK | CP936/Unicode | 41K / 9K |
Microsoft Research | MSRA | CP936/Unicode | 107K / 13K |
Model | AS | CITYU | MSRA | PKU |
---|---|---|---|---|
Ke et al. (2021) | 97.0 | 98.2 | 98.5 | 96.9 |
Qiu, Pei, Yan, Huang (2020) | 96.4 | 96.9 | 98.1 | 96.4 |
Tian, Song, Xia, Zhang, Wang (2020) | 96.6 | 97.9 | 98.4 | 96.5 |
Meng et al. (2019) | 96.7* | 97.9* | 98.3 | 96.7 |
Huang et al. (2019) | 96.6 | 97.6 | 97.9 | 96.6 |
Ma et al. (2018) | 96.2 | 97.2 | 97.4 | 96.1 |
Yang et al. (2017) | 95.7 | 96.9 | 97.5 | 96.3 |
Zhou et al. (2017) | 97.8 | 96.0 |
* Unlike others, Meng et al. (2019) do not report converting traditional Chinese to simplified Chinese.
Train set | Training Size(Words) |
---|---|
AS | 5.45M |
CityU | 1.46M |
MSRA | 2.37M |
PKU | 1.1M |
- Website
- Includes 3 datasets:
Data set | Test set (Tokens) |
---|---|
CTB6 | 82K |
CTB7 | 245K |
CTB9 | 242K |
Model | CTB6 | CTB7 | CTB9 |
---|---|---|---|
Ke et al. (2021) | 97.9 | ||
Tian, Song, Ao, Xia, Quan, Zhang, Wang (2020) | 97.5 | 97.3 | 97.8 |
Tian, Song, Xia, Zhang, Wang (2020) | 97.3 | ||
Yan et al. (2020) | 97.1 | 97.6 | |
Huang et al. (2019) | 97.6 | ||
Ma et al. (2018) | 96.7 | 96.6** | |
Yang et al. (2017) | 96.2 | ||
Zhou et al. (2017) | 96.2 |
** Ma et al. (2018) report different statistics for their CTB7 split (950K/60K/82K), so the results might not be comparable.
Train set | Training Size (Words) |
---|---|
CTB6 | 641K |
CTB7 | 718K |
CTB9 | 1,696K |
Data set | Test set(Tokens) |
---|---|
UD | 12,012 |
Model | UD |
---|---|
Ke et al. (2021) | 98.6 |
Tian, Song, Ao, Xia, Quan, Zhang, Wang (2020) | 98.3 |
Huang et al. (2019) | 97.3 |
Ma et al. (2018) | 96.9 |
Train set | Training Size(Words) |
---|---|
UD | 98,608 |
# Sentences | # Words | # Characters | |
---|---|---|---|
8,592 | - | 315,857 |
Model | |
---|---|
Yang et al. (2017) | 95.5 |
# Sentences | # Words | # Characters | |
---|---|---|---|
Train | 20,135 | 421,166 | 688,734 |
Dev | 2,052 | 43,697 | 73,244 |
Suggestions? Changes? Please send email to [email protected]