Skip to content

Commit

Permalink
Merge pull request #30 from XiaoMi/develop
Browse files Browse the repository at this point in the history
新增多进程分词功能、模型内干预
  • Loading branch information
nepshi authored Feb 22, 2021
2 parents f8e4aec + 7fc6100 commit 240b13d
Show file tree
Hide file tree
Showing 16 changed files with 261 additions and 205 deletions.
File renamed without changes.
26 changes: 26 additions & 0 deletions .pep8speaks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
scanner:
diff_only: False # If False, the entire file touched by the Pull Request is scanned for errors. If True, only the diff is scanned.
linter: pycodestyle # Other option is flake8

pycodestyle: # Same as scanner.linter value. Other option is flake8
max-line-length: 120 # Default is 79 in PEP 8
ignore: # Errors and warnings to ignore
- W504 # line break after binary operator
- E402 # module level import not at top of file
- E731 # do not assign a lambda expression, use a def
- C406 # Unnecessary list literal - rewrite as a dict literal.
- E741 # ambiguous variable name

no_blank_comment: True # If True, no comment is made on PR without any errors.
descending_issues_order: False # If True, PEP 8 issues in message will be displayed in descending order of line numbers in the file

message: # Customize the comment made by the bot
opened: # Messages when a new PR is submitted
header: "Hello @{name}! Thanks for opening this PR. "
# The keyword {name} is converted into the author's username
footer: "Do see the [Hitchhiker's guide to code style](https://goo.gl/hqbW4r)"
# The messages can be written as they would over GitHub
updated: # Messages when new commits are added to the PR
header: "Hello @{name}! Thanks for updating this PR. "
footer: "" # Why to comment the link to the style guide everytime? :)
no_errors: "There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers: "
57 changes: 44 additions & 13 deletions minlp-tokenizer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
## 1. 工具介绍

MiNLP-Tokenizer是小米AI实验室NLP团队自研的中文分词工具,基于深度学习序列标注模型实现,在公开测试集上取得了SOTA效果。其具备以下特点:
- **分词效果好**:基于深度学习模型在大规模语料上进行训练,粗、细粒度在SIGHAN 2005 PKU测试集上的F1分别达到95.7%和96.3%[注1]
- **轻量级模型**:精简模型参数和结构,模型仅有20MB
- **分词效果好**:基于深度学习模型在大规模语料上进行训练,粗、细粒度在SIGHAN 2005 PKU测试集上的F1分别达到95.7%和96.3%<sup>[1]</sup>
- **轻量级模型**:精简模型参数和结构,模型仅有20MB,在CPU(i7-6700 3.4GHz)环境下,分词速度可达到150KB/s
- **词典可定制**:灵活、方便的干预机制,根据用户词典对模型结果进行干预
- **多粒度切分**:提供粗、细粒度两种分词规范,满足各种场景需要
- **调用更便捷**:一键快速安装,API简单易用
Expand All @@ -21,41 +21,72 @@ pip install minlp-tokenizer

## 3. 使用API

- 分词(逐句或者列表):
```python
from minlptokenizer.tokenizer import MiNLPTokenizer

tokenizer = MiNLPTokenizer(granularity='fine') # fine:细粒度,coarse:粗粒度,默认为细粒度
print(tokenizer.cut('今天天气怎么样?'))
print(tokenizer.cut('今天天气怎么样?')) # 单句分词
# ['今天','天气','怎么样']
print(tokenizer.cut(['今天天气怎么样', '小米的价值观是真诚与热爱'])) # 列表分词
# [['今天','天气','怎么样'],['小米','的','价值观','是','真诚','与','热爱']]
```

## 4. 自定义用户词典
- 多进程分词:
开启多个进程并行分词,加快分词速度:

- 通过用户词典List添加:
```python
(1)由于开启进程需要额外的时间开销,适用于分词数量较大的情况,建议数量在10万+时启用多进程分词。

(2)请根据自身硬件和负载情况,选择合适的进程数量,进程数默认为1,即不启用多进程。

```python
from minlptokenizer.tokenizer import MiNLPTokenizer

tokenizer = MiNLPTokenizer(['word1', 'word2'], granularity='fine') #用户自定义干预词典传入
```
texts = ['小米的价值观是真诚与热爱'] * 2048
tokenizer = MiNLPTokenizer(granularity='fine') # fine:细粒度,coarse:粗粒度,默认为细粒度
result = tokenizer.cut(texts, n_jobs=4) # n_jobs:进程数,默认为1,即不启用多进程
```

## 4. 自定义用户词典

- 通过文件路径方式添加
- List添加/文件路径方式:
```python
from minlptokenizer.tokenizer import MiNLPTokenizer

tokenizer = MiNLPTokenizer('/path/to/your/lexicon/file', granularity='coarse') # 构造函数的参数为用户词典路径
tokenizer = MiNLPTokenizer(file_or_list=['word1', 'word2'], granularity='fine') # 用户自定义干预词典传入
tokenizer = MiNLPTokenizer(file_or_list='/path/to/your/lexicon/file', granularity='coarse') # 构造函数的参数为用户词典路径
```

## 5. 未来计划
## 5. 注意事项
由于Windows和Linux对multi-processing的实现方法不同,Linux基于Fork实现多进程,Windows则是启动新进程。在Windows环境下使用多进程分词(n_jobs>1)时,请务必保证调用时在 if \_\_name__=='\_\_main__'之后(详情见:[官方文档](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing)),例如:
```python
from minlptokenizer.tokenizer import MiNLPTokenizer

# Windows 环境下使用多进程分词
if __name__ == '__main__':
texts = ['小米的价值观是真诚与热爱'] * 2048
tokenizer = MiNLPTokenizer(granularity='fine')
result = tokenizer.cut(texts, n_jobs=4) # n_jobs:进程数,默认为1,即不启用多进程
```

## 6. 未来计划

MiNLP是小米AI实验室NLP团队开发的小米自然语言处理平台,目前已经具备词法、句法、语义等数十个功能模块,在公司业务中得到了广泛应用。
第一阶段我们开源了MiNLP的中文分词功能,后续我们将陆续开源词性标注、命名实体识别、句法分析等功能,和开发者一起打造功能强大、效果领先的NLP工具集。

## 6. 参与开发
## 7. 参与开发

我们欢迎开发者向MiNLP-Tokenizer贡献代码,也欢迎提出各种Issue和反馈意见。
开发流程详见CONTRIBUTING.md。

## 7.在学术成果中使用
## 8. 开发者致谢

感谢社区众多的开发者对MiNLP-Tokenizer提出的支持、意见、鼓励和建议。在此特别感谢以下开发者为MiNLP-Tokenizer分词工具贡献了PR:
- 2020.12.4 aseaday 贡献了有关多进程分词的代码,提升了分词速度。

## 9. 在学术成果中使用

如果您在学术成果中使用了MiNLP中文分词工具,请按如下格式引用:
- 中文:郭元凯, 史亮, 陈宇鹏, 孟二利, 王斌. MiNLP-Tokenizer:小米中文分词工具. 2020.
- 英文:Yuankai Guo, Liang Shi, Yupeng Chen, Erli Meng, Bin Wang. MiNLP-Tokenzier: XiaoMi Chinese Word Segmenter. 2020.

6 changes: 2 additions & 4 deletions minlp-tokenizer/minlptokenizer/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,17 @@
'tokenizer_granularity': {
'fine': {
'model': 'model/zh/b-fine-cnn-crf-an2cn.pb',
'trans': 'trans/b-fine.300d.trans'
},
'coarse': {
'model': 'model/zh/b-coarse-cnn-crf-an2cn.pb',
'trans': 'trans/b-coarse.300d.trans'
}
},
'tokenizer_limit': {
'max_batch_size': 512,
'max_batch_size': 128,
'max_string_length': 1024
},
'lexicon_files': [
'lexicon/default.txt',
'lexicon/chengyu.txt',
],
]
}
63 changes: 0 additions & 63 deletions minlp-tokenizer/minlptokenizer/crf_viterbi.py

This file was deleted.

8 changes: 8 additions & 0 deletions minlp-tokenizer/minlptokenizer/exception.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,11 @@ def __init__(self):

def __str__(self):
return '输入参数异常.'


class ThreadNumberException(Exception):
def __init__(self):
super(Exception, self).__init__()

def __str__(self):
return '多进程数必须大于等于1'
40 changes: 19 additions & 21 deletions minlp-tokenizer/minlptokenizer/lexicon.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,13 @@

import ahocorasick
from collections import Iterable
import numpy as np
from minlptokenizer.tag import Tag

DEFAULT_INTERFERE_FACTOR = 2


class Lexicon:

def __init__(self, file_or_list=None):
self.ac = ahocorasick.Automaton(ahocorasick.STORE_LENGTH)
if file_or_list:
Expand Down Expand Up @@ -48,30 +49,27 @@ def add_words(self, file_or_list):
for word in filter(lambda t: t and not t.startswith('#'), file_or_list):
self.ac.add_word(word)

def parse_unary_score(self, text, unary_score):
def get_factor(self, texts):
"""
干预发射权重
:param text:原始文本
:param unary_score: 发射概率矩阵
:return:
根据用户词典生成句子对应的干预权重矩阵
:param texts: 目标句子
:return: 干预权重矩阵
"""
if self.ac.get_stats()["nodes_count"] == 0:
return
if self.ac.kind is not ahocorasick.AHOCORASICK:
self.ac.make_automaton()
for (end_pos, length) in self.ac.iter(text):
start_pos = end_pos - length + 1
if length == 1:
unary_score[start_pos][1] = self.max_socre(unary_score[start_pos]) # S
else:
unary_score[start_pos][2] = self.max_socre(unary_score[start_pos]) # B
unary_score[end_pos][4] = self.max_socre(unary_score[end_pos]) # E
for i in range(start_pos + 1, end_pos):
unary_score[i][3] = self.max_socre(unary_score[i]) # M
return unary_score

def max_socre(self, scores):
return self.interfere_factor * abs(max(scores))
max_len = max(map(len, texts))
factor_matrix = np.zeros(shape=[len(texts), max_len, Tag.__len__()]) # 干预矩阵中0表示非干预,非零位表示对应位置干预系数
for index, text in enumerate(texts):
for (end_pos, length) in self.ac.iter(text):
start_pos = end_pos - length + 1
if length == 1:
factor_matrix[index][start_pos][1] = self.interfere_factor
else:
factor_matrix[index][start_pos][2] = self.interfere_factor
factor_matrix[index][end_pos][4] = self.interfere_factor
for i in range(start_pos + 1, end_pos):
factor_matrix[index][i][3] = self.interfere_factor
return factor_matrix

def set_interfere_factor(self, interfere_factor):
"""
Expand Down
Binary file not shown.
Binary file modified minlp-tokenizer/minlptokenizer/model/zh/b-fine-cnn-crf-an2cn.pb
Binary file not shown.
Loading

0 comments on commit 240b13d

Please sign in to comment.