-
Notifications
You must be signed in to change notification settings - Fork 89
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from XiaoMi/develop
add minlp-tokenizer module
- Loading branch information
Showing
25 changed files
with
23,836 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,6 @@ | ||
# MiNLP | ||
小米自然语言处理平台(MiNLP)具备词法、句法、语义分析等数十个功能模块,已经在公司业务中得到了广泛应用。 | ||
|
||
MiNLP-Tokenizer中文分词工具经过不断优化和实战打磨,已于2020年11月正式对外开源。 | ||
|
||
我们计划在2021年Q2完成所有词法工具(词性标注和命名实体识别)的开源,从2021年Q3开始,我们将逐步开源句法分析和部分语义分析工具,和开发者一起打造功能强大、效果领先的NLP平台。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.DS_Store | ||
.idea/ | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
cover/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
.pybuilder/ | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
# For a library or package, you might want to ignore these files since the code is | ||
# intended to run in multiple environments; otherwise, check them in: | ||
# .python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# pytype static type analyzer | ||
.pytype/ | ||
|
||
# Cython debug symbols | ||
cython_debug/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# 如何参与 | ||
|
||
欢迎大家对MiNLP-tokenizer贡献自己宝贵的知识,方便更多的开发者。 | ||
在进行Contribute时请确保: | ||
|
||
- 对您所做的更改,尽量添加注释以便更快捷的理解修改的内容。 | ||
- 如果要引入新功能,请在pull request中描述您的想法和例子,在开发完成后确保通过所有的测试用例。 | ||
- 您的修改应保证在SIGHAN 2005 PKU测试集上进行测试并记录F1值。 | ||
- 如果是对现有bug的修正,单元测试中应包括重现问题的用例。 | ||
- 不要在代码中出现私人信息。 | ||
- 请在提交pull request时,对多个commit进行合并。并为不相关的功能分别提交请求,较小的修改合在一起提交也是可以的。 | ||
|
||
|
||
## 1. SIGHAN 2005 PKU测试集 | ||
|
||
下载[icwb2-data](http://sighan.cs.uchicago.edu/bakeoff2005/),可以使用SIGHAN-PKU测试集进行分词测试。 | ||
注:我们结合公司应用场景,制定了粗、细粒度分词规范,并按照规范对PKU测试集重新进行了标注(由于测试集版权限制,未包含在本项目中)。 | ||
由于分词标准不一致,因此使用SIGHAN-PKU官方测试集的评价结果可能有所降低。 | ||
|
||
## 2. GitHub流程 | ||
简单流程如下: | ||
|
||
(1) fork [MiNLP-Tokenizer](https://github.com/XiaoMi/MiNLP) 到自己的 git 仓库 | ||
|
||
``` | ||
https://github.com/XiaoMi/MiNLP | ||
``` | ||
|
||
(2) 从自己的 git 仓库clone | ||
|
||
``` | ||
git clone [email protected]:<username>/MiNLP.git | ||
``` | ||
|
||
使用自己的git账号替换<username> | ||
|
||
|
||
(3) 创建自己的feature分支进行开发 | ||
|
||
``` | ||
git checkout -b feature-xxx remotes/upstream/develop | ||
``` | ||
|
||
(4) 保持与当前develop分支同步,提交分支 | ||
``` | ||
git rebase -i upstream/develop | ||
git push origin feature-xxx | ||
``` | ||
|
||
(5) 从个人副本发起 pull request 并填写一个清楚有效的改动描述 | ||
|
||
## 3. 分支合并 | ||
|
||
项目的维护者 [email protected]、[email protected] 会对发起的pull request进行review并对合适的代码进行合并,再次感谢各位开发者为项目作出的贡献。 |
Oops, something went wrong.