Skip to content

Commit

Permalink
add minlp-tokenizer module
Browse files Browse the repository at this point in the history
  • Loading branch information
郭元凯 committed Nov 17, 2020
1 parent 9b1988c commit 3f9da41
Show file tree
Hide file tree
Showing 25 changed files with 23,836 additions and 0 deletions.
4 changes: 4 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -198,4 +198,8 @@
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
<<<<<<< HEAD
limitations under the License.
=======
limitations under the License.
>>>>>>> 750fa67... add README.md
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,6 @@
# MiNLP
小米自然语言处理平台(MiNLP)具备词法、句法、语义分析等数十个功能模块,已经在公司业务中得到了广泛应用。

MiNLP-Tokenizer中文分词工具经过不断优化和实战打磨,已于2020年11月正式对外开源。

我们计划在2021年Q2完成所有词法工具(词性标注和命名实体识别)的开源,从2021年Q3开始,我们将逐步开源句法分析和部分语义分析工具,和开发者一起打造功能强大、效果领先的NLP平台。
140 changes: 140 additions & 0 deletions minlp-tokenizer/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.DS_Store
.idea/
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/
54 changes: 54 additions & 0 deletions minlp-tokenizer/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# 如何参与

欢迎大家对MiNLP-tokenizer贡献自己宝贵的知识,方便更多的开发者。
在进行Contribute时请确保:

- 对您所做的更改,尽量添加注释以便更快捷的理解修改的内容。
- 如果要引入新功能,请在pull request中描述您的想法和例子,在开发完成后确保通过所有的测试用例。
- 您的修改应保证在SIGHAN 2005 PKU测试集上进行测试并记录F1值。
- 如果是对现有bug的修正,单元测试中应包括重现问题的用例。
- 不要在代码中出现私人信息。
- 请在提交pull request时,对多个commit进行合并。并为不相关的功能分别提交请求,较小的修改合在一起提交也是可以的。


## 1. SIGHAN 2005 PKU测试集

下载[icwb2-data](http://sighan.cs.uchicago.edu/bakeoff2005/),可以使用SIGHAN-PKU测试集进行分词测试。
注:我们结合公司应用场景,制定了粗、细粒度分词规范,并按照规范对PKU测试集重新进行了标注(由于测试集版权限制,未包含在本项目中)。
由于分词标准不一致,因此使用SIGHAN-PKU官方测试集的评价结果可能有所降低。

## 2. GitHub流程
简单流程如下:

(1) fork [MiNLP-Tokenizer](https://github.com/XiaoMi/MiNLP) 到自己的 git 仓库

```
https://github.com/XiaoMi/MiNLP
```

(2) 从自己的 git 仓库clone

```
git clone [email protected]:<username>/MiNLP.git
```

使用自己的git账号替换<username>


(3) 创建自己的feature分支进行开发

```
git checkout -b feature-xxx remotes/upstream/develop
```

(4) 保持与当前develop分支同步,提交分支
```
git rebase -i upstream/develop
git push origin feature-xxx
```

(5) 从个人副本发起 pull request 并填写一个清楚有效的改动描述

## 3. 分支合并

项目的维护者 [email protected][email protected] 会对发起的pull request进行review并对合适的代码进行合并,再次感谢各位开发者为项目作出的贡献。
Loading

0 comments on commit 3f9da41

Please sign in to comment.