GitHub - callmeYe/SLM: Segmental Language Models for CWS

Segmental Language Models

A Pytorch Implementation.

Dependencies

Python 3.6.3 :: Anaconda custom (64-bit)

Pytorch: 0.3.0.post4

Numpy: 1.13.3

Pandas: 0.20.3

Setup

For example, if you want to train the model on pku dataset, you should prepare following files in the “data” directory:

pku.txt

unsegmented original training data

pku_test.txt

unsegmented original test data

pku_test_gold.txt

segmeted data, gold standard for test data

_pku.txt

This file contains the preprocessed training sentences, for example:

附图片张

_pku_test.txt

The same as _pku.txt, but contains the test sentences.

supervised_pku.txt

additional supervised data for pku (1024 sentences), for exapmle:

迈向 | 充满 | 希望 | 的 | 新 | 世纪 | 附 | 图片 | | 张 |

Pretrained Word Embedding

Put "unigram256.txt" in the "models" directory, you can modify the number to keep in consistency to the real word embedding dimension you use.

Training

After preparing the data in the "data" directory, just run

python train.py

During the training, the test is also performed.

Better view the training and test results on TensorBoard. The TensorBoard log can be found at "logs" directory.

Model Config

Remember to set

DATA = 'pku'

In the config.py, other configuration can also be modified in this file.

Set BATCH2 to 0 for unsupervised training.

Results

Results can be found at the "results" directory, "result*" is the original results and we apply the post-processing on "result*" to get the corresponding "improved_result*" file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
Appendix.pdf		Appendix.pdf
LICENSE		LICENSE
chinese.py		chinese.py
config.py		config.py
data_helpers.py		data_helpers.py
evaluate.py		evaluate.py
logger.py		logger.py
model.py		model.py
prepare_data_index.py		prepare_data_index.py
readme.md		readme.md
train.py		train.py
voc.py		voc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Segmental Language Models

Dependencies

Setup

pku.txt

pku_test.txt

pku_test_gold.txt

_pku.txt

_pku_test.txt

supervised_pku.txt

Pretrained Word Embedding

Training

Model Config

Results

About

Releases

Packages

Languages

License

callmeYe/SLM

Folders and files

Latest commit

History

Repository files navigation

Segmental Language Models

Dependencies

Setup

pku.txt

pku_test.txt

pku_test_gold.txt

_pku.txt

_pku_test.txt

supervised_pku.txt

Pretrained Word Embedding

Training

Model Config

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages