A Pytorch Implementation.
Python 3.6.3 :: Anaconda custom (64-bit)
Pytorch: 0.3.0.post4
Numpy: 1.13.3
Pandas: 0.20.3
For example, if you want to train the model on pku dataset, you should prepare following files in the “data” directory:
unsegmented original training data
unsegmented original test data
segmeted data, gold standard for test data
This file contains the preprocessed training sentences, for example:
附 图 片 张
The same as _pku.txt, but contains the test sentences.
additional supervised data for pku (1024 sentences), for exapmle:
迈 向 | 充 满 | 希 望 | 的 | 新 | 世 纪 | 附 | 图 片 | | 张 |
Put "unigram256.txt" in the "models" directory, you can modify the number to keep in consistency to the real word embedding dimension you use.
After preparing the data in the "data" directory, just run
python train.py
During the training, the test is also performed.
Better view the training and test results on TensorBoard. The TensorBoard log can be found at "logs" directory.
Remember to set
DATA = 'pku'
In the config.py, other configuration can also be modified in this file.
Set BATCH2 to 0 for unsupervised training.
Results can be found at the "results" directory, "result*" is the original results and we apply the post-processing on "result*" to get the corresponding "improved_result*" file.