Segmentation wrapper of Jieba Chinese segmentation.
pip install segjb
(dependency: jieba)
- Lazy initialization.
- Initialization with user defined dict.
- Build-in stop-words dict, punctuations dict.
- Output control of stopwords, punctuations, minimum word length, output delimiters etc..
- Support ngram.
init(stopwords_file, puncs_file, user_dict, silent, main_dict, thread)
-- Initialize the segmentation utility instance.
- return: void.
- stopwords_file: stopword dictionary. Use "" to disable. [SegJb.DEFAULT_STPW]
- puncs_file: punctuation dictionary. Use "" to disable. [SegJb.DEFAULT_PUNC]
- user_dict: load user customized dictionary. Use "" to disable. [SegJb.DEFAULT_DICT]
- silent: whether print initializing log. [True]
- thread: number of part to separate the corpus for parallel. [1]
set_param(delim, min_word_len, ngram, keep_stopwords, keep_puncs)
-- Set one or more parameters of the segmentation utility instance. Refer to parameter description.
- return: void
-- Cut a sentence to list due to configuration.
- return: list
- corp: unicode or utf8 sentence.
-- Cut a sentence to a delimeter(can be set by set_param) joined string.
- return: unicode string.
- corp: unicode or utf8 sentence.
delim [' ']
the delimeter used to constuct the segmentation result in string. -
min_word_len [1]
word with length less than min_word_len will not in segmentation result. -
ngram [1]
result can be ngram. -
keep_stopwords [True]
whether to keep stopwords in result. -
keep_puncs [True]
whether to keep stopwords in result.
from segjb import SegJb
hdl_seg = SegJb()
hdl_seg.set_param(delim=' ', ngram=2, keep_stopwords=True, keep_puncs=False)
- Bigdict from iLife([email protected])