GitHub - gabor-recski/HunTag: a sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models

gabor-recski / HunTag Public

forked from recski/HunTag

Notifications You must be signed in to change notification settings
Fork 0
Star 0

a sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models

LGPL-3.0 license

0 stars 10 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
configs		configs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README		README
TODO		TODO
bigram.py		bigram.py
eval.py		eval.py
feature.py		feature.py
feature_select.py		feature_select.py
features.py		features.py
huntag.py		huntag.py
lexicon.py		lexicon.py
liblinear.patch		liblinear.patch
optfunc.py		optfunc.py
tagger.py		tagger.py
tools.py		tools.py
trainer.py		trainer.py
viterbi.py		viterbi.py

Repository files navigation

Huntag - a sequential tagger for NLP combining the linear classificator Liblinear and Hidden Markov Models
Based on training data, Huntag can perform any kind of sequential sentence
tagging and has been used for NP chunking and Named Entity Recognition.

=== REQUIREMENTS ===
HunTag uses the Liblinear package, which can be downloaded from:

http://www.csie.ntu.edu.tw/~cjlin/liblinear/

In order for HunTag to work, Liblinear should be compiled with python bindings and the directory containing the python files `liblinearutil.py' and `liblinear.py' should be added to the environment variable PYTHONPATH.

IMPORTANT: after installing Liblinear, the python bindings must be patched by cd-ing to the python subdirectory of your liblinear installation and running
patch < (*path-to-HunTag*)/liblinear.patch
This allows liblinear to handle the more memory-efficient cType input used by HunTag

=== DATA FORMAT ===

Input data must be a tab-separated file with one word per line and an empty
line to mark sentence boundaries. Each line must contain the same number of
fields and the last field must contain the correct tag for the word, which
may be in the BI format used at CoNLL shared tasks (e. g. B-NP to mark the
first word of a noun phrase, I-NP to mark the rest and O to mark words
outside an NP) or in the so-called BIE1 format which has a seperate symbol
for words constituting a chunk themselves (1-NP) and one for the last words
of multi-word phrases (E-NP). The first two characters of answer tags
should always conform to one of these two conventions, the rest may be any
string describing the category. 

=== FEATURES ===

The flexibility of Huntag comes from the fact that it will generate any kind
of features from the input data given the appropriate python functions.
Several dozens of features used regularly in NLP tasks are already
implemented in the file features.py, however the user is encouraged to add
any number of her own.

Once the desired features are implemented, a data set and a configuration
file containing the list of feature functions to be used are all Huntag
needs to perform training and tagging.

=== CONFIG FILE ===
The configuration file lists the features that are to be used for a given task. The feature file may start with a command specifying the default radius for features. This is non-mandatory. Example:
!defaultRadius 5

Next, it can give values to variables that shall be used by the featurizing methods.
For example, the following three lines set the parameters of the feature called krpatt

let krpatt minLength 2
let krpatt maxLength 99
let krpatt lang hu

The second field specifies the name of the feature, the third a key, the fourth a numeric value. The dictionary of key-value pairs will be passed to the feature.

After this come the actual assignments of feture names to features. Examples:

token ngr ngrams 0
sentence bwsamecases isBetweenSameCases 1
lex street hunner/lex/streetname.lex 0
token lemmalowered lemmaLowered 0,2

The first keyword can have three values, token, lex and sentence. For example, in the first example line above, the feature name ngr will be assigned to the python function ngrams() that returns a feature string for the given token. The third argument is a column or comma-separated list of columns. It specifies which fields of the input should be passed to the feature function. Counting starts from zero.

For sentence features, the input is aggregated sentence-wise into a list, and this list is then passed to the feature function. This function should return a list consisting of one feature string for each of the tokens of the sentence.

For lex features, the second argument specifies a lexicon file rather than a python function name. The specified token field is matched against this lexicon file.


=== USAGE ===
HunTag may be run in any of the following three modes:

TRAIN
used to train a Liblinear model given a training corpus and a set of feature functions. When run in TRAIN mode, HunTag creates three files, one containing the liblinear mode and two listing features and labels and the integers they are mapped to when passed to liblinear. With the --model option set to NAME, the three files will be stored under NAME.model, NAME.featureNumbers and NAME.labelNumbers respectively.

cat TRAINING_DATA | python huntag.py train OPTIONS

Mandatory options:
    -c FILE, --config-file=FILE
        read feature configuration from FILE
    -m NAME, --model=NAME
        name of liblinear model and lists
    -p PARAMS --parameters=PARAMS
        pass PARAMS to liblinear trainer

Non-mandatory options:    
    -f FILE, --feature-file=FILE
        write training events to FILE


BIGRAM-TRAIN
Used to train a bigram language model using a given field of the training data

cat TRAINING_DATA | python huntag.py bigram-train OPTIONS

Mandatory options:
    -b FILE, --bigram-model=FILE
        name of bigram model file to be written
    -t FIELD, --tag-field=FIELD
        specify FIELD containing the tags to build bigram

TAG
Used to tag input. Given a maxent model providing the value P(t|w) for all tags t and words (set of feature values) w, and a bigram language model supplying P(t|t0) for all pairs of tags, HunTag will assign to each sentence the most likely tag sequence.

cat INPUT | python huntag.py tag OPTIONS

Mandatory options:
    -m NAME, --model=NAME
        name of liblinear model file and lists
    -b FILE, --bigram-model=FILE
        name of bigram model file
    -c FILE, --config-file=FILE
        read feature configuration from FILE

Non-mandatory options:
    -l L, --language-model-weight=L
        set weight of the language model to L (default is 1)