Named Entity Recognition

Input

NERPipe module requires text to be annotated with morphological information and it must be provided in specific format based on CoNLL-X Shared Task data specification (http://ilk.uvt.nl/conll/). Rows in input file correspond to single tokens in text. Each rows consists of a number of attributes seperated by tab character.

CONLL format description

Field number	Field name	Description
1	ID	Token counter, starting at 1 for each new sentence.
2	FORM	Word form or punctuation symbol.
3	LEMMA	Lemma of word form, or an underscore if not available.
4	CPOSTAG	Coarse-grained part-of-speech tag (http://www.semti-kamols.lv/doc_upl/TagSet.html for more detailed tagset information)
5	POSTAG	Fine-grained part-of-speech tag. Similar to CPOSTAG, produced by replacing CPOSTAG tags of lexical information with underscores.
6	FEATS	Unordered set of morphological features, separated by a vertical bar, or an underscore if not available.
7	HEAD	Head of the current token, which is either a value of ID or zero ('0') or underscore if not available.
8	NER	Named entity category. Consecutive tokens with the same category denotes named entity.

For NER Pipe mandatory fields are (ID, FORM, LEMMA, CPOSTAG, POSTAG, FEATS), next columns will be omitted.

Sample input

1	Mārtiņš	mārtiņš	n_msn_	n_msn1	Galotnes_nr=14|Vārdgrupas_nr=2|Vārdšķira=Lietvārds|Vārds=Mārtiņš|Skaitlis=Vienskaitlis|Minēšana=Minēšana_pēc_galotnes|Deklinācija=1|Mija=0|Dzimte=Vīriešu|Locījums=Nominatīvs|Pamatforma=mārtiņš|Avots=minējums_pēc_galotnes	
2	Bondars	bondars	n_msn_	n_msn1	Galotnes_nr=1|Vārdgrupas_nr=1|Vārdšķira=Lietvārds|Vārds=Bondars|Skaitlis=Vienskaitlis|Minēšana=Minēšana_pēc_galotnes|Deklinācija=1|Mija=0|Dzimte=Vīriešu|Locījums=Nominatīvs|Pamatforma=bondars|Avots=minējums_pēc_galotnes	
3	ir	būt	v__i___30__	vcnipii30an	Galotnes_nr=1158|Vārdšķira=Darbības_vārds|Konjugācija=Nekārtns|Avota_pamatforma=būt|Vārds=ir|Minēšana=Nav|Mija=0|Laiks=Tagadne|Pamatforma=būt|Vārdgrupas_nr=29|Leksēmas_nr=28668|Skaitlis=Nepiemīt|Atgriezeniskums=Nē|Izteiksme=Īstenības|Transitivitāte=Nepārejošs|Darbības_vārda_tips=Palīgverbs_'būt'|Noliegums=Nē|Persona=3|Kārta=Darāmā
4	dzimis	dzimt	v__pdmsn__n	vmnpdmsnasn	Galotnes_nr=897|Konjugācija=1|Vārdšķira=Darbības_vārds|Avota_pamatforma=dzimt|Vārds=dzimis|Minēšana=Nav|Mija=14|Locījums=Nominatīvs|Pamatforma=dzimt|Laiks=Pagātne|Vārdgrupas_nr=15|Leksēmas_nr=26489|Skaitlis=Vienskaitlis|Atgriezeniskums=Nē|Noteiktība=Nenoteiktā|Izteiksme=Divdabis|Transitivitāte=Nepārejošs|Darbības_vārda_tips=Patstāvīgs_darbības_vārds|Dzimte=Vīriešu|Noliegums=Nē|Lokāmība=Lokāms|Kārta=Darāmā
5	1971.	1971.	xo	xo	Vārdšķira=Reziduālis|Vārds=1971.|Reziduāļa_tips=Kārtas_skaitlis_cipariem|Pamatforma=1971.	
6	g.	g.	y	y	Vārdšķira=Saīsinājums|Galotnes_nr=1158|Vārdgrupas_nr=29|Avota_pamatforma=g.|Vārds=g.|Leksēmas_nr=50535|Minēšana=Nav|Mija=0|Pamatforma=g.	
7	31.	31.	xo	xo	Vārdšķira=Reziduālis|Vārds=31.|Reziduāļa_tips=Kārtas_skaitlis_cipariem|Pamatforma=31.	
8	decembrī	decembris	n_msl_	ncmsl2	Vārdšķira=Lietvārds|Galotnes_nr=31|Avota_pamatforma=decembris|Vārds=decembrī|Minēšana=Nav|Deklinācija=2|Mija=0|Locījums=Lokatīvs|Pamatforma=decembris|Vārdgrupas_nr=3|Leksēmas_nr=8141|Skaitlis=Vienskaitlis|Lietvārda_tips=Sugas_vārds|Dzimte=Vīriešu|Avots=Valērija_leksikons	
9	,	,	zc	zc	Vārdšķira=Pieturzīme|Galotnes_nr=1158|Vārdgrupas_nr=29|Avota_pamatforma=,|Vārds=,|Leksēmas_nr=27198|Minēšana=Nav|Mija=0|Pieturzīmes_tips=Komats|Pamatforma=,	
10	Rīgā	Rīga	n_fsl_	npfsl4	Vārdšķira=Lietvārds|Galotnes_nr=79|Avota_pamatforma=rīga|Vārds=Rīgā|Minēšana=Nav|Deklinācija=4|Mija=0|Locījums=Lokatīvs|Pamatforma=Rīga|Vārdgrupas_nr=7|Leksēmas_nr=8855|Skaitlis=Vienskaitlis|Lietvārda_tips=Īpašvārds|Dzimte=Sieviešu|Avots=Valērija_leksikons	
11	.	.	zs	zs	Vārdšķira=Pieturzīme|Galotnes_nr=1158|Vārdgrupas_nr=29|Avota_pamatforma=.|Vārds=.|Leksēmas_nr=27194|Minēšana=Nav|Mija=0|Pieturzīmes_tips=Punkts|Pamatforma=.

To produce necessary input format from plain text run morphotagger.bat -conll-x -paragraphs < sample.txt. Note that all plain text files should use UTF-8 encoding and they should not contain blank lines (denotes end of the file).

Running

For named entity recognition, run the included ner_pipe.sh (.bat). Three blank lines denotes end of the file.

Arguments

Argument	Description
-h	help
-conll-in (default)	CONLL shared task data format - one line per token, with tab delimited columns, sentences separated by blank lines.
-conll-x (default)	CONLL-X shared task data format - one line per token, with tab-delimited columns, sentences separated by blank lines.
-loadClassifier	serialized classifier (dafault "lv-ner-model.ser.gz")
-simple	simplified output for NER output comparison (FORM
-toFeatures	add NER column to feature string ("ner=...")
-saveExtraColumns	save extra columns after FEATS in output
-whiteList	files containing white list named entities (see Gazetteers), separated by comma
-regexList	files containing simple regular expressions for ner recognition (see Gazetteer/REGEX.txt), separated by comma

Run the nerpipe.sh (.bat) script for named entity recognition, using Morphotagger -conll-x format as input data : nerpipe.sh < sample.conll.

Output

Output format is based on the same CONLL format adding one NER column. Output fields consists of ID, FORM, LEMMA, CPOSTAG, POSTAG, FEATS, HEAD, NER.

Sample output:

1	Mārtiņš	mārtiņš	n	n_msn1	Galotnes_nr=14|Vārdšķira=Lietvārds|Vārds=Mārtiņš|Skaitlis=Vienskaitlis|Minēšana=Minēšana_pēc_galotnes|Deklinācija=1|Mija=0|Lielo_burtu_lietojums=Sākas_ar_lielo_burtu|Dzimte=Vīriešu|Locījums=Nominatīvs|Pamatforma=mārtiņš|Avots=minējums_pēc_galotnes	person
2	Bondars	bondars	n	n_msn1	Galotnes_nr=1|Vārdšķira=Lietvārds|Vārds=Bondars|Skaitlis=Vienskaitlis|Minēšana=Minēšana_pēc_galotnes|Deklinācija=1|Mija=0|Lielo_burtu_lietojums=Sākas_ar_lielo_burtu|Dzimte=Vīriešu|Locījums=Nominatīvs|Pamatforma=bondars|Avots=minējums_pēc_galotnes	person
3	ir	būt	v	vcnipii30an	Galotnes_nr=1158|Vārdšķira=Darbības_vārds|Konjugācija=Nekārtns|Avota_pamatforma=būt|Vārds=ir|Minēšana=Nav|Mija=0|Laiks=Tagadne|Pamatforma=būt|Leksēmas_nr=28668|Skaitlis=Nepiemīt|Atgriezeniskums=Nē|Izteiksme=Īstenības|Transitivitāte=Nepārejošs|Darbības_vārda_tips=Palīgverbs_'būt'|Persona=3|Noliegums=Nē|Kārta=Darāmā	O
4	dzimis	dzimt	v	vmnpdmsnasn	Galotnes_nr=897|Konjugācija=1|Vārdšķira=Darbības_vārds|Avota_pamatforma=dzimt|Vārds=dzimis|Minēšana=Nav|Mija=14|Locījums=Nominatīvs|Pamatforma=dzimt|Laiks=Pagātne|Leksēmas_nr=26489|Skaitlis=Vienskaitlis|Atgriezeniskums=Nē|Noteiktība=Nenoteiktā|Izteiksme=Divdabis|Transitivitāte=Nepārejošs|Darbības_vārda_tips=Patstāvīgs_darbības_vārds|Dzimte=Vīriešu|Noliegums=Nē|Lokāmība=Lokāms|Kārta=Darāmā	O
5	1971.	1971.	x	xo	Vārdšķira=Reziduālis|Vārds=1971.|Reziduāļa_tips=Kārtas_skaitlis_cipariem|Pamatforma=1971.	time
6	gada	gads	n	ncmsg1	Vārdšķira=Lietvārds|Galotnes_nr=2|Vārds=gada|Avota_pamatforma=gads|Minēšana=Nav|Deklinācija=1|Mija=0|Locījums=Ģenitīvs|Pamatforma=gads|Leksēmas_nr=1391|Skaitlis=Vienskaitlis|Lietvārda_tips=Sugas_vārds|Dzimte=Vīriešu|Avots=Valērija_leksikons	time
7	31.	31.	x	xo	Vārdšķira=Reziduālis|Vārds=31.|Reziduāļa_tips=Kārtas_skaitlis_cipariem|Pamatforma=31.	time
8	decembrī	decembris	n	ncmsl2	Vārdšķira=Lietvārds|Galotnes_nr=31|Vārds=decembrī|Avota_pamatforma=decembris|Minēšana=Nav|Deklinācija=2|Mija=0|Locījums=Lokatīvs|Pamatforma=decembris|Leksēmas_nr=8141|Skaitlis=Vienskaitlis|Lietvārda_tips=Sugas_vārds|Dzimte=Vīriešu|Avots=Valērija_leksikons	time
9	,	,	z	zc	Vārdšķira=Pieturzīme|Galotnes_nr=1158|Avota_pamatforma=,|Vārds=,|Leksēmas_nr=27198|Minēšana=Nav|Mija=0|Pieturzīmes_tips=Komats|Pamatforma=,	O
10	Rīgā	Rīga	n	npfsl4	Vārdšķira=Lietvārds|Galotnes_nr=79|Vārds=Rīgā|Avota_pamatforma=rīga|Minēšana=Nav|Deklinācija=4|Mija=0|Locījums=Lokatīvs|Pamatforma=Rīga|Leksēmas_nr=8855|Skaitlis=Vienskaitlis|Lietvārda_tips=Īpašvārds|Lielo_burtu_lietojums=Sākas_ar_lielo_burtu|Dzimte=Sieviešu|Avots=Valērija_leksikons	location
11	.	.	z	zs	Vārdšķira=Pieturzīme|Galotnes_nr=1158|Avota_pamatforma=.|Vārds=.|Leksēmas_nr=27194|Minēšana=Nav|Mija=0|Pieturzīmes_tips=Punkts|Pamatforma=.	O

Training

Run included ner_train.sh (.bat). Training data should be in tab-seperated columns, following CONLL format with extra column NER_FEATS (contains additional features separated by a vertical bar). NER_FEATS currently is optional and provides extra information about original text file annotation. New training data can be added by simply appending them to ner_train.tab. All annotated training data are available in folder NerTrainingData.

NER_FEATS keys	Description
ner_annotation	original ner annotation
ner_start	1 if token is the beginning new token, underscore otherwise
ner_end	1 if token is the end of NE, underscore otherwise
ner_pos	Single POS tag, equals first character of POSTAG

Features

Used features can be specified in lv-ner-train.prop.

Feature	Description
trainFileList = ner_train.tab	Location of the training file, one or more file names seperated by commas
serializeTo = lv-ner-model.ser.gz	Location to save classifier;
map = word=1,tag=9,lemma=2,answer=6,morphologyFeatureString=5,idx=0	Structure of the training file; this tells the classifier that the word is in column 1 and the correct answer is in column 6
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true	Word shape
useTypeSeqs2=true	Word shape
useTypeySequences=true	Word shape
wordShape=dan2useLC	Word shape
saveFeatureIndexToDisk=true
useTags=true	Use POS tags
useLemmas=true	Use lemma instead of word
gazette = ./Gazetteer/LV_PERS_GAZETTEER.txt, ./Gazetteer/PP_Onomastica_surnames.txt	Gazetteers
sloppyGazette=true	Word matches any word gazette entry
cleanGazette=true	Word sequence fully matches gazette entry
useMorphologyFeatures=true	Use morphology features

Gazetteers

Gazetteer (entity dictionary) consists of list of labels and categories (Latvija location). Since the corespondence between the labels and NE categories can be learned by tagging models, a gazetteer will be useful as long as it returns consisten labels even if those returned are not the NE categories. If a gazette is used, this does not guarantee that words in the gazette are always used as a member of the intended class, and it does not guarantee that words outside the gazette will not be chosen. It simply provides another feature for the CRF to train against. If the CRF has higher weights for other features, the gazette features may be overwhelmed. The gazette files should be of the format

CLASS1 example
CLASS2 another example 
...

Currently LVTagger supports only normalised gazetteers (each token in label should be replaced with its lemma, e.g., "Latvijas Banka" -> "Latvija Banka".). To add new gazetteer create new text file and modify lv-ner-train.prop gazette property.

Gazetteer	Description
LV_PER_GAZETTEER.txt	First names
PP_Onomastica_surnames.txt	Surnames
LV_LOC_GAZETTEER.txt	Locations
PP_valstis.txt	Countries
PP_org_elements.txt
PP_orgnames.txt
PP_orgnames_LETA.txt
AZ_profesijas_full_lem.txt	Lemmatized professions from http://www.vid.lv/lv/gramatveziem/profesijuklasifikators
AZ_profesijas.txt	Profession keywords from http://www.vid.lv/lv/gramatveziem/profesijuklasifikators
AZ_roles.txt	Role names retrieved from news archive corpus using simple heurestics
AZ_valsts_parvaldes_struktura_lem.txt	Lemmatized names of government structures from http://www.logincee.org/file/3409/library
Laura_partijas_lem.txt	Lemmatized party names obtained from various texts. Alternative forms and abbreviations are included in addition to full names.
DB_persons.txt	Full person names from entity database
DB_organizations.txt	Organizations from entity database
DB_locations.txt	Locations from entity database
DB_professions.txt	Professions from entity database

Regular expression NER

Implements a simple, rule-based NER over token sequences using Java regular expressions specified in a file ./Gazetteer/regex.txt. The user provides a file formatted as follows:

regex1    TYPE    overwritableType1,Type2...    priority
regex2    TYPE    overwritableType1,Type2...    priority
...

where each argument is tab-separated, and the last two arguments are optional. Spaces can only be used to separate regular expression tokens; within tokens \s or similar non-space representations need to be used instead. Notes: Following Java regex conventions, some characters in the file need to be escaped. This class isn't implemented very efficiently, since every regex is evaluated at every token position. So it can and does get quite slow if you have a lot of patterns in your NER rules.

(janvāris|februāris|marts|aprīlis|maijs|jūnijs|jūlijs|augusts|septembris|oktobris|novembris|decembris) (sākums|beigas)?	time

matches janvāra sākums, janvārī, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Named Entity Recognition

Named Entity Recognition

Input

CONLL format description

Sample input

Running

Arguments

Output

Sample output:

Training

Features

Gazetteers

Regular expression NER

Clone this wiki locally