forked from andre-martins/TurboParser
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
392 lines (372 loc) · 19.8 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>TurboParser</title>
<link type="text/css" rel="stylesheet" href="http://www.cs.cmu.edu/~nasmith/nasstyle.css">
</head>
<body>
<h1>TurboParser (Dependency Parser with Linear Programming)</h1>
<div class="mybox">
<table>
<tr><td rowspan=3 valign=top><img src=turbo-parser.png width=180></td>
<td>This page provides a link to <b>TurboParser</b>, a free multilingual dependency parser developed by <a href="http://www.cs.cmu.edu/~afm">André Martins</a>.<br></td></tr>
<tr><td>It is based on joint work with
<a href="http://www.cs.cmu.edu/~nasmith">Noah Smith</a>,
<a href="http://www.lx.it.pt/~mtf">Mário Figueiredo</a>,
<a href="http://www.cs.cmu.edu/~epxing">Eric Xing</a>,
<a href="http://www.isr.ist.utl.pt/~aguiar">Pedro Aguiar</a>.
</td></tr>
<tr><td> </td>
</tr>
</table>
</div>
<h3>Background</h3>
<p>
Dependency parsing is a lightweight syntactic formalism that relies on lexical relationships between words.
<i>Nonprojective</i> dependency grammars may generate languages that are not context-free, offering a formalism
that is arguably more adequate for some natural languages.
Statistical parsers, learned from treebanks, have achieved the best performance in this task. While only local
models (arc-factored) allow for exact inference, it has been shown that including non-local features and performing
approximate inference can greatly increase performance.
</p>
<p>
This package contains a C++ implementation of a
dependency parser based on the papers [1,2,3,4,5] below.
The latest version of this package also contains C++ implementations of
a POS tagger, a semantic role labeler, a entity tagger,
a coreference resolver, and a constituent (phrase-based) parser.
The relevant references are the papers [6,7,8,9] below.
<p>
This package allows:
<ul>
<li>learning a parser/tagger/semantic parser/entity tagger/coreference resolver from a treebank,</li>
<li>running a parser/tagger/semantic parser/entity tagger/coreference resolver on new data,</li>
<li>evaluating the results against a gold-standard.</li>
</ul>
</p>
<br/>
<h3>Demo</h3>
<ul>
<li>
<a href="http://demo.ark.cs.cmu.edu/parse">ARK Syntactic & Semantic Parsing Demo</a>
</li>
</ul>
<h3>News</h3>
<p>
<b>We released TurboParser v2.3 on November 6th, 2015!</b>
This version introduces some new features:
<ul>
<li>
A named entity recognizer (TurboEntityRecognizer) based on the
Illinois Entity Tagger (ref. [7] below).
</li>
<li>
A coreference resolver (TurboCoreferenceResolver) based on the
Berkeley Coreference Resolution System (ref. [8] below).
</li>
<li>
A constituent parser based on dependency-to-constituent reduction,
implementing ref. [9] below.
</li>
<li>
A dependency labeler, TurboDependencyLabeler, that can optionally be applied
after the dependency parser.
</li>
<li>
Compatibility with MS Windows (using MSVC) and with C++0x.
</li>
</ul>
</p>
<p>
<b>We released TurboParser v2.2 on June 26th, 2014!</b>
This version introduces some new features:
<ul>
<li>
A Python wrapper for the tagger and parser (requires Cython 0.19).
</li>
<li>
A semantic role labeler (TurboSemanticParser) implementing ref. [6] below.
</li>
</ul>
</p>
<p>
<b>We released TurboParser v2.1 on May 23th, 2013!</b>
This version introduces some new features:
<ul>
<li>
The full model has now third-order parts for grand-siblings and tri-siblings (see ref. [5] below).
</li>
<li>
Compatibility with MS Windows (using MSVC).
</li>
</ul>
</p>
<p>
<b>We released TurboParser v2.0 on September 20th, 2012!</b>
This version introduces a number of new features:
<ul>
<li>
The parser does not depend anymore on CPLEX (or any other non-free LP solver).
Instead, the decoder is now based on <a href="http://www.ark.cs.cmu.edu/AD3">AD3</a>, our free library for
approximate MAP inference.
</li>
<li>
The parser now outputs <i>dependency labels</i> along with the backbone structure.
</li>
<li>
As a bonus, we now provide a trainable part-of-speech tagger, called <i>TurboTagger</i>, which can be used in standalone mode, or to provide part-of-speech
tags as input for the parser. TurboTagger has state-of-the-art accuracy for English (97.3% on section 23 of the Penn Treebank) and is fast (~40,000 tokens per second).
</li>
<li>
The parser is much faster than in previous versions. You may choose among a basic arc-factored parser (~4,300 tokens per second), a
standard second-order model with consecutive sibling and grandparent features (the default; ~1,200 tokens per second), and
a full model with head bigram and arbitrary sibling features (~900 tokens per second).
</li>
</ul>
<b>Note:</b> The runtimes above are approximate, and based on experiments with a desktop machine with a Intel Core i7 CPU 3.4 GHz and 8GB RAM.
</p>
<!--p>
This software has the following external dependencies: <a href="http://www.ark.cs.cmu.edu/AD3">AD3</a>, a library for
approximate MAP inference; <a href="http://eigen.tuxfamily.org/">Eigen</a>, a template
library for linear algebra; <a href="http://code.google.com/p/google-glog/">google-glog</a>, a library for logging;
<a href="http://code.google.com/p/gflags/">gflags</a>, a library
for commandline flag processing. All these libraries are free software and are
provided as tarballs in this package.
</p-->
<p>
To run this software, you need a standard C++ compiler.
This software has the following external dependencies: <a href="http://www.ark.cs.cmu.edu/AD3">AD3</a>, a library for
approximate MAP inference; <a href="http://eigen.tuxfamily.org/">Eigen</a>, a template
library for linear algebra; <a href="http://code.google.com/p/google-glog/">google-glog</a>, a library for logging;
<a href="http://code.google.com/p/gflags/">gflags</a>, a library
for commandline flag processing. All these libraries are free software and are
provided as tarballs in this package.
</p>
<p>
This software has been tested in several Linux platforms. It has also
successfully compiled in Mac OS X and MS Windows (using MSVC).
</p>
<br/>
<h3>Further Reading</h3>
<p>
The main technical ideas behind this software appear in the papers:
<br /><br />
<table>
<tr valign="top"><td>[1] </td>
<td>
André F. T. Martins, Noah A. Smith, and Eric P. Xing.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/acl2009.pdf" title="http://www.cs.cmu.edu/~afm/Home_files/acl2009.pdf">Concise Integer Linear Programming Formulations for Dependency Parsing</a>.<br>
Annual Meeting of the Association for Computational Linguistics (ACL'09), Singapore, August 2009.<br />
</td></tr>
<tr valign="top"><td>[2] </td>
<td>
André F. T. Martins, Noah A. Smith, and Eric P. Xing.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/icml2009.pdf">Polyhedral Outer Approximations with Application to Natural Language Parsing</a>.<br />
International Conference on Machine Learning (ICML'09), Montreal, Canada, June 2009.<br />
</td></tr>
<tr valign="top"><td>[3] </td>
<td>
André F. T. Martins, Noah A. Smith, Eric P. Xing, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/emnlp2010.pdf">TurboParsers: Dependency Parsing by Approximate Variational Inference</a>.<br />
Empirical Methods in Natural Language Processing (EMNLP'10), Boston, USA, October 2010.<br>
</td></tr>
<tr valign="top"><td>[4] </td>
<td>
André F. T. Martins, Noah A. Smith, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/emnlp2011b.pdf">Dual Decomposition With Many Overlapping Components</a>.<br />
Empirical Methods in Natural Language Processing (EMNLP'11), Edinburgh, UK, July 2011.<br>
</td></tr>
<tr valign="top"><td>[5] </td>
<td>
André F. T. Martins, Miguel B. Almeida, Noah A. Smith.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/acl2013short.pdf">Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers</a>.<br />
In Annual Meeting of the Association for Computational Linguistics (ACL'13), Sofia, Bulgaria, August 2013.<br>
</td></tr>
<tr valign="top"><td>[6] </td>
<td>
André F. T. Martins and Mariana S. C. Almeida.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/semeval2014_task8.pdf">Priberam: A Turbo Semantic Parser with Second Order Features</a>.<br />
In International Workshop on Semantic Evaluation (SemEval), task 8: Broad-Coverage Semantic Dependency Parsing, Dublin, August 2014.<br>
</td></tr>
<tr valign="top"><td>[7] </td>
<td>
Lev Ratinov and Dan Roth.<br />
<a href="http://cogcomp.cs.illinois.edu/papers/RatinovRo09.pdf">Design Challenges and Misconceptions in Named Entity Recognition</a>.<br />
In International Conference on Natural Language Learning (CoNLL'09), 2009.<br>
</td></tr>
<tr valign="top"><td>[8] </td>
<td>
Greg Durrett and Dan Klein.<br />
<a href="http://www.eecs.berkeley.edu/~gdurrett/papers/durrett-klein-emnlp2013.pdf">Easy Victories and Uphill Battles in Coreference Resolution</a>.<br />
Empirical Methods in Natural Language Processing (EMNLP'13), 2013.<br>
</td></tr>
<tr valign="top"><td>[9] </td>
<td>
Daniel F.-González and André F. T. Martins.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/acl2015_reduction.pdf">Parsing As Reduction</a>.<br />
In Annual Meeting of the Association for Computational Linguistics (ACL'15), Beijing, China, August 2015.<br>
</td></tr>
</table>
</p>
<br/>
<h3>Download</h3>
<p>
The latest version of TurboParser is <a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.3.0.tar.gz">TurboParser v2.3.0 [~5.4MB,.tar.gz format]</a>.
See the <a href="http://www.cs.cmu.edu/~afm/TurboParser/README">README</a> file for instructions for compilation, running, and file formatting.
It does <i>not</i> include the data sets used in the papers;
for information about how to get these data sets, please go to <a href="http://nextens.uvt.nl/~conll">http://nextens.uvt.nl/~conll</a>.
Bear in mind that some data sets must be separately licensed through the <a href="http://www.ldc.upenn.edu/">LDC</a>.
</p>
<p>
In addition, we provide separately the following pre-trained models (notice that these are very large files):
<ul>
<li>An English tagger trained on the sections 02-21 of the Penn Treebank.
Click <a href="sample_models/english_proj_tagger.tar.gz">here</a> to download this model [~3.3MB, .tar.gz format].
Then, uncompress this model and save it in a local folder (e.g. as models/english_proj_tagger.model).
To tag a new file <input-file>, type:<br/>
<br/>
<div class="mybox" style="font-family:Courier">
./TurboTagger --test \<br/>
--file_model=models/english_proj_tagger.model \<br/>
--file_test=<input-file> \<br/>
--file_prediction=<output-file> \<br/>
--logtostderr<br/>
</div>
<br/>
Check the <a href="http://www.cs.cmu.edu/~afm/TurboParser/README">README</a> for file formatting instructions and additional options.
<li>First, second, and third-order English parsers trained on the sections 02-21 of the Penn Treebank,
with dependencies extracted using the head-rules of Yamada and Matsumoto, through <a href="http://w3.msi.vxu.se/~nivre/research/Penn2Malt.html">Penn2Malt</a>.
Click <a href="sample_models/english_proj_parser.tar.gz">here</a> to download these models [~1.8GB, .tar.gz format].
Uncompress this file and save the models in a local folder (e.g. as models/english_proj_parser_model-{basic,standard,full}.model).
To parse a new file <input-file> in CoNLL format, type:<br/>
<br/>
<div class="mybox" style="font-family:Courier">
./TurboParser --test \<br/>
--file_model=models/english_proj_parser_model-standard.model \<br/>
--file_test=<input-file> \<br/>
--file_prediction=<output-file> \<br/>
--logtostderr<br/>
</div>
<br/>
Check the <a href="http://www.cs.cmu.edu/~afm/TurboParser/README">README</a> for file formatting instructions and additional options.
<!--li>Another English parser trained in the dataset provided in the CoNLL 2008 shared task (ignoring the semantic dependencies).
As described <a href="http://www.yr-bcn.es/conll2008">here</a>,
this dataset was obtained from the sections 02-21 of the Penn Treebank by
applying a different set of rules. Unlike the dataset used to train the previous model,
this one contains non-projective arcs.<br>
Click <a href="sample_models/english.tar.gz">here</a> to download this model [~1.4 GB, .tar.gz format].
<li>A model trained in the Arabic dataset provided in the CoNLL-X shared task. <br>
Click <a href="sample_models/arabic.tar.gz">here</a> to download this model [~225 MB, .tar.gz format].
<-->
<li>First, second, and third-order Arabic parsers trained in the Arabic dataset provided in the CoNLL-X shared task.
Click <a href="sample_models/arabic_parser.tar.gz">here</a> to download these models [~520 MB, .tar.gz format].
Uncompress this file and save the models in a local folder (e.g. as models/arabic_model-{basic,standard,full}.model).
To parse a new file <input-file> in CoNLL format, type:<br/>
<br/>
<div class="mybox" style="font-family:Courier">
./TurboParser --test \<br/>
--file_model=models/arabic_parser_model-standard.model \<br/>
--file_test=<input-file> \<br/>
--file_prediction=<output-file> \<br/>
--logtostderr<br/>
</div>
<br/>
Check the <a href="http://www.cs.cmu.edu/~afm/TurboParser/README">README</a> for file formatting instructions and additional options.
<li>Taggers and parsers for <a href="nasmith_models/kin-turbo-v1.0.tgz">Kinyarwanda</a> and
<a href="nasmith_models/mlg-turbo-v1.0.tgz">Malagasy</a>.
There is
a <a href="nasmith_models/README">README</a>
specifically for these models. They require TurboParser v. 2.0.2.
<li>Farsi parser trained on the <a href="http://dadegan.ir/en">Dadegan Persian treebank</a>. Click <a href="sample_models/farsi_parser.tar.gz">here</a> to download the model [~530 MB, .tar.gz format]. This model requires TurboParser v 2.0.2.
Associated Farsi NLP tools can be found <a href="https://github.com/wfeely/farsiNLPTools">here</a>.
<li>Parsers that generate Stanford-style dependencies can be found <a href="http://www.ark.cs.cmu.edu/TBSD/">here</a>.</li>
<li>A parser trained on the English Web Treebank for Stanford basic dependencies can be found <a href="../LexSem/#syntax">here</a>.</li>
</ul>
<p>
Finally, a script "parse.sh" is provided in this package that allows you to tag and parse
free text (in English, one sentence per line) with the models above. Just type:
<br/>
<div class="mybox" style="font-family:Courier">
./scripts/parse.sh <filename>
</div>
</br>
where <i><filename></i> is a text file with one sentence per line. If no filename is
specified, it parses <i>stdin</i>, so e.g.
<br/>
<div class="mybox" style="font-family:Courier">
echo "I solved the problem with statistics." | ./scripts/parse.sh
</div>
<br/>
yields
<br/>
<div class="mybox" style="font-family:Courier">
1 I _ PRP PRP _ 2 SUB<br/>
2 solved _ VBD VBD _ 0 ROOT<br/>
3 the _ DT DT _ 4 NMOD<br/>
4 problem _ NN NN _ 2 OBJ<br/>
5 with _ IN IN _ 2 VMOD<br/>
6 statistics _ NNS NNS _ 5 PMOD<br/>
7 . _ . . _ 2 P<br/>
<!--table>
<tr>
<td>1</td><td>I</td><td>_</td><td>PRP</td><td>PRP</td><td>_</td><td>2</td><td>SUB</td>
</tr>
<tr>
<td>2</td><td>solved</td><td>_</td><td>VB</td><td>VBD</td><td>_</td><td>0</td><td>ROOT</td>
<tr>
</tr>
<td>3</td><td>the</td><td>_</td><td>DT</td><td>DT</td><td>_</td><td>4</td><td>NMOD</td>
<tr>
</tr>
<td>4</td><td>problem</td><td>_</td><td>NN</td><td>NN</td><td>_</td><td>2</td><td>OBJ</td>
<tr>
</tr>
<td>5</td><td>with</td><td>_</td><td>IN</td><td>IN</td><td>_</td><td>2</td><td>VMOD</td>
<tr>
</tr>
<td>6</td><td>statistics</td><td>_</td><td>NN</td><td>NNS</td><td>_</td><td>5</td><td>PMOD</td>
<tr>
</tr>
<tr>
<td>7</td><td>.</td><td>_</td><td>.</td><td>.</td><td>_</td><td>2</td><td>P</td>
</tr>
</table-->
</div>
</p>
</p>
<p> Older versions:
<ul>
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.2.0.tar.gz">TurboParser v2.2.0 [~2.8MB,.tar.gz format]</a>.
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.1.0.tar.gz">TurboParser v2.1.0 [~2.5MB,.tar.gz format]</a>.
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.0.2.tar.gz">TurboParser v2.0.2 [~2.5MB,.tar.gz format]</a>.
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.0.1.tar.gz">TurboParser v2.0.1 [~2.5MB,.tar.gz format]</a>.
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.0.tar.gz">TurboParser v2.0 [~3.2MB,.tar.gz format]</a>.
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/turboparser-0.1.tar.gz">TurboParser v0.1 [~2.5Mb,.tar.gz format]</a>.
Along with this distribution, we released
an <a href="TurboParser-0.1/sample_models/english_proj.tar.gz">English parser</a> trained on the sections 02-21 of the Penn Treebank,
with dependencies extracted using the head-rules of Yamada and Matsumoto [~1.2 GB, .tar.gz format];
<a href="TurboParser-0.1/sample_models/english.tar.gz">another English parser</a> trained in the dataset provided in the CoNLL 2008 shared task [~1.4 GB, .tar.gz format];
an <a href="TurboParser-0.1/sample_models/arabic.tar.gz">Arabic parser</a> trained in the CoNLL-X dataset [~225 MB, .tar.gz format];
a <a href="TurboParser-0.1/sample_models/run_pretrained.sh">script</a> to apply these models to parse new data.
</ul>
<br/>
<h3>Contributing to TurboParser</h3>
<p>For questions, bug fixes and comments, please e-mail <i>afm [at] cs.cmu.edu</i>.</p>
<p>To contribute to TurboParser, you can fork the following github repository: <a href="http://github.com/andre-martins/TurboParser">http://github.com/andre-martins/TurboParser</a>.</p>
<p>To receive announcements about updates to TurboParser, <a href="https://mailman.srv.cs.cmu.edu/mailman/listinfo/ark-tools">join the ARK-tools mailing list</a>.</p>
<br/>
<h3>Acknowledgments</h3>
<p>A. M. was supported by a FCT/ICTI grant through
the CMU-Portugal Program, and by Priberam. This
work was partially supported by the FET programme
(EU FP7), under the SIMBAD project (contract 213250),
by National Science Foundation grant IIS-1054319,
and by the QNRF grant NPRP 08-485-1-083.</p>
</body>
</html>