syntactic-ngram-builder

An open-source tool to generate syntactic n-grams from a syntactically parsed data. The syntactic n-grams follow the same format as used in the Google Ngram Collection (http://googleresearch.blogspot.fi/2013/05/syntactic-ngrams-over-time.html).

Input

At the moment the default supported input format is CONLL-U but CONLL-09 is also supported. The extended n-grams are defined for the Stanford Dependencies (SD) and Universal Dependencies (UD) schemes.

Generating n-grams

python build_ngrams.py input.conllu --ngrams --args --out_dir output_directory

The input can be either a file or a directory containing multiple files. Files must be in CONLL-U format, and end with .gz, .conllu or .conll. Use --ngrams to generate syntactic n-grams from nodes to quadarcs (and their extended variants) and --args to generate noun-arguments and verb-arguments. You can also have both option at the same time. --out_dir is the directory where the resulting n-gram files get created. Alternatively, --stdout can be used to print n-grams into standard output, but note that then all n-grams with different length are mixed, and a each line starts with a dataset name followed by the actual n-gram. N-gram builder uses multiprocessing, ad the number of builder processes can be set with -p (default is 4). You should however note that also file reader and writers have their own processes.

Sorting and counting n-grams

./sort.sh output_directory

The n-grams produced by the software are not unique. Thus, a separate step to sort and count the n-grams is needed. If you have a lot of data to sort and a powerful machine (more than 50G of memory and 20 cores), you can also use the optimized version of the sort script (./fast_sort.sh out_dir). For more information about the optimized sort, see https://gist.github.com/fginter/2d4662faeef79acdb772.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_ngrams.py		build_ngrams.py
config.py		config.py
fast_sort.sh		fast_sort.sh
file_io.py		file_io.py
graph.py		graph.py
pigz.sh		pigz.sh
sort.sh		sort.sh
syntax_builder.py		syntax_builder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

syntactic-ngram-builder

Input

Generating n-grams

Sorting and counting n-grams

About

Releases

Packages

Contributors 2

Languages

License

jmnybl/syntactic-ngram-builder

Folders and files

Latest commit

History

Repository files navigation

syntactic-ngram-builder

Input

Generating n-grams

Sorting and counting n-grams

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages