Skip to content

Commit

Permalink
Merge pull request #48 from kthyng/patch-2
Browse files Browse the repository at this point in the history
Update paper.md
  • Loading branch information
oxinabox authored Feb 7, 2020
2 parents 3c1013c + c72345a commit 8c76dc5
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ bibliography: paper.bib

# Summary

WordTokenizers.jl is a tool to help users of the Julia programming language [@Julia], work with natural language.
WordTokenizers.jl is a tool to help users of the Julia programming language [@Julia] work with natural language.
In natural language processing (NLP) tokenization refers to breaking a text up into parts -- the tokens.
Generally, tokenization refers to breaking a sentence up into words and other tokens such as punctuation.
Complementary to word tokenization is _sentence segmentation_ or _sentence splitting_ (occasionally also called _sentence tokenization_),
Expand All @@ -49,7 +49,7 @@ Using this API several standard tokenizers and sentence segmenters have been imp
WordTokenizers.jl does not implement significant novel tokenizers or sentence segmenters.
Rather, it contains ports/implementations of the well-established and commonly used algorithms.
At present, it contains rule-based methods primarily designed for English.
Several of the implementations are sourced from the Python NLTK project [@NLTK1], [@NLTK2];
Several of the implementations are sourced from the Python NLTK project [@NLTK1; @NLTK2],
although these were in turn sourced from older pre-existing methods.

WordTokenizers.jl uses a `TokenBuffer` API and its various lexers for fast word tokenization.
Expand All @@ -58,11 +58,11 @@ A desired set of TokenBuffer lexers are used to read characters from the stream
The package provides the following tokenizers made using this API.

- A Tweet Tokenizer [@tweettok] for casual text.
- A general purpose NLTK Tokenizer [@NLTK1], [@NLTK2].
- An improved version of the multilingual Tok-tok tokenizer [@toktok], [@toktokpub].
- A general purpose NLTK Tokenizer [@NLTK1; @NLTK2].
- An improved version of the multilingual Tok-tok tokenizer [@toktok; @toktokpub].

With various lexers written for the `TokenBuffer` API, users can also create their high-speed custom tokenizers with ease.
The package also provides a simple reversible tokenizer [@reversibletok1], [@reversibletok2],
The package also provides a simple reversible tokenizer [@reversibletok1; @reversibletok2]
that works by leaving certain merge symbols, as a means to reconstruct tokens into the original string.

WordTokenizers.jl exposes a configurable default interface,
Expand Down

0 comments on commit 8c76dc5

Please sign in to comment.