Zemberek-NLP

!!!! Important Notice: I have stopped working with the project indefinitely. Feel free to fork the project !!!!

Here is the the new home of the Zemberek project.
Zemberek-NLP is a Natural Language Processing library. Some modules are specifically developed for Turkish language.

Usage

Maven

Add this to pom.xml file

<repositories>
    <repository>
        <id>ahmetaa-repo</id>
        <name>ahmetaa Maven Repo on Github</name>
        <url>https://raw.github.com/ahmetaa/maven-repo/master</url>
    </repository>
</repositories>

And dependecies (For example morphology):

<dependencies>
    <dependency>
        <groupId>zemberek-nlp</groupId>
        <artifactId>morphology</artifactId>
        <version>0.9.0</version>
    </dependency>
</dependencies>

Jar distributions

[Zemberek-NLP-Distributions] (https://github.com/ahmetaa/zemberek-nlp-distributions) page has versions and separate module and dependent jars.

Alternatively there is a public [Google docs page] (https://drive.google.com/#folders/0B9TrB39LQKZWSjNKdVcwWUxxUm8) for distributions downloads.

Examples

[Turkish-nlp-examples] (https://github.com/ahmetaa/turkish-nlp-examples) contains a maven java project with small usage examples.

Known Issues and Limitations

Project requires Java 7.
Currently word and sentence parse module operations generates parse graph with each initialization. So each run in the system takes some seconds. We will fix it in the next version with fast serialization of the parse graph.
Morphological parsing does not work for some obvious and frequent words.
Morphological disambiguation is working less accurate then expected (Not very usable).
Morphological generation may not work for some obvious Stem-Suffix combinations.
Please see issues section for further issues and feel free to create new ones.

Modules

Core

Core classes such as special Collection classes, Hash functions and helpers.

Morphology

Turkish morphological parsing, disambiguation and generation. [Morphology Documentation] (https://github.com/ahmetaa/zemberek-nlp/tree/master/morphology)

Tokenization

Turkish Tokenization and sentence boundary detection. So far only rule based algorithms.

Hyphenation

Turkish syllabification and hyphenation.

Language modelling

[Language model compression] (https://github.com/ahmetaa/zemberek-nlp/tree/master/lm)

Acknowledgements

Please refer to contributors.txt file.

Portions of this code has been developed in Tübitak BİLGEM's Speech and Language Technologies Laboratory.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
core		core
docs/images		docs/images
hyphenation		hyphenation
lm		lm
morphology		morphology
tokenization		tokenization
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
contributors.txt		contributors.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zemberek-NLP

Usage

Maven

Jar distributions

Examples

Known Issues and Limitations

Modules

Core

Morphology

Tokenization

Hyphenation

Language modelling

Acknowledgements

About

Releases

Packages

Languages

License

MaticDiba/zemberek-nlp

Folders and files

Latest commit

History

Repository files navigation

Zemberek-NLP

Usage

Maven

Jar distributions

Examples

Known Issues and Limitations

Modules

Core

Morphology

Tokenization

Hyphenation

Language modelling

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages