Turkish-NLP-Preprocessing-module

Preprocessing tool for Turkish NLP that contains

Developed by Melikşah Türker and Büşra Oğuzoğlu for CMPE561 NLP class project.

modules have 2 versions,

Rule-based: uses RegEx rules.
Machine learning based: uses handcrafted features.
- Machine Learning part contains Naive Bayes Classifier and Logistic Regression Classifier. We developed the Naive Bayes algorithm from scratch, but used sklearn implementation for Logistic Regression.

has 2 versions,

Static: requires pre-defined stopwords
Dynamic: detects the stop-words choosing a threshold according to word frequency distribution, using second derivative(elbow rule) automatically. Works for any language!.

works using

Normalization lexicon
Levenshtein distance: calculating for:
- whole word
- consonant letters only, facilitating the both.

Data folder contains lots of lexicons for multi-word-expressions, normalization, prefixes, abbreviations(non-breaking prefixes), stop-words, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
DATA		DATA
Demonstration of PreProcessing System.ipynb		Demonstration of PreProcessing System.ipynb
DynamicStopWordEliminator.py		DynamicStopWordEliminator.py
MLBasedSentenceSplitter.py		MLBasedSentenceSplitter.py
MLBasedTokenizer.py		MLBasedTokenizer.py
NaiveBayesClassifier.py		NaiveBayesClassifier.py
Normalizer.py		Normalizer.py
NounSuffixes.py		NounSuffixes.py
PreProcessing.py		PreProcessing.py
README.md		README.md
RuleBasedSentenceSplitter.py		RuleBasedSentenceSplitter.py
RuleBasedTokenizer.py		RuleBasedTokenizer.py
StaticStopwordRemover.py		StaticStopwordRemover.py
Stemmer.py		Stemmer.py
Suffix.py		Suffix.py
TokenizationRules.py		TokenizationRules.py
Utility.py		Utility.py
VerbSuffixes.py		VerbSuffixes.py

Provide feedback