Semantic-Textual-Similarity

An implementation for SemEval-2016 Task1.

Task Definition

Given two sentences, participating systems are asked to return a continuous valued similarity score on a scale from 0 to 5, with 0 indicating that the semantics of the sentences are completely independent and 5 signifying semantic equivalence.

Usage

cd {project_folder/}
python ensemble.py

Data

Training Data

Task participants are allowed to use all of the data sets released during prior years (2012-2015) as training data.

Testing data

There are five source of testind data: Headline, Plagirism, Postediting, Question to Question and Answer to Answer.

NLP Fature

We used two nlp features to capture useful information.

N-gram overlap

We calculated the similarity from the character n-grams extracted from two sentences.

BOW cosine similarity

Each sentence is represented as a Bag-of-Words (BOW) and each word is weighted by its IDF value. The cosine similarity between two sentences is then calculated as a feature. We got 1 feature for BOW.

Manhattan LSTM

There is two identical LSTM network. LSTM is passed word vector representations of sentences and output a hidden state encoding semantic meaning of the sentences using manhattan distance.

CNN

We add another CNN model to enhance the ensemble model.

Ensemble model

We use Random Forests (RF), Gradient Boosting (GB),XGBoost (XGB) for traditional features and the LSTM model. We average the scores from four models to achieve better performance.

Result

NLP Features	Headline	Plagiarsim	Postediting	Ans - Ans	Ques - Ques	All
Ngram Overlap	0.7519	0.3726	0.4819	0.7942	0.5949	0.6327
BOW similarity	0.7228	0.3408	0.3666	0.7335	0.5669	0.5635
Overlap + BOW	0.7409	0.3628	0.4339	0.7928	0.5855	0.6112
LSTM	0.6112	0.7058	0.6172	0.4786	0.4308	0.5805
CNN	0.6281	0.4503	0.6094	0.4429	0.5099	0.5092
Ensemble	0.7244	0.7823	0.8119	0.5560	0.4626	0.6755

Possible improvement

more features and more model tuning will be added later.

Reference

J. Tian, Z. Zhou, M. Lan, and Y. Wu. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 191–197, 2017.
Mueller, J., & Thyagarajan, A. (2016, March). Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30, No. 1).

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
.gitignore		.gitignore
README.md		README.md
data.zip		data.zip
ensemble.py		ensemble.py
lstm_model.py		lstm_model.py
main.py		main.py
preprocessing.py		preprocessing.py
traditional_nlp_model.py		traditional_nlp_model.py
word2vec.py		word2vec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic-Textual-Similarity

Task Definition

Usage

Data

Training Data

Testing data

NLP Fature

N-gram overlap

BOW cosine similarity

Manhattan LSTM

CNN

Ensemble model

Result

Possible improvement

Reference

About

Releases

Packages

Languages

Dongbei-Dalapi/Semantic-Textual-Similarity

Folders and files

Latest commit

History

Repository files navigation

Semantic-Textual-Similarity

Task Definition

Usage

Data

Training Data

Testing data

NLP Fature

N-gram overlap

BOW cosine similarity

Manhattan LSTM

CNN

Ensemble model

Result

Possible improvement

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages