Skip to content

Dongbei-Dalapi/Semantic-Textual-Similarity

Repository files navigation

Semantic-Textual-Similarity

An implementation for SemEval-2016 Task1.

Task Definition

Given two sentences, participating systems are asked to return a continuous valued similarity score on a scale from 0 to 5, with 0 indicating that the semantics of the sentences are completely independent and 5 signifying semantic equivalence.

Usage

cd {project_folder/}
python ensemble.py

Data

Training Data

Task participants are allowed to use all of the data sets released during prior years (2012-2015) as training data.

Testing data

There are five source of testind data: Headline, Plagirism, Postediting, Question to Question and Answer to Answer.

NLP Fature

We used two nlp features to capture useful information.

N-gram overlap

We calculated the similarity from the character n-grams extracted from two sentences.

BOW cosine similarity

Each sentence is represented as a Bag-of-Words (BOW) and each word is weighted by its IDF value. The cosine similarity between two sentences is then calculated as a feature. We got 1 feature for BOW.

Manhattan LSTM

There is two identical LSTM network. LSTM is passed word vector representations of sentences and output a hidden state encoding semantic meaning of the sentences using manhattan distance.

CNN

We add another CNN model to enhance the ensemble model.

Ensemble model

We use Random Forests (RF), Gradient Boosting (GB),XGBoost (XGB) for traditional features and the LSTM model. We average the scores from four models to achieve better performance.

Result

NLP Features Headline Plagiarsim Postediting Ans - Ans Ques - Ques All
Ngram Overlap 0.7519 0.3726 0.4819 0.7942 0.5949 0.6327
BOW similarity 0.7228 0.3408 0.3666 0.7335 0.5669 0.5635
Overlap + BOW 0.7409 0.3628 0.4339 0.7928 0.5855 0.6112
LSTM 0.6112 0.7058 0.6172 0.4786 0.4308 0.5805
CNN 0.6281 0.4503 0.6094 0.4429 0.5099 0.5092
Ensemble 0.7244 0.7823 0.8119 0.5560 0.4626 0.6755

Possible improvement

more features and more model tuning will be added later.

Reference

  • J. Tian, Z. Zhou, M. Lan, and Y. Wu. Ecnu at semeval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 191–197, 2017.
  • Mueller, J., & Thyagarajan, A. (2016, March). Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30, No. 1).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published