NlpMovieDb

Synopsis

The project is semantic search engine on movie database using Natural Language Processing and Information retrieval techniques. The user provides a search query to the search engine and it returns a movie list with scores. Here is how it is different from conventional search engines.

It is based on semantic search and not keywords based approach which is how popular search engines implementation in IMDB, rotten tomatoes works.
Information of the complete plot is parsed to form a graph model, representing the series of events, characters and their properties.

Existing Problem & Motivation

Based on web mining and analysis from the websites like Movies.stackexchange, IMDB, Quora and others, we observed that more than 90% of the questions posted online go unanswered. Clearly the magnitude of this problem is big, which got us motivated towards this project and to successfully answer these questions. Web mined Data

Data analytics from the mined data:

![alt text] (https://calitripmagblog.files.wordpress.com/2016/02/picture1.png?w=320&h=280&crop=1)

Tools and Technologies used

Java: Java is a general-purpose object oriented programming language.
Python: Python is a widely used general-purpose, high-level programming language.
Knowledge Parser: K-Parser is a semantic parser that translates any English sentence into a directed acyclic semantic graph. Used for: Event extraction, Event-Event relation extraction. K-Parser
Stanford Core NLP : Stanford CoreNLP provides a set of natural language analysis tools. Stanford CoreNLp Used for: Named entity recognition, Co-reference resolution, Parts of speech, Extraction object dependency for semantic representation.
NLTK : Leading platform for building Python programs to work with NLP. NLTK Used for: NER detection and name and coreference unification for text.
WS4J (WordNet Similarity for Java): Provides a pure Java API for semantic relatedness.WS4J Used for: Wordnet based word-word similarity algorithms PATH, LIN, LESK.
Tf-IDF: Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how importance a word is to a document in a collection or corpus.Tf-IDF Used for: Creating event importance vector for each plot to give weights to the event similarity

Process

Our approach is based on four similarity mechanisms i.e. Event Similarity, Named Entity Similarity, Term Similarity and Character Similarity, after the Data pre-processing. Following is an architecture of our system in detail:

Extraction Engine

Plot and query sentences are fed to K-parser to generate their semantic parse tree
Verb nodes extracted
Verbs lemmatized and stored
NER are extracted using Stanford Core NLP
Name, Location, Organization
Entities are extracted using NLTK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NlpMovieDb

Synopsis

Existing Problem & Motivation

Tools and Technologies used

Process

Extraction Engine

Character Extraction

Search Engine

Character Similarity

NER Similarity

Term Similarity

Ranking Engine

Experiments and Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

NlpMovieDb

Synopsis

Existing Problem & Motivation

Tools and Technologies used

Process

Extraction Engine

Character Extraction

Search Engine

Character Similarity

NER Similarity

Term Similarity

Ranking Engine

Experiments and Results