Skip to content

Latest commit

 

History

History
80 lines (56 loc) · 4.34 KB

README.md

File metadata and controls

80 lines (56 loc) · 4.34 KB

NlpMovieDb

Synopsis

The project is semantic search engine on movie database using Natural Language Processing and Information retrieval techniques. The user provides a search query to the search engine and it returns a movie list with scores. Here is how it is different from conventional search engines.

  1. It is based on semantic search and not keywords based approach which is how popular search engines implementation in IMDB, rotten tomatoes works.
  2. Information of the complete plot is parsed to form a graph model, representing the series of events, characters and their properties.

Existing Problem & Motivation

Based on web mining and analysis from the websites like Movies.stackexchange, IMDB, Quora and others, we observed that more than 90% of the questions posted online go unanswered. Clearly the magnitude of this problem is big, which got us motivated towards this project and to successfully answer these questions. Web mined Data

Data analytics from the mined data:

![alt text] (https://calitripmagblog.files.wordpress.com/2016/02/picture1.png?w=320&h=280&crop=1) alt text

Tools and Technologies used

  1. Java: Java is a general-purpose object oriented programming language.
  2. Python: Python is a widely used general-purpose, high-level programming language.
  3. Knowledge Parser: K-Parser is a semantic parser that translates any English sentence into a directed acyclic semantic graph. Used for: Event extraction, Event-Event relation extraction. K-Parser
  4. Stanford Core NLP : Stanford CoreNLP provides a set of natural language analysis tools. Stanford CoreNLp Used for: Named entity recognition, Co-reference resolution, Parts of speech, Extraction object dependency for semantic representation.
  5. NLTK : Leading platform for building Python programs to work with NLP. NLTK Used for: NER detection and name and coreference unification for text.
  6. WS4J (WordNet Similarity for Java): Provides a pure Java API for semantic relatedness.WS4J Used for: Wordnet based word-word similarity algorithms PATH, LIN, LESK.
  7. Tf-IDF: Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how importance a word is to a document in a collection or corpus.Tf-IDF Used for: Creating event importance vector for each plot to give weights to the event similarity

Process

Our approach is based on four similarity mechanisms i.e. Event Similarity, Named Entity Similarity, Term Similarity and Character Similarity, after the Data pre-processing. Following is an architecture of our system in detail:

alt text

Extraction Engine

  • Plot and query sentences are fed to K-parser to generate their semantic parse tree
  • Verb nodes extracted
  • Verbs lemmatized and stored
  • NER are extracted using Stanford Core NLP
  • Name, Location, Organization
  • Entities are extracted using NLTK

alt text

Character Extraction

alt text

Search Engine

alt text

Character Similarity

alt text

NER Similarity

alt text

Term Similarity

alt text

Ranking Engine

alt text

Experiments and Results

alt text alt text alt text alt text