Wikipedia Search Engine

This is a search engine built on the full English corpus of wikipedia (~75GB)

Performance

For Queries of

less than 3 words, time to fetch results is < 1s
between 3 and 7 words, time to fetch results is Around 5s

Code Files

myWikiIndexer.py - File containing all functions related to XML parsing and text preprocessing, the code for indexing.
k-way-merge.py - File with functions related to k-way mergesort algorithm and creates secondary index.
search.py - Main file containing all the code for Query Processing

Execution of Code

Prerequisits

Required Directories

index - Initial index gets created here
finalIndex - They get merged here and also stored secondary index in this directory

Required Files

stopwords.txt - A txt file containing all the stop words in the current directory of the code
wiki_dump.xml - The XML file containing the full data of wikipedia

Execution

Run myWikiIndexer.py with path to dump and folder to index as command line arguments.
Run k-way-merge.py - Will sort the index and create secondary index
Run search.py - An infinite loop runs expecting queries.

Types of Queries

Normal query - Any sequence of words that doesn’t satisfy the above conditions is considered a normal query eg: “Sachin Tendulkar”
Field query - Assuming that fields are small letters(b, i, c, t, r, e) followed by colon and the fields are space separated. eg: “body:sachin infobox:2003 category:sports”

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Readme.md		Readme.md
index.sh		index.sh
k-way-merge.py		k-way-merge.py
myWikiIndexer.py		myWikiIndexer.py
search.py		search.py
stopwords.txt		stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Search Engine

Performance

For Queries of

Code Files

Execution of Code

Prerequisits

Required Directories

Required Files

Execution

Types of Queries

About

Releases

Packages

Languages

rushitjasani/Wikipedia-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Search Engine

Performance

For Queries of

Code Files

Execution of Code

Prerequisits

Required Directories

Required Files

Execution

Types of Queries

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages