Search-Engine-Politicians-Sinhala

This repository contain source code for Sinhala politicians search engine created using Python and Elasticsearch

Directory Structure

The important files and directories of the repository is shown below

├── Corpus : data scraped from the [website](https://en.wikipedia.org/wiki/List_of_Sri_Lankan_politicians)                    
    ├── politician_corpus_english.csv : original data scrapped from the website in English
    ├── politician_corpus_sinhala.csv : translated data scraped form the website in Sinhala
    ├── politician_meta_data_corpus.json : contain all meta date related to the politicians
    └── politician_corpus.json : contain the final politician set
├── Frontend : React frontend
├── Scrap : Source codes for the data scraper
    ├── scrap.py : Source code for web scrapper and translator
    └── scrap_all.py : Source code to create corpus, Contain all the urls
├── Search : Source codes for the data scraper
    ├── app.py : Backend of the web app created using Flask
    ├── search.py : Search functions used to classify user search phrases and elasticsearch queries
    ├── facetedSearch.py : Search function used for faceted search using filters
    └── upload_data.py : File to upload data to elasticsearch cluster
├── queries.txt :  Example queries

Starting the web app

Quick Start

Pre requesists :

Python, Flask, requests library and Elasticsearch needed in your PC.

Steps :

Clone the repository.
Run an Elasticsearch instance on port 9200.
Go to the folder Search. Run the python script upload_data.py to put the corpus to the Elasticsearch.
And then run the python script app.py

git clone https://github.com/Sachini-Dissanayaka/Search-Engine-Politicians-Sinhala.git
cd Search-Engine-Politicians-Sinhala
cd Search
python upload_data.py
python app.py

To run the web scraper

cd Search-Engine-Politicians-Sinhala
cd Scrap
python scrap_all.py

Data fields

Each politician record contain subset of following data fields

Name - English
Name - Sinhala
Gender - English
Gender - Sinhala
Period - English
Period - Sinhala
Political Party - English
Political Party - Sinhala
Position - English
Position - Sinhala
Rate
Early Life
Education
Political Career
Family

Data Scraping process

The dataset for the project was scraped from the website List of Sri Lankan politicians using the HTML/XML parsing library, BeautifulSoup. The search engine contains a collection of 259 politicians having 6 metadata fields, and additional 4 descriptive fields which includes Early Life, Education, Political Career, and Family. All the data were presented only in the English language on the website. While scraping, first, I generated clean data using simple regex techniques and then all the data were translated into Sinhala language using the googletrans library to provide full support for Sinhala language queries.

Search Process

Indexing and quering

The standard analyzer was used for indexing the dataset. Since there is no issue with lowercase and uppercase letters in Sinhala I have disabled the lowercase token filter which is enabled by default in the standard tokenizer. A primary index named “index-politicians” was created and the data was indexed under that. Elastic search queries are dynamically generated depending on the intention of the data extracted from the query string entered by the user. If no data can be extracted from the query string a basic multi_match query is sent. If the query string contains any keywords related to a specific data field such as Name, Gender, Period, Political Party, or Position that field is boosted in the request and the filter is added on top of the query. Either the score given by elastic search or the rate of the politician is used to sort the results depending on the input query. In Order to serve misspelled queries, fuzzy queries are allowed by setting fuzziness to auto in all the queries.

Advance Features

Text mining and text preprocessing
- Search queries are processed before intent classification, here generate the cleaned data using simple regex techniques and then translated into Sinhala language using a translator.
Intent Classification
- Once the query is added, intent behind the query is found by intent classification. As an example, a query like “හොඳම ගැහැණු අගමැතිවරු 6” will return the 6 Female Prime Ministers having the highest rate.
Faceted Search
- The search engine supported faceted search related to Political Party, Gender,and Position.
Synonyms support
- The search engine also support synonyms in Sinhala. As an example “කාන්තා අගමැතිවරු” will return all the female prime ministers even though the politician data does not have the word “කාන්තා” in the gender field or any of its fields
Resistant to simple spelling errors
- The search engine servers misspelled queries using Fuzziness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search-Engine-Politicians-Sinhala

Directory Structure

Starting the web app

Quick Start

Pre requesists :

Steps :

To run the web scraper

Data fields

Data Scraping process

Search Process

Indexing and quering

Advance Features

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Corpus		Corpus
Frontend		Frontend
Scrap		Scrap
Search		Search
.gitignore		.gitignore
README.md		README.md
Rule_Based_Search.png		Rule_Based_Search.png
queries.txt		queries.txt

Sachini-Dissanayaka/Search-Engine-Politicians-Sinhala

Folders and files

Latest commit

History

Repository files navigation

Search-Engine-Politicians-Sinhala

Directory Structure

Starting the web app

Quick Start

Pre requesists :

Steps :

To run the web scraper

Data fields

Data Scraping process

Search Process

Indexing and quering

Advance Features

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages