Skip to content

Latest commit

 

History

History
156 lines (111 loc) · 8.4 KB

README.md

File metadata and controls

156 lines (111 loc) · 8.4 KB

Zensearch

A distributed search engine where user's are able to control what they can search, they can manually crawl specific websites of their liking and based on what they want to work with everyday. User's can crawl the web with a click of a button, and while crawling they can continue using the search feature to query existing webpages in their database, crawling might take some time because of security reasons and network throttling mechanism used by different website authors.

Snapshots

home-zensearch crawl-section-zensearch crawling-sample search-example

Concepts for ranking webpages

TF-IDF: "In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general.Like the bag-of-words model, it models a document as a multiset of words, without word order. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus." source.

  • Despite it being fast to determine the ranking of each document relative to the user's search query, it disregards the context of the query as long as a document matches the terms in the search query eg: "dog bites man" or "man bites dog" does not matter in the context of this model.

Document Length Normalization: Is a technique which mitigates the length bias of a document whose context might not be relevant to the query of a user, because a document's length is greater than all other documents that contains the user's query, the term frequency of a longer document might be much more higher compared to all other documents if we disregard the concentration of a lengthy document to the term.

  • The Document length normalization mitigates the length of a long document by dividing: currentDocLength / avgDocLength and controlled by b which controls the normalization of the document to determine the concentration of the term in that document. if the term frequency is proportionate to the document length then that means the current document is relevant to the query, else if the document is longer than average and is not proportionate to the term frequency then it is most likely no the main focus of the document.

Beyond TF-IDF using BM25: I am no expert but from my understanding of the BM25 Model is that it is an instance of the TF-IDF but with super powers, where the relevancy of a document is controlled by constants k1 & b where k1 controls the weight of a term frequency in a document or how much impact this term has throughout a document.

  • In k1 if the constant is set to a lower value, it saturates the term very quickly which diminishes the term frequency as the term grows and stops to a certain point but if it is set to a higher value eg: k1 = 2 it will grow a bit slower up to a point where it begins slow down the rate as the term grows.

  • The b controls the normalization of the length of the document relative to the term's relevancy or controls the concentration of the term in the document, if the term is sparse and is not mentioned enough in a long document and if b is high from 0-1, then long documents will be punished which means they will be scored lower, but if a document mentions the term more frequently and is more concetrated throughout the whole document, the document will be scored higher, using 0 normalization will render the document to only consider the term frequency and not consider if the document is relative to the term.

TODO

  • Save the most recently crawled webpage for continuation.
  • Create cancellation for crawling but still save the indexed pages up to that point.
  • One click to clear database.
  • Let users delete a website from the sqlite database from the client-side.
  • Documentation.
  • Create a list of known websites (websites that have been indexed.) on the front-end

IMPORTANT FOR USERS OF THIS PROJECT

Crawl Cancellations or Downed Connections

As of now while crawling webpages when for some reason user's would want to cancel or any system errors might occur while crawling, the webpages that are crawled up to that point will NOT BE SAVED unfortunately, I'll have to implement a safety net for saving and resuming crawled webpages, sorry about that.

So for now just crawl websites with fewer contents/pages for expermentation and testing.

  • if the Crawl list sidebar is opened with ongoing crawls and zensearch is down, to reset the UI navigate to the browser's devtools and clear the cookies and local storage.

Disclaimer

User will have take to full responsibility in the event that they will be blocked by a website author, so make sure you're crawling a website that would generally accept web crawlers and has a rate-limiting mechanism in their services, I have implemented a rudimentary rate-limiter for the crawler in crawler/page_navigator.go file called requestDelay().

/*
using elapsed time from start to end of request in milliseconds and compressing
it using log to smooth the values for increasing intervals for each requests
such that it doesnt grow too much when multiplying intervals.

multiplier values:
  - 0 ignores all intervals
  - 1 increases slowly but is still fast and might be blocked
  - 2 sweet middleground
*/
func (pn *PageNavigator) requestDelay(multiplier int) {
	max := 10000
	base := int(math.Log10(float64(pn.mselapsed)))

	fmt.Printf("CURRENT ELAPSED TIME: %d\n", pn.mselapsed)
	if pn.interval < max {
		pn.interval = (pn.interval + base) * multiplier
		fmt.Printf("INCREASE INTERVAL: %d\n", pn.interval)
	} else if pn.interval > max {
		fmt.Printf("RESET INTERVAL: %d\n", pn.interval)
		pn.interval = 0
	}
	time.Sleep(time.Duration(pn.interval * 1000000))
}

So be careful and read their robots.txt file from their website https://<website-hostname>/robots.txt.

How to Run

# make sure you have docker and docker compose installed
# run these commands

cd path/to/zensearch/


docker compose up

Database

The project uses Sqlite3 database which is stored within database/dist/website_collection.db, you can go into it if you have sqlite3 installed in your system and if not go ahead and install then after that:

  • cd to the database/dist/website_collection.db
  • run sqlite3 website_collection.db

TABLES:

  • indexed_sites
  • known_sites (i know both of these mean the same thing)
  • webpages

SCHEMAS:

known_sites (
    id INTEGER PRIMARY KEY,
    url TEXT NOT NULL,
    last_added INTEGER NOT NULL
);
indexed_sites (
    id INTEGER PRIMARY KEY,
    primary_url TEXT NOT NULL UNIQUE,
    last_indexed INTEGER NOT NULL
);
webpages (
    parent INTEGER REFERENCES indexed_sites(id),
    id INTEGER PRIMARY KEY,
    url TEXT NOT NULL ,
    title TEXT,
    contents TEXT
);

Modifying data within sqlite3 in Docker Compose

I have not yet implemented a way for users to delete individual crawled websites, so in order for the users to remove their indexed websites, users will have to navigate into the running container.

How to modify a running database container

  • if you are in the zensearch directory do docker exec -ti db sh else find the docker running docker container prefixed with zensearch- using docker ps copy the container id of zensearch-db then do the same docker exec -ti <id of container> sh remove the brackets.
  • run sqlite3 dist/webiste_collection.db
  • remove the indexed website's data

Tools and Dependencies

Message Broker

RabbitMQ

Frontend

No Framework just vanilla yavascript

Backend Server

ExpressJs NodeJS

Database Service

NodeJS Sqlite3 for Nodejs

Crawler Service

Go Selenium Container

Search Engine Service

Go