PDF Search Engine

This project is a PDF search engine that utilizes Natural Language Processing (NLP) techniques to extract and rank pages based on user queries. The application analyzes the content of PDF documents and provides real-time ranking for pages relevant to the user's input.

Features

PDF Upload: Users can upload PDF files for analysis.
Real-Time Query Processing: The search engine provides real-time ranking of PDF pages based on user queries.
NLP Techniques: Utilizes tokenization, stemming, lemmatization, and TF-IDF for improved search accuracy.
Page Ranking: Returns ranked pages based on relevance to the query, allowing users to directly view content from specific pages.

Technologies Used

Python: The main programming language for development.
Streamlit: A framework to create the web application.
NLTK: Natural Language Toolkit for NLP tasks.
LangChain: For handling PDF document loading.

Requirements

Make sure you have the following libraries installed:

pip install streamlit langchain_community nltk

You also need to download the necessary NLTK resources by running:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

Usage

Clone the repository:

git clone https://github.com/yourusername/pdf-search-engine.git
cd pdf-search-engine

Run the Streamlit application:
```
streamlit run app.py
```
Upload a PDF document and enter your query in the provided text input. The application will display ranked pages based on the relevance of the query.

How It Works

Preprocessing: The application preprocesses the PDF content by tokenizing, stemming, lemmatizing, and removing stopwords to generate a word frequency dictionary for each page.
PageRank Algorithm: The application implements a PageRank-like algorithm that calculates TF-IDF scores for words in the query and ranks the pages accordingly
Results: Users can view the ranked pages and click on the page number to see the content directly.

Thank you!!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
Report.pdf		Report.pdf
app.py		app.py
application.py		application.py
index.ipynb		index.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Search Engine

Features

Technologies Used

Requirements

Usage

How It Works

About

Releases

Packages

Languages

jayantp2003/PDF-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

PDF Search Engine

Features

Technologies Used

Requirements

Usage

How It Works

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages