OCR Text Extraction Using Streamlit

Overview

A powerful Streamlit application that uses Optical Character Recognition (OCR) to extract text from images and PDF files. The app employs advanced image processing techniques to enhance OCR accuracy and provide a user-friendly text extraction experience.

Key Features

Multi-file upload support (PNG, JPG, JPEG, PDF)
Advanced image preprocessing techniques
Configurable OCR options
Real-time text extraction
Downloadable extracted text files

Technologies

Python 3.7+
Streamlit
OpenCV
Pytesseract
PyMuPDF
NumPy
Pillow

OCR Processing Techniques

Image Preprocessing

Deskewing
Binarization (Otsu's method)
Noise removal
Contrast enhancement
Page segmentation mode selection

Pytesseract Configuration

LSTM neural network mode
Uniform text block assumption
Interword space preservation

Installation

Clone the repository:

git clone https://github.com/yourusername/ocr-text-extraction-app.git
cd ocr-text-extraction-app

Install dependencies:
```
pip install -r requirements.txt
```
Install Tesseract OCR:
- Ubuntu: sudo apt-get install tesseract-ocr
- macOS: brew install tesseract
- Windows: Download from Tesseract GitHub

Usage

Run the Streamlit app:
```
streamlit run app.py
```
Open http://localhost:8501 in your web browser
Upload images or PDF files
Configure OCR options in the sidebar
Download extracted text files

Deployment

Deploy using Streamlit Cloud:

Push code to GitHub
Connect Streamlit Cloud to your repository
Configure build settings

Contributing

Contributions are welcome! Please submit pull requests to the main repository.

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
utils		utils
LICENSE		LICENSE
README.md		README.md
app.py		app.py
packages.txt		packages.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Text Extraction Using Streamlit

Overview

Key Features

Technologies

OCR Processing Techniques

Image Preprocessing

Pytesseract Configuration

Installation

Usage

Deployment

Contributing

License

About

Releases

Packages

Languages

License

PhoenixAlpha23/Pytesseract-Webapp

Folders and files

Latest commit

History

Repository files navigation

OCR Text Extraction Using Streamlit

Overview

Key Features

Technologies

OCR Processing Techniques

Image Preprocessing

Pytesseract Configuration

Installation

Usage

Deployment

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages