A powerful Streamlit application that uses Optical Character Recognition (OCR) to extract text from images and PDF files. The app employs advanced image processing techniques to enhance OCR accuracy and provide a user-friendly text extraction experience.
- Multi-file upload support (PNG, JPG, JPEG, PDF)
- Advanced image preprocessing techniques
- Configurable OCR options
- Real-time text extraction
- Downloadable extracted text files
- Python 3.7+
- Streamlit
- OpenCV
- Pytesseract
- PyMuPDF
- NumPy
- Pillow
- Deskewing
- Binarization (Otsu's method)
- Noise removal
- Contrast enhancement
- Page segmentation mode selection
- LSTM neural network mode
- Uniform text block assumption
- Interword space preservation
-
Clone the repository:
git clone https://github.com/yourusername/ocr-text-extraction-app.git cd ocr-text-extraction-app
-
Install dependencies:
pip install -r requirements.txt
-
Install Tesseract OCR:
- Ubuntu:
sudo apt-get install tesseract-ocr
- macOS:
brew install tesseract
- Windows: Download from Tesseract GitHub
- Ubuntu:
-
Run the Streamlit app:
streamlit run app.py
-
Open
http://localhost:8501
in your web browser -
Upload images or PDF files
-
Configure OCR options in the sidebar
-
Download extracted text files
Deploy using Streamlit Cloud:
- Push code to GitHub
- Connect Streamlit Cloud to your repository
- Configure build settings
Contributions are welcome! Please submit pull requests to the main repository.
MIT License - see LICENSE file for details.