As of October 8, 2024 this repository is archived because it is old and not used anymore.

OCR image preparation

This application offers some basic functionality to prepare images for further processing with OCR (Optical Character Recognition) software. It is specifically intended for preprocessing of scanned book pages and runs on single image files (PNG or JPEG), whole directories, or even a Zip file containing images. The application requires ImageMagick, Java 1.5 or up, Apache Maven, and JMagick.

Functionality currently included

Contrast correction: all of the tools included, except SplitLines, have a command line option for setting the contrast level;
Page splitting: all of the tools included, except SplitLines, split an image in two if the width of the input image exceeds the height;
Skew correction: correct rotation of input image;
Margin correction: crop the image to only the section containing the text;
Line splitting: split a block of text into images of the separate lines;
Run-length smoothing (RLSA): converts sequences of intermittently colored pixels into a fully colored sequence. This essentially converts the printed parts of a page into a completely colored block. RLSA is used by the application to detect text sections on a page.

Functionality to be added

Configurable page splitting: turn off the splitting if needed;
Column detection: this will improve line splitting for pages with multiple columns;
Image detection: detect images and other graphics and remove these.

Installation

Make sure you have ImageMagick, Java, and Apache Maven installed on your system. Next, download and install the latest version of JMagick.

Note for Mac users: be sure to add --with-shared-lib-ext=".dylib" to the ./configure command for JMagick.

Finally, download or clone this repository and descend into its main folder. Execute the following command to build the application (the $ denotes the terminal and is not part of the command):

$ mvn package

This will create a new folder labeled target in which you will find a file ocr-prep-X.X.X-SNAPSHOT.jar. Feel free to rename this file or move it to a more convenient location.

Usage

The main application will apply all corrections (page splitting, skew correction, margin correction, line splitting) to each input image, with optional contrast correction, using the following command:

$ java -Djava.library.path=/path/to/ImageMagick/ -jar ocr-prep-X.X.X-SNAPSHOT.jar [-b] [-c X] [--saveAll] /path/to/input

The square brackets denote optional parameters, so remove these when entering the command. By default the application outputs everything to a folder labeled output. You can use the '-b' option to use the basename of the input file instead. The '-c' option defines whether the contrast of the input images is adjusted. Replace the 'X' with a numerical value. A higher value means more contrast. The '--saveAll' option, when set, will cause all intermediate images to be saved to disk. If not set, only the final result will be saved (processed image and line images). The output is written to a folder labeled output in the same directory as the processed image.

Some of the functionality can be called separately using the following command:

$ java -Djava.library.path=/path/to/ImageMagick/ -cp ocr-prep-X.X.X-SNAPSHOT.jar com.mediate18.ocr.tools.ClassName [-c X] [--saveAll] /path/to/input

The square brackets denote optional parameters, so remove these when entering the command. Replace 'ClassName' with one of the following classes:

CorrectSkew
CorrectMargins
SplitLines
RunLengthSmoothing

And again, replace the 'X' with a numerical value to set the contrast, and use '--saveAll' to keep intermediate images.

Configuration

Some settings used by the application can be configured by placing a file named ocr-prep.properties in the same directory as your ocr-prep-X.X.X-SNAPSHOT.jar file. The format for setting options is 'option=value'.

Run-length smoothing options

RLSA is applied in two runs: the first run applies the algorithm both horizontally and vertically, the second run only applies horizontal smoothing. The settings determine the maximum space between two colored pixels required for them to be transformed into a continuously colored block, and is inherently dependent on the resolution of the input image. It is therefore recommended to test the RLSA on a few sample images before processing an entire collection.

Option	Default	Description
horizontal_threshold1	44	Horizontal smoothing threshold in pixels for first RLSA run
vertical_threshold	75	Vertical smoothing threshold in pixels for first RLSA run
horizontal_threshold2	50	Horizontal smoothing threshold in pixels for second RLSA run

Line splitting options

The line splitting algorithm looks for horizontal rows of mostly uncolored pixels in between rows of mostly colored pixels. Ideally, each line of text is separated by a few rows of completely uncolored pixels. However, in practice this is rarely the case. The line height threshold determines the margin between colored and uncolored rows, expressed in the percentage of colored pixels per row. It is both dependent on the resolution of the input image and the typesetting of the text.

Option	Default	Description
lineheight_threshold	0.01	The maximum percentage of colored pixels a row may have to be considered empty

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OCR image preparation

Functionality currently included

Functionality to be added

Installation

Usage

Configuration

Run-length smoothing options

Line splitting options

Files

README.md

Latest commit

History

README.md

File metadata and controls

OCR image preparation

Functionality currently included

Functionality to be added

Installation

Usage

Configuration

Run-length smoothing options

Line splitting options