Skip to content

Fall2018-Project4-sec1--sec1proj4_grp8 created by GitHub Classroom

Notifications You must be signed in to change notification settings

TZstatsADS/Fall2018-Project4-sec1-grp8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project: OCR (Optical Character Recognition)

image credit to wikipedia.org

Term: Fall 2018

  • Team #8

  • Team members

    • Shiqing Long: sl4225
    • Yang Yue: yy2826
    • Yiding Xie: yx2443
    • Yingqiao Zhang: yz3209

    (Names are listed in alphabetical order of last names.)

  • Paper: C2 + D1

Project summary: In this project, we created an OCR post-processing procedure to enhance Tesseract OCR output accuracy.

  1. We modified ground_truth as ground_truth_trimmed since there are 13 files in the data folder with mismatching rows between ground_truth and tesseract. For more details, please refer to README.md
  2. First clean the data by filtering out all punctuations. Detect Tesseract data error based on 8 rules from paper D-1:Shortening Documents and Weeding Out Garbage
  3. Locate the corresponding error words in ground truth dataset.
  • if the number of words in corresponding row (between tesseract and ground_truth) are equal, locate the ground truth word by indexing directly
  • if the number of words in corresponding row are not equal, extract previous and following 2 words of the error word (total of 5 index), and apply string-distance function (stringdist) to locate the most likely ground truth word.
  1. Select possible Candidates for errors, calculate fetures scoring for each candidate; label candidate with 1 if it equals to ground truth, else 0.

  2. Performed Adaboost.R2 to predict the top 3 best matching results, and use Top 1 prediction to replace all error words. C-2: Statistical Learning for OCR Text Correction prediction

  3. Evaluated OCR performance based on two formulas for both word-level and character-level: formula

Custom Functions

  • Five different functions were implemented for the purpose of this project. Detailed descriptions can be found here: README.md

OCR Performance Result

  • word wise: in terms of word based accuracy.
  • character wise: in terms of letter based accuracy.
  • Tesseract: pre- process data
  • Tesseract_with_postprocessing: post- processed data, that is Tesseract with Correction.

result

Contribution statement: (default) All team members approve our work presented in this GitHub repository including this contributions statement.

  • Shiqing Long: (major contributor) Assist with features scoring and model building, OCR performance evaluation, and presentation.
  • Yang Yue: OCR performance evaluation, README, normalization attempt.
  • Yiding Xie: (major contributor) Error Detection, generate all files as corpus, identify corresponding ground truth words with detected error words, partially contribute to OCR performance evaluation, README.
  • Yingqiao Zhang: (major contributor) Feature scoring, select candidate, modeling, model evaluation.

Following suggestions by RICH FITZJOHN (@richfitz). This folder is orgarnized as follows.

proj/
├── lib/
├── data/
├── doc/
├── figs/
└── output/

Please see each subfolder for a README file.

About

Fall2018-Project4-sec1--sec1proj4_grp8 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages