Skip to content

Overview of tooling for and issues with NLP on historical texts, dealing with OCR/HTR errors and spelling variation and change

Notifications You must be signed in to change notification settings

HuygensING/NLP-for-Historical-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

NLP Resources for dealing with historical texts

NLP-for-Historical-Text repository

Issues with historical documents

  • OCR/HTR errors
  • Layout and segmentation
  • Spelling variation, especially with text from before 1800 (problem increases with older documents)
  • Spelling and vocabulary change (old words disappear, new words appear, existing words shift meaning)

These issues affect search, string matching, syntactical parsing, and the term frequency distribution and context that is used in e.g. embeddings.

Existing Technologies and Tools

Pre OCR cleaning/layout-detection/imageprocessing

  • OpenCV, NumPy, SciPy
  • All together in the Fusus project (with Cornelis van Lit, Univ. Utrecht)
  • Overview of tech used: tech

OCR post correction

Fuzzy Search and String Matching

PoS, Lemmatisation and NER on historical dutch

  • Nederlab Pipeline
    • uses Frog for PoS-tagging and lemmatisation Middle Dutch, and Early New Dutch (vroegnieuwnederlands). Contains models trained on Brieven als Buit corpus (early new dutch) and Corpys Gysseling and Corpus Reenen Mulder (middle dutch).
    • FoLiA-wordtranslate (part of FoLiA-utils) - (Re)implements Erik Tjong Kim Sang's word-by-word modernisation method.

Parse tree querying and extraction

  • GrETEL: partial tree extraction
  • BlackLab: open source corpus search engine that supports Corpus Query Language

Meetings

Literature

  • Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M., & Doucet, A. (2021). Named Entity Recognition and Classification on Historical Documents: A Survey. arXiv preprint arXiv:2109.11406.PDF
  • Reynaert, M., Hendrickx, I., & Marquilhas, R. (2012). Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2. Proceedings of ACRH-2, 87-98. PDF
  • Reynaert, M., Gompel, M. V., Sloot, K., & van den Bosch, A. P. J. (2015). PICCL: Philosophical Integrator of Computational and Corpus Libraries. PDF
  • Sommerauer, P., & Fokkens, A. (2019, August). Conceptual change and distributional semantic models: An exploratory study on pitfalls and possibilities. In Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change (pp. 223-233).PDF
  • Wevers, M., & Koolen, M. (2020). Digital begriffsgeschichte: Tracing semantic change using word embeddings. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 53(4), 226-243. PDF

About

Overview of tooling for and issues with NLP on historical texts, dealing with OCR/HTR errors and spelling variation and change

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published