NLP Resources for dealing with historical texts

Issues with historical documents

OCR/HTR errors
Layout and segmentation
Spelling variation, especially with text from before 1800 (problem increases with older documents)
Spelling and vocabulary change (old words disappear, new words appear, existing words shift meaning)

These issues affect search, string matching, syntactical parsing, and the term frequency distribution and context that is used in e.g. embeddings.

Existing Technologies and Tools

Pre OCR cleaning/layout-detection/imageprocessing

OpenCV, NumPy, SciPy
All together in the Fusus project (with Cornelis van Lit, Univ. Utrecht)
Overview of tech used: tech

OCR post correction

Analiticcl by Maarten van Gompel:
- Code: https://github.com/proycon/analiticcl
- Presentation video: https://diode.zone/w/kkrqA4MocGwxyC3s68Zsq7
- Presentation slides: https://github.com/proycon/analiticcl/blob/master/docs/pres_20220120/analiticcl_presentation.pdf
Edit diff rather than edit dist (immature idea Dirk)
- see wp6-daghregisters
- see also sesdiff
PICCL and underlying TICCLtools, Martin Reynaert, Ko van der Sloot

Fuzzy Search and String Matching

Python libraries for fuzzy search and matching
- fuzzywuzzy
- fuzzysearch
- Fuzzy-search
- python-levenshtein
- Python regex module URL
- python-string-similarity
Indexing and Search
- Elasticsearch
  - edit distance search
- Postgresql
  - edit distance search
- Recommended blog article: Index 1,600,000,000 Keys with Automata and Rust

PoS, Lemmatisation and NER on historical dutch

Nederlab Pipeline
- uses Frog for PoS-tagging and lemmatisation Middle Dutch, and Early New Dutch (vroegnieuwnederlands). Contains models trained on Brieven als Buit corpus (early new dutch) and Corpys Gysseling and Corpus Reenen Mulder (middle dutch).
- FoLiA-wordtranslate (part of FoLiA-utils) - (Re)implements Erik Tjong Kim Sang's word-by-word modernisation method.

Parse tree querying and extraction

GrETEL: partial tree extraction
BlackLab: open source corpus search engine that supports Corpus Query Language

Meetings

22 March 2022

Literature

Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M., & Doucet, A. (2021). Named Entity Recognition and Classification on Historical Documents: A Survey. arXiv preprint arXiv:2109.11406.PDF
Reynaert, M., Hendrickx, I., & Marquilhas, R. (2012). Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2. Proceedings of ACRH-2, 87-98. PDF
Reynaert, M., Gompel, M. V., Sloot, K., & van den Bosch, A. P. J. (2015). PICCL: Philosophical Integrator of Computational and Corpus Libraries. PDF
Sommerauer, P., & Fokkens, A. (2019, August). Conceptual change and distributional semantic models: An exploratory study on pitfalls and possibilities. In Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change (pp. 223-233).PDF
Wevers, M., & Koolen, M. (2020). Digital begriffsgeschichte: Tracing semantic change using word embeddings. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 53(4), 226-243. PDF

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Resources for dealing with historical texts

Issues with historical documents

Existing Technologies and Tools

Pre OCR cleaning/layout-detection/imageprocessing

OCR post correction

Fuzzy Search and String Matching

PoS, Lemmatisation and NER on historical dutch

Parse tree querying and extraction

Meetings

Literature

About

Releases

Packages

HuygensING/NLP-for-Historical-Text

Folders and files

Latest commit

History

Repository files navigation

NLP Resources for dealing with historical texts

Issues with historical documents

Existing Technologies and Tools

Pre OCR cleaning/layout-detection/imageprocessing

OCR post correction

Fuzzy Search and String Matching

PoS, Lemmatisation and NER on historical dutch

Parse tree querying and extraction

Meetings

Literature

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages