NLP-for-Historical-Text repository
- OCR/HTR errors
- Layout and segmentation
- Spelling variation, especially with text from before 1800 (problem increases with older documents)
- Spelling and vocabulary change (old words disappear, new words appear, existing words shift meaning)
These issues affect search, string matching, syntactical parsing, and the term frequency distribution and context that is used in e.g. embeddings.
- OpenCV, NumPy, SciPy
- All together in the Fusus project (with Cornelis van Lit, Univ. Utrecht)
- Overview of tech used: tech
- Analiticcl by Maarten van Gompel:
- Code: https://github.com/proycon/analiticcl
- Presentation video: https://diode.zone/w/kkrqA4MocGwxyC3s68Zsq7
- Presentation slides: https://github.com/proycon/analiticcl/blob/master/docs/pres_20220120/analiticcl_presentation.pdf
- Edit diff rather than edit dist (immature idea Dirk)
- see wp6-daghregisters
- see also sesdiff
- PICCL and underlying TICCLtools, Martin Reynaert, Ko van der Sloot
- Python libraries for fuzzy search and matching
- fuzzywuzzy
- fuzzysearch
- Fuzzy-search
- python-levenshtein
- Python regex module URL
- python-string-similarity
- Indexing and Search
- Elasticsearch
- edit distance search
- Postgresql
- edit distance search
- Recommended blog article: Index 1,600,000,000 Keys with Automata and Rust
- Elasticsearch
- Nederlab Pipeline
- uses Frog for PoS-tagging and lemmatisation Middle Dutch, and Early New Dutch (vroegnieuwnederlands). Contains models trained on Brieven als Buit corpus (early new dutch) and Corpys Gysseling and Corpus Reenen Mulder (middle dutch).
FoLiA-wordtranslate
(part of FoLiA-utils) - (Re)implements Erik Tjong Kim Sang's word-by-word modernisation method.
- GrETEL: partial tree extraction
- BlackLab: open source corpus search engine that supports Corpus Query Language
- Ehrmann, M., Hamdi, A., Pontes, E. L., Romanello, M., & Doucet, A. (2021). Named Entity Recognition and Classification on Historical Documents: A Survey. arXiv preprint arXiv:2109.11406.PDF
- Reynaert, M., Hendrickx, I., & Marquilhas, R. (2012). Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2. Proceedings of ACRH-2, 87-98. PDF
- Reynaert, M., Gompel, M. V., Sloot, K., & van den Bosch, A. P. J. (2015). PICCL: Philosophical Integrator of Computational and Corpus Libraries. PDF
- Sommerauer, P., & Fokkens, A. (2019, August). Conceptual change and distributional semantic models: An exploratory study on pitfalls and possibilities. In Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change (pp. 223-233).PDF
- Wevers, M., & Koolen, M. (2020). Digital begriffsgeschichte: Tracing semantic change using word embeddings. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 53(4), 226-243. PDF