Fusus

This is a workflow that transforms scanned pages into readable text.

The pages come from several printed Arabic books from the past few centuries.

The workflow takes care of cleaning, OCR, proofing, converting to tab separated files and from there to Text-Fabric from where the text material can be processed further.

Features

cleaning is included: specks and symbols can be specified for cleaning by copying and pasting such fragments and storing them in a designated directory;
column layout and line boundaries are detected prior to OCRing;
individual lines will be passed to the OCR engine, which is Kraken using a model trained on many printed Arabic books, see model;
the results are stored in tab-separated files, retaining boundary boxes and confidences;
proofing pages can be generated for manually checking the OCR results;
the OCR results of each book are composed into Text-Fabric datasets.

This lays the foundations for:

correcting OCR mistakes;
enriching the text with morphological/linguistic annotations, named entities;
perform intertextuality research between the ground work (the "Fusus" by Ibn Arabi) and its commentary books.

A lot of cleaning has been carried out on two editions of the Fusus: Lakhnawi and Afifi. After that these editions have been aligned and brought together in a single dataset, in which it is possible read back the individual editions.

Text-Fabric interface

Get started with the tutorial.

We also have generated a static search interface.

Just click fusus-search and off you go.

You can do full text search via regular expressions, not only in the full-text, but also in attributes of the text, notably the bounding box information of each word.

Authors

Project

Fusus has been funded by the IT Research Innovation Fund.

It has been developed between 2020-03-01 and 2021-03-01

Correction, enrichment and alignment of the two Fusus editions was done from the end of the project till the end of 2021.

Docs

There is more documentation about sources, the research project, and how to use this software in the docs.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
app		app
example		example
fusus.egg-info		fusus.egg-info
fusus		fusus
fusust-text-laboratory		fusust-text-laboratory
incoming/pages		incoming/pages
legacy		legacy
model		model
notebooks		notebooks
pathological		pathological
tf		tf
tools		tools
tutorial		tutorial
tutoriall		tutoriall
ur		ur
varia		varia
.gitattributes		.gitattributes
.gitignore		.gitignore
.no-python-version		.no-python-version
LICENSE		LICENSE
README.md		README.md
build.py		build.py
double.png		double.png
setup.py		setup.py
zipLK-AF-complete.txt		zipLK-AF-complete.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fusus

Features

Text-Fabric interface

Authors

Project

Docs

About

Releases 6

Packages

Contributors 2

Languages

License

among/fusus

Folders and files

Latest commit

History

Repository files navigation

Fusus

Features

Text-Fabric interface

Authors

Project

Docs

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 2

Languages

Packages