Skip to content

Commit

Permalink
adding volume 14 completed
Browse files Browse the repository at this point in the history
  • Loading branch information
dirkroorda committed May 4, 2022
1 parent bdf8523 commit f6e4276
Show file tree
Hide file tree
Showing 694 changed files with 37,009,574 additions and 61,841 deletions.
83 changes: 8 additions & 75 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@

# Status

* 2022-05-04 version 1.0: Additional volumes: Volume 14, bands (i) and (ii) have been added.
The earlier corrections by Sophie have not been reapplied, but the conversion has been improved
so that they are not needed anymore.
* 2022-04-11 Additional volumes: Volume 14, bands (i) and (ii) are in the process of being converted from
textual pdf to Text-Fabric. Most structure has been recognized, but no TF has been generated yet.
* 2021-07-22 Additional data corrections: letters that have no page break elements were
Expand Down Expand Up @@ -75,33 +78,9 @@

# Corpus

This repo contains a structurally clean version of the data of the *General Missives*.
This repo contains a structurally clean version of the data of the *General Missives*, volumes 1-14.

The *Generale Missiven* is a collection of letters from governors of the
VOC (Dutch East Indian Company) to
the *Heren 17*, the council of the governors of the 17 provinces of the Netherlands,
which was the effective
government of the Low Countries at the time of the 17th and 18th century.

The letters comprise 13 volumes and date from 1610 to 1761.

The Huygens-ING institute publishes this material:
[General Missives Website](http://resources.huygens.knaw.nl/retroboeken/generalemissiven/#page=0&accessor=toc&view=homePane),
see also
[General Missives Project](http://resources.huygens.knaw.nl/vocgeneralemissiven);
both websites are in Dutch.

The CLARIAH project uses a TEI version of this corpus in its Work Package 6 which deals with
new infrastructure for academic text processing:
[WP6-Text](https://www.clariah.nl/en/work-packages/focus-areas/text?layout=blog).
That work is conducted and carried out by

* [Lodewijk Petram](https://www.lodewijkpetram.nl) (HuygensING)
* [Jesse de Does](https://www.researchgate.net/profile/Jesse_De_Does) (INT)
* [Sophie Arnoult](http://www.illc.uva.nl/People/person/3601/Ir-Sophie-Arnoult)

This repo does not publish the source/intermediate data as developed in CLARIAH-WP6;
they will publish their materials in due course.
Read more in [about](dos/about.md)

# Rationale for this representation of the corpus

Expand Down Expand Up @@ -137,7 +116,7 @@ and is particularly suited to Jupyter notebooks and lab.
We have generated a search interface for the missieven from the Text-Fabric data.

Just click
[missieven-search](https://CLARIAH.github.io/wp6-gm-search/)
[missieven-search](https://CLARIAH.github.io/wp6-missieven-search/)
and off you go.

It is experimental.
Expand All @@ -163,52 +142,6 @@ You can export search results as tab-separated files.

More info in the [manual](https://annotation.github.io/text-fabric/tf/about/clientmanual.html).

# Conversion steps

The CLARIAH WP6 people kindly provided me with a TEI version of the corpus.

However, this TEI version contains many inaccuracies.
There are many instances of miscategorized material:
page headers and footers end up in body text and vice versa;
editorial notes and footnotes are not always properly detected; dozens of letters have not been separated;
metadata is often incorrect.

In order to produce a quality dataset, I needed to do something about it: checks and corrections.

1. all metadata has been freshly distilled from the letter headings, an in case of doubt the
online images of the missives have been inspected.
2. all footnote marks are linked to all footnote bodies.
It is still possible that there are missed footnotes and missed footnote marks,
but chances are slim because footnote marks and footnote bodies are detected
independently.

Yet, most OCR errors within words and numbers are mostly untouched.
The main concern was to get a correct separation between the kinds of text:

* original letter
* editorial text
* footnotes
* page headers and footers

[trimTei.py](https://github.com/CLARIAH/wp6-missieven/blob/master/programs/trimTei.py)
consists of a battery of 4 conversions to clean the incoming TEI ,
leaving out all bits that do not end up in the final dataset,
and reorganizing some material to facilitate the conversion to TF.

The first result of the laundry is a set of XML files, which contain a clean, simplified TEI-like
encoding of the material, with all non-essential parts stripped, such as page headers and footers,
title pages, etc.
There is also an exact correspondence between files and letters.

Then I used the
[walker module from TF](https://annotation.github.io/text-fabric/tf/convert/walker.html)
to turn the simple XML into Text-Fabric.
See
[tfFromTrim.py](https://github.com/CLAARIAH/wp6-missieven/blob/master/programs/tfFromTrim.py).

For details about the features of the end result, see
[transcription](docs/transcription.md)

# Using this corpus data

At the moment the data delivered is available
Expand All @@ -221,13 +154,13 @@ a command:

* have Python installed (at least 3.6)
* `pip3 install text-fabric`
* `text-fabric missieven:latest --checkout=latest`
* `text-fabric clariah/wp6-missieven`

This will download the corpus and fire up a local webserver and your webbrowser pointing to a in interface
for this corpus.

Another version of the data (less cleaned) is visible online in a
[Blacklab interface ](http://corpora.ato.ivdnt.org/corpus-frontend/Missiven/search/)
[Blacklab interface ](http://corpora.ato.ivdnt.org/corpus-frontend/Missiven/search)

The next step is to make the data of this repository available in a Blacklab interface.
In this repo we show how to set up a local Blacklab server and front-end and how to get the
Expand Down
4 changes: 3 additions & 1 deletion app/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
from tf.advanced.app import App


MODIFIERS = "remark folio note ref emph und super special".strip().split()
MODIFIERS = """
remark folio note ref emph und super special q num den
""".strip().split()


def fmt_layoutFull(app, n, **kwargs):
Expand Down
5 changes: 3 additions & 2 deletions app/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,17 @@ dataDisplay:
layout-orig:
method: layoutOrig
docs:
docPage: transcription
docPage: about
featureBase: 'https://github.com/{org}/{repo}/blob/master/docs/transcription{docExt}'
featurePage: ''
interfaceDefaults: {}
provenanceSpec:
corpus: 'General Missives Dutch East India Company 1600-1800'
doi: 10.5281/zenodo.4011801
version: 0.9.1
version: 1.0
webBase: http://resources.huygens.knaw.nl/retroboeken/generalemissiven
webHint: Show this document on Huygens
webFeature: weblink
webUrl: '{webBase}/#page=<2>&source=<1>'
webOffset:
2:
Expand Down
15 changes: 15 additions & 0 deletions app/static/display.css
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,21 @@
.und {
text-decoration: underline;
}
.q {
color: #777777;
font-weight: bold;
}
.num {
font-size: small;
vertical-align: super;
}
.den {
font-size: small;
vertical-align: sub;
}
.sub {
vertical-align: sub;
}
.super {
vertical-align: super;
}
Expand Down
2 changes: 1 addition & 1 deletion change/add/09/p0233-0235.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<pb n="235" tpl="2" vol="09" facs="205_235"/>
<pb n="235" vol="09" />
eenige maanden voor zijn vaders dood verlaten, en dus ook tegelijk verijdelt was de<lb/>
hoope van dat geraffineerde en staatkundige hoff om op dese en diergelijke wij se eens<lb/>
een voet in het Bimase rijk te verkrijgen.<lb/>
Expand Down
2 changes: 1 addition & 1 deletion change/add/09/p0628-0651.xml
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
<pb n="651" tpl="2" vol="09" facs="205_651"/>
<pb n="651" vol="09" />
insert OCR of missing page 651 here
<p>
2 changes: 1 addition & 1 deletion change/add/13/p0014-0065.xml
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
<pb n="65" tpl="2" vol="13" facs="205_065"/>
<pb n="65" vol="13" />
insert OCR of missing page 65 here
2 changes: 1 addition & 1 deletion change/doc/01/p0003a.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</header>

<body>
<pb n="3" tpl="1" vol="01" facs="1_104_0003"/>
<pb n="3" vol="01" />
<head rend="font-size: 10; line-height: 11.1pt">I. PIETER BOTH, AAN BOORD VAN HET WAPEN VAN AMSTERDAM,<lb/>
VOOR ILE DE MAYO 1) 25 februari 1610.<lb/>
</head>
Expand Down
2 changes: 1 addition & 1 deletion change/doc/01/p0016a.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</header>

<body>
<pb n="16" tpl="1" vol="01" facs="1_104_0016"/>
<pb n="16" vol="01" />
<head rend="font-size: 10; line-height: 11.1pt">VII. PIETER BOTH, FORT MAURITIUS NABIJ NGOFAKIAHA OP HET<lb/>
EILAND MAKÉAN 26 juli 1612.<lb/>
</head>
Expand Down
2 changes: 1 addition & 1 deletion change/doc/01/p0020a.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</header>

<body>
<pb n="20" tpl="1" vol="01" facs="1_104_0020"/>
<pb n="20" vol="01" />
<head rend="font-size: 10; line-height: 11.1pt">XII. PIETER BOTH, FORT MAURITIUS NABIJ NGOFAKIAHA OP HET<lb/>
EILAND MAKÉAN 26 juli 1612.<lb/>
</head>
Expand Down
2 changes: 1 addition & 1 deletion change/doc/01/p0027a.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</header>

<body>
<pb n="27" tpl="1" vol="01" facs="1_104_0027"/>
<pb n="27" vol="01" />
<head rend="font-size: 10; line-height: 11.1pt">XVI. PIETER BOTH, BANTAM 1 januari 1614.<lb/>
</head>

Expand Down
2 changes: 1 addition & 1 deletion change/doc/01/p0097a.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</header>

<body>
<pb n="97" tpl="1" vol="01" facs="1_104_0097"/>
<pb n="97" vol="01" />
<head rend="font-size: 10; line-height: 11.1pt">I. JAN PIETERSZ. COEN, KASTEEL JAKATRA 5 augustus 1619.<lb/>
</head>

Expand Down
2 changes: 1 addition & 1 deletion change/doc/01/p0098a.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</header>

<body>
<pb n="98" tpl="1" vol="01" facs="1_104_0098"/>
<pb n="98" vol="01" />
<head rend="font-size: 10; line-height: 11.1pt">VIII. JAN PIETERSZ. COEN, PIETER DE CARPENTIER, JACOB DEDEL<lb/>
EN DR. MARTINUS SONCK, KASTEEL JAKATRA 31 juli 1620.<lb/>
</head>
Expand Down
2 changes: 1 addition & 1 deletion change/doc/01/p0118a.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</header>

<body>
<pb n="118" tpl="1" vol="01" facs="1_104_0118"/>
<pb n="118" vol="01" />
<head rend="font-size: 10; line-height: 11.1pt">XIV. JAN PIETERSZ. COEN, PIETER DE CARPENTIER, JAN DIRCKSZ.<lb/>
LAM EN WILLEM VAN ANTZEN, BATAVIA 1) 16 november 1621.<lb/>
</head>
Expand Down
2 changes: 1 addition & 1 deletion change/doc/01/p0244a.xml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</header>

<body>
<pb n="244" tpl="1" vol="01" facs="1_104_0244"/>
<pb n="244" vol="01" />
<head rend="font-size: 10; line-height: 11.1pt">I. JAN PIETERSZ. COEN, PIETER DE CARPENTIER, JACQUES SPECX,<lb/>
PIETER VAN DUYNEN, ANTONIO VAN DIEMEN EN JEREMIAS DE<lb/>
MEESTER, Batavia 9 november 1627.<lb/>
Expand Down
2 changes: 1 addition & 1 deletion change/replace/09/p0628-0648.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<pb n="648" tpl="2" vol="09" facs="205_648"/>
<pb n="648" vol="09" />
<p>Op soldijen voor het affgegevene aan Compagnies loontreckende dienaaren op
reeckening van verdiende of nog te verdiene maandgelden:<lb/>
</p>
Expand Down
2 changes: 1 addition & 1 deletion change/replace/09/p0628-0649.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<pb n="649" tpl="2" vol="09" facs="205_649"/>
<pb n="649" vol="09" />
<p>Waartegens de zuyvere inkomsten, en andere gevallen voordeden,
ƒ 1.704.911,6,15 bedragen, soo als hier onder werd<lb/>
aangewesen: als</p>
Expand Down
2 changes: 1 addition & 1 deletion change/replace/09/p0628-0650.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<pb n="650" tpl="2" vol="09" facs="205_650"/>
<pb n="650" vol="09" />
<table>
<row>
<cell>aan veelerhande equipagie, arthillerye en wapenkamers, goederen, materialen,
Expand Down
2 changes: 1 addition & 1 deletion change/replace/09/p0628-0697.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<pb n="697" tpl="2" vol="09" facs="205_697"/>
<pb n="697" vol="09" />
<p>Op soldijen, voor 't afgegevene aan 's Compagnies loontreckende dienaren op reeck.</p>
<table>
<row>
Expand Down
2 changes: 1 addition & 1 deletion change/replace/09/p0628-0698.xml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<pb n="698" tpl="2" vol="09" facs="205_698"/>
<pb n="698" vol="09" />
<table>
<row>
<cell>Timmeragiën eenelijk voor een quart der ongelden van 's Compagnies slaven, sijnde ten
Expand Down
2 changes: 1 addition & 1 deletion change/replace/09/p0628-0699.xml
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
<pb n="699" tpl="2" vol="09" facs="205_699"/>
<pb n="699" vol="09" />
*** replace broken OCR of landscape table by manual encoding ***
85 changes: 85 additions & 0 deletions docs/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
This repo contains a structurally clean version of the data of the *General Missives*, volumes 1-14.

The *Generale Missiven* is a collection of letters from governors of the
VOC (Dutch East Indian Company) to
the *Heren 17*, the council of the governors of the 17 provinces of the Netherlands,
which was the effective
government of the Low Countries at the time of the 17th and 18th century.

The letters comprise 14 volumes and date from 1610 to 1767.

The Huygens-ING institute publishes this material:
[General Missives Website](http://resources.huygens.knaw.nl/retroboeken/generalemissiven/#page=0&accessor=toc&view=homePane),
see also
[General Missives Project](http://resources.huygens.knaw.nl/vocgeneralemissiven);
both websites are in Dutch.

The CLARIAH project uses a TEI version of the volumes 1-13 of this corpus in its Work Package 6 which deals with
new infrastructure for academic text processing:
[WP6-Text](https://www.clariah.nl/en/work-packages/focus-areas/text?layout=blog).
That work is conducted and carried out by

* [Lodewijk Petram](https://www.lodewijkpetram.nl) (HuygensING)
* [Jesse de Does](https://www.researchgate.net/profile/Jesse_De_Does) (INT)
* [Sophie Arnoult](http://www.illc.uva.nl/People/person/3601/Ir-Sophie-Arnoult)

This repo does not publish the source/intermediate data as developed in CLARIAH-WP6;
they will publish their materials in due course.

# Conversion steps

The CLARIAH WP6 people kindly provided me with a TEI version of the volumes 1-13 of the corpus
and a textual pdf of bands i and ii of volume 14.

However, the TEI version contains many inaccuracies.
There are many instances of miscategorized material:
page headers and footers end up in body text and vice versa;
editorial notes and footnotes are not always properly detected; dozens of letters have not been separated;
metadata is often incorrect.

In order to produce a quality dataset, I needed to do something about it: checks and corrections.

1. all metadata has been freshly distilled from the letter headings, an in case of doubt the
online images of the missives have been inspected.
2. all footnote marks are linked to all footnote bodies.
It is still possible that there are missed footnotes and missed footnote marks,
but chances are slim because footnote marks and footnote bodies are detected
independently.

Yet, most OCR errors within words and numbers are mostly untouched.
The main concern was to get a correct separation between the kinds of text:

* original letter
* editorial text
* footnotes
* page headers and footers

[trimTei.py](https://github.com/CLARIAH/wp6-missieven/blob/master/programs/trimTei.py)
consists of a battery of 4 conversions to clean the incoming TEI ,
leaving out all bits that do not end up in the final dataset,
and reorganizing some material to facilitate the conversion to TF.

The first result of the laundry is a set of XML files, which contain a clean, simplified TEI-like
encoding of the material, with all non-essential parts stripped, such as page headers and footers,
title pages, etc.
There is also an exact correspondence between files and letters.

Concerning volume 14: the pdf is the result of a much better OCR process.
We have used all typographical clues given in the pdf to infer structure.
There was quite a bit of nitty-gritty involved in this, especially for detecting table layout.
I do not claim that all table structure has been perfectly detected.
Another issue was to detected numbers and fractions properly.
Eventually, the script
[trimPdf.py](https://github.com/CLARIAH/wp6-missieven/blob/master/programs/trimPdf.py)
transforms the pdf in a sequence of stages to the same kind of XML as the trimTei.py script produced
for the volumes 1-13.

Then I used the
[walker module from TF](https://annotation.github.io/text-fabric/tf/convert/walker.html)
to turn the simple XML of all 14 volumes into Text-Fabric.
See
[tfFromTrim.py](https://github.com/CLAARIAH/wp6-missieven/blob/master/programs/tfFromTrim.py).

For details about the features of the end result, see
[transcription](docs/transcription.md)

Loading

0 comments on commit f6e4276

Please sign in to comment.