adding volume 14 completed

CLARIAH · May 4, 2022 · f6e4276 · f6e4276
1 parent bdf8523
commit f6e4276
Show file tree

Hide file tree

Showing 694 changed files with 37,009,574 additions and 61,841 deletions.
diff --git a/README.md b/README.md
@@ -14,6 +14,9 @@
 
 # Status
 
+*   2022-05-04 version 1.0: Additional volumes: Volume 14, bands (i) and (ii) have been added.
+    The earlier corrections by Sophie have not been reapplied, but the conversion has been improved 
+    so that they are not needed anymore.
 *   2022-04-11 Additional volumes: Volume 14, bands (i) and (ii) are in the process of being converted from
     textual pdf to Text-Fabric. Most structure has been recognized, but no TF has been generated yet.
 *   2021-07-22 Additional data corrections: letters that have no page break elements were
@@ -75,33 +78,9 @@
 
 # Corpus
 
-This repo contains a structurally clean version of the data of the *General Missives*.
+This repo contains a structurally clean version of the data of the *General Missives*, volumes 1-14.
 
-The *Generale Missiven* is a collection of letters from governors of the
-VOC (Dutch East Indian Company) to
-the *Heren 17*, the council of the governors of the 17 provinces of the Netherlands,
-which was the effective
-government of the Low Countries at the time of the 17th and 18th century.
-
-The letters comprise 13 volumes and date from 1610 to 1761.
-
-The Huygens-ING institute publishes this material:
-[General Missives Website](http://resources.huygens.knaw.nl/retroboeken/generalemissiven/#page=0&accessor=toc&view=homePane),
-see also
-[General Missives Project](http://resources.huygens.knaw.nl/vocgeneralemissiven);
-both websites are in Dutch.
-
-The CLARIAH project uses a TEI version of this corpus in its Work Package 6 which deals with
-new infrastructure for academic text processing:
-[WP6-Text](https://www.clariah.nl/en/work-packages/focus-areas/text?layout=blog).
-That work is conducted and carried out by
-
-* [Lodewijk Petram](https://www.lodewijkpetram.nl) (HuygensING)
-* [Jesse de Does](https://www.researchgate.net/profile/Jesse_De_Does) (INT)
-* [Sophie Arnoult](http://www.illc.uva.nl/People/person/3601/Ir-Sophie-Arnoult)
-
-This repo does not publish the source/intermediate data as developed in CLARIAH-WP6;
-they will publish their materials in due course.
+Read more in [about](dos/about.md)
 
 # Rationale for this representation of the corpus
 
@@ -137,7 +116,7 @@ and is particularly suited to Jupyter notebooks and lab.
 We have generated a search interface for the missieven from the Text-Fabric data.
 
 Just click
-[missieven-search](https://CLARIAH.github.io/wp6-gm-search/)
+[missieven-search](https://CLARIAH.github.io/wp6-missieven-search/)
 and off you go.
 
 It is experimental.
@@ -163,52 +142,6 @@ You can export search results as tab-separated files.
 
 More info in the [manual](https://annotation.github.io/text-fabric/tf/about/clientmanual.html).
 
-# Conversion steps
-
-The CLARIAH WP6 people kindly provided me with a TEI version of the corpus.
-
-However, this TEI version contains many inaccuracies.
-There are many instances of miscategorized material:
-page headers and footers end up in body text and vice versa;
-editorial notes and footnotes are not always properly detected; dozens of letters have not been separated;
-metadata is often incorrect.
-
-In order to produce a quality dataset, I needed to do something about it: checks and corrections.
-
-1. all metadata has been freshly distilled from the letter headings, an in case of doubt the
-   online images of the missives have been inspected.
-2. all footnote marks are linked to all footnote bodies.
-   It is still possible that there are missed footnotes and missed footnote marks,
-   but chances are slim because footnote marks and footnote bodies are detected
-   independently.
-
-Yet, most OCR errors within words and numbers are mostly untouched.
-The main concern was to get a correct separation between the kinds of text:
-
-* original letter
-* editorial text
-* footnotes
-* page headers and footers
-
-[trimTei.py](https://github.com/CLARIAH/wp6-missieven/blob/master/programs/trimTei.py)
-consists of a battery of 4 conversions to clean the incoming TEI ,
-leaving out all bits that do not end up in the final dataset,
-and reorganizing some material to facilitate the conversion to TF.
-
-The first result of the laundry is a set of XML files, which contain a clean, simplified TEI-like
-encoding of the material, with all non-essential parts stripped, such as page headers and footers,
-title pages, etc.
-There is also an exact correspondence between files and letters.
-
-Then I used the
-[walker module from TF](https://annotation.github.io/text-fabric/tf/convert/walker.html)
-to turn the simple XML into Text-Fabric.
-See
-[tfFromTrim.py](https://github.com/CLAARIAH/wp6-missieven/blob/master/programs/tfFromTrim.py).
-
-For details about the features of the end result, see 
-[transcription](docs/transcription.md)
-
 # Using this corpus data
 
 At the moment the data delivered is available
@@ -221,13 +154,13 @@ a command:
 
 * have Python installed (at least 3.6)
 * `pip3 install text-fabric`
-* `text-fabric missieven:latest --checkout=latest`
+* `text-fabric clariah/wp6-missieven`
 
 This will download the corpus and fire up a local webserver and your webbrowser pointing to a in interface
 for this corpus.
 
 Another version of the data (less cleaned) is visible online in a
-[Blacklab interface ](http://corpora.ato.ivdnt.org/corpus-frontend/Missiven/search/)
+[Blacklab interface ](http://corpora.ato.ivdnt.org/corpus-frontend/Missiven/search)
 
 The next step is to make the data of this repository available in a Blacklab interface.
 In this repo we show how to set up a local Blacklab server and front-end and how to get the

diff --git a/app/app.py b/app/app.py
@@ -2,7 +2,9 @@
 from tf.advanced.app import App
 
 
-MODIFIERS = "remark folio note ref emph und super special".strip().split()
+MODIFIERS = """
+    remark folio note ref emph und super special q num den
+""".strip().split()
 
 
 def fmt_layoutFull(app, n, **kwargs):

diff --git a/app/config.yaml b/app/config.yaml
@@ -16,16 +16,17 @@ dataDisplay:
     layout-orig:
       method:  layoutOrig
 docs:
-  docPage: transcription
+  docPage: about
   featureBase: 'https://github.com/{org}/{repo}/blob/master/docs/transcription{docExt}'
   featurePage: ''
 interfaceDefaults: {}
 provenanceSpec:
   corpus: 'General Missives Dutch East India Company 1600-1800'
   doi: 10.5281/zenodo.4011801
-  version: 0.9.1
+  version: 1.0
   webBase: http://resources.huygens.knaw.nl/retroboeken/generalemissiven
   webHint: Show this document on Huygens
+  webFeature: weblink
   webUrl: '{webBase}/#page=<2>&source=<1>'
   webOffset:
     2:

diff --git a/app/static/display.css b/app/static/display.css
@@ -28,6 +28,21 @@
 .und {
   text-decoration: underline;
 }
+.q {
+  color: #777777;
+  font-weight: bold;
+}
+.num {
+  font-size: small;
+  vertical-align: super;
+}
+.den {
+  font-size: small;
+  vertical-align: sub;
+}
+.sub {
+  vertical-align: sub;
+}
 .super {
   vertical-align: super;
 }

diff --git a/change/add/09/p0233-0235.xml b/change/add/09/p0233-0235.xml
@@ -1,4 +1,4 @@
-<pb n="235" tpl="2" vol="09" facs="205_235"/>
+<pb n="235"  vol="09" />
 eenige maanden voor zijn vaders dood verlaten, en dus ook tegelijk verijdelt was de<lb/>
 hoope van dat geraffineerde en staatkundige hoff om op dese en diergelijke wij se eens<lb/>
 een voet in het Bimase rijk te verkrijgen.<lb/>

diff --git a/change/add/09/p0628-0651.xml b/change/add/09/p0628-0651.xml
@@ -1,3 +1,3 @@
-<pb n="651" tpl="2" vol="09" facs="205_651"/>
+<pb n="651"  vol="09" />
 insert OCR of missing page 651 here
 <p>
diff --git a/change/add/13/p0014-0065.xml b/change/add/13/p0014-0065.xml
@@ -1,2 +1,2 @@
-<pb n="65" tpl="2" vol="13" facs="205_065"/>
+<pb n="65"  vol="13" />
 insert OCR of missing page 65 here
diff --git a/change/doc/01/p0003a.xml b/change/doc/01/p0003a.xml
@@ -12,7 +12,7 @@
 </header>
 
 <body>
-<pb n="3" tpl="1" vol="01" facs="1_104_0003"/>
+<pb n="3"  vol="01" />
 <head rend="font-size: 10; line-height: 11.1pt">I. PIETER BOTH, AAN BOORD VAN HET WAPEN VAN AMSTERDAM,<lb/>
 VOOR ILE DE MAYO 1) 25 februari 1610.<lb/>
 </head>

diff --git a/change/doc/01/p0016a.xml b/change/doc/01/p0016a.xml
@@ -12,7 +12,7 @@
 </header>
 
 <body>
-<pb n="16" tpl="1" vol="01" facs="1_104_0016"/>
+<pb n="16"  vol="01" />
 <head rend="font-size: 10; line-height: 11.1pt">VII. PIETER BOTH, FORT MAURITIUS NABIJ NGOFAKIAHA OP HET<lb/>
 EILAND MAKÉAN 26 juli 1612.<lb/>
 </head>

diff --git a/change/doc/01/p0020a.xml b/change/doc/01/p0020a.xml
@@ -12,7 +12,7 @@
 </header>
 
 <body>
-<pb n="20" tpl="1" vol="01" facs="1_104_0020"/>
+<pb n="20"  vol="01" />
 <head rend="font-size: 10; line-height: 11.1pt">XII. PIETER BOTH, FORT MAURITIUS NABIJ NGOFAKIAHA OP HET<lb/>
 EILAND MAKÉAN 26 juli 1612.<lb/>
 </head>

diff --git a/change/doc/01/p0027a.xml b/change/doc/01/p0027a.xml
@@ -12,7 +12,7 @@
 </header>
 
 <body>
-<pb n="27" tpl="1" vol="01" facs="1_104_0027"/>
+<pb n="27"  vol="01" />
 <head rend="font-size: 10; line-height: 11.1pt">XVI. PIETER BOTH, BANTAM 1 januari 1614.<lb/>
 </head>
 

diff --git a/change/doc/01/p0097a.xml b/change/doc/01/p0097a.xml
@@ -12,7 +12,7 @@
 </header>
 
 <body>
-<pb n="97" tpl="1" vol="01" facs="1_104_0097"/>
+<pb n="97"  vol="01" />
 <head rend="font-size: 10; line-height: 11.1pt">I. JAN PIETERSZ. COEN, KASTEEL JAKATRA 5 augustus 1619.<lb/>
 </head>
 

diff --git a/change/doc/01/p0098a.xml b/change/doc/01/p0098a.xml
@@ -12,7 +12,7 @@
 </header>
 
 <body>
-<pb n="98" tpl="1" vol="01" facs="1_104_0098"/>
+<pb n="98"  vol="01" />
 <head rend="font-size: 10; line-height: 11.1pt">VIII. JAN PIETERSZ. COEN, PIETER DE CARPENTIER, JACOB DEDEL<lb/>
 EN DR. MARTINUS SONCK, KASTEEL JAKATRA 31 juli 1620.<lb/>
 </head>

diff --git a/change/doc/01/p0118a.xml b/change/doc/01/p0118a.xml
@@ -12,7 +12,7 @@
 </header>
 
 <body>
-<pb n="118" tpl="1" vol="01" facs="1_104_0118"/>
+<pb n="118"  vol="01" />
 <head rend="font-size: 10; line-height: 11.1pt">XIV. JAN PIETERSZ. COEN, PIETER DE CARPENTIER, JAN DIRCKSZ.<lb/>
 LAM EN WILLEM VAN ANTZEN, BATAVIA 1) 16 november 1621.<lb/>
 </head>

diff --git a/change/doc/01/p0244a.xml b/change/doc/01/p0244a.xml
@@ -12,7 +12,7 @@
 </header>
 
 <body>
-<pb n="244" tpl="1" vol="01" facs="1_104_0244"/>
+<pb n="244"  vol="01" />
 <head rend="font-size: 10; line-height: 11.1pt">I. JAN PIETERSZ. COEN, PIETER DE CARPENTIER, JACQUES SPECX,<lb/>
 PIETER VAN DUYNEN, ANTONIO VAN DIEMEN EN JEREMIAS DE<lb/>
 MEESTER, Batavia 9 november 1627.<lb/>

diff --git a/change/replace/09/p0628-0648.xml b/change/replace/09/p0628-0648.xml
@@ -1,4 +1,4 @@
-<pb n="648" tpl="2" vol="09" facs="205_648"/>
+<pb n="648"  vol="09" />
 <p>Op soldijen voor het affgegevene aan Compagnies loontreckende dienaaren op
 reeckening van verdiende of nog te verdiene maandgelden:<lb/>
 </p>

diff --git a/change/replace/09/p0628-0649.xml b/change/replace/09/p0628-0649.xml
@@ -1,4 +1,4 @@
-<pb n="649" tpl="2" vol="09" facs="205_649"/>
+<pb n="649"  vol="09" />
 <p>Waartegens de zuyvere inkomsten, en andere gevallen voordeden,
 ƒ 1.704.911,6,15 bedragen, soo als hier onder werd<lb/>
 aangewesen: als</p>

diff --git a/change/replace/09/p0628-0650.xml b/change/replace/09/p0628-0650.xml
@@ -1,4 +1,4 @@
-<pb n="650" tpl="2" vol="09" facs="205_650"/>
+<pb n="650"  vol="09" />
 <table>
 <row>
     <cell>aan veelerhande equipagie, arthillerye en wapenkamers, goederen, materialen,

diff --git a/change/replace/09/p0628-0697.xml b/change/replace/09/p0628-0697.xml
@@ -1,4 +1,4 @@
-<pb n="697" tpl="2" vol="09" facs="205_697"/>
+<pb n="697"  vol="09" />
 <p>Op soldijen, voor 't afgegevene aan 's Compagnies loontreckende dienaren op reeck.</p>
 <table>
 <row>

diff --git a/change/replace/09/p0628-0698.xml b/change/replace/09/p0628-0698.xml
@@ -1,4 +1,4 @@
-<pb n="698" tpl="2" vol="09" facs="205_698"/>
+<pb n="698"  vol="09" />
 <table>
 <row>
     <cell>Timmeragiën eenelijk voor een quart der ongelden van 's Compagnies slaven, sijnde ten

diff --git a/change/replace/09/p0628-0699.xml b/change/replace/09/p0628-0699.xml
@@ -1,2 +1,2 @@
-<pb n="699" tpl="2" vol="09" facs="205_699"/>
+<pb n="699"  vol="09" />
 *** replace broken OCR of landscape table by manual encoding ***
diff --git a/docs/about.md b/docs/about.md
@@ -0,0 +1,85 @@
+This repo contains a structurally clean version of the data of the *General Missives*, volumes 1-14.
+
+The *Generale Missiven* is a collection of letters from governors of the
+VOC (Dutch East Indian Company) to
+the *Heren 17*, the council of the governors of the 17 provinces of the Netherlands,
+which was the effective
+government of the Low Countries at the time of the 17th and 18th century.
+
+The letters comprise 14 volumes and date from 1610 to 1767.
+
+The Huygens-ING institute publishes this material:
+[General Missives Website](http://resources.huygens.knaw.nl/retroboeken/generalemissiven/#page=0&accessor=toc&view=homePane),
+see also
+[General Missives Project](http://resources.huygens.knaw.nl/vocgeneralemissiven);
+both websites are in Dutch.
+
+The CLARIAH project uses a TEI version of the volumes 1-13 of this corpus in its Work Package 6 which deals with
+new infrastructure for academic text processing:
+[WP6-Text](https://www.clariah.nl/en/work-packages/focus-areas/text?layout=blog).
+That work is conducted and carried out by
+
+* [Lodewijk Petram](https://www.lodewijkpetram.nl) (HuygensING)
+* [Jesse de Does](https://www.researchgate.net/profile/Jesse_De_Does) (INT)
+* [Sophie Arnoult](http://www.illc.uva.nl/People/person/3601/Ir-Sophie-Arnoult)
+
+This repo does not publish the source/intermediate data as developed in CLARIAH-WP6;
+they will publish their materials in due course.
+
+# Conversion steps
+
+The CLARIAH WP6 people kindly provided me with a TEI version of the volumes 1-13 of the corpus
+and a textual pdf of bands i and ii of volume 14.
+
+However, the TEI version contains many inaccuracies.
+There are many instances of miscategorized material:
+page headers and footers end up in body text and vice versa;
+editorial notes and footnotes are not always properly detected; dozens of letters have not been separated;
+metadata is often incorrect.
+
+In order to produce a quality dataset, I needed to do something about it: checks and corrections.
+
+1. all metadata has been freshly distilled from the letter headings, an in case of doubt the
+   online images of the missives have been inspected.
+2. all footnote marks are linked to all footnote bodies.
+   It is still possible that there are missed footnotes and missed footnote marks,
+   but chances are slim because footnote marks and footnote bodies are detected
+   independently.
+
+Yet, most OCR errors within words and numbers are mostly untouched.
+The main concern was to get a correct separation between the kinds of text:
+
+* original letter
+* editorial text
+* footnotes
+* page headers and footers
+
+[trimTei.py](https://github.com/CLARIAH/wp6-missieven/blob/master/programs/trimTei.py)
+consists of a battery of 4 conversions to clean the incoming TEI ,
+leaving out all bits that do not end up in the final dataset,
+and reorganizing some material to facilitate the conversion to TF.
+
+The first result of the laundry is a set of XML files, which contain a clean, simplified TEI-like
+encoding of the material, with all non-essential parts stripped, such as page headers and footers,
+title pages, etc.
+There is also an exact correspondence between files and letters.
+
+Concerning volume 14: the pdf is the result of a much better OCR process.
+We have used all typographical clues given in the pdf to infer structure.
+There was quite a bit of nitty-gritty involved in this, especially for detecting table layout.
+I do not claim that all table structure has been perfectly detected.
+Another issue was to detected numbers and fractions properly.
+Eventually, the script
+[trimPdf.py](https://github.com/CLARIAH/wp6-missieven/blob/master/programs/trimPdf.py)
+transforms the pdf in a sequence of stages to the same kind of XML as the trimTei.py script produced
+for the volumes 1-13.
+
+Then I used the
+[walker module from TF](https://annotation.github.io/text-fabric/tf/convert/walker.html)
+to turn the simple XML of all 14 volumes into Text-Fabric.
+See
+[tfFromTrim.py](https://github.com/CLAARIAH/wp6-missieven/blob/master/programs/tfFromTrim.py).
+
+For details about the features of the end result, see 
+[transcription](docs/transcription.md)
+