This script process segmentation, normalization and lemmatization of XML-TEI encoded files. Except the first, each step can be activated separately.
- This repo is not updated anymore. Please use the Annotator repo.
For the Annotator repo, cf. here.
- clone or download this repository
git clone [email protected]:e-ditiones/SEG17.git
cd SEG17
- create a virtual environment and activate it
python3 -m venv env
source env/bin/activate
- install dependencies
pip install -r requirements.txt
- install lemmatisation models
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended pie-extended download fr
- if you want to split your text
python3 level2to3.py path/to/file
- if you want to split and normalize your text
python3 level2to3.py -n path/to/file
- if you want to split and lemmatize
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 level2to3.py -l path/to/file
- if you want to split, normalize and lemmatize your text
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 level2to3.py -l -n path/to/file
Using the Level-2_to_level-3.xsl
XSL stylesheet, the script adds XML-TEI tags to split the text in segments (<seg>
).
For each <p>
(paragraph) and <l>
(line), using some poncuation marks (.;:!?), the script level2to3.py
split the text in segments captured in <seg>
elements.
For the normalisation, we use PARALLEL17.
For lemmatisation, we use Pie-extended and the "fr" model.
The original version, and not the normalised version, is lemmatised.
We use Morphalou. We offer an alternative normalisation, not seg-based but token-based. The script offer a normalised version for each token.
You can have more informations about the dictionaries used here.
Extract of the file to be processed (available here) :
<p n="1" xml:id="EXP_0001-1-1">
<persName>Monseignevr</persName>, Quand ie ne ſerois pas nay cõme ie ſuis, voſtre tres-humble
ſeruiteur, il faudroit que ie fuſſe mauuais François pour ne me reſioüir pas des contẽtemens de
voſtre <orgName>maiſon</orgName>, puis que ce ſont des felicités publiques.
</p>
Using PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 level2to3.py -l -n path/to/EXP_0001_level-2_text.xml
, you get :
<p>
<seg>
<choice>
<orig>
<w lemma="monseigneur" pos="VERinf" msd="NOMB.=s">
<orig>Monseignevr</orig>
<reg>Monseignevr</reg>
</w>
<w lemma="," pos="PONfbl" msd="MORPH=empty">
<orig>,</orig>
<reg>,</reg>
</w>
<w lemma="quand" pos="CONsub" msd="MORPH=empty">
<orig>Quand</orig>
<reg>Quand</reg>
</w>
<w lemma="je" pos="PROper" msd="NOMB.=s">
<orig>ie</orig>
<reg>j'|je</reg>
</w>
<w lemma="ne" pos="ADVneg" msd="MORPH=empty">
<orig>ne</orig>
<reg>ne</reg>
</w>
<w lemma="être" pos="VERcjg" msd="MODE=con|PERS.=2|NOMB.=s">
<orig>ſerois</orig>
<reg>serais</reg>
</w>
...
</orig>
<reg>Monseigneur , Quand je ne serais pas né comme je suis , votre très-humble
serviteur , il faudrait que je fusse mauvais Français pour ne me réjouir pas des ressentiments de
votre maison , puisque ce sont des ressentiments publiques .
</reg>
</choice>
</seg>
</p>
The output file can be found here.
Based on the dictionary provided in TEI by Morphalou, we created a new dictionary, in JSON, using this script.
For the entry "abbaye", you get :
- in TEI :
<entry xml:id="e69">
<form type="lemma" corresp="morphalou2-tlf#ABBAYE{commonNoun} dela#abbaye{N+z1} dicollecte#abbaye{nom} lefff#abbaye{nc}">
<orth>abbaye</orth>
<pron>a b E i @</pron>
<gramGrp>
<pos>commonNoun</pos>
<gen>feminine</gen>
</gramGrp>
</form>
<form type="inflected" corresp="morphalou2-morphalou1#abbaye dela#abbaye dicollecte#abbaye lefff#abbaye">
<orth>abbaye</orth>
<pron>a b E i @</pron>
<gramGrp>
<number>singular</number>
</gramGrp>
</form>
<form type="inflected" corresp="morphalou2-morphalou1#abbayes dela#abbayes dicollecte#abbayes lefff#abbayes">
<orth>abbayes</orth>
<pron>a b E i</pron>
<gramGrp>
<number>plural</number>
</gramGrp>
</form>
</entry>
- in JSON :
{
"abbaye":[
{
"orth":"abbaye",
"pron":"a b E i @",
"gramGrp":{
"pos":"commonNoun",
"gen":"feminine"
}
},
{
"orth":"abbaye",
"pron":"a b E i @",
"gramGrp":{
"number":"singular"
}
},
{
"orth":"abbayes",
"pron":"a b E i",
"gramGrp":{
"number":"plural"
}
}
]
}
This repository is developed by Alexandre Bartz with the help of Simon Gabay, as part of the project e-ditiones.
Our work is licenced under a Creative Commons Attribution 4.0 International Licence.
Pie-extended is under the Mozilla Public License 2.0.
Morphalou is under the LGPL-LR.
Alexandre Bartz, Simon Gabay. 2020. Lemmatization and normalization of French modern manuscripts and printed documents. Retrieved from https://github.com/e-ditiones/SEG17.