Skip to content

Script processing segmentation, normalization and lemmatization of XML-TEI encoded files.

Notifications You must be signed in to change notification settings

e-ditiones/SEG17

Repository files navigation

SEG17

This script process segmentation, normalization and lemmatization of XML-TEI encoded files. Except the first, each step can be activated separately.

- This repo is not updated anymore. Please use the Annotator repo.

For the Annotator repo, cf. here.

Getting starded

To install SEG17, using command lines, you have to :

  1. clone or download this repository
git clone [email protected]:e-ditiones/SEG17.git
cd SEG17
  1. create a virtual environment and activate it
python3 -m venv env
source env/bin/activate
  1. install dependencies
pip install -r requirements.txt
  1. install lemmatisation models
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended pie-extended download fr

Now you can use SEG17

  • if you want to split your text
python3 level2to3.py path/to/file
  • if you want to split and normalize your text
python3 level2to3.py -n path/to/file
  • if you want to split and lemmatize
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 level2to3.py -l path/to/file
  • if you want to split, normalize and lemmatize your text
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 level2to3.py -l -n path/to/file

How it works

The segmentation

Using the Level-2_to_level-3.xsl XSL stylesheet, the script adds XML-TEI tags to split the text in segments (<seg>). For each <p>(paragraph) and <l>(line), using some poncuation marks (.;:!?), the script level2to3.py split the text in segments captured in <seg> elements.

The normalization via NMT

For the normalisation, we use PARALLEL17.

The lemmazition

For lemmatisation, we use Pie-extended and the "fr" model.

The original version, and not the normalised version, is lemmatised.

The normalisation via lemmas

We use Morphalou. We offer an alternative normalisation, not seg-based but token-based. The script offer a normalised version for each token.

You can have more informations about the dictionaries used here.

Examples

Processing level2to3.Py on XML-TEI files

Extract of the file to be processed (available here) :

<p n="1" xml:id="EXP_0001-1-1">
    <persName>Monseignevr</persName>, Quand ie ne ſerois pas nay cõme ie ſuis, voſtre tres-humble
    ſeruiteur, il faudroit que ie fuſſe mauuais François pour ne me reſioüir pas des contẽtemens de
    voſtre <orgName>maiſon</orgName>, puis que ce ſont des felicités publiques.
</p>

Using PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 level2to3.py -l -n path/to/EXP_0001_level-2_text.xml, you get :

<p>
        <seg>
          <choice>
            <orig>
              <w lemma="monseigneur" pos="VERinf" msd="NOMB.=s">
                <orig>Monseignevr</orig>
                <reg>Monseignevr</reg>
              </w>
              <w lemma="," pos="PONfbl" msd="MORPH=empty">
                <orig>,</orig>
                <reg>,</reg>
              </w>
              <w lemma="quand" pos="CONsub" msd="MORPH=empty">
                <orig>Quand</orig>
                <reg>Quand</reg>
              </w>
              <w lemma="je" pos="PROper" msd="NOMB.=s">
                <orig>ie</orig>
                <reg>j'|je</reg>
              </w>
              <w lemma="ne" pos="ADVneg" msd="MORPH=empty">
                <orig>ne</orig>
                <reg>ne</reg>
              </w>
              <w lemma="être" pos="VERcjg" msd="MODE=con|PERS.=2|NOMB.=s">
                <orig>ſerois</orig>
                <reg>serais</reg>
              </w>
              ...
           </orig>
           <reg>Monseigneur , Quand je ne serais pas né comme je suis , votre très-humble
serviteur , il faudrait que je fusse mauvais Français pour ne me réjouir pas des ressentiments de
votre maison , puisque ce sont des ressentiments publiques .
</reg>
          </choice>
        </seg>
</p>

The output file can be found here.

The dictionary

Based on the dictionary provided in TEI by Morphalou, we created a new dictionary, in JSON, using this script.

For the entry "abbaye", you get :

  • in TEI :
<entry xml:id="e69">
	<form type="lemma" corresp="morphalou2-tlf#ABBAYE{commonNoun} dela#abbaye{N+z1} dicollecte#abbaye{nom} lefff#abbaye{nc}">
		<orth>abbaye</orth>
		<pron>a b E i @</pron>
		<gramGrp>
			<pos>commonNoun</pos>
			<gen>feminine</gen>
		</gramGrp>
	</form>
	<form type="inflected" corresp="morphalou2-morphalou1#abbaye dela#abbaye dicollecte#abbaye lefff#abbaye">
		<orth>abbaye</orth>
		<pron>a b E i @</pron>
		<gramGrp>
			<number>singular</number>
		</gramGrp>
	</form>
	<form type="inflected" corresp="morphalou2-morphalou1#abbayes dela#abbayes dicollecte#abbayes lefff#abbayes">
		<orth>abbayes</orth>
		<pron>a b E i</pron>
		<gramGrp>
			<number>plural</number>
		</gramGrp>
	</form>
</entry>
  • in JSON :
{
   "abbaye":[
      {
         "orth":"abbaye",
         "pron":"a b E i @",
         "gramGrp":{
            "pos":"commonNoun",
            "gen":"feminine"
         }
      },
      {
         "orth":"abbaye",
         "pron":"a b E i @",
         "gramGrp":{
            "number":"singular"
         }
      },
      {
         "orth":"abbayes",
         "pron":"a b E i",
         "gramGrp":{
            "number":"plural"
         }
      }
   ]
}

Credits

This repository is developed by Alexandre Bartz with the help of Simon Gabay, as part of the project e-ditiones.

Licences

Licence Creative Commons
Our work is licenced under a Creative Commons Attribution 4.0 International Licence.

Pie-extended is under the Mozilla Public License 2.0.

Morphalou is under the LGPL-LR.

Cite this repository

Alexandre Bartz, Simon Gabay. 2020. Lemmatization and normalization of French modern manuscripts and printed documents. Retrieved from https://github.com/e-ditiones/SEG17.

About

Script processing segmentation, normalization and lemmatization of XML-TEI encoded files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •