SEG17

This script process segmentation, normalization and lemmatization of XML-TEI encoded files. Except the first, each step can be activated separately.

- This repo is not updated anymore. Please use the Annotator repo.

For the Annotator repo, cf. here.

Getting starded

To install SEG17, using command lines, you have to :

clone or download this repository

git clone [email protected]:e-ditiones/SEG17.git
cd SEG17

create a virtual environment and activate it

python3 -m venv env
source env/bin/activate

install dependencies

pip install -r requirements.txt

install lemmatisation models

PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended pie-extended download fr

Now you can use SEG17

if you want to split your text

python3 level2to3.py path/to/file

if you want to split and normalize your text

python3 level2to3.py -n path/to/file

if you want to split and lemmatize

PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 level2to3.py -l path/to/file

if you want to split, normalize and lemmatize your text

PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 level2to3.py -l -n path/to/file

How it works

The segmentation

Using the Level-2_to_level-3.xsl XSL stylesheet, the script adds XML-TEI tags to split the text in segments (<seg>). For each <p>(paragraph) and <l>(line), using some poncuation marks (.;:!?), the script level2to3.py split the text in segments captured in <seg> elements.

The normalization via NMT

For the normalisation, we use PARALLEL17.

The lemmazition

For lemmatisation, we use Pie-extended and the "fr" model.

The original version, and not the normalised version, is lemmatised.

The normalisation via lemmas

We use Morphalou. We offer an alternative normalisation, not seg-based but token-based. The script offer a normalised version for each token.

You can have more informations about the dictionaries used here.

Examples

Processing level2to3.Py on XML-TEI files

Extract of the file to be processed (available here) :

<p n="1" xml:id="EXP_0001-1-1">
    <persName>Monseignevr</persName>, Quand ie ne ſerois pas nay cõme ie ſuis, voſtre tres-humble
    ſeruiteur, il faudroit que ie fuſſe mauuais François pour ne me reſioüir pas des contẽtemens de
    voſtre <orgName>maiſon</orgName>, puis que ce ſont des felicités publiques.
</p>

Using PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 level2to3.py -l -n path/to/EXP_0001_level-2_text.xml, you get :

<p>
        <seg>
          <choice>
            <orig>
              <w lemma="monseigneur" pos="VERinf" msd="NOMB.=s">
                <orig>Monseignevr</orig>
                <reg>Monseignevr</reg>
              </w>
              <w lemma="," pos="PONfbl" msd="MORPH=empty">
                <orig>,</orig>
                <reg>,</reg>
              </w>
              <w lemma="quand" pos="CONsub" msd="MORPH=empty">
                <orig>Quand</orig>
                <reg>Quand</reg>
              </w>
              <w lemma="je" pos="PROper" msd="NOMB.=s">
                <orig>ie</orig>
                <reg>j'|je</reg>
              </w>
              <w lemma="ne" pos="ADVneg" msd="MORPH=empty">
                <orig>ne</orig>
                <reg>ne</reg>
              </w>
              <w lemma="être" pos="VERcjg" msd="MODE=con|PERS.=2|NOMB.=s">
                <orig>ſerois</orig>
                <reg>serais</reg>
              </w>
              ...
           </orig>
           <reg>Monseigneur , Quand je ne serais pas né comme je suis , votre très-humble
serviteur , il faudrait que je fusse mauvais Français pour ne me réjouir pas des ressentiments de
votre maison , puisque ce sont des ressentiments publiques .
</reg>
          </choice>
        </seg>
</p>

The output file can be found here.

The dictionary

Based on the dictionary provided in TEI by Morphalou, we created a new dictionary, in JSON, using this script.

For the entry "abbaye", you get :

in TEI :

<entry xml:id="e69">
	<form type="lemma" corresp="morphalou2-tlf#ABBAYE{commonNoun} dela#abbaye{N+z1} dicollecte#abbaye{nom} lefff#abbaye{nc}">
		<orth>abbaye</orth>
		<pron>a b E i @</pron>
		<gramGrp>
			<pos>commonNoun</pos>
			<gen>feminine</gen>
		</gramGrp>
	</form>
	<form type="inflected" corresp="morphalou2-morphalou1#abbaye dela#abbaye dicollecte#abbaye lefff#abbaye">
		<orth>abbaye</orth>
		<pron>a b E i @</pron>
		<gramGrp>
			<number>singular</number>
		</gramGrp>
	</form>
	<form type="inflected" corresp="morphalou2-morphalou1#abbayes dela#abbayes dicollecte#abbayes lefff#abbayes">
		<orth>abbayes</orth>
		<pron>a b E i</pron>
		<gramGrp>
			<number>plural</number>
		</gramGrp>
	</form>
</entry>

in JSON :

{
   "abbaye":[
      {
         "orth":"abbaye",
         "pron":"a b E i @",
         "gramGrp":{
            "pos":"commonNoun",
            "gen":"feminine"
         }
      },
      {
         "orth":"abbaye",
         "pron":"a b E i @",
         "gramGrp":{
            "number":"singular"
         }
      },
      {
         "orth":"abbayes",
         "pron":"a b E i",
         "gramGrp":{
            "number":"plural"
         }
      }
   ]
}

Credits

This repository is developed by Alexandre Bartz with the help of Simon Gabay, as part of the project e-ditiones.

Licences

Our work is licenced under a Creative Commons Attribution 4.0 International Licence.

Pie-extended is under the Mozilla Public License 2.0.

Morphalou is under the LGPL-LR.

Cite this repository

Alexandre Bartz, Simon Gabay. 2020. Lemmatization and normalization of French modern manuscripts and printed documents. Retrieved from https://github.com/e-ditiones/SEG17.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
Dictionaries		Dictionaries
Examples		Examples
NORM17-LSTM		NORM17-LSTM
NORM17		NORM17
XSLT		XSLT
depreciated		depreciated
.gitattributes		.gitattributes
.gitignore		.gitignore
Documentation.md		Documentation.md
README.md		README.md
level2to3.py		level2to3.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEG17

Getting starded

To install SEG17, using command lines, you have to :

Now you can use SEG17

How it works

The segmentation

The normalization via NMT

The lemmazition

The normalisation via lemmas

Examples

Processing level2to3.Py on XML-TEI files

The dictionary

Credits

Licences

Cite this repository

About

Releases

Packages

Contributors 3

Languages

e-ditiones/SEG17

Folders and files

Latest commit

History

Repository files navigation

SEG17

Getting starded

To install SEG17, using command lines, you have to :

Now you can use SEG17

How it works

The segmentation

The normalization via NMT

The lemmazition

The normalisation via lemmas

Examples

Processing level2to3.Py on XML-TEI files

The dictionary

Credits

Licences

Cite this repository

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages