IL Feedback #883

matyaskopp · 2024-11-22T13:20:54Z

Thanks for the great work on the corpora!
Please do not be scared of a long task list (everyone received it). I hope it will help you improve your corpus. I am ready to help and discuss any ambiguities or doubts, so do not hesitate to ask.

Are component filenames really unique

unique component files

The filenames (file IDs /TEI/@id) must be unique. I am not sure if multiple plenary/committee meetings can be held on the same day.

maintitle unique and also in Hebrew

unique
Hebrew translation

The text value of the main title in component files has to be unique within the corpus and there also should be Hebrew translation:
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2021/ParlaMint-IL_2021-12-21.xml#L9
so instead of reference corpus, you can place date and some more info that makes it unique (because you encode committees too, I believe the date is not enough):

<title type="main" xml:lang="he"><!-- ... --> ParlaMint-IL, <!--date + some more info--> [ParlaMint]</title>
<title type="main" xml:lang="en">Israeli parliamentary corpus ParlaMint-IL, <!--date + some more info--> [ParlaMint]</title>

`<meeting>` element in plenarys

parla.term
parla.session ?
parla.meeting
parla.sitting

Values from the <meeting> elements are used in corcondancers for filtering transcriptions, so the correct encoding is really important. See documentation: https://clarin-eric.github.io/ParlaMint/#exa-titleStmtComp
and also the taxonomy:

I believe this plenary hearing file: https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-03-12.xml#L12

<meeting ana="#parla.uni"/>

should be encoded this way

<meeting ana="#parla.term #parla.uni #period_18" n="18" corresp="#ParlaMint-IL-KNESS">הכנסת ה-18</meeting>
<meeting ana="#parla.meeting #parla.uni" n="5" corresp="#ParlaMint-IL-KNESS">ישיבה מס' 5</meeting>
<meeting ana="#parla.sitting #parla.uni" n="2009-03-12" corresp="#ParlaMint-IL-KNESS">2009-03-12</meeting>

`<meeting>` element in committees

parla.term
parla.meeting
parla.sitting

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2021/ParlaMint-IL_2021-12-21.xml#L12
can be encoded this way:

<meeting ana="#parla.term #parla.committee #period_24" n="24" >הכנסת ה-24</meeting><!-- no corresp attribute, we don't have org -->
<meeting ana="#parla.meeting #parla.committee" n="????">ישיבה מס' ????</meeting><!-- no corresp attribute, we don't have org -->
<meeting ana="#parla.sitting #parla.committee" n="2021-12-21">2021-12-21</meeting><!-- no corresp attribute, we don't have org -->

It is a pity you do not have committee organizations and texts so they can be linked. ParlaMint-BE has committee meetings too (but no <org>anization). On the other hand CZ and HU have organizations but not corresponding texts. It would be great to have one corpus that has both :-)

`<meeting>` element in teiCorpus

parla.term

There should be a list of terms in the <meeting> elements in corpus root files, like this:

ParlaMint/Samples/ParlaMint-AT/ParlaMint-AT.xml

Lines 10 to 17 in f9a0b6a

    
           <meeting n="27" corresp="#NR" ana="#parla.lower #parla.term #NR.XXVII"/> 
        
           <meeting n="26" corresp="#NR" ana="#parla.lower #parla.term #NR.XXVI"/> 
        
           <meeting n="25" corresp="#NR" ana="#parla.lower #parla.term #NR.XXV"/> 
        
           <meeting n="24" corresp="#NR" ana="#parla.lower #parla.term #NR.XXIV"/> 
        
           <meeting n="23" corresp="#NR" ana="#parla.lower #parla.term #NR.XXIII"/> 
        
           <meeting n="22" corresp="#NR" ana="#parla.lower #parla.term #NR.XXII"/> 
        
           <meeting n="21" corresp="#NR" ana="#parla.lower #parla.term #NR.XXI"/> 
        
           <meeting n="20" corresp="#NR" ana="#parla.lower #parla.term #NR.XX"/>

annotation of the file `TEI/@ana`

add #parla.sitting into TEI/@ana

Add #parla.sitting into TEI/@ana if one file corresponds to one sitting or the #parla.meeting value can be used if sitting is one to one to meeting.

bibliography

date
~~idno URL~~ - texts available online, but the source is a different corpus that does not preserve this information

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-03-12.xml#L52-L53

           <bibl>
               <title type="main" xml:lang="he">פרוטוקולים של הכנסת</title>
               <title type="main" xml:lang="en">Knesset Protocols</title>
               <idno type="URI" subtype="parliament">https://www.knesset.gov.il</idno>
               <date from="1993-07-12" to="2024-04-03">1993-07-12 - 2024-04-03</date>
            </bibl>

should contain correct single day when="2009-03-12" - the day of making text public or meeting date.
Url should contain the proper source of the transcription (if available), so everyone can see the source that you have transformed to corpus.

           <bibl>
               <title type="main" xml:lang="he">פרוטוקולים של הכנסת</title>
               <title type="main" xml:lang="en">Knesset Protocols</title>
               <idno type="URI" subtype="parliament">https://www.knesset.gov.il<!-- more concrete URL to the source --></idno>
               <date when="2009-03-12">2009-03-12</date>
            </bibl>

settingDesc

<setting>

Take a look at examples from other corpora:

ParlaMint/Samples/ParlaMint-AT/2022/ParlaMint-AT_2022-10-12-027-XXVII-NRSITZ-00178.xml

Lines 111 to 119 in 9946040

    
           <settingDesc> 
        
              <setting> 
        
                 <name type="city" xml:lang="de">Wien</name> 
        
                 <name type="city" xml:lang="en">Vienna</name> 
        
                 <name type="country" key="AT" xml:lang="en">Austria</name> 
        
                 <name type="country" key="AT" xml:lang="de">Österreich</name> 
        
                 <date ana="#parla.sitting" when="2022-10-12">2022-10-12</date> 
        
              </setting> 
        
           </settingDesc>

ParlaMint/Samples/ParlaMint-CZ/2023/ParlaMint-CZ_2023-07-26-ps2021-071-07-000-000.xml

Lines 106 to 114 in 9946040

    
           <settingDesc> 
        
              <setting> 
        
                 <name type="org">Parlament České republiky - Poslanecká sněmovna</name> 
        
                 <name type="address">Sněmovní 176/4</name> 
        
                 <name type="city">Praha</name> 
        
                 <name key="CZ" type="country">Česká republika</name> 
        
                 <date when="2023-07-26" ana="#parla.sitting">2023-07-26</date> 
        
              </setting> 
        
           </settingDesc>

ParlaMint/Samples/ParlaMint-SI/2022/ParlaMint-SI_2022-04-06-SDZ8-Izredna-99.xml

Lines 95 to 101 in 9946040

    
           <settingDesc> 
        
              <setting> 
        
                 <name type="city">Ljubljana</name> 
        
                 <name type="country" key="SI">Slovenija</name> 
        
                 <date when="2022-04-06" ana="#parla.sitting">6. 4. 2022</date> 
        
              </setting> 
        
           </settingDesc>

ID format

u/@id
seg/@id
s/@id
w/@id and pc/@id

I know that ID value is just for technical purposes, but consider changing them in the way most corpora do it, something like
{file_id}.u{utteranceN}.p{paragraphN}.s{sentenceN}.w{tokenN} (CZech style of creating ids).

changed ids in annotated version

no id changes

For technical reasons, we want to preserve utterances and segment ids in annotated versions (they would be equal). When you annotate the corpus, you are only enriching it, not changing existing content.
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.xml#L97-L100

            <u xml:id="u.session.18_ptv_139208_doc.0"
               who="#person.526"
               ana="#chair">
               <seg xml:id="seg.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9">שלום לכולם, אני פותח את הדיון. על סדר היום – העברות תקציביות. פניה מספר 238 מדובר על מנהלת סל"ע.</seg>

vs annotated version:
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.ana.xml#L103-L106

            <u xml:id="u.session.18_ptv_139208_doc-1.0"
               who="#person.526"
               ana="#chair">
               <seg xml:id="seg.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9-2">

syntactic vs orthographic words

syntactic words

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.ana.xml#L107-L130

                  <s xml:id="s.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9-3">
                     <w lemma="שלום"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.0"
                        join="right">שלום</w>
                     <w lemma="_"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.1"
                        join="right">לכולם</w>
                     <w lemma="ל"
                        pos="ADP"
                        msd="UPosTag=ADP"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.2"
                        join="right">ל</w>
                     <w lemma="כולם"
                        pos="NOUN"
                        msd="UPosTag=NOUN|Gender=Masc|Number=Plur"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.3"
                        join="right">כולם</w>
                     <pc xml:id="w.session.18_ptv_139208_doc-1.0.1.4"
                         msd="UPosTag=PUNCT"
                         join="right">,</pc>

annotation with udpipe for easier illustration:

it should be encoded this way:

                  <s xml:id="s.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9-3">
                     <w lemma="שלום"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.0">שלום</w> <!-- removed join="right" -->
<!-- REMOVED:
                     <w lemma="_"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.1"
                        join="right">לכולם</w>
-->
                     <w lemma="ל"
                        pos="ADP"
                        msd="UPosTag=ADP"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.2"
                        join="right">ל</w>
                     <w lemma="כולם"
                        pos="NOUN"
                        msd="UPosTag=NOUN|Gender=Masc|Number=Plur"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.3"
                        join="right">כולם</w>
                     <pc xml:id="w.session.18_ptv_139208_doc-1.0.1.4"
                         msd="UPosTag=PUNCT"
                         join="right">,</pc>

Or the ways documented here: https://clarin-eric.github.io/ParlaMint/#sec-ana-norm

Not sure... "לכולם" does not have a lemma....
@TomazErjavec please help me here. We need to be able to convert it into conllu and vert. It would also be great if it would be possible to search it as one word for users...

named entities

Are there any multi-word named entities?

I have found only single-word named entities which were adjected, like this:

                     <name type="MISC">
                        <w lemma="כנסת"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.18"
                           join="right">כנסת</w>
                     </name>
                     <name type="PER">
                        <w lemma="שוש"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.19"
                           join="right">שוש</w>
                     </name>
                     <name type="PER">
                        <w lemma="כרם"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.20"
                           join="right">כרם</w>
                     </name>

taxonomies

common taxonomies
ParlaMint-IL specific taxonomies

there are two types of taxonomies

common one with ParlaMint-taxonomy- prefix where no changes are allowed, only translation is required (except UD-SYN)
country-specific, in your case, this prefix: ParlaMint-IL-taxonomy-

You have changed the content of the common taxonomies and also the filenames, so the taxonomies do not match the ParlaMint ones.

You can initialize common taxonomies with this command. Run it in the repository root folder:

make initTaxonomies4translation-IL

it creates taxonomies in Sample/ParlaMint-IL and place placeholders where the translations should appear (it overwrites existing ones if the filename is equal)

If you have the correct filename and IDs, you can use this sequence to prefill your translations:

# save your translations in the common translations
make translateTaxonomies-IL
# initialize taxonomies from common ones (if the translation exists, then it is used; otherwise, uses placeholder)
make initTaxonomies4translation-IL
# revert changes in common taxonomies (you are not allowed to change this folder - it is my job)
git checkout Build/Taxonomies/

languages

<langUsage>

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL.ana.xml#L143-L146

         <langUsage>
            <language ident="he">עברית</language>
            <language ident="en">English</language>
         </langUsage>

there should be both(@ident) languages information stored in both(@xml:lang) languages, like this:

ParlaMint/Samples/ParlaMint-CZ/ParlaMint-CZ.ana.xml

Lines 147 to 152 in f9a0b6a

    
           <langUsage> 
        
              <language ident="cs" xml:lang="cs">čeština</language> 
        
              <language ident="en" xml:lang="cs">angličtina</language> 
        
              <language ident="cs" xml:lang="en">Czech</language> 
        
              <language ident="en" xml:lang="en">English</language> 
        
           </langUsage>

invalit label content

org/listEvent/event/label content

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L420-L427

    <listEvent>
      <event xml:id="org.13_period_1" from="1949-02-14" to="1969-01-28">
        <label xml:lang="en">&lt;Element {http://www.tei-c.org/ns/1.0}orgName at 0x2ccab29f6c0&gt;_period_1</label>
      </event>
      <event xml:id="org.13_period_2" from="1984-10-22" to="1992-03-09">
        <label xml:lang="en">&lt;Element {http://www.tei-c.org/ns/1.0}orgName at 0x2ccab29f6c0&gt;_period_2</label>
      </event>
    </listEvent>

This appears in multiple organizations; the above is just a sample.

abbreviated form is longer than full

org/orgName

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L647-L648

  <org xml:id="org.31" role="parliamentaryGroup">
    <orgName full="yes">הליכוד</orgName>
    <orgName full="abb">גוש חירות ליברלים (גח"ל)</orgName>

This appears in multiple organizations; the above is just a sample.

independent MP forms parliamentary group

independent MP

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L2045-L2051

  <org xml:id="org.153" role="parliamentaryGroup">
    <orgName full="yes">ח"כ עצמאי - הלל קוק</orgName>
    <orgName full="abb">ח"כ עצמאי - הלל קוק</orgName>
    <event from="1951-02-20">
      <label xml:lang="en">existence</label>
    </event>
  </org>

approx 30 occurrences.

This solution allows to affiliate with political orientation an independent MP, but it is really strange. Probably we have to find a better solution. (@TomazErjavec ??)

only member affiliations

various affiliation roles

corpus contains only member roles. is there a possibility to add various roles? See https://clarin-eric.github.io/ParlaMint/#sec-affiliation

unknown person name

unknown person

  <person xml:id="person.3ae50273-a4de-47e6-99c8-71db4881d15d">
    <persName>
      <forename>Unknown</forename>
      <surname>קריאה</surname>
    </persName>
    <sex value="U"/>
  </person>
  <person xml:id="person.6311c681-5a63-4e29-b7f0-b76f0b3a0a6a">
    <persName>
      <forename>Unknown</forename>
      <surname>קריאות</surname>
    </persName>
    <sex value="U"/>
  </person>
  <person xml:id="person.31227de9-062c-4ff1-a01c-78f9dd30418c">
    <persName>
      <forename>Unknown</forename>
      <surname>Unknown</surname>
    </persName>
    <sex value="U"/>
  </person>

it is not necessary to fill in both forename and surname if unknown. If the person is completely unknown, then he/she shouldn't have a person record in listPerson (you can also skip @who attribute in utterance)

The text was updated successfully, but these errors were encountered:

TomazErjavec · 2024-11-23T17:37:03Z

@TomazErjavec please help me here. We need to be able to convert it into conllu and vert. It would also be great if it would be possible to search it as one word for users...

If I understand correctly, "לכולם" is a surface word that corresponds to two syntactic words "ל" and "כולם".
If this is the case, then this corresponds to the "abyste" example from https://clarin-eric.github.io/ParlaMint/#sec-ana-norm, so

<w join="right">לכולם
  <w norm="ל" lemma="ל" pos="ADP" msd="UPosTag=ADP"/>
  <w norm="כולם" lemma="כולם"  pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur"/>
</w>

(the join=right is because comma follows).

A similar case form IT sample would be

ParlaMint/Samples/ParlaMint-IT/2018/ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.ana.xml

Line 490 in f9a0b6a

    
           <w xml:id="ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.ana.seg4.2.21-22" join="right">nell'<w xml:id="ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.ana.seg4.2.21" norm="in" lemma="in" pos="E" msd="UPosTag=ADP"/><w xml:id="ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.ana.seg4.2.22" norm="l'" lemma="il" pos="RD" msd="UPosTag=DET|Definite=Def|Number=Sing|PronType=Art"/></w>

This gets converted to CoNLL-U like:

ParlaMint/Samples/ParlaMint-IT/2018/ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.conllu

Lines 197 to 199 in f9a0b6a

    
           21-22	nell'	_	_	_	_	_	_	_	NER=O|SpaceAfter=No 
        
           21	in	in	ADP	E	_	23	case	_	_ 
        
           22	l'	il	DET	RD	Definite=Def|Number=Sing|PronType=Art	23	det	_	_

and to vert like

ParlaMint/Samples/ParlaMint-IT/2018/ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.vert

Line 202 in f9a0b6a

    
           nell'	in|l'	in|il	ADP|DET	-|Definite=Def Number=Sing PronType=Art	21|22	case|det	assetto	NOUN	Gender=Masc Number=Sing	23

Note that vert (exactly for cases like this) have multivalued attributes on norm, lemma etc. Not ideal, but best we can do with vertical files.

matyaskopp · 2024-11-24T21:48:59Z

random person check דורון אביטל

government affiliation of דורון אביטל (https://he.wikipedia.org/wiki/%D7%93%D7%95%D7%A8%D7%95%D7%9F_%D7%90%D7%91%D7%99%D7%98%D7%9C)

I have checked random person: https://github.com/GiliGoldin/ParlaMint/blob/8040ae5cd6579b7e4f414517a766ce5ce8b93f74/Samples/ParlaMint-IL/ParlaMint-IL-listPerson.xml#L241-L254

  <person xml:id="person.18990">
    <persName>
      <forename>דורון</forename>
      <surname>אביטל</surname>
    </persName>
    <sex value="M"/>
    <birth when="1959-01-22">
      <placeName>ישראל</placeName>
    </birth>
    <affiliation ref="#org.122" role="member" from="2011-03-18" to="2013-02-05"/>
    <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="2011-03-18" to="2013-02-05"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="member" from="2011-03-18" to="2013-02-05"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="minister" from="2011-03-18" to="2013-02-05"/>
  </person>

His parliamentary group status at the time of membership:

    <relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2009-03-31" to="2012-05-08"/>
    <relation name="coalition" mutual="#org.122" from="2012-05-08" to="2012-07-17"/>
    <relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2012-07-17" to="2013-02-05"/>

There are some weirds:

he is at the same time a member of government and in the opposition
the government membership has the same timespan as parliament membership (in Czechia, it takes some time(weeks-months) to become a minister after becoming a parliament member)
wiki does not say he was a minister

Not sure if you understand the concept of members of the government in ParlaMint. It seems that all parliament members who are affiliated with the parliamentary group in the coalition are members of the government.
https://clarin-eric.github.io/ParlaMint/#sec-affiliation
A member of government is someone who has some position in government (not everyone from the coalition)

matyaskopp · 2024-11-25T06:35:42Z

INVALID @matyaskopp fault:

<meeting> element in teiCorpus

non-unique meeting element in teiCorpus

<meeting> element should be unique within the file, there are repetitions in a corpus root file: https://github.com/GiliGoldin/ParlaMint/blob/8040ae5cd6579b7e4f414517a766ce5ce8b93f74/Samples/ParlaMint-IL/ParlaMint-IL.xml#L13-L36

            <meeting n="14"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_14"
                     xml:lang="he">הכנסת ה-14</meeting>
            <meeting n="14"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_14"
                     xml:lang="en">14th Knesset</meeting>
            <meeting n="18"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_18"
                     xml:lang="he">הכנסת ה-18</meeting>
            <meeting n="18"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_18"
                     xml:lang="en">18th Knesset</meeting>
            <meeting n="24"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_24"
                     xml:lang="he">הכנסת ה-24</meeting>
            <meeting n="24"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_24"
                     xml:lang="en">24th Knesset</meeting>

GiliGoldin · 2024-11-25T07:09:54Z

Sorry I don't understand. What is the repetition? Shouldn't there be a meeting for each term? It's written in Hebrew and in English. How should it be then?

…

On Mon, Nov 25, 2024 at 8:36 AM Matyáš Kopp ***@***.***> wrote: <meeting> element in teiCorpus - non-unique meeting element in teiCorpus <meeting> element should be unique within the file, there are repetitions in a corpus root file: https://github.com/GiliGoldin/ParlaMint/blob/8040ae5cd6579b7e4f414517a766ce5ce8b93f74/Samples/ParlaMint-IL/ParlaMint-IL.xml#L13-L36 <meeting n="14" corresp="#ParlaMint-IL-KNESS" ana="#parla.uni #parla.term #period_14" xml:lang="he">הכנסת ה-14</meeting> <meeting n="14" corresp="#ParlaMint-IL-KNESS" ana="#parla.uni #parla.term #period_14" xml:lang="en">14th Knesset</meeting> <meeting n="18" corresp="#ParlaMint-IL-KNESS" ana="#parla.uni #parla.term #period_18" xml:lang="he">הכנסת ה-18</meeting> <meeting n="18" corresp="#ParlaMint-IL-KNESS" ana="#parla.uni #parla.term #period_18" xml:lang="en">18th Knesset</meeting> <meeting n="24" corresp="#ParlaMint-IL-KNESS" ana="#parla.uni #parla.term #period_24" xml:lang="he">הכנסת ה-24</meeting> <meeting n="24" corresp="#ParlaMint-IL-KNESS" ana="#parla.uni #parla.term #period_24" xml:lang="en">24th Knesset</meeting> — Reply to this email directly, view it on GitHub <#883 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAUPHHRALKQI4WGVG4P2EL2CLANJAVCNFSM6AAAAABSJLAHHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJWHE4DCOBSGI> . You are receiving this because you were assigned.Message ID: ***@***.***>

matyaskopp · 2024-11-25T07:13:22Z

Sorry I don't understand. What is the repetition? Shouldn't there be a meeting for each term? It's written in Hebrew and in English. How should it be then?

Of course, you are right.
Your encoding is correct!

matyaskopp · 2024-11-25T07:15:57Z

extra files

remove from repository:

ParlaMint-IL.teiCorpus.xml
ParlaMint-IL.ana.teiCorpus.xml
Scripts/bin/saxon-ee-11.6.jar
Scripts/bin/saxon-ee-11.6.jar:Zone.Identifier
Scripts/bin/trang-20220510/copying.txt
Scripts/bin/trang-20220510/trang-manual.html
Scripts/bin/trang-20220510/trang.jar
Scripts/bin/trang.jar (revert change)

GiliGoldin · 2024-11-25T08:22:38Z

You are right, it seems that I inserted the time of his faction membership as the time in the government instead of the time in the coalition. I will fix this. I did assign each coalition member as a government member and as a minister. I see now that this is a mistake, I will remove the minister role since I don't have the information regarding the roles of the ministers and government positions. In our corpus we do consider all the people in the coalition to be government members.

…

On Sun, Nov 24, 2024, 23:49 Matyáš Kopp ***@***.***> wrote: random person check דורון אביטל - government affiliation of דורון אביטל ( https://he.wikipedia.org/wiki/%D7%93%D7%95%D7%A8%D7%95%D7%9F_%D7%90%D7%91%D7%99%D7%98%D7%9C ) I have checked random person: https://github.com/GiliGoldin/ParlaMint/blob/8040ae5cd6579b7e4f414517a766ce5ce8b93f74/Samples/ParlaMint-IL/ParlaMint-IL-listPerson.xml#L241-L254 <person xml:id="person.18990"> <persName> <forename>דורון</forename> <surname>אביטל</surname> </persName> <sex value="M"/> <birth when="1959-01-22"> <placeName>ישראל</placeName> </birth> <affiliation ref="#org.122" role="member" from="2011-03-18" to="2013-02-05"/> <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="2011-03-18" to="2013-02-05"/> <affiliation ref="#ParlaMint-IL-GOV" role="member" from="2011-03-18" to="2013-02-05"/> <affiliation ref="#ParlaMint-IL-GOV" role="minister" from="2011-03-18" to="2013-02-05"/> </person> His parliamentary group status at the time of membership: <relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2009-03-31" to="2012-05-08"/> <relation name="coalition" mutual="#org.122" from="2012-05-08" to="2012-07-17"/> <relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2012-07-17" to="2013-02-05"/> There are some weirds: - he is at the same time a member of government and in the opposition - the government membership has the same timespan as parliament membership (in Czechia, it takes some time(weeks-months) to become a minister after becoming a parliament member) - wiki does not say he was a minister Not sure if you understand the concept of members of the government in ParlaMint. It seems that all parliament members who are affiliated with the parliamentary group in the coalition are members of the government. https://clarin-eric.github.io/ParlaMint/#sec-affiliation A member of government is someone who has some position in government (not everyone from the coalition) — Reply to this email directly, view it on GitHub <#883 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAUPHBMBXSEFDNHDW7AJ7T2CJCWBAVCNFSM6AAAAABSJLAHHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJWGI2TEMJQGE> . You are receiving this because you were assigned.Message ID: ***@***.***>

matyaskopp · 2024-11-26T07:23:30Z

languages

https://github.com/GiliGoldin/ParlaMint/blob/199b869dd0c2734124a98663193da1a6ad972007/Samples/ParlaMint-IL/ParlaMint-IL.xml#L144-L149

         <langUsage>
            <language ident="he">עברית</language>
            <language ident="en">אנגלית</language>
            <language ident="he">Hebrew</language>
            <language ident="en">English</language>
         </langUsage>

should be:

         <langUsage>
            <language ident="he" xml:lang="he">עברית</language>
            <language ident="en" xml:lang="he">אנגלית</language>
            <language ident="he" xml:lang="en">Hebrew</language>
            <language ident="en" xml:lang="en">English</language>
         </langUsage>

matyaskopp · 2024-11-26T10:02:59Z

taxonomies

There are still some taxonomies which are IL-specific or not linked:

Samples/ParlaMint-IL/ParlaMint-taxonomy-roles.xml
Samples/ParlaMint-IL/ParlaMint-taxonomy-sessionTypes.xml

I guess they can be removed

matyaskopp · 2024-11-26T10:37:57Z

Thanks for the great progress; I have ticked what has been resolved so far. If anything is unclear, please ask.

You are right, it seems that I inserted the time of his faction membership as the time in the government instead of the time in the coalition. I will fix this.
I did assign each coalition member as a government member and as a minister. I see now that this is a mistake, I will remove the minister role since I don't have the information regarding the roles of the ministers and government positions. In our corpus we do consider all the people in the coalition to be government members.

Well, you made more changes than just removing ministers and fixing the beginnings of timespans in 199b869; see Netanyahu:

Some remove seem to be correct (e.g. Netanyahu was not in government with Bennett) - I hope you are aware of these changes - it was a bugfix, not accidental removal.

The government beginnings seem to be okay (if the start of the coalition is the start of the government), but now you have most probably time spans without government because you have shifted only beginnings (old government still works after new MPs make parliamentary oath).
I believe you want to make a ParlaMint comparable corpus (not just using ParlaMint encoding) - so I suggest not sticking to your source corpus but rather extending it with more metadata. On Wikipedia, there are easily reachable all Israeli governments. Would it be a solution to use this data?
https://en.wikipedia.org/wiki/Cabinet_of_Israel#List_of_cabinets

We have a script for enriching tei with tsv data:

script https://github.com/clarin-eric/ParlaMint/blob/main/Build/Scripts/ministers-tsv2tei.xsl
sample input: https://github.com/clarin-eric/ParlaMint/blob/main/Build/Sources-TSV/ParlaMint-ES-PV/Ministers-ES-PV.edited.tsv
so you can use it.

GiliGoldin · 2024-11-28T11:15:30Z

bibliography

date

idno URL

Url should contain the proper source of the transcription (if available), so everyone can see the source that you have transformed to corpus.

           <bibl>
               <title type="main" xml:lang="he">פרוטוקולים של הכנסת</title>
               <title type="main" xml:lang="en">Knesset Protocols</title>
               <idno type="URI" subtype="parliament">https://www.knesset.gov.il<!-- more concrete URL to the source --></idno>
               <date when="2009-03-12">2009-03-12</date>
            </bibl>

The sources can be found online but I don't have this specific URL information since we didn't process the files directly from the website, we received them in email directly from the Knesset archivists.

matyaskopp · 2024-11-28T12:04:15Z

The sources can be found online but I don't have this specific URL information since we didn't process the files directly from the website, we received them in email directly from the Knesset archivists.

Okay, it's a shame. You can add it to your checklist for improving your source corpus.
I highly recommend including the source link to everyone - it is beneficial not only for development purposes but also for validating the correctness and completeness of your data (with non-trivial effort, but it is doable).

GiliGoldin · 2024-12-01T08:27:12Z

Well, you made more changes than just removing ministers and fixing the beginnings of timespans in 199b869; see Netanyahu:

Some remove seem to be correct (e.g. Netanyahu was not in government with Bennett) - I hope you are aware of these changes - it was a bugfix, not accidental removal.

The government beginnings seem to be okay (if the start of the coalition is the start of the government), but now you have most probably time spans without government because you have shifted only beginnings (old government still works after new MPs make parliamentary oath). I believe you want to make a ParlaMint comparable corpus (not just using ParlaMint encoding) - so I suggest not sticking to your source corpus but rather extending it with more metadata. On Wikipedia, there are easily reachable all Israeli governments. Would it be a solution to use this data? https://en.wikipedia.org/wiki/Cabinet_of_Israel#List_of_cabinets

We have a script for enriching tei with tsv data:

script https://github.com/clarin-eric/ParlaMint/blob/main/Build/Scripts/ministers-tsv2tei.xsl

sample input: https://github.com/clarin-eric/ParlaMint/blob/main/Build/Sources-TSV/ParlaMint-ES-PV/Ministers-ES-PV.edited.tsv
so you can use it.

I made sure to use the coalition dates rather than the faction membership dates. This caused all the mentioned changes which are correct now. The start of the coalition membership is the start of the government membership, not the parliamentary oath, but yes the end will be the end of the coalition membership.
I did see and fix a mistake in the coalition start_date 22.11.88 instead of 22.12.88, but the content of the corpus and it's accuracy should be our responsibility. I am of course making all effort for the corpus to fit the parlaMint formatting and instructions.
I manually added the prime ministers as roles in the government with the dates in the role. I hope this will suffice both for the request for roles other than members, and also that there won't be gaps without a government at all.

GiliGoldin · 2024-12-01T08:34:56Z

annotation with udpipe for easier illustration:

it should be encoded this way:

                  <s xml:id="s.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9-3">
                     <w lemma="שלום"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.0">שלום</w> <!-- removed join="right" -->
<!-- REMOVED:
                     <w lemma="_"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.1"
                        join="right">לכולם</w>
-->
                     <w lemma="ל"
                        pos="ADP"
                        msd="UPosTag=ADP"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.2"
                        join="right">ל</w>
                     <w lemma="כולם"
                        pos="NOUN"
                        msd="UPosTag=NOUN|Gender=Masc|Number=Plur"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.3"
                        join="right">כולם</w>
                     <pc xml:id="w.session.18_ptv_139208_doc-1.0.1.4"
                         msd="UPosTag=PUNCT"
                         join="right">,</pc>

Or the ways documented here: https://clarin-eric.github.io/ParlaMint/#sec-ana-norm

This was fixed according to what TomazErjavec suggested.

named entities

Are there any multi-word named entities?

I have found only single-word named entities which were adjected, like this:

                     <name type="MISC">
                        <w lemma="כנסת"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.18"
                           join="right">כנסת</w>
                     </name>
                     <name type="PER">
                        <w lemma="שוש"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.19"
                           join="right">שוש</w>
                     </name>
                     <name type="PER">
                        <w lemma="כרם"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.20"
                           join="right">כרם</w>
                     </name>

This was fixed

taxonomies

common taxonomies

ParlaMint-IL specific taxonomies

there are two types of taxonomies

common one with ParlaMint-taxonomy- prefix where no changes are allowed, only translation is required (except UD-SYN)

country-specific, in your case, this prefix: ParlaMint-IL-taxonomy-

You have changed the content of the common taxonomies and also the filenames, so the taxonomies do not match the ParlaMint ones.

You can initialize common taxonomies with this command. Run it in the repository root folder:
make initTaxonomies4translation-IL
it creates taxonomies in Sample/ParlaMint-IL and place placeholders where the translations should appear (it overwrites existing ones if the filename is equal)

If you have the correct filename and IDs, you can use this sequence to prefill your translations:

This was fixed. There are no IL-specifix taxonomies anymore

# save your translations in the common translations
make translateTaxonomies-IL
# initialize taxonomies from common ones (if the translation exists, then it is used; otherwise, uses placeholder)
make initTaxonomies4translation-IL
# revert changes in common taxonomies (you are not allowed to change this folder - it is my job)
git checkout Build/Taxonomies/
languages

<langUsage>

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL.ana.xml#L143-L146
         <langUsage>
            <language ident="he">עברית</language>
            <language ident="en">English</language>
         </langUsage>
there should be both(@ident) languages information stored in both(@xml:lang) languages, like this:

ParlaMint/Samples/ParlaMint-CZ/ParlaMint-CZ.ana.xml

Lines 147 to 152 in f9a0b6a

<langUsage>

<language ident="cs" xml:lang="cs">čeština</language>

<language ident="en" xml:lang="cs">angličtina</language>

<language ident="cs" xml:lang="en">Czech</language>

<language ident="en" xml:lang="en">English</language>

</langUsage>

This was fixed

independent MP forms parliamentary group

independent MP

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L2045-L2051
  <org xml:id="org.153" role="parliamentaryGroup">
    <orgName full="yes">ח"כ עצמאי - הלל קוק</orgName>
    <orgName full="abb">ח"כ עצמאי - הלל קוק</orgName>
    <event from="1951-02-20">
      <label xml:lang="en">existence</label>
    </event>
  </org>
approx 30 occurrences.

This solution allows to affiliate with political orientation an independent MP, but it is really strange. Probably we have to find a better solution. (@TomazErjavec ??)

Those are factions that are made of only one independent MP, but they are considered as a regular faction/political party in the parliament. I don't see why to save them differently.

matyaskopp · 2024-12-02T08:52:35Z

languages

<langUsage>

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL.ana.xml#L143-L146
         <langUsage>
            <language ident="he">עברית</language>
            <language ident="en">English</language>
         </langUsage>
there should be both(@ident) languages information stored in both(@xml:lang) languages, like this:

ParlaMint/Samples/ParlaMint-CZ/ParlaMint-CZ.ana.xml

Lines 147 to 152 in f9a0b6a

<langUsage>

<language ident="cs" xml:lang="cs">čeština</language>

<language ident="en" xml:lang="cs">angličtina</language>

<language ident="cs" xml:lang="en">Czech</language>

<language ident="en" xml:lang="en">English</language>

</langUsage>
This was fixed

@GiliGoldin, it was fixed only partially, see my previous comment: #883 (comment)

should be:

         <langUsage>
            <language ident="he" xml:lang="he">עברית</language>
            <language ident="en" xml:lang="he">אנגלית</language>
            <language ident="he" xml:lang="en">Hebrew</language>
            <language ident="en" xml:lang="en">English</language>
         </langUsage>

matyaskopp · 2024-12-02T09:02:29Z

Those are factions that are made of only one independent MP, but they are considered as a regular faction/political party in the parliament. I don't see why to save them differently.

If this reflects the reality in Knesset, then do it this way - I am ok with it.
In most European parliaments, there are some restrictions on forming parliamentary groups. A minimum number of members is required, for example, in CZ, the minimum is 3.

matyaskopp · 2024-12-02T09:23:14Z

join attribute

*/@join="right"

There are too many joins, so the raw TEI and annotated (TEI.ana) versions are different
I have polished the script to make it easier to debug it. The idea is to convert all segments in TEI and TEI.ana into text, and the result should be the same:

$ make text.seg.ana-IL text.seg-IL
INFO: converting ParlaMint-IL_2009-10-21-18ptv139208.ana to text file
INFO: converting ParlaMint-IL_2021-12-21-24ptv616837.ana to text file
INFO: converting ParlaMint-IL_2009-03-12-18ptm186016.ana to text file
INFO: converting ParlaMint-IL_1998-07-08-14ptm532674.ana to text file
INFO: annotated segments converted to text are stored in Samples/ParlaMint-IL/text.seg.ana
INFO: converting ParlaMint-IL_2009-03-12-18ptm186016 to text file
INFO: converting ParlaMint-IL_2009-10-21-18ptv139208 to text file
INFO: converting ParlaMint-IL_2021-12-21-24ptv616837 to text file
INFO: converting ParlaMint-IL_1998-07-08-14ptm532674 to text file
INFO: segments converted to text are stored in Samples/ParlaMint-IL/text.seg

and then you can compare folders (I use meld):

$ meld Samples/ParlaMint-IL/text.seg Samples/ParlaMint-IL/text.seg.ana

matyaskopp · 2024-12-02T14:21:03Z

@GiliGoldin, you removed your comment before I could react, so there are probably still some doubts.
In the meantime, I "fixed" one commit(9cb4c22) with orthographical words that hides one of the bugs, so the result is slightly different.

I can give you an example https://github.com/GiliGoldin/ParlaMint/blob/4571733fe48a9d200c92fd1ba7b02807bfc7ccfb/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21-18ptv139208.ana.xml#L484-L519 on how should this sentence be encoded.
The current state is:

                  <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0">
                     <w lemma="איפה"
                        pos="ADV"
                        msd="UPosTag=ADV|PronType=Int"
                        xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t1">איפה</w>
                     <w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2-3" join="right">המכינה<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2"
                           norm="ה"
                           lemma="ה"
                           pos="DET"
                           msd="UPosTag=DET|Definite=Def|PronType=Art"/>
                        <w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t3"
                           norm="מכינה"
                           lemma="מכינה"
                           pos="NOUN"
                           msd="UPosTag=NOUN|Gender=Fem|Number=Sing"/>
                     </w>
                     <w lemma="הוקם"
                        pos="VERB"
                        msd="UPosTag=VERB|Gender=Fem|HebBinyan=HUFAL|Number=Sing|Person=3|Tense=Fut|Voice=Pass"
                        xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4">תוקם</w>
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5"
                         msd="UPosTag=PUNCT"
                         join="right">?</pc>
                     <linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
                  </s>

It should be:

                  <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0">
                     <w lemma="איפה"
                        pos="ADV"
                        msd="UPosTag=ADV|PronType=Int"
                        xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t1">איפה</w>
<!-- 
(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2-3) removing join="right" 
because the token(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4) 
on the right(=following) in this file is not joined
-->
                     <w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2-3">המכינה<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2"
                           norm="ה"
                           lemma="ה"
                           pos="DET"
                           msd="UPosTag=DET|Definite=Def|PronType=Art"/>
                        <w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t3"
                           norm="מכינה"
                           lemma="מכינה"
                           pos="NOUN"
                           msd="UPosTag=NOUN|Gender=Fem|Number=Sing"/>
                     </w>
<!-- 
(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4) added join="right" 
because the punctation(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5) is joined with this token
-->
                     <w lemma="הוקם"
                        join="right"
                        pos="VERB"
                        msd="UPosTag=VERB|Gender=Fem|HebBinyan=HUFAL|Number=Sing|Person=3|Tense=Fut|Voice=Pass"
                        xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4">תוקם</w>
<!-- 
(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5) removing join="right" 
because the sentence is at the end of the sentence
-->
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5"
                         msd="UPosTag=PUNCT"
                         join="right">?</pc>
                     <linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
                  </s>

I am sorry if I have written it unambiguously in #882. I hope this example helps

GiliGoldin · 2024-12-02T14:30:50Z

Yes, I removed the comment since I noticed more problems that needed fixing.
I think I solved this problem in the mean while. Now the problems are mostly with punctuations like ", which I didn't figure out how to distinguish between a closing one and an opening.

matyaskopp · 2024-12-02T16:41:40Z

Yes, I removed the comment since I noticed more problems that needed fixing.
I think I solved this problem in the mean while. Now the problems are mostly with punctuations like ", which I didn't figure out how to distinguish between a closing one and an opening.

The idea is to store source spacing, not to create a typographically correct one. But the current state is much better - it does not break the text, so we can leave it as it is.

matyaskopp · 2024-12-02T17:55:11Z

I have spotted one easy-fix join issue:
https://github.com/clarin-eric/ParlaMint/actions/runs/12123458750/job/33799059974#step:4:11009

Make sure that the last token in a sentence does not contain the join attribute:
https://github.com/GiliGoldin/ParlaMint/blob/abc39b0c344fffa970992b82e2260ace3c5377ac/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21-18ptv139208.ana.xml#L1975-L1978

                  <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1">
                     <!-- SKIPPING -->
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t12"
                         msd="UPosTag=PUNCT"
                         join="right">-</pc>
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t13"
                         msd="UPosTag=PUNCT"
                         join="right">-</pc> <!-- REMOVE THIS JOIN -->
                     <linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
                  </s>

I believe your pipeline will be ready to run on all data when you fix this.

@GiliGoldin, thanks for the exceptional work!

@TomazErjavec, just for your update, ParlaMint-IL sample is close to being ready.

GiliGoldin · 2024-12-03T07:33:36Z

I have spotted one easy-fix join issue: https://github.com/clarin-eric/ParlaMint/actions/runs/12123458750/job/33799059974#step:4:11009

Make sure that the last token in a sentence does not contain the join attribute: https://github.com/GiliGoldin/ParlaMint/blob/abc39b0c344fffa970992b82e2260ace3c5377ac/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21-18ptv139208.ana.xml#L1975-L1978
                  <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1">
                     
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t12"
                         msd="UPosTag=PUNCT"
                         join="right">-</pc>
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t13"
                         msd="UPosTag=PUNCT"
                         join="right">-</pc> 
                     <linkGrp targFunc="head argument" type="UD-SYN"></linkGrp>
                  </s>
I believe your pipeline will be ready to run on all data when you fix this.

@GiliGoldin, thanks for the exceptional work!

@TomazErjavec, just for your update, ParlaMint-IL sample is close to being ready.

That's great, thank you so much!
I fixed this issue.

GiliGoldin · 2024-12-29T12:44:10Z

Hi, the full data is located here:
https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/tree/main/ParlaMint-IL

Happy holidays!
Gili

TomazErjavec · 2024-12-30T17:50:18Z

@GiliGoldin, thanks for letting us know. Will have a look and try to process it soon.
Happy holidays back!

matyaskopp assigned GiliGoldin Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IL Feedback #883

IL Feedback #883

matyaskopp commented Nov 22, 2024 •

edited

Loading

TomazErjavec commented Nov 23, 2024 •

edited

Loading

matyaskopp commented Nov 24, 2024 •

edited

Loading

matyaskopp commented Nov 25, 2024 •

edited

Loading

`<meeting>` element in teiCorpus

GiliGoldin commented Nov 25, 2024 via email

matyaskopp commented Nov 25, 2024

matyaskopp commented Nov 25, 2024 •

edited

Loading

GiliGoldin commented Nov 25, 2024 via email

matyaskopp commented Nov 26, 2024

languages

matyaskopp commented Nov 26, 2024

taxonomies

matyaskopp commented Nov 26, 2024

GiliGoldin commented Nov 28, 2024 •

edited by matyaskopp

Loading

bibliography

matyaskopp commented Nov 28, 2024

GiliGoldin commented Dec 1, 2024

GiliGoldin commented Dec 1, 2024 •

edited

Loading

named entities

taxonomies

languages

independent MP forms parliamentary group

matyaskopp commented Dec 2, 2024

languages

matyaskopp commented Dec 2, 2024

matyaskopp commented Dec 2, 2024 •

edited

Loading

matyaskopp commented Dec 2, 2024 •

edited

Loading

GiliGoldin commented Dec 2, 2024

matyaskopp commented Dec 2, 2024 •

edited

Loading

matyaskopp commented Dec 2, 2024

GiliGoldin commented Dec 3, 2024

GiliGoldin commented Dec 29, 2024

TomazErjavec commented Dec 30, 2024

IL Feedback #883

IL Feedback #883

Comments

matyaskopp commented Nov 22, 2024 • edited Loading

Are component filenames really unique

maintitle unique and also in Hebrew

<meeting> element in plenarys

<meeting> element in committees

<meeting> element in teiCorpus

annotation of the file TEI/@ana

bibliography

settingDesc

ID format

changed ids in annotated version

syntactic vs orthographic words

named entities

taxonomies

languages

invalit label content

abbreviated form is longer than full

independent MP forms parliamentary group

only member affiliations

unknown person name

TomazErjavec commented Nov 23, 2024 • edited Loading

matyaskopp commented Nov 24, 2024 • edited Loading

random person check דורון אביטל

matyaskopp commented Nov 25, 2024 • edited Loading

<meeting> element in teiCorpus

GiliGoldin commented Nov 25, 2024 via email

matyaskopp commented Nov 25, 2024

matyaskopp commented Nov 25, 2024 • edited Loading

extra files

GiliGoldin commented Nov 25, 2024 via email

matyaskopp commented Nov 26, 2024

languages

matyaskopp commented Nov 26, 2024

taxonomies

matyaskopp commented Nov 26, 2024

GiliGoldin commented Nov 28, 2024 • edited by matyaskopp Loading

bibliography

matyaskopp commented Nov 28, 2024

GiliGoldin commented Dec 1, 2024

GiliGoldin commented Dec 1, 2024 • edited Loading

named entities

taxonomies

languages

independent MP forms parliamentary group

matyaskopp commented Dec 2, 2024

languages

matyaskopp commented Dec 2, 2024

matyaskopp commented Dec 2, 2024 • edited Loading

join attribute

matyaskopp commented Dec 2, 2024 • edited Loading

GiliGoldin commented Dec 2, 2024

matyaskopp commented Dec 2, 2024 • edited Loading

matyaskopp commented Dec 2, 2024

GiliGoldin commented Dec 3, 2024

GiliGoldin commented Dec 29, 2024

TomazErjavec commented Dec 30, 2024

matyaskopp commented Nov 22, 2024 •

edited

Loading

`<meeting>` element in plenarys

`<meeting>` element in committees

`<meeting>` element in teiCorpus

annotation of the file `TEI/@ana`

TomazErjavec commented Nov 23, 2024 •

edited

Loading

matyaskopp commented Nov 24, 2024 •

edited

Loading

matyaskopp commented Nov 25, 2024 •

edited

Loading

`<meeting>` element in teiCorpus

matyaskopp commented Nov 25, 2024 •

edited

Loading

GiliGoldin commented Nov 28, 2024 •

edited by matyaskopp

Loading

GiliGoldin commented Dec 1, 2024 •

edited

Loading

matyaskopp commented Dec 2, 2024 •

edited

Loading

matyaskopp commented Dec 2, 2024 •

edited

Loading

matyaskopp commented Dec 2, 2024 •

edited

Loading