Help me segment #13616

MonoMarkor · 2024-09-03T15:51:23Z

MonoMarkor
Sep 3, 2024

i am using spacy sentisizer to make a sentence array for a page i downloaded from confluence. i want to make it so that a whole table is treated as a single sentence after using nlp.

Before the start of each table is 'SRT' and after the end of the table is 'END'. e.g:
SRT| Name | Wert | Bemerkung |
| --- | --- | --- |
| d_1_SiO2 | 70 [nm] | HCl-Oxid |
| d_2_Nitr | 500[nm] | TS: 13522 |
| d_3_SiO2 | 70 [nm] | HTO |
| d_4_MoSi2 | 100 [nm] | MoSi2, nominelle Dicke PVD |
| d_5_SiO2 | 50 [nm] | HTO |
| d_6_Nitr | 100[nm] | TS: 13522 |END

How ever after sentisizing it looks like this:
SRT|

,Name	Wert	Bemerkung

,| d_1_SiO2 | 70 [nm] |
,HCl-Oxid |
| d_2_Nitr | 500[nm] | TS: 13522 |
| d_3_SiO2 | 70 [nm] |
,HTO |
| d_4_MoSi2
,| 100 [nm] | MoSi2, nominelle Dicke PVD |
| d_5_SiO2 | 50 [nm] | HTO |
| d_6_Nitr
,| 100[nm] | TS:
,13522 |END

As you can see, it has set the elements inside the table as sentences as well.
This the code that i have used:

import spacy
from spacy import language
from spacy.language import Language

num_sentence_chunk_size = 5
nlp2 = spacy.load('de_core_news_lg')

@Language.component('table_segmentor')
def table_segmentor(doc:str):
print((doc[23]))
for i, token in enumerate(doc[:-1]):
if token.text == 'SRT|':
doc[i+1].is_sent_start=True
print('found srt')
elif token.text == '|END':
print('found end')
doc[i].is_sent_start=False
return doc

nlp2.add_pipe('table_segmentor',before='parser')

s_t_s=[]
s_text_spacy=list(nlp2(f_text).sents)
s_t_s= [str(sentence) for sentence in s_text_spacy]
print(len(s_text_spacy))
for item in s_t_s:
print(',' + item)

it has correctly identified all the starting and ending points of the table but i can't get it to work properly, any help would be apreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help me segment #13616

{{title}}

Replies: 0 comments

Select a reply

Help me segment #13616

MonoMarkor Sep 3, 2024

Replies: 0 comments

MonoMarkor
Sep 3, 2024