Help me segment #13616
Help me segment
#13616
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
i am using spacy sentisizer to make a sentence array for a page i downloaded from confluence. i want to make it so that a whole table is treated as a single sentence after using nlp.
Before the start of each table is 'SRT' and after the end of the table is 'END'. e.g:
SRT| Name | Wert | Bemerkung |
| --- | --- | --- |
| d_1_SiO2 | 70 [nm] | HCl-Oxid |
| d_2_Nitr | 500[nm] | TS: 13522 |
| d_3_SiO2 | 70 [nm] | HTO |
| d_4_MoSi2 | 100 [nm] | MoSi2, nominelle Dicke PVD |
| d_5_SiO2 | 50 [nm] | HTO |
| d_6_Nitr | 100[nm] | TS: 13522 |END
How ever after sentisizing it looks like this:
SRT|
,| d_1_SiO2 | 70 [nm] |
,HCl-Oxid |
| d_2_Nitr | 500[nm] | TS: 13522 |
| d_3_SiO2 | 70 [nm] |
,HTO |
| d_4_MoSi2
,| 100 [nm] | MoSi2, nominelle Dicke PVD |
| d_5_SiO2 | 50 [nm] | HTO |
| d_6_Nitr
,| 100[nm] | TS:
,13522 |END
As you can see, it has set the elements inside the table as sentences as well.
This the code that i have used:
import spacy
from spacy import language
from spacy.language import Language
num_sentence_chunk_size = 5
nlp2 = spacy.load('de_core_news_lg')
@Language.component('table_segmentor')
def table_segmentor(doc:str):
print((doc[23]))
for i, token in enumerate(doc[:-1]):
if token.text == 'SRT|':
doc[i+1].is_sent_start=True
print('found srt')
elif token.text == '|END':
print('found end')
doc[i].is_sent_start=False
return doc
nlp2.add_pipe('table_segmentor',before='parser')
s_t_s=[]
s_text_spacy=list(nlp2(f_text).sents)
s_t_s= [str(sentence) for sentence in s_text_spacy]
print(len(s_text_spacy))
for item in s_t_s:
print(',' + item)
it has correctly identified all the starting and ending points of the table but i can't get it to work properly, any help would be apreciated.
Beta Was this translation helpful? Give feedback.
All reactions