A small program to extract key passage texts based on their (start/end)
position in the text. Inputs are defined as a literary text and the corresponding citation_sources file, created with Lotte. Returns a .csv
file or a .pkl
file containing the texts of all cited passages and additional information. For now, this script only works if called from the lotte-develop
repo, might need to be rewritten for other purposes.
extract_passages.py [-h] [-o {.csv,.pkl}] -c {.json} -t {.txt}
required named arguments:
-c
,--citations
: citation_sources path/file name (file type:{.json}
)-t
,--text
: literary text path/file name (file type:{.txt}
)
optional arguments:
-h
,--help
: show this help message and exit-o
,--output
: output path / file name (file type:{.csv,.pkl}
)
This Python script allows us to divide our literary text into two groups - one consisting of potential key passages ("cited") and one containing the rest ("not cited"). A .pkl
file created with extract_passages.py
is needed as input and another .pkl
file containing all text passages is returned.
group_passages.py [-h] [-w WORK] -i {.pkl} -t {.txt}
required named arguments:
-i
,--input
: input path/file name (file type:{.pkl}
)-t
,--text
: literary text path/file name (file type:{.txt}
)
optional arguments:
-h
,--help
: show this help message and exit-w
,--work
: title of the work, used for output file names (type:{str([WORK])}
)#
- Prerequisites:
j_all.pkl
andk_all.pkl
, as created in Step 0 withgroup_passages.py
(here: data/0_extraction_data) - Output: data/1_text-stats_data
- Python scripts used: basic.py, processing.py, stats.py, vis.py
text/df = read_file(filepath)
Reads a .txt
or a .pkl
file.
name | description |
---|---|
filepath |
name of the file/filepath type: str |
text/df |
file content, based on filepath it is either a
|
sents_listed/sentences = split_sentences(text)
Splits an input text
into sentences.
name | description |
---|---|
text |
type: str or list |
sents_listed/sentences |
depending on whether the input is a single text file or a list of texts, this function return either a
|
stats_df, sumstats = get_stats(text, sents, cit_num)
Returns a pandas.DataFrame
of text statistics (character length, token count, token length and sentence length per passage) as well as a pandas.DataFrame
with the corresponding summary statistics. For functions prepare_stats()
and summary_stats()
as well as the calculation of the different text statistics ( get_char_len()
, get_token_count()
, get_token_len()
, get_sent_len()
), please take a look at stats.py.
name | description |
---|---|
text |
list of texts (per passage) type: list |
sents |
list of sentences (per passage) type: list |
cit_num |
list of citation frequencies (per passage) type: list |
stats_df |
output from prepare_stats() type: pandas.DataFrame |
sumstats |
output from summary_stats() type: pandas.DataFrame |
fig = box_plot(df_cols, df_names, attribute)
Returns a box plot for the given columns and attribute.
name | description |
---|---|
df_cols |
pandas.DataFrame column names to compare type: list |
df_names |
str names for columns in df_cols , used for labelling purposes type: list |
attribute |
attribute (e.g. textual statistic) to inspect type: str |
fig |
figure type: plotly.graph_objects.Figure |
fig = scatter_plot(df, df_name, colx, coly)
Returns a scatter plot for the given columns in df
.
name | description |
---|---|
df |
type: pandas.DataFrame |
df_name |
name of df , used for title text type: str |
colx |
column 1 of DF to inspect type: str |
coly |
column 1 of DF to inspect type: str |
fig |
figure type: plotly.express.scatter |
- Prerequisites: all files in data/1_text-stats_data
- Output: data/2_pos_data
- path: 2_pos.ipynb
- libraries that need to be installed:
- Python scripts used: basic.py, processing.py, pos.py, vis.py
pos_tagged_dict, pos_tagged_list = get_pos_tags(sents)
Returns a dict
and list
of POS Tags for all of the passages.
name | description |
---|---|
sents |
list of lists of sentences, as created using split_sents() type: list |
pos_tagged_dict |
all individual terms (= each term once per passage) and their associated POS Tags type: dict |
pos_tagged_list |
all POS Tags per passage in their original order type: list |
sorted_tag_freqs, tags_used = count_tag_freqs(pos_tagged)
Counts the individual POS Tag frequencies per passage in pos_tagged
.
name | description |
---|---|
pos_tagged |
output pos_tagged_list from get_pos_tags type: list |
sorted_tag_freqs |
DataFrame where row = tag name, column = index of passage, values = relative frequencies type: pandas.DataFrame |
tags_used |
all individual POS Tags used within an input text/document type: list |
fig = pos_heatmap(df)
Returns a heatmap visualization for the input df
.
name | description |
---|---|
df |
output sorted_tag_freqs from count_tag_freqs type: pandas.DataFrame |
fig |
figure type: plotly.graph_objects.Figure |
df = calculate_weights(df, cit_num)
Calculate weighted values for sorted_tag_freqs
.
name | description |
---|---|
df (input) |
output sorted_tag_freqs from count_tag_freqs type: pandas.DataFrame |
cit_num |
citation frequencies per passage type: list |
df (output) |
equals df (input) but with newly calculated values type: pandas.DataFrame |
ngrams_list = find_ngrams(pos_tagged, n)
Return a list of n
-Grams for pos_tagged
.
name | description |
---|---|
pos_tagged |
output pos_tagged from get_pos_tags type: list |
n |
n as in n-Gram type: int |
ngrams_list |
all the corresponding n-Grams found type: list |
df = ngram_count(ngrams)
Counts the frequencies of each ngram in ngrams
and returns them as a pandas.DataFrame
.
name | description |
---|---|
ngrams |
output ngrams_list from find_ngrams type: list |
df |
DataFrame containing the columns "ngram" and "count" type: pandas.DataFrame |
ngrams, names = get_n_ngrams(n, topn, pos_tagged)
Allows to call find_ngrams
and ngram_count
for more than one n
and limit results to a topn
count.
name | description |
---|---|
n |
list of int s, n as in n-Gram type: list([int_1, int_2, int_n]) |
top |
describes how many highest values to return type: int |
pos_tagged |
output pos_tagged from get_pos_tags type: list |
ngrams |
returns a pandas.DataFrame similar to output df from ngram_count for each n type: list |
names |
contains names for each of the nested pandas.DataFrames in ngrams to use for visualization purposes type: list |
fig = vis_subplots(subtitles, dataframes, rowcount, colcount, showlabels, rel_yaxis)
Create a plotly.graph_objects.Fig
consisting of several bar subplots for each DataFrame in dataframes
.
name | description |
---|---|
subtitles |
list of str s for each subtitle type: list |
dataframes |
list of pandas.DataFrame s, output ngrams from get_n_grams type: list |
rowcount |
number of rows for subplots type: int |
colcount |
number of columns for subplots type: int |
showlabels |
whether to show labels for subplots or not type: binary |
rel_yaxis |
if True all following subplots have the same y-axis scale as the first one type: binary |
fig |
type: plotly.graph_objects.Fig |
all_grams_list = list_individual_grams(df_lists)
Return all individual n-grams over all input df_lists
.
name | description |
---|---|
df_lists |
list of pandas.DataFrame s that equal the output ngrams from get_n_grams type: list |
all_grams_lists |
all individual ngrams over the different input DataFrames in df_lists type: list |
check_grams = grams_matrix_prep(grams, all_grams, type)
Checks for each n-gram in grams
whether it occurs or not ("binary"
) or how often it occurs ("count"
), depending on type
. Returns a list
of values.
name | description |
---|---|
grams |
equals output ngrams from get_n_grams type: pandas.DataFrame |
all_grams |
output all_grams_list from list_individual_grams type: list |
type |
options are "binary" and "count" , does not (spell-)check them right now type: str |
check_grams |
returns a list of binary/count values for each n-gram in all_grams type: list |
index_list = find_ngram_index(pos_tagged, ngram)
Finds all indices of passages in pos_tagged
that contain a certain ngram
at least once.
name | description |
---|---|
pos_tagged |
output pos_tagged_list from get_pos_tags type: list |
ngram |
must be in the following format (use , as delimiter): "[pos_tag],[pos_tag_2],[pos_tag_n]" type: str |
index_list |
type: list |
div_df = get_pos_diversity(df)
Calculate Shannon Entropy for all POS Tag frequencies per passage in df
and returns them in a pandas.DataFrame
.
name | description |
---|---|
df |
output sorted_tag_freqs from count_tag_freqs type: pandas.DataFrame |
div_df |
DataFrame containing the column "pos_diversity" type: pandas.DataFrame |
- Prerequisites: all files in data/2_pos_data
- Output: data/3_sentiment_data
- path: 3_sentiment.ipynb
- libraries that need to be installed:
- Python scripts used: basic.py, processing.py, sentiment.py, summary.py
sentiment_glossary = sentiws_glossary(positive_lines, negative_lines)
Return all SentiWS data in form of a pandas.DataFrame
.
name | description |
---|---|
positive_lines |
file SentiWS_v2.0_Positive.txt , read with .readlines() type: list |
negative_lines |
file SentiWS_v2.0_Negative.txt , read with .readlines() type: list |
sentiment_glossary |
processable glossary of words and their sentiment values to work with type: pandas.DataFrame |
sentiment_vals = get_polarity_values(text, sentiment_df)
Return a sentiment value for each passage in text
.
name | description |
---|---|
text |
type: list |
sentiment_df |
output sentiment_glossary from sentiws_glossary type: pandas.DataFrame |
sentiment_vals |
all polarity values for text based on sentiment_df type: list |
dataframe = apply_scaling(dataframe, col, scale_range)
Apply a new scale to all data in one col
of dataframe (input)
.
name | description |
---|---|
dataframe (input) |
type: pandas.DataFrame |
col |
column in dataframe (input) that apply_scaling should be applied to type: str |
scale_range |
(currently) one of two options: "zero_pos" equals range [0, 1] (using sklearn.preprocessing.MinMaxScaler ), "neg_pos" equals range [-1, 1] (using sklearn.preprocessing.MaxAbsScaler ) type: str |
dataframe (output) |
type: pandas.DataFrame |
sentiment = get_germansentiment(text_col)
Calculate sentiment_scores for each passage in text_col
and return a DataFrame.
name | description |
---|---|
text_col |
type: pandas.DataFrame[column] |
sentiment |
germansentiment.SentimentModel().predict_sentiment() for each text in text_col type: pandas.DataFrame |
compare_sentiment(passage_loc, df)
Prints out germansentiment and SentiWS Scores for a given passage_loc
in df
.
name | description |
---|---|
passage_loc |
location (index) of passage to inspect type: int |
df |
must contain the columns "text" , "germansentiment" and "rel_sentiws" for them to be compared type: pandas.DataFrame |
df = map_sentiment(df)
Transforms germansentiment values in df
to a [-1, 0, 1] scale.
name | description |
---|---|
df (input/output) |
must contain the column "germansentiment" on which the function is applied type: pandas.DataFrame |
- Prerequisites: all files in data/3_sentiment_data
- Output: data/4_summary_data
- path: 4_summary.ipynb
- libraries that need to be installed: