batch processing improvements
improvements for batch processing
A small release that includes some improvements to the Summarizer class for batch-processing related use.
let's say you've loaded your Summarizer class:
from textsum.summarize import Summarizer
model_name = "pszemraj/pegasus-x-large-book_synthsumm-bf16" # recent model
summarizer = Summarizer(model_name)
new features/improvements:
Smart __call__
Function for Summarizer
Class:
- Added a smart
__call__
function to automatically distinguish between text input and file paths for summarization, allowing easier integration into batch processing and.map()
tasks.
# Directly passing text to be summarized
summary_text = summarizer("This is a sample text to summarize.")
print(summary_text)
# Passing a file path to be summarized
output_filepath = summarizer(
"/path/to/textfile.extension",
output_dir="./my-summary-stash",
)
print(output_filepath)
Enhanced Batch Processing Controls:
- Introduced
disable_progress_bar
andbatch_delimiter
options to improve control over batch processing and output formatting
from datasets import load_dataset
dataset = load_dataset("Trelis/tiny-shakespeare")
dataset = dataset.map(
lambda x: {"summary": summarizer(x["text"], disable_progress_bar=True)},
batched=False,
) # doesn't spam you with multiple progress bars!!
print(dataset)
Note: You can pass
disable_progress_bar=True
when instantiating the Summarizer() for cleaner inference.
You can now set the 'summary batch delimiter' string by the batch_delimiter
arg when running inference:
summary_output = summarizer(text, batch_delimiter="<I AM A DELIMITER>")
print(summary_output)
# "Summary of first chunk.<I AM A DELIMITER>Summary of second chunk.<I AM A DELIMITER>Summary of third chunk."
by default, it's "\n\n"
Misc
- default parameter update: the
length_penalty
for inference is now 1.0 (was 0.8) - code cleanup across modules, mostly for readability and maintainability.
What's Changed
Full Changelog: v0.2.0...v0.2.1