bigbird pegasus on the booksum dataset

this is the "latest" version of the model that has been trained the longest, currently at 70k steps

GOAL: A summarization model that 1) summarizes the source content accurately 2) more important IMO produces summaries that are easy to read and understand (* cough * unlike arXiv * cough *)
- This model attempts to help with that by using the booksum dataset to provide explanatory summarization
- Explanatory Summary - A summary that both consolidates information and also explains why said consolidated information is important.
This model was trained for seven epochs total (approx 70,000 steps) and is closer to finished.
- Will continue to improve (slowly, now that it has been trained for a long time) based on any result findings/feedback.
starting checkpoint was google/bigbird-pegasus-large-bigpatent

example usage

An extended example, including a demo of batch summarization, is here.

create the summarizer object:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline

model = AutoModelForSeq2SeqLM.from_pretrained(
    "pszemraj/bigbird-pegasus-large-K-booksum",
    low_cpu_mem_usage=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "pszemraj/bigbird-pegasus-large-K-booksum",
)


summarizer = pipeline(
    "summarization",
    model=model,
    tokenizer=tokenizer,
)

define text to be summarized, and pass it through the pipeline. Boom done.

wall_of_text = "your text to be summarized goes here."

result = summarizer(
    wall_of_text,
    min_length=16,
    max_length=256,
    no_repeat_ngram_size=3,
    clean_up_tokenization_spaces=True,
)

print(result[0]["summary_text"])

Alternate Checkpoint

if experiencing runtime/memory issues, try this earlier checkpoint at 40,000 steps which is almost as good at the explanatory summarization task but runs faster.
see similar summarization models fine-tuned on booksum but using different architectures: long-t5 base and LED-Large

Downloads last month: 26

Safetensors

Model size

0.6B params

Tensor type

F32

Dataset used to train pszemraj/bigbird-pegasus-large-K-booksum

Paper for pszemraj/bigbird-pegasus-large-K-booksum

BookSum: A Collection of Datasets for Long-form Narrative Summarization

Paper • 2105.08209 • Published May 18, 2021 • 2

Evaluation results

ROUGE-1 on kmfoda/booksum
test set verified

34.076
ROUGE-2 on kmfoda/booksum
test set verified

5.918
ROUGE-L on kmfoda/booksum
test set verified

16.387
ROUGE-LSUM on kmfoda/booksum
test set verified

31.612
loss on kmfoda/booksum
test set verified

3.522
gen_len on kmfoda/booksum
test set verified

254.368
ROUGE-1 on launch/gov_report
test set verified

40.015
ROUGE-2 on launch/gov_report
test set verified

10.741
ROUGE-L on launch/gov_report
test set verified

20.134
ROUGE-LSUM on launch/gov_report
test set verified

36.774