Bart-Base-Summarization-Council-PT: Abstractive Summarization of Portuguese Municipal Meeting Minutes Discussion Subjects
Model Description
Primera-Summarization-Council-PT is an abstractive text summarization model based on primera, fine-tuned to produce concise and informative summaries of discussion subjects from Portuguese municipal meeting minutes.
The model was trained on a curated and annotated corpus of official municipal meeting minutes covering a variety of administrative and political topics at the municipal level.
Try out the model: Hugging Face Space Demo
Key Features
- ๐งพ Abstractive Summarization โ Generates natural, human-like summaries rather than extracts.
- ๐ต๐น European Portuguese โ Optimized for official and administrative Portuguese.
- ๐๏ธ Domain-Specific โ Trained on municipal meeting minutes and administrative discussions.
- โ๏ธ Fine-tuned primera โ Built upon
allenai/primerausing supervised fine-tuning. - ๐ง Fact-Aware Generation โ Produces short summaries that preserve factual content.
Model Details
- Architecture: allenai/PRIMERA
- Base Model: Longformer Encoder-Decoder (extension of BART)
- Task: Abstractive summarization (text โ summary)
- Framework: Hugging Face Transformers (PyTorch)
- Tokenizer: Longformer/BART tokenizer (English vocabulary reused for Portuguese text)
- Max Input Length: 4096 tokens
- Max Summary Length: 128 tokens
- Training Objective: Conditional generation (cross-entropy loss)
- Dataset: Portuguese municipal meeting minutes annotated with summaries
How It Works
The model receives a discussion subject of a municipal meeting and outputs a short, coherent summary highlighting:
- The main subject or topic of discussion
- Any decisions, motions, or proposals made
- The entities or departments involved
Example Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "anonymous12321/Primera-Summarization-Council-PT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = """
17. PROCESSO DE OBRAS N.ยบ ***** -- EDIFIC\nPelo Senhor Presidente foi presente a esta reuniรฃo a informaรงรฃo n.ยบ ****** da Secรงรฃo de Urbanismo e Fiscalizaรงรฃo -- Serviรงo de Obras Particulares que se anexa ร presente ata. \nPonderado e analisado o assunto o Executivo Municipal deliberou por unanimidade aprovar as especialidades relativas ao processo de obras n.ยบ ***** -- EDIFIC.
"""
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
๐งพ Model Output
Output:
"O Executivo Municipal aprovou, por unanimidade, as especialidades relativas a um processo de obras particulares."
๐ Evaluation Results
Quantitative Metrics (on held-out test set)
| Metric | Score | Description |
|---|---|---|
| ROUGE-1 | 0.632 | Unigram overlap between generated and reference summaries |
| ROUGE-2 | 0.500 | Bigram overlap |
| ROUGE-L | 0.577 | Longest common subsequence overlap |
| BERTScore (F1) | 0.846 | Semantic similarity between summary and reference |
โ๏ธ Training Details
- Pretrained Model:
allenai/primera - Optimizer: AdamW (default in Hugging Face Trainer)
- Learning Rate: 2e-5
- Batch Size: 4
- Epochs: 3
- Scheduler: Linear warmup
- Loss Function: Cross-entropy
- Evaluation Metrics: ROUGE (computed on validation set every 100 steps)
- Evaluation Strategy: Step-based evaluation (
eval_steps=100) - Weight Decay: 0.01
- Mixed Precision (fp16): Enabled when CUDA is available
- Chunking: Implemented with
max_length=512andstride=256for hierarchical input segmentation - Target (summary) Max Length: 128 tokens
๐ Dataset Description
The model was trained on a specialized dataset of Portuguese municipal meeting minutes, consisting of:
- Discussion Subjects from official municipal meeting minutes.
- Decisions and deliberations across departments (urban planning, finance, education, etc.)
- Expert-annotated summaries per discussion segment
Dataset sources include:
- Six Portuguese municipalities meeting minutes
โ ๏ธ Limitations
- Language Restriction: The model is optimized for Portuguese; performance may degrade in other languages.
- Domain Dependence: Best suited for administrative and institutional texts; less effective on informal or creative writing.
- Length Sensitivity: Very long transcripts (>1024 tokens) are truncated; chunking may be needed for full documents.
- Generalization: While robust within-domain, it may underperform on unseen domains or vocabulary.
๐ License
This model is released under the
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).
- Downloads last month
- 25
Model tree for anonymous12321/Primera-Summarization-Council-PT
Base model
allenai/PRIMERA