NSP-WikiSection-en_city: Linear Text Segmentation for Wikipedia City Articles
Model Description
NSP-WikiSection-en_city is a fine-tuned BERT model specialized in Text Segmentation for Wikipedia city articles (WikiSection - English, city subset). The model uses Next Sentence Prediction (NSP) to identify topic boundaries in long-form articles, making it effective for segmenting topical sections within Wikipedia city pages.
Try out the model: Hugging Face Space Demo
Key Features
- π― Specialized for Wikipedia city articles: Fine-tuned on the WikiSection English city subset (WikiSection_en_city)
- β‘ Fast Inference: Efficient BERT-base architecture suitable for relatively quick segmentation
- π Evaluated on WikiSection_en_city: Trained and evaluated using a 10% sample of the WikiSection city subset
- π Sentence-Level Segmentation: Identifies topic boundaries at sentence granularity
Model Details
- Base Model:
google-bert/bert-base-uncased - Architecture: BERT with Next Sentence Prediction head
- Parameters: 110M
- Max Sequence Length: 512 tokens
- Fine-tuning Dataset: WikiSection (English, in 10% sample of WikiSection_en_city)
- Fine-tuning Method: Focal Loss with boundary-aware weighting (boundary-aware sampling during training)
- Training Framework: PyTorch + Transformers
How It Works
The model predicts whether two consecutive sentences belong to the same topic (label 0: "is_next") or represent a topic transition (label 1: "not_next"). By applying this classifier sequentially across all sentence pairs in an article, it identifies section boundaries and topical transitions.
Example (Wikipedia city article style):
Sentence A: "The city had a population of 120,000 in the 2010 census."
Sentence B: "Its economy is driven by tourism and light manufacturing."
β Prediction: Same Topic (confidence: 73%)
Sentence A: "The museum attracts thousands of visitors annually and is a cultural landmark."
Sentence B: "In 2018 the city hosted an international film festival, which boosted local tourism."
β Prediction: Possible Topic Boundary (confidence: 64%)
Usage
Quick Start with Transformers
from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
import torch
# Load model and tokenizer (replace with your model path/ID)
tokenizer = AutoTokenizer.from_pretrained("anonymous15135/nsp-wikisection_en_city")
model = AutoModelForNextSentencePrediction.from_pretrained("anonymous15135/nsp-wikisection_en_city")
# Prepare input (Wikipedia city article style)
sentence_a = "The historic downtown district contains many 19th century buildings."
sentence_b = "A new pedestrian promenade was opened along the riverfront in 2015."
# Tokenize
inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=1)
# Interpret results
is_next_prob = probs[0][0].item()
not_next_prob = probs[0][1].item()
print(f"Is Next (same topic): {is_next_prob:.3f}")
print(f"Not Next (topic boundary): {not_next_prob:.3f}")
if not_next_prob > 0.5:
print("π΄ Topic boundary detected!")
else:
print("π’ Same topic continues")
Evaluation Results
Limitations
- Domain Specificity: Best performance on Wikipedia city articles (WikiSection_en_city); performance on other domains may vary.
- Language: Optimized for English (WikiSection_en_city)
- Document Length: Designed for articles with multiple sections; very short pages may give noisy boundaries
- Context Window: Limited to 512 tokens per sentence pair
- Ambiguous Boundaries: May struggle with subtle or highly subjective transitions between paragraphs
Model Card Contact
For questions or feedback, please open an issue in the model repository.
License
This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International
- Downloads last month
- 24
Model tree for anonymous15135/nsp-wikisection_en_city
Base model
google-bert/bert-base-uncased