NSP-WikiSection-en_city: Linear Text Segmentation for Wikipedia City Articles

Model Description

NSP-WikiSection-en_city is a fine-tuned BERT model specialized in Text Segmentation for Wikipedia city articles (WikiSection - English, city subset). The model uses Next Sentence Prediction (NSP) to identify topic boundaries in long-form articles, making it effective for segmenting topical sections within Wikipedia city pages.

Try out the model: Hugging Face Space Demo

Key Features

🎯 Specialized for Wikipedia city articles: Fine-tuned on the WikiSection English city subset (WikiSection_en_city)
⚡ Fast Inference: Efficient BERT-base architecture suitable for relatively quick segmentation
📊 Evaluated on WikiSection_en_city: Trained and evaluated using a 10% sample of the WikiSection city subset
🔄 Sentence-Level Segmentation: Identifies topic boundaries at sentence granularity

Model Details

Base Model: google-bert/bert-base-uncased
Architecture: BERT with Next Sentence Prediction head
Parameters: 110M
Max Sequence Length: 512 tokens
Fine-tuning Dataset: WikiSection (English, in 10% sample of WikiSection_en_city)
Fine-tuning Method: Focal Loss with boundary-aware weighting (boundary-aware sampling during training)
Training Framework: PyTorch + Transformers

How It Works

The model predicts whether two consecutive sentences belong to the same topic (label 0: "is_next") or represent a topic transition (label 1: "not_next"). By applying this classifier sequentially across all sentence pairs in an article, it identifies section boundaries and topical transitions.

Example (Wikipedia city article style):

Sentence A: "The city had a population of 120,000 in the 2010 census."
Sentence B: "Its economy is driven by tourism and light manufacturing."
→ Prediction: Same Topic (confidence: 73%)

Sentence A: "The museum attracts thousands of visitors annually and is a cultural landmark."
Sentence B: "In 2018 the city hosted an international film festival, which boosted local tourism."
→ Prediction: Possible Topic Boundary (confidence: 64%)

Usage

Quick Start with Transformers

from transformers import AutoTokenizer, AutoModelForNextSentencePrediction
import torch

# Load model and tokenizer (replace with your model path/ID)
tokenizer = AutoTokenizer.from_pretrained("anonymous15135/nsp-wikisection_en_city")
model = AutoModelForNextSentencePrediction.from_pretrained("anonymous15135/nsp-wikisection_en_city")

# Prepare input (Wikipedia city article style)
sentence_a = "The historic downtown district contains many 19th century buildings."
sentence_b = "A new pedestrian promenade was opened along the riverfront in 2015."

# Tokenize
inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=1)

# Interpret results
is_next_prob = probs[0][0].item()
not_next_prob = probs[0][1].item()

print(f"Is Next (same topic): {is_next_prob:.3f}")
print(f"Not Next (topic boundary): {not_next_prob:.3f}")

if not_next_prob > 0.5:
    print("🔴 Topic boundary detected!")
else:
    print("🟢 Same topic continues")

Evaluation Results

Limitations

Domain Specificity: Best performance on Wikipedia city articles (WikiSection_en_city); performance on other domains may vary.
Language: Optimized for English (WikiSection_en_city)
Document Length: Designed for articles with multiple sections; very short pages may give noisy boundaries
Context Window: Limited to 512 tokens per sentence pair
Ambiguous Boundaries: May struggle with subtle or highly subjective transitions between paragraphs

Model Card Contact

For questions or feedback, please open an issue in the model repository.

License

This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International

Downloads last month: 24

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anonymous15135/nsp-wikisection_en_city

Base model

google-bert/bert-base-uncased

Finetuned

(5988)

this model

anonymous15135
/

nsp-wikisection_en_city