--- language: - vi - en library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - mathematics - vietnamese - binary-classification - hard-negatives - loss-based-early-stopping - e5-base - exact-retrieval base_model: intfloat/multilingual-e5-base metrics: - mean_reciprocal_rank - hit_rate - accuracy datasets: - custom-vietnamese-math --- # E5-Math-Vietnamese-Binary: Hard Negatives + Loss-based Early Stopping ## Model Overview Fine-tuned E5-base model optimized for **exact chunk retrieval** in Vietnamese mathematics using: - **🎯 Binary Classification**: Correct vs Incorrect (instead of 3-level hierarchy) - **💪 Hard Negatives**: Related chunks as hard negatives for better discrimination - **⏰ Loss-based Early Stopping**: Stops when validation loss stops improving - **📊 Comprehensive Evaluation**: Hit@K, Accuracy@1, MRR metrics ## Performance Summary ### Training Results - **Best Validation Loss**: N/A - **Training Epochs**: 10 - **Early Stopping**: ❌ Not triggered - **Training Time**: 4661.226378917694 ### Test Performance 🌟 EXCELLENT Outstanding performance with correct chunks consistently at top positions | Metric | Base E5 | Fine-tuned | Improvement | |--------|---------|------------|-------------| | **MRR** | 0.7740 | 0.8505 | +0.0765 | | **Accuracy@1** | 0.6129 | 0.7634 | +0.1505 | | **Hit@1** | 0.6129 | 0.7634 | +0.1505 | | **Hit@3** | 0.9462 | 0.9247 | -0.0215 | | **Hit@5** | 1.0000 | 0.9785 | -0.0215 | **Total Test Queries**: 93 ## Key Innovations ### 🎯 Binary Classification Approach Instead of traditional 3-level hierarchy (correct/related/irrelevant), this model uses: - **Correct chunks**: Score 1.0 (positive examples) - **Incorrect chunks**: Score 0.0 (includes both related and irrelevant) - **Hard negatives**: Related chunks serve as challenging negative examples ### 💪 Hard Negatives Strategy ```python # Training strategy positive = correct_chunk # Score: 1.0 hard_negative = related_chunk # Score: 0.0 (but semantically close) easy_negative = irrelevant_chunk # Score: 0.0 (semantically distant) # This forces model to learn fine-grained distinctions ``` ### ⏰ Loss-based Early Stopping - Monitors **validation loss** instead of MRR - Stops when loss stops decreasing (patience=3) - Prevents overfitting and saves training time ## Usage ### Basic Usage ```python from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity # Load model model = SentenceTransformer('ThanhLe0125/ebd-math') # ⚠️ CRITICAL: Must use E5 prefixes query = "query: Định nghĩa hàm số đồng biến là gì?" chunks = [ "passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà...", # Should rank #1 "passage: Ví dụ về hàm số đồng biến: f(x) = 2x + 1...", # Related (trained as hard negative) "passage: Phương trình bậc hai có dạng ax² + bx + c = 0" # Irrelevant ] # Encode and rank query_emb = model.encode([query]) chunk_embs = model.encode(chunks) similarities = cosine_similarity(query_emb, chunk_embs)[0] # Get rankings ranked_indices = similarities.argsort()[::-1] for rank, idx in enumerate(ranked_indices, 1): print(f"Rank {rank}: Score {similarities[idx]:.4f} - {chunks[idx][:50]}...") ``` ### Advanced Usage with Multiple Queries ```python def find_best_chunks(queries, chunk_pool, top_k=3): """Find best chunks for multiple queries""" results = [] for query in queries: # Ensure E5 format formatted_query = f"query: {query}" if not query.startswith("query:") else query formatted_chunks = [f"passage: {chunk}" if not chunk.startswith("passage:") else chunk for chunk in chunk_pool] # Encode query_emb = model.encode([formatted_query]) chunk_embs = model.encode(formatted_chunks) similarities = cosine_similarity(query_emb, chunk_embs)[0] # Get top K top_indices = similarities.argsort()[::-1][:top_k] top_chunks = [ { 'chunk': chunk_pool[i], 'similarity': similarities[i], 'rank': rank + 1 } for rank, i in enumerate(top_indices) ] results.append({ 'query': query, 'top_chunks': top_chunks }) return results # Example queries = [ "Công thức tính đạo hàm của hàm hợp", "Cách giải phương trình bậc hai", "Định nghĩa giới hạn của hàm số" ] chunk_pool = [ "Đạo hàm của hàm hợp: (f(g(x)))' = f'(g(x)) × g'(x)", "Giải phương trình bậc hai bằng công thức nghiệm", "Giới hạn của hàm số tại một điểm", # ... more chunks ] results = find_best_chunks(queries, chunk_pool, top_k=3) ``` ## Training Details ### Dataset - **Domain**: Vietnamese mathematics education - **Split**: Train/Validation/Test with proper separation - **Hard Negatives**: Related mathematical concepts as challenging negatives - **Easy Negatives**: Unrelated mathematical concepts ### Training Configuration ```python Config: base_model = "intfloat/multilingual-e5-base" train_batch_size = 4 learning_rate = 2e-5 max_epochs = 10 early_stopping_patience = 3 loss_function = "MultipleNegativesRankingLoss" evaluation_metric = "validation_loss" ``` ### Evaluation Methodology 1. **Training**: Binary classification with hard negatives 2. **Validation**: Loss-based monitoring for early stopping 3. **Testing**: Comprehensive evaluation with restored 3-level hierarchy 4. **Metrics**: Hit@K, Accuracy@1, MRR comparison vs base model ## Model Architecture - **Base**: intfloat/multilingual-e5-base - **Max Sequence Length**: 256 tokens - **Output Dimension**: 768 - **Similarity**: Cosine similarity - **Training Loss**: MultipleNegativesRankingLoss ## Use Cases - ✅ **Educational Q&A**: Find exact mathematical definitions and explanations - ✅ **Content Retrieval**: Precise chunk retrieval for Vietnamese math content - ✅ **Tutoring Systems**: Quick and accurate answer finding - ✅ **Knowledge Base Search**: Efficient mathematical concept lookup ## Performance Interpretation - **Hit@1 ≥ 0.7**: 🌟 Excellent - Correct answer usually at #1 - **Hit@3 ≥ 0.8**: 🎯 Very Good - Correct answer in top 3 - **MRR ≥ 0.7**: 👍 Good - Low average rank for correct answers - **Accuracy@1 ≥ 0.6**: ✅ Solid - Good precision for top result ## Limitations - **Vietnamese-specific**: Optimized for Vietnamese mathematical terminology - **Domain-specific**: Best performance on educational math content - **Sequence length**: Limited to 256 tokens - **E5 format required**: Must use "query:" and "passage:" prefixes ## Citation ```bibtex @model{e5-math-vietnamese-binary, title={E5-Math-Vietnamese-Binary: Hard Negatives Fine-tuning for Mathematical Retrieval}, author={ThanhLe0125}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/ThanhLe0125/ebd-math} } ``` --- *Trained on July 01, 2025 using hard negatives and loss-based early stopping for optimal retrieval performance.*