Pulse Core 1 - Vietnamese Sentiment Analysis System

A comprehensive machine learning-based sentiment analysis system for Vietnamese text processing. Built on TF-IDF feature extraction pipeline combined with various machine learning algorithms, achieving 71.14% accuracy on VLSP2016 general sentiment dataset and 71.72% accuracy on UTS2017_Bank banking aspect sentiment dataset with Support Vector Classification (SVC).

📋 View Detailed System Card for comprehensive model documentation, performance analysis, and limitations.

Model Description

Pulse Core 1 is a versatile Vietnamese sentiment analysis system that supports both general sentiment classification and specialized banking aspect sentiment analysis. The system can analyze general Vietnamese text sentiment (positive/negative/neutral) and banking-specific aspect sentiment (combining banking aspects with sentiment polarities). It's designed for Vietnamese text analysis across multiple domains, with specialized capabilities for banking customer feedback analysis and financial service categorization.

Model Architecture

Algorithm: TF-IDF + SVC/Logistic Regression Pipeline
Feature Extraction: CountVectorizer with 20,000 max features
N-gram Support: Unigram and bigram (1-2)
TF-IDF: Transformation with IDF weighting
Classifier: Support Vector Classification (SVC) / Logistic Regression with optimized parameters
Framework: scikit-learn ≥1.6
Caching System: Hash-based caching for efficient processing

Supported Datasets & Categories

VLSP2016 Dataset - General Sentiment Analysis (3 classes)

Sentiment Categories:

positive - Positive sentiment towards products/services
negative - Negative sentiment towards products/services
neutral - Neutral or mixed sentiment

Dataset Statistics:

Training samples: 5,100 (1,700 per class)
Test samples: 1,050 (350 per class)
Balanced distribution across all sentiment classes
Domain: General product and service reviews

UTS2017_Bank Dataset - Banking Aspect Sentiment (35 combined classes)

Banking Aspects:

ACCOUNT - Account services
CARD - Card services
CUSTOMER_SUPPORT - Customer support
DISCOUNT - Discount offers
INTEREST_RATE - Interest rate information
INTERNET_BANKING - Internet banking services
LOAN - Loan services
MONEY_TRANSFER - Money transfer services
OTHER - Other services
PAYMENT - Payment services
PROMOTION - Promotional offers
SAVING - Savings accounts
SECURITY - Security features
TRADEMARK - Trademark/branding

Sentiments:

positive - Positive sentiment
negative - Negative sentiment
neutral - Neutral sentiment

Combined Labels: The model predicts combined aspect-sentiment labels in the format <aspect>#<sentiment>, such as:

CUSTOMER_SUPPORT#negative - Negative feedback about customer support
LOAN#positive - Positive opinion about loan services
TRADEMARK#positive - Positive brand perception

Installation

pip install scikit-learn>=1.6 joblib

Usage

Training the Model

Dataset Selection and Training

VLSP2016 Dataset (General Sentiment Analysis):

# Train on VLSP2016 with Logistic Regression
python train.py --dataset vlsp2016 --model logistic

# Train with SVC for better performance
python train.py --dataset vlsp2016 --model svc_linear

# Compare n-gram ranges
python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 2
python train.py --dataset vlsp2016 --model svc_linear --ngram-min 1 --ngram-max 3

# Export model for deployment
python train.py --dataset vlsp2016 --model svc_linear --export-model

UTS2017_Bank Dataset (Banking Aspect Sentiment Analysis):

# Train on UTS2017_Bank (default dataset)
python train.py --dataset uts2017 --model logistic

# Train with SVC for better performance
python train.py --dataset uts2017 --model svc_linear

# With specific parameters
python train.py --dataset uts2017 --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2

# Export model for deployment
python train.py --dataset uts2017 --model logistic --export-model

# Compare multiple models on specific dataset
python train.py --dataset vlsp2016 --compare-models logistic svc_linear

Training from Scratch

from train import train_notebook

# Train VLSP2016 general sentiment model
results = train_notebook(
    dataset="vlsp2016",
    model_name="svc_linear",
    max_features=20000,
    ngram_min=1,
    ngram_max=2,
    export_model=True
)

# Train UTS2017_Bank aspect sentiment model
results = train_notebook(
    dataset="uts2017",
    model_name="logistic",
    max_features=20000,
    ngram_min=1,
    ngram_max=2,
    export_model=True
)

# Compare multiple models on VLSP2016
comparison_results = train_notebook(
    dataset="vlsp2016",
    compare=True
)

Performance Metrics

VLSP2016 General Sentiment Analysis Performance

Training Accuracy: 94.57% (SVC Linear)
Test Accuracy: 71.14% (SVC Linear, 1-2 ngram) / 70.67% (SVC Linear, 1-3 ngram) / 70.19% (Logistic Regression)
Training Samples: 5,100 (balanced: 1,700 per class)
Test Samples: 1,050 (balanced: 350 per class)
Number of Classes: 3 sentiment polarities
Training Time: ~24.95 seconds (SVC) / 0.75 seconds (LR)
Per-Class Performance (SVC Linear):
- Positive: 80% precision, 72% recall, 76% F1-score
- Negative: 70% precision, 72% recall, 71% F1-score
- Neutral: 65% precision, 69% recall, 67% F1-score
Key Insights: Consistent performance across all sentiment classes due to balanced dataset
Optimal N-gram: Bigrams (1-2) outperform trigrams (1-3) by 0.47 percentage points

UTS2017_Bank Aspect Sentiment Analysis Performance

Training Accuracy: 94.57% (SVC)
Test Accuracy: 71.72% (SVC) / 68.18% (Logistic Regression)
Training Samples: 1,581
Test Samples: 396
Number of Classes: 35 aspect-sentiment combinations
Training Time: ~5.3 seconds (SVC) / 2.13 seconds (LR)
Best Performing Classes:
- TRADEMARK#positive: 90% F1-score
- CUSTOMER_SUPPORT#positive: 88% F1-score
- LOAN#negative: 67% F1-score (SVC improvement over LR)
- CUSTOMER_SUPPORT#negative: 65% F1-score
Challenges: Class imbalance affects minority aspect-sentiment combinations
Key Finding: SVC shows superior category diversity compared to Logistic Regression

Cross-Dataset Performance Analysis

Consistent SVC Performance: ~71% accuracy on both 3-class (VLSP2016) and 35-class (UTS2017_Bank) tasks
Balance Impact: Balanced datasets (VLSP2016) yield consistent per-class results while imbalanced datasets create performance variations
Training Efficiency: Larger balanced datasets require more training time but provide stable results

Using the Pre-trained Models

Local Model (Vietnamese Banking Aspect Sentiment Analysis)

import joblib

# Load VLSP2016 general sentiment model
general_model = joblib.load("vlsp2016_sentiment_20250929_075529.joblib")

# Load UTS2017_Bank aspect sentiment model
banking_model = joblib.load("uts2017_sentiment_20250928_131716.joblib")

# Or use inference script directly
from inference import predict_text

# General sentiment analysis
general_text = "Sản phẩm này rất tốt, tôi rất hài lòng"
prediction, confidence, top_predictions = predict_text(general_model, general_text)
print(f"General Sentiment: {prediction}")  # Expected: positive

# Banking aspect sentiment analysis
bank_text = "Lãi suất vay mua nhà hiện tại quá cao"
prediction, confidence, top_predictions = predict_text(banking_model, bank_text)
print(f"Banking Aspect-Sentiment: {prediction}")  # Expected: INTEREST_RATE#negative

print(f"Confidence: {confidence:.3f}")
print("Top 3 predictions:")
for i, (category, prob) in enumerate(top_predictions, 1):
    print(f"  {i}. {category}: {prob:.3f}")

# Example output for banking text:
# Banking Aspect-Sentiment: INTEREST_RATE#negative
# Confidence: 0.509
# Top 3 predictions:
#   1. INTEREST_RATE#negative: 0.509
#   2. LOAN#negative: 0.218
#   3. CUSTOMER_SUPPORT#negative: 0.095

Using the Inference Script

# Interactive mode
python inference.py

# Single prediction
python inference.py --text "Lãi suất vay mua nhà hiện tại quá cao"

# Test with examples
python inference.py --test-examples

# List available models
python inference.py --list-models

Model Parameters

dataset: Dataset selection ("vlsp2016" for general sentiment, "uts2017" for banking aspect sentiment)
model: Model type ("logistic", "svc_linear", "svc_rbf", "naive_bayes", "decision_tree", "random_forest", etc.)
max_features: Maximum number of TF-IDF features (default: 20000)
ngram_min/max: N-gram range (default: 1-2, optimal for Vietnamese)
split_ratio: Train/test split ratio (default: 0.2, only used for uts2017)
n_samples: Optional sample limit for quick testing
export_model: Export model for deployment (creates <dataset>_sentiment_<timestamp>.joblib)
compare: Compare multiple model configurations
compare_models: Specify models to compare

Project Management

Cleanup Utility

The project includes a cleanup script to manage training runs:

# Preview runs that will be deleted (without exported models)
uv run python clean.py --dry-run --verbose

# Clean up runs without exported models
uv run python clean.py --yes

# Interactive cleanup with confirmation
uv run python clean.py

Features:

Automatically identifies runs without exported model files
Shows space that will be freed
Dry-run mode for safe previewing
Detailed information about each run
Preserves runs with exported models

Limitations

Language Specificity: Only works with Vietnamese text
Domain Coverage: Two specialized domains (general sentiment + banking aspect sentiment)
Feature Limitations: Limited to 20,000 most frequent features
Class Imbalance Sensitivity: Performance degrades significantly with imbalanced datasets (evident in UTS2017_Bank)
Specific Weaknesses:
- VLSP2016: Minor performance variation between sentiment classes
- UTS2017_Bank: Poor performance on minority aspect-sentiment classes due to insufficient training data
- N-gram Limitation: Trigrams provide minimal improvement over bigrams while increasing computational cost
- Banking domain aspects limited to predefined categories (account, loan, card, etc.)

Ethical Considerations

Dataset Bias: Models reflect biases present in training datasets (VLSP2016 general reviews, UTS2017_Bank banking feedback)
Performance Variation: Significant performance differences between balanced (VLSP2016) and imbalanced (UTS2017_Bank) datasets
Domain Validation: Should be validated on target domain before deployment
Class Imbalance: Consider dataset balance when interpreting results, especially for banking aspect sentiment
Representation: VLSP2016 provides more equitable performance across sentiment classes due to balanced training data

Citation

If you use this model, please cite:

@misc{undertheseanlp_2025,
    author       = { Vu Anh },
    organization = { UnderTheSea NLP },
    title        = { Pulse Core 1 - Vietnamese Sentiment Analysis System },
    year         = 2025,
    url          = { https://huggingface.co/undertheseanlp/pulse_core_1 },
    doi          = { 10.57967/hf/6605 },
    publisher    = { Hugging Face }
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Datasets used to train undertheseanlp/pulse_core_1

Evaluation results

Test Accuracy (SVC Linear) on VLSP2016
self-reported

0.711
Test Accuracy (Logistic Regression) on VLSP2016
self-reported

0.702
Weighted F1-Score (SVC) on VLSP2016
self-reported

0.713
Weighted F1-Score (Logistic Regression) on VLSP2016
self-reported

0.703
Test Accuracy (SVC) on UTS2017_Bank
self-reported

0.717
Test Accuracy (Logistic Regression) on UTS2017_Bank
self-reported

0.682
Weighted Precision (SVC) on UTS2017_Bank
self-reported

0.650
Weighted Recall (SVC) on UTS2017_Bank
self-reported

0.720
Weighted F1-Score (SVC) on UTS2017_Bank
self-reported

0.660
Weighted F1-Score (Logistic Regression) on UTS2017_Bank
self-reported

0.660

View on Papers With Code