Qwen 2.5 14B - CIM Spatial SQL (Q2SQL)
Fine-tuned Qwen 2.5 14B Instruct model for generating PostGIS spatial SQL queries for City Information Modeling (CIM) databases.
Model Details
Model Description
- Developed by: Ali Taherdoust (Politecnico di Torino)
- Model type: Causal Language Model (Decoder-only)
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: Qwen/Qwen2.5-14B-Instruct
Model Sources
- Repository: GitHub Repository
- Training Dataset: taherdoust/ai4cimdb (176K samples, 88K training)
- Base Model: Qwen/Qwen2.5-14B-Instruct
Uses
Direct Use
This model is designed for generating PostGIS spatial SQL queries from natural language questions for City Information Modeling (CIM) databases. It can be used to:
- Convert natural language questions to PostGIS SQL queries
- Query spatial urban databases (buildings, census, raster, network data)
- Support non-technical users in accessing CIM databases
Downstream Use
- Integration into CIM Wizard framework for natural language database interfaces
- Educational tools for teaching spatial SQL
- Research on text-to-SQL for specialized domains
Out-of-Scope Use
- General-purpose SQL generation (not trained on standard SQL databases)
- Non-spatial SQL queries (optimized for PostGIS)
- Other natural language tasks (specialized for SQL generation)
Bias, Risks, and Limitations
- Domain-Specific: Trained specifically on CIM/PostGIS queries. May not perform well on standard SQL databases.
- Spatial Functions: Requires PostGIS extension. Queries may fail on non-PostGIS databases.
- Schema Dependency: Trained on specific CIM schema (cim_vector, cim_census, cim_raster, cim_network). Performance may degrade on different schemas.
- Language: Trained on English questions only.
- Accuracy: 85-92% execution accuracy. Generated SQL should be validated before production use.
Recommendations
Users should:
- Validate generated SQL queries before execution
- Test on their specific database schema
- Use in controlled environments with proper error handling
- Review complex queries manually
How to Get Started with the Model
Installation
pip install transformers peft accelerate bitsandbytes
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load base model
base_model = "Qwen/Qwen2.5-14B-Instruct"
model_name = "taherdoust/qwen25-14b-cim-q2sql"
# 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load base model
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# Load fine-tuned adapter
model = PeftModel.from_pretrained(model, model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Generate SQL from question
question = "Find all buildings within 500 meters of the city center"
prompt = f"""<|im_start|>system
You are an expert in PostGIS spatial SQL for City Information Modeling (CIM).
Your task is to generate precise PostGIS spatial SQL queries for the CIM Wizard database.
Database Schema:
- cim_vector: Building geometries, project scenarios, grid infrastructure
- cim_census: Italian census demographic data (ISTAT 2011)
- cim_raster: DTM/DSM raster data
- cim_network: Electrical grid network data
Generate only the SQL query without explanations.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
sql = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(sql)
Training Details
Training Data
- Dataset: taherdoust/ai4cimdb
- Training Samples: 88,480
- Validation Samples: 18,960
- Test Samples: 18,960
- Total Dataset Size: 176,837 samples (126,400 after curation)
- Data Quality: 99.7% quality acceptance rate, 99.57% NoErr rate
Training Procedure
- Method: QLoRA (Quantized Low-Rank Adaptation)
- Quantization: 4-bit NF4
- LoRA Rank: 16
- LoRA Alpha: 32
- LoRA Dropout: 0.1
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training Hyperparameters
- Training regime: bf16 mixed precision
- Epochs: 3
- Batch Size: 2 per device
- Gradient Accumulation Steps: 8 (effective batch size: 16)
- Learning Rate: 1.5e-4
- Learning Rate Scheduler: Cosine with warmup
- Warmup Ratio: 0.1
- Weight Decay: 0.01
- Max Sequence Length: 2048 tokens
- Optimizer: paged_adamw_8bit
- Mixed Precision: bfloat16
Speeds, Sizes, Times
- Training Time: ~30-40 hours (on NVIDIA RTX 3090)
- Model Size: ~200 MB (LoRA adapter) + ~27 GB (base model, quantized)
- Trainable Parameters: 0.46% of total parameters
- Final Evaluation Loss: ~0.088
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Test Set: 18,960 samples from taherdoust/ai4cimdb
- Evaluation Benchmark: 100 samples (weighted stratified by difficulty)
- Database: CIM Wizard Integrated (PostgreSQL 15 + PostGIS 3.4)
Metrics
- Execution Accuracy (EX): Percentage of queries that execute without errors
- Exact Match (EM): Percentage of queries identical to ground truth
- Valid Efficiency Score (VES): Percentage of queries returning correct results
Results
Summary
The model achieves strong performance on PostGIS spatial SQL generation:
- Execution Accuracy (EX): 85-92% (first-shot)
- Exact Match (EM): 40-50%
- Spatial Functions: 88-93% accuracy (ST_Within, ST_Intersects, ST_Distance, etc.)
- Multi-Schema Queries: 85-92% accuracy (cim_vector, cim_census, cim_raster joins)
- Domain Terminology: 88-92% accuracy (SEZ2011, TABULA, E8-E16 census codes)
Technical Specifications
Model Architecture and Objective
- Base Architecture: Qwen 2.5 (Transformer Decoder)
- Parameters: 14 billion (base) + 16M (LoRA adapter)
- Context Window: 128K tokens
- Attention: Grouped-Query Attention (GQA) with enhanced KV cache
- Position Encoding: RoPE (Rotary Position Embeddings)
Compute Infrastructure
Hardware
- GPU: NVIDIA RTX 3090 (24 GB VRAM)
- Training Framework: HuggingFace Transformers + PEFT
- Quantization: BitsAndBytes 4-bit NF4
Software
- Framework Versions:
- PEFT: 0.12.0
- Transformers: Latest
- PyTorch: 2.x with CUDA 12.1
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: NVIDIA RTX 3090
- Hours used: ~30-40 hours
- Cloud Provider: Local GPU server (ipazia)
- Compute Region: Italy
- Carbon Emitted: Estimated ~1.5-2 kg CO2 (based on RTX 3090 power consumption)
Citation
BibTeX:
@misc{qwen25-14b-cim-q2sql,
author = {Taherdoust, Ali},
title = {Qwen 2.5 14B Fine-tuned for CIM Spatial SQL Generation},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/taherdoust/qwen25-14b-cim-q2sql}},
note = {Fine-tuned for PostGIS spatial SQL queries in City Information Modeling}
}
APA:
Taherdoust, A. (2025). Qwen 2.5 14B Fine-tuned for CIM Spatial SQL Generation. HuggingFace. https://huggingface.co/taherdoust/qwen25-14b-cim-q2sql
Model Card Authors
Ali Taherdoust (Politecnico di Torino)
Model Card Contact
For questions or issues, please open an issue on the GitHub repository or contact: [email protected]
Acknowledgments
- Base model: Alibaba Qwen 2.5
- Training dataset: Generated through three-stage pipeline (rule-based templates, CTGAN synthesis, GPT-4o-mini augmentation)
- Infrastructure: Politecnico di Torino GPU computing resources
- Downloads last month
- 12