Qwen 2.5 14B - CIM Spatial SQL (Q2SQL)

Fine-tuned Qwen 2.5 14B Instruct model for generating PostGIS spatial SQL queries for City Information Modeling (CIM) databases.

Model Details

Model Description

Developed by: Ali Taherdoust (Politecnico di Torino)
Model type: Causal Language Model (Decoder-only)
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen2.5-14B-Instruct

Model Sources

Repository: GitHub Repository
Training Dataset: taherdoust/ai4cimdb (176K samples, 88K training)
Base Model: Qwen/Qwen2.5-14B-Instruct

Uses

Direct Use

This model is designed for generating PostGIS spatial SQL queries from natural language questions for City Information Modeling (CIM) databases. It can be used to:

Convert natural language questions to PostGIS SQL queries
Query spatial urban databases (buildings, census, raster, network data)
Support non-technical users in accessing CIM databases

Downstream Use

Integration into CIM Wizard framework for natural language database interfaces
Educational tools for teaching spatial SQL
Research on text-to-SQL for specialized domains

Out-of-Scope Use

General-purpose SQL generation (not trained on standard SQL databases)
Non-spatial SQL queries (optimized for PostGIS)
Other natural language tasks (specialized for SQL generation)

Bias, Risks, and Limitations

Domain-Specific: Trained specifically on CIM/PostGIS queries. May not perform well on standard SQL databases.
Spatial Functions: Requires PostGIS extension. Queries may fail on non-PostGIS databases.
Schema Dependency: Trained on specific CIM schema (cim_vector, cim_census, cim_raster, cim_network). Performance may degrade on different schemas.
Language: Trained on English questions only.
Accuracy: 85-92% execution accuracy. Generated SQL should be validated before production use.

Recommendations

Users should:

Validate generated SQL queries before execution
Test on their specific database schema
Use in controlled environments with proper error handling
Review complex queries manually

How to Get Started with the Model

Installation

pip install transformers peft accelerate bitsandbytes

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model
base_model = "Qwen/Qwen2.5-14B-Instruct"
model_name = "taherdoust/qwen25-14b-cim-q2sql"

# 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Load fine-tuned adapter
model = PeftModel.from_pretrained(model, model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Generate SQL from question
question = "Find all buildings within 500 meters of the city center"

prompt = f"""<|im_start|>system
You are an expert in PostGIS spatial SQL for City Information Modeling (CIM).
Your task is to generate precise PostGIS spatial SQL queries for the CIM Wizard database.

Database Schema:
- cim_vector: Building geometries, project scenarios, grid infrastructure
- cim_census: Italian census demographic data (ISTAT 2011)
- cim_raster: DTM/DSM raster data
- cim_network: Electrical grid network data

Generate only the SQL query without explanations.<|im_end|>
<|im_start|>user
{question}<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
sql = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(sql)

Training Details

Training Data

Dataset: taherdoust/ai4cimdb
Training Samples: 88,480
Validation Samples: 18,960
Test Samples: 18,960
Total Dataset Size: 176,837 samples (126,400 after curation)
Data Quality: 99.7% quality acceptance rate, 99.57% NoErr rate

Training Procedure

Method: QLoRA (Quantized Low-Rank Adaptation)
Quantization: 4-bit NF4
LoRA Rank: 16
LoRA Alpha: 32
LoRA Dropout: 0.1
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Hyperparameters

Training regime: bf16 mixed precision
Epochs: 3
Batch Size: 2 per device
Gradient Accumulation Steps: 8 (effective batch size: 16)
Learning Rate: 1.5e-4
Learning Rate Scheduler: Cosine with warmup
Warmup Ratio: 0.1
Weight Decay: 0.01
Max Sequence Length: 2048 tokens
Optimizer: paged_adamw_8bit
Mixed Precision: bfloat16

Speeds, Sizes, Times

Training Time: ~30-40 hours (on NVIDIA RTX 3090)
Model Size: ~200 MB (LoRA adapter) + ~27 GB (base model, quantized)
Trainable Parameters: 0.46% of total parameters
Final Evaluation Loss: ~0.088

Evaluation

Testing Data, Factors & Metrics

Testing Data

Test Set: 18,960 samples from taherdoust/ai4cimdb
Evaluation Benchmark: 100 samples (weighted stratified by difficulty)
Database: CIM Wizard Integrated (PostgreSQL 15 + PostGIS 3.4)

Metrics

Execution Accuracy (EX): Percentage of queries that execute without errors
Exact Match (EM): Percentage of queries identical to ground truth
Valid Efficiency Score (VES): Percentage of queries returning correct results

Results

Summary

The model achieves strong performance on PostGIS spatial SQL generation:

Execution Accuracy (EX): 85-92% (first-shot)
Exact Match (EM): 40-50%
Spatial Functions: 88-93% accuracy (ST_Within, ST_Intersects, ST_Distance, etc.)
Multi-Schema Queries: 85-92% accuracy (cim_vector, cim_census, cim_raster joins)
Domain Terminology: 88-92% accuracy (SEZ2011, TABULA, E8-E16 census codes)

Technical Specifications

Model Architecture and Objective

Base Architecture: Qwen 2.5 (Transformer Decoder)
Parameters: 14 billion (base) + 16M (LoRA adapter)
Context Window: 128K tokens
Attention: Grouped-Query Attention (GQA) with enhanced KV cache
Position Encoding: RoPE (Rotary Position Embeddings)

Compute Infrastructure

Hardware

GPU: NVIDIA RTX 3090 (24 GB VRAM)
Training Framework: HuggingFace Transformers + PEFT
Quantization: BitsAndBytes 4-bit NF4

Software

Framework Versions:
- PEFT: 0.12.0
- Transformers: Latest
- PyTorch: 2.x with CUDA 12.1

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA RTX 3090
Hours used: ~30-40 hours
Cloud Provider: Local GPU server (ipazia)
Compute Region: Italy
Carbon Emitted: Estimated ~1.5-2 kg CO2 (based on RTX 3090 power consumption)

Citation

BibTeX:

@misc{qwen25-14b-cim-q2sql,
  author = {Taherdoust, Ali},
  title = {Qwen 2.5 14B Fine-tuned for CIM Spatial SQL Generation},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/taherdoust/qwen25-14b-cim-q2sql}},
  note = {Fine-tuned for PostGIS spatial SQL queries in City Information Modeling}
}

APA:

Taherdoust, A. (2025). Qwen 2.5 14B Fine-tuned for CIM Spatial SQL Generation. HuggingFace. https://huggingface.co/taherdoust/qwen25-14b-cim-q2sql

Model Card Authors

Ali Taherdoust (Politecnico di Torino)

Model Card Contact

For questions or issues, please open an issue on the GitHub repository or contact: [email protected]

Acknowledgments

Base model: Alibaba Qwen 2.5
Training dataset: Generated through three-stage pipeline (rule-based templates, CTGAN synthesis, GPT-4o-mini augmentation)
Infrastructure: Politecnico di Torino GPU computing resources

Downloads last month: 12

Model tree for taherdoust/qwen25-14b-cim-q2sql

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Adapter

(180)

this model

taherdoust
/

qwen25-14b-cim-q2sql