---
license: apache-2.0
tags:
- qwen
- qwen2-vl
- vision-language-model
- object-detection
- nutrition-table-detection
- qlora
- sft
- openfoodfacts
datasets:
- openfoodfacts/nutrition-table-detection
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
---

# Model Card: Qwen2-VL-7B for Nutrition Table Detection

This model card describes a fine-tuned version of the Qwen/Qwen2-VL-7B-Instruct model, specifically adapted for detecting nutrition tables in product images.

## Model Details

* **Base Model:** [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
* **Model Type:** Vision Language Model (VLM)
* **Fine-tuning Task:** Object Detection (Nutrition Tables)
* **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation) with SFT (Supervised Fine-Tuning)
* **Language(s):** English (primarily for prompts and responses)
* **License:** Apache-2.0 (inherited from the base model)

## Colab Notebooks

The entire fine-tuning walkthrough can be accessed at: https://colab.research.google.com/drive/1EkF4arAYcxfi2fugO1B3bfohr9gZa8Ly?usp=sharing

For serving on vLLM/Nvidia Triton, the code can be found on: https://colab.research.google.com/drive/1furnMbQmD7beK5Z35KnJb2lQCIzQ-dpB?usp=sharing

## Intended Use

This model is intended for identifying and localizing nutrition tables within images of food products. The primary output is the bounding box coordinates of the detected nutrition table.

**Primary Intended Uses:**
* Automated extraction of nutrition information from product packaging.
* Assisting in food logging and dietary tracking applications.
* Retail and e-commerce applications for product information management.

**Out-of-Scope Uses:**
* Detection of objects other than nutrition tables (unless further fine-tuned).
* Optical Character Recognition (OCR) of the text within the nutrition table (the model only provides bounding boxes).
* Making dietary recommendations or health assessments.

## Training Data

* **Dataset:** [openfoodfacts/nutrition-table-detection](https://huggingface.co/datasets/openfoodfacts/nutrition-table-detection)
* **Dataset Description:** This dataset contains product images along with corresponding bounding boxes for nutrition tables.
* **Preprocessing:**
    * The dataset was converted to the OpenAI ChatML format.
    * Each sample consists of:
        * A system message defining the VLM's role.
        * A user message containing the product image and the prompt: "Detect the bounding box of the nutrition table."
        * An assistant message containing the ground truth bounding box coordinates formatted with Qwen2-VL specific tokens (`<|object_ref_start|>nutrition table<|object_ref_end|><|box_start|>(x0,y0),(x1,y1)<|box_end|>`), where coordinates are scaled to the [0,1000) integer space.
    * The object name used was "nutrition table".

## Training Procedure

* **Fine-tuning Framework:** Hugging Face TRL (Transformer Reinforcement Learning) library, specifically using `SFTTrainer` (or a custom `QwenVLSFTTrainer` to handle specific model inputs).
* **Quantization:** 4-bit NormalFloat (NF4) quantization using `bitsandbytes`.
    * `bnb_4bit_quant_type`: "nf4"
    * `bnb_4bit_compute_dtype`: `torch.bfloat16`
    * `bnb_4bit_use_double_quant`: True
* **LoRA Configuration (QLoRA):**
    * `r`: 64
    * `lora_alpha`: 16
    * `lora_dropout`: 0.05
    * `bias`: "none"
    * `task_type`: "CAUSAL_LM"
    * `target_modules`: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "qkv", "proj"]` (covering both transformer decoder and vision encoder ViT blocks).
* **SFT Configuration (`SFTConfig`):**
    * `dataset_text_field`: "messages"
    * `learning_rate`: 2e-4
    * `per_device_train_batch_size`: 4 (PD_BATCH)
    * `gradient_accumulation_steps`: 16 (GA_STEPS) - effective batch size of 64.
    * `num_train_epochs`: 3
    * `lr_scheduler_type`: "cosine"
    * `warmup_ratio`: 0.05
    * `bf16`: True
    * `tf32`: True
    * `gradient_checkpointing`: True
    * `optim`: "paged_adamw_32bit"
    * `max_grad_norm`: 1.0
    * `eval_strategy`: "steps"
    * `eval_steps`: 500
    * `save_strategy`: "steps"
    * `save_steps`: 500
    * `save_total_limit`: 2
    * `logging_steps`: 25
    * `report_to`: "wandb"
    * `load_best_model_at_end`: True
    * `metric_for_best_model`: "eval_loss"
    * `remove_unused_columns`: False
    * `packing`: False
    * `dataloader_pin_memory`: True
    * `output_dir`: "qwen2vl_qlora_sft"
    * `seed`: 42
* **Hardware:** Training was performed on NVIDIA A100 GPUs. Flash Attention 2 / SDPA was enabled for memory efficiency.
* **Software:**
    * `transformers`: 4.52.0.dev0 (or similar dev version, potentially 4.47.0.dev0 as initially installed)
    * `trl`: 0.12.0.dev0
    * `datasets`: 3.0.2
    * `bitsandbytes`: 0.44.1
    * `peft`: 0.13.2
    * `qwen-vl-utils`: 0.0.8
    * `accelerate`: 1.0.1
    * `torch`: 2.4.1+cu121 (specific older version due to compatibility issues with latest PyTorch at the time of notebook creation)
    * `torchvision`: 0.19.1+cu121
    * `torchaudio`: 2.4.1+cu121
    * `wandb` for logging.

## Evaluation

* **Metric:** Mean Intersection over Union (IoU) between predicted and ground truth bounding boxes.
* **Fine-tuned Model Performance (on validation set):**
    * Mean IoU: 0.1111
* **Base Model Performance (Qwen/Qwen2-VL-7B-Instruct, on validation set, without fine-tuning):**
    * Mean IoU: 0.0632
* **Comparison:** The fine-tuned model shows an improvement in IoU score compared to the base model, indicating better localization of nutrition tables.

## Hardware and Software Requirements

* **GPU:** Recommended 1x A100 or 2x A6000 GPUs for fine-tuning due to model size. Flash Attention support (Nvidia Ampere series or better) is beneficial for memory efficiency.
* **Software:** See "Training Procedure" for key library versions.

## Limitations and Bias

* **Computational Resources:** Fine-tuning Qwen2-VL-7B is computationally intensive.
* **Flash Attention:** Optimal memory efficiency with Flash Attention is limited to Nvidia Ampere GPUs or newer. Disabling it on older GPUs may require more resources.
* **Dataset Specificity:** The model is fine-tuned specifically for nutrition table detection on the `openfoodfacts/nutrition-table-detection` dataset. Performance on other types of objects or significantly different image styles may vary.
* **Bounding Box Format:** The model outputs bounding boxes in a specific format (`<|object_ref_start|>object_name<|object_ref_end|><|box_start|>(x0,y0),(x1,y1)<|box_end|>`) with coordinates scaled to [0,1000). Parsing this output is necessary for downstream tasks.
* **IoU Score:** While improved, the IoU of 0.1111 suggests there is still room for improvement in localization accuracy. Further fine-tuning, data augmentation, or architectural adjustments might be needed for higher precision.
* The notebook mentions a specific PyTorch version (2.4.1+cu121) was used due to an issue with the latest version at the time. This might be a consideration for reproducibility.

## How to Use

The fine-tuned LoRA adapters are available in the `qwen2vl_qlora_sft/checkpoint-51` directory (or your specified output directory).
The notebook demonstrates merging these adapters with the base Qwen/Qwen2-VL-7B-Instruct model and saving the merged model to `/content/qwen2vl_merged-bf16`. This merged model can then be pushed to a Hugging Face Hub repository (e.g., `lordChipotle/nutrition-label-detector`).

**Inference with the merged model:**
Use the `AutoModelForVision2Seq` and `AutoProcessor` from the Hugging Face `transformers` library, loading the model from your Hub repository or the saved `OUTPUT_DIR`.

```python
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

# Example: Load from local merged directory
MODEL_PATH = "/content/qwen2vl_merged-bf16" # Or your Hugging Face Hub repo name
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16

model = AutoModelForVision2Seq.from_pretrained(
    MODEL_PATH,
    torch_dtype=DTYPE,
    device_map="auto",
    trust_remote_code=True # Qwen models may require this
)
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)

# Prepare image and prompt
# img = Image.open("path/to/your/image.jpg").convert("RGB")
# prompt_text = "Detect the bounding box of the nutrition table."
# vision_inputs = [{"type": "image", "image": img}]
# messages = [{"role": "user", "content": vision_inputs + [{"type": "text", "text": prompt_text}]}]
# prompt_chatml = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# inputs = processor(
#     text=prompt_chatml,
#     images=[img], # Assuming single image for simplicity
#     return_tensors="pt"
# ).to(DEVICE)

# with torch.no_grad():
#     output_ids = model.generate(**inputs, max_new_tokens=64)
# response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
# print(response)
# # Parse bounding box from response