--- license: apache-2.0 tags: - qwen - qwen2-vl - vision-language-model - object-detection - nutrition-table-detection - qlora - sft - openfoodfacts datasets: - openfoodfacts/nutrition-table-detection base_model: - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-text-to-text --- # Model Card: Qwen2-VL-7B for Nutrition Table Detection This model card describes a fine-tuned version of the Qwen/Qwen2-VL-7B-Instruct model, specifically adapted for detecting nutrition tables in product images. ## Model Details * **Base Model:** [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) * **Model Type:** Vision Language Model (VLM) * **Fine-tuning Task:** Object Detection (Nutrition Tables) * **Fine-tuning Method:** QLoRA (Quantized Low-Rank Adaptation) with SFT (Supervised Fine-Tuning) * **Language(s):** English (primarily for prompts and responses) * **License:** Apache-2.0 (inherited from the base model) ## Colab Notebooks The entire fine-tuning walkthrough can be accessed at: https://colab.research.google.com/drive/1EkF4arAYcxfi2fugO1B3bfohr9gZa8Ly?usp=sharing For serving on vLLM/Nvidia Triton, the code can be found on: https://colab.research.google.com/drive/1furnMbQmD7beK5Z35KnJb2lQCIzQ-dpB?usp=sharing ## Intended Use This model is intended for identifying and localizing nutrition tables within images of food products. The primary output is the bounding box coordinates of the detected nutrition table. **Primary Intended Uses:** * Automated extraction of nutrition information from product packaging. * Assisting in food logging and dietary tracking applications. * Retail and e-commerce applications for product information management. **Out-of-Scope Uses:** * Detection of objects other than nutrition tables (unless further fine-tuned). * Optical Character Recognition (OCR) of the text within the nutrition table (the model only provides bounding boxes). * Making dietary recommendations or health assessments. ## Training Data * **Dataset:** [openfoodfacts/nutrition-table-detection](https://huggingface.co/datasets/openfoodfacts/nutrition-table-detection) * **Dataset Description:** This dataset contains product images along with corresponding bounding boxes for nutrition tables. * **Preprocessing:** * The dataset was converted to the OpenAI ChatML format. * Each sample consists of: * A system message defining the VLM's role. * A user message containing the product image and the prompt: "Detect the bounding box of the nutrition table." * An assistant message containing the ground truth bounding box coordinates formatted with Qwen2-VL specific tokens (`<|object_ref_start|>nutrition table<|object_ref_end|><|box_start|>(x0,y0),(x1,y1)<|box_end|>`), where coordinates are scaled to the [0,1000) integer space. * The object name used was "nutrition table". ## Training Procedure * **Fine-tuning Framework:** Hugging Face TRL (Transformer Reinforcement Learning) library, specifically using `SFTTrainer` (or a custom `QwenVLSFTTrainer` to handle specific model inputs). * **Quantization:** 4-bit NormalFloat (NF4) quantization using `bitsandbytes`. * `bnb_4bit_quant_type`: "nf4" * `bnb_4bit_compute_dtype`: `torch.bfloat16` * `bnb_4bit_use_double_quant`: True * **LoRA Configuration (QLoRA):** * `r`: 64 * `lora_alpha`: 16 * `lora_dropout`: 0.05 * `bias`: "none" * `task_type`: "CAUSAL_LM" * `target_modules`: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "qkv", "proj"]` (covering both transformer decoder and vision encoder ViT blocks). * **SFT Configuration (`SFTConfig`):** * `dataset_text_field`: "messages" * `learning_rate`: 2e-4 * `per_device_train_batch_size`: 4 (PD_BATCH) * `gradient_accumulation_steps`: 16 (GA_STEPS) - effective batch size of 64. * `num_train_epochs`: 3 * `lr_scheduler_type`: "cosine" * `warmup_ratio`: 0.05 * `bf16`: True * `tf32`: True * `gradient_checkpointing`: True * `optim`: "paged_adamw_32bit" * `max_grad_norm`: 1.0 * `eval_strategy`: "steps" * `eval_steps`: 500 * `save_strategy`: "steps" * `save_steps`: 500 * `save_total_limit`: 2 * `logging_steps`: 25 * `report_to`: "wandb" * `load_best_model_at_end`: True * `metric_for_best_model`: "eval_loss" * `remove_unused_columns`: False * `packing`: False * `dataloader_pin_memory`: True * `output_dir`: "qwen2vl_qlora_sft" * `seed`: 42 * **Hardware:** Training was performed on NVIDIA A100 GPUs. Flash Attention 2 / SDPA was enabled for memory efficiency. * **Software:** * `transformers`: 4.52.0.dev0 (or similar dev version, potentially 4.47.0.dev0 as initially installed) * `trl`: 0.12.0.dev0 * `datasets`: 3.0.2 * `bitsandbytes`: 0.44.1 * `peft`: 0.13.2 * `qwen-vl-utils`: 0.0.8 * `accelerate`: 1.0.1 * `torch`: 2.4.1+cu121 (specific older version due to compatibility issues with latest PyTorch at the time of notebook creation) * `torchvision`: 0.19.1+cu121 * `torchaudio`: 2.4.1+cu121 * `wandb` for logging. ## Evaluation * **Metric:** Mean Intersection over Union (IoU) between predicted and ground truth bounding boxes. * **Fine-tuned Model Performance (on validation set):** * Mean IoU: 0.1111 * **Base Model Performance (Qwen/Qwen2-VL-7B-Instruct, on validation set, without fine-tuning):** * Mean IoU: 0.0632 * **Comparison:** The fine-tuned model shows an improvement in IoU score compared to the base model, indicating better localization of nutrition tables. ## Hardware and Software Requirements * **GPU:** Recommended 1x A100 or 2x A6000 GPUs for fine-tuning due to model size. Flash Attention support (Nvidia Ampere series or better) is beneficial for memory efficiency. * **Software:** See "Training Procedure" for key library versions. ## Limitations and Bias * **Computational Resources:** Fine-tuning Qwen2-VL-7B is computationally intensive. * **Flash Attention:** Optimal memory efficiency with Flash Attention is limited to Nvidia Ampere GPUs or newer. Disabling it on older GPUs may require more resources. * **Dataset Specificity:** The model is fine-tuned specifically for nutrition table detection on the `openfoodfacts/nutrition-table-detection` dataset. Performance on other types of objects or significantly different image styles may vary. * **Bounding Box Format:** The model outputs bounding boxes in a specific format (`<|object_ref_start|>object_name<|object_ref_end|><|box_start|>(x0,y0),(x1,y1)<|box_end|>`) with coordinates scaled to [0,1000). Parsing this output is necessary for downstream tasks. * **IoU Score:** While improved, the IoU of 0.1111 suggests there is still room for improvement in localization accuracy. Further fine-tuning, data augmentation, or architectural adjustments might be needed for higher precision. * The notebook mentions a specific PyTorch version (2.4.1+cu121) was used due to an issue with the latest version at the time. This might be a consideration for reproducibility. ## How to Use The fine-tuned LoRA adapters are available in the `qwen2vl_qlora_sft/checkpoint-51` directory (or your specified output directory). The notebook demonstrates merging these adapters with the base Qwen/Qwen2-VL-7B-Instruct model and saving the merged model to `/content/qwen2vl_merged-bf16`. This merged model can then be pushed to a Hugging Face Hub repository (e.g., `lordChipotle/nutrition-label-detector`). **Inference with the merged model:** Use the `AutoModelForVision2Seq` and `AutoProcessor` from the Hugging Face `transformers` library, loading the model from your Hub repository or the saved `OUTPUT_DIR`. ```python from transformers import AutoModelForVision2Seq, AutoProcessor from PIL import Image import torch # Example: Load from local merged directory MODEL_PATH = "/content/qwen2vl_merged-bf16" # Or your Hugging Face Hub repo name DEVICE = "cuda" if torch.cuda.is_available() else "cpu" DTYPE = torch.bfloat16 model = AutoModelForVision2Seq.from_pretrained( MODEL_PATH, torch_dtype=DTYPE, device_map="auto", trust_remote_code=True # Qwen models may require this ) processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True) # Prepare image and prompt # img = Image.open("path/to/your/image.jpg").convert("RGB") # prompt_text = "Detect the bounding box of the nutrition table." # vision_inputs = [{"type": "image", "image": img}] # messages = [{"role": "user", "content": vision_inputs + [{"type": "text", "text": prompt_text}]}] # prompt_chatml = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # inputs = processor( # text=prompt_chatml, # images=[img], # Assuming single image for simplicity # return_tensors="pt" # ).to(DEVICE) # with torch.no_grad(): # output_ids = model.generate(**inputs, max_new_tokens=64) # response = processor.batch_decode(output_ids, skip_special_tokens=True)[0] # print(response) # # Parse bounding box from response