QwenStoryteller2
QwenStoryteller2 is an improved version of QwenStoryteller, fine-tuned using contrastive reinforcement learning with Direct Preference Optimization (DPO) to achieve superior entity re-identification and visual grounding in cross-frame storytelling scenarios.
Model Description
Base Model: QwenStoryteller (Qwen2.5-VL 7B)
Training Method: Contrastive Reinforcement Learning with Direct Preference Optimization (LoRA rank 2048, alpha 4096)
Training Dataset: StoryReasoningAdversarialDPO
QwenStoryteller2 builds upon the original QwenStoryteller by addressing critical limitations in cross-frame entity consistency through:
- Contrastive Learning: Training on both real and synthetic negative story examples
 - Enhanced Entity Re-identification: Improved tracking of characters and objects across frames
 - Better Grounding: Superior alignment between narrative elements and visual entities
 - Reduced Hallucinations: More reliable entity connections and fewer spurious references
 
The model employs a dual-component reward function that promotes appropriate entity connections in coherent sequences while discouraging incorrect connections in synthetic arrangements.
Key Improvements Over QwenStoryteller
- Grounding Performance: mAP improved from 0.27 to 0.31 (+14.8%), F1 score from 0.35 to 0.41 (+17.1%)
 - Cross-frame Consistency: Character persistence on ≥5 frames increased from 37.7% to 49.3% (+30.8%)
 - Pronoun Grounding: Significant improvements across all pronoun types (he: 90.1%→99.1%, she: 91.1%→98.6%, they: 47.6%→68.8%)
 - Structural Quality: Well-structured stories increased from 79.1% to 97.5% (+23.3%)
 - Entity Tracking: Object persistence on ≥5 frames improved from 20.9% to 21.3%
 
System Prompt
The model was trained with the following system prompt, and we recommend using it for optimal performance:
You are an AI storyteller that can analyze sequences of images and create creative narratives. 
First think step-by-step to analyze characters, objects, settings, and narrative structure. 
Then create a grounded story that maintains consistent character identity and object references across frames. 
Use <think></think> tags to show your reasoning process before writing the final story.
Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image
# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "daniel3303/QwenStoryteller2", torch_dtype="auto", device_map="auto"
)
# Load processor
processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller2")
# Load images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg"),
    Image.open("image4.jpg"),
    Image.open("image5.jpg")
]
# Create image content list
image_content = []
for img in images:
    image_content.append({
        "type": "image",
        "image": img,
    })
# Add text prompt at the end
image_content.append({"type": "text", "text": "Generate a story based on these images."})
# Create messages with system prompt
messages = [
    {
        "role": "system", 
        "content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story."
    },
    {
        "role": "user",
        "content": image_content,
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(
    **inputs, 
    max_new_tokens=4096,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
story = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(story)
Using vLLM for faster inference
For significantly faster inference, you can use vLLM to serve the model:
# Install vLLM
pip install vllm
# Serve the model with vLLM
vllm serve daniel3303/QwenStoryteller2
Training Methodology
Contrastive Learning Framework
QwenStoryteller2 was trained using a novel contrastive reinforcement learning approach:
- Synthetic Story Generation: Extended the StoryReasoning dataset with 4,178 synthetic stories created by sampling images from different movies to create incoherent sequences
 - Dual-Component Reward Function: Combined entity re-identification (R_reid) and grounding (R_ground) rewards with structural validation
 - Direct Preference Optimization: Used offline preference pairs generated from the reward function to train the model
 
Reward Function Components
- Entity Re-identification Reward: Tracks character and object persistence across frames, promoting connections in real stories while penalizing them in synthetic ones
 - Grounding Reward: Evaluates pronoun and proper noun grounding to visual entities
 - Structure Validation: Ensures generated outputs maintain required format and consistency
 
Training Configuration
- Method: Direct Preference Optimization (DPO) with LoRA fine-tuning
 - LoRA Parameters: Rank 2048, alpha 4096
 - Optimizer: AdamW with learning rate 5×10⁻⁶
 - Batch Size: 8
 - Epochs: 3
 - Temperature Parameter (β): 0.1
 
Performance Metrics
| Metric | QwenStoryteller | QwenStoryteller2 | Improvement | 
|---|---|---|---|
| Character Precision | 0.83 | 0.78 | -6.0% | 
| Object Precision | 0.46 | 0.29 | -37.0% | 
| Total Precision | 0.57 | 0.45 | -21.1% | 
| mAP | 0.27 | 0.31 | +14.8% | 
| Character Recall | 0.62 | 0.77 | +24.2% | 
| Object Recall | 0.25 | 0.28 | +12.0% | 
| Total Recall | 0.40 | 0.48 | +20.0% | 
| F1 Score | 0.35 | 0.41 | +17.1% | 
| METEOR | 0.14 | 0.17 | +21.4% | 
| ROUGE-L | 0.16 | 0.18 | +12.5% | 
| BLEU-4 | 0.054 | 0.057 | +5.6% | 
Output Format
QwenStoryteller2 produces enhanced outputs with improved consistency:
Chain-of-Thought Analysis (
<think></think>): More accurate structured analysis with:- Improved character tables with consistent identity references
 - Better object tracking with enhanced spatial coordination
 - More reliable setting categorization
 - Stronger narrative structure modeling
 
Grounded Story: Enhanced narrative with specialized XML tags:
<gdi>: Image tags for specific frames<gdo>: Entity reference tags with improved accuracy<gda>: Action tags with better character-action alignment<gdl>: Location/landmark tags with enhanced spatial grounding
Key Features
- Enhanced Cross-Frame Consistency: Superior character and object identity maintenance through contrastive learning
 - Improved Pronoun Grounding: Better alignment of pronouns with visual entities (up to 99.1% for "he", 98.6% for "she")
 - Reduced Hallucinations: Fewer incorrect entity connections and spurious references
 - Robust Entity Discrimination: Learned ability to distinguish when cross-frame connections are appropriate
 - Better Structural Quality: Near-perfect adherence to expected output format (97.5%)
 
Limitations
- Precision scores show some reduction compared to the original model due to increased recall
 - Training data derived from movies may introduce cinematic biases
 - Entity re-identification still relies primarily on visual similarity within bounding boxes
 - Performance validated only on 7B parameter scale
 - Optimal real-to-synthetic story ratio (2:1) may not generalize to all scenarios
 
Citation
TODO
@misc{oliveira2025storyreasoningdatasetusingchainofthought,
      title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, 
      author={Daniel A. P. Oliveira and David Martins de Matos},
      year={2025},
      eprint={2505.10292},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.10292}
}
Contact
For questions or feedback regarding this model, please contact:
- Daniel A. P. Oliveira ([email protected])
 
- Downloads last month
 - 5
 
Model tree for daniel3303/QwenStoryteller2
Datasets used to train daniel3303/QwenStoryteller2
Space using daniel3303/QwenStoryteller2 1
Evaluation results
- Character Precision on StoryReasoningAdversarialDPOtest set self-reported0.780
 - Object Precision on StoryReasoningAdversarialDPOtest set self-reported0.290
 - Total Precision on StoryReasoningAdversarialDPOtest set self-reported0.450
 - mAP on StoryReasoningAdversarialDPOtest set self-reported0.310
 - Character Recall on StoryReasoningAdversarialDPOtest set self-reported0.770
 - Object Recall on StoryReasoningAdversarialDPOtest set self-reported0.280
 - Total Recall on StoryReasoningAdversarialDPOtest set self-reported0.480
 - F1 Score on StoryReasoningAdversarialDPOtest set self-reported0.410
 - METEOR on StoryReasoningAdversarialDPOtest set self-reported0.170
 - ROUGE-L on StoryReasoningAdversarialDPOtest set self-reported0.180