Moondream 3 Preview HF

Moondream 3 Preview HF is a reimplementation of the Moondream 3 (Preview) model using the standard HuggingFace Transformers architecture conventions.

Overview

  • Multimodal vision-language model with a mixture-of-experts (MoE) text backbone
  • Architecture and weights correspond to Moondream 3 (Preview) (approximately 9B parameters, 2B active)
  • Implemented as standard Transformers components:

The purpose of this repository is to make Moondream 3 interoperable with the Hugging Face ecosystem so it can be used directly with the Transformers API, including generate(), Trainer, and PEFT integrations.

Example usage

Example for running multimodal inference with the moondream3-preview-hf implementation:

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

DEVICE="cuda:0"

model = AutoModelForCausalLM.from_pretrained("NyxKrage/moondream3-preview-hf", dtype="bfloat16", device_map=DEVICE, trust_remote_code=True)
processor = AutoProcessor.from_pretrained("NyxKrage/moondream3-preview-hf", use_fast=False, trust_remote_code=True)


image1 = Image.open("image1.jpg")
image2 = Image.open("image2.jpg")
text = [processor.apply_chat_template("", tokenize=False)] * 2
inputs = processor(text=text, images=[
    image1,
    image2,
])
inputs = {k: v.to(DEVICE) for k,v in inputs.items()}

model.eval()

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        use_cache=True,
    )
    outputs = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs["input_ids"], outputs)
    ]
    for output in outputs:
        print(processor.decode(output))

The chat_template uses Hugging Face’s Jinja format and accepts either a single string or a sequence of messages (user [, assistant]).

Prompting modes

The chat template supports multiple task types via text prefixes:

Mode Template prefix Example input
Query query: query: What is happening in this image?
Reason reason: reason: What is happening in this image?
Caption caption: [short/normal/long] caption: long
Detect detect: detect: dog
Point point: point: red car

If no prefix is provided, the default mode is caption:normal. reason is the same as query, but makes the model think/reason before providing a final answer.

Output Format

For query: and caption: prompts, the model behaves like a standard Hugging Face causal language model and returns token IDs to be decoded normally.

For detect: prompts, the model does not return text. Instead, it produces a floating-point tensor of shape [batch_size, max_detections, 4]. Each non-zero entry represents a bounding box in normalized coordinates [x_min, y_min, x_max, y_max]. Zero rows indicate padding.

For point: prompts, the model likewise returns structured coordinates rather than text. The output is a floating-point tensor of shape [batch_size, max_points, 2], where each non-zero row is a point [x, y] in normalized image coordinates. Zero rows again represent padding.

Training

The model can be trained using trl and supports peft and bitsandbytes out of the box.
Included in the repo is also an implementation which replaces the MoE layers with a grouped_gemm implementation which has been adapted from github:woct0rdho/transformers-qwen3-moe-fused, and can be used by importing Moondream3ForConditionalGeneration from modeling_moondream3_fusedmoe.py instead.

Limitations

All images within a batch must yield the same number of vision crops. To ensure this, it is recommended to resize every image to the same resolution before preprocessing; otherwise, differences in aspect ratio or size can cause mismatched crop counts and prevent batching.

Only a single action type is supported per batch. Every prompt in the batch must use the same mode prefix (such as query:, caption:, point:, or detect:). Mixing different modes in a single batch is not allowed.

Because both the visual crops and text inputs must align across examples, batches must be homogeneous in both image structure and task type. Inputs that differ in crop count or prompt mode should be run separately or grouped into compatible batches.

License

The model weights remain under the Business Source License 1.1 with an Additional Use Grant (No Third-Party Service), identical to the original Moondream 3 Preview license.

All new implementation code in this repository is released under the Apache 2.0 License.

Credits

  • Original model and research: M87 Labs / Moondream AI
  • Hugging Face–compatible reimplementation: NyxKrage
  • Based on the public Moondream 3 Preview release
Downloads last month
290
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support