Nanonets-OCR2-3B is based on Qwen2.5-VL-3B, which ususally performs better when quantization approaches keep the vision encoder in full precision. My usecase involves dense table extraction; for now, I'll hypothesize Nanonet finetuning takes "pressure" off the language model component so keeping vision encoder in higher precision retains more understanding of vision tokens, which represent information differently than text tokens. To test this, I used instructions which do not deviate so much from training data, which Nanonets reccomends in their examples.

Based on the Qwen-VL papers, this behavior may echo QwenTeam training recipie for ViT and LLM components, where each part was trained seperately, then ViT was frozen, the language model was merged.

In the first phase, only the Vision Transformer (ViT) is trained to improve its alignment with the language model, laying a solid foundation for multimodal understanding. The primary data sources during this phase include image captions, visual knowledge, and OCR data. These datasets are carefully selected to foster ViT’s ability to extract meaningful visual representations that can be effectively integrated with textual information...

See section 2.2.2 from Qwen2.5-V Technical Report to learn more.

Anyway, heres how to obtain the model, adapted from pipeline-quantization example;

from optimum.intel import OVModelForVisualCausalLM
from optimum.intel import OVPipelineQuantizationConfig, OVQuantizationConfig, OVWeightQuantizationConfig

model_id = "nanonets/Nanonets-OCR2-3B"
model = OVModelForVisualCausalLM.from_pretrained(
    model_id,
    export=True,
    trust_remote_code=True,
    quantization_config=OVPipelineQuantizationConfig(
        quantization_configs={
            "lm_model": OVQuantizationConfig(bits=8),
            "text_embeddings_model": OVWeightQuantizationConfig(bits=4),
        },
        dataset="contextual",
        trust_remote_code=True,
    )
)
model.save_pretrained("Nanonets-OCR2-3B-LM-INT4_ASYM-VE-FP16-ov")

Here is some test code which performs the core inference work. To make something useful, I reccomend poaching from the Nanonet-OCR2 repo

import time
from PIL import Image
from transformers import AutoProcessor, TextStreamer
from optimum.intel.openvino import OVModelForVisualCausalLM


model_id = "/home/ecomm/Desktop/lochinvar_nanonets/Nanonets-OCR2-3B-LM-INT4_ASYM-VE-FP16-ov"

print("Loading model...")
start_load_time = time.time()
model = OVModelForVisualCausalLM.from_pretrained(model_id, export=False, device="CPU")
processor = AutoProcessor.from_pretrained(model_id)

image_path = r"/home/ecomm/Desktop/lochinvar_nanonets/OpenArc-1.0.6/src/tests/dedication.png"
image = Image.open(image_path)
image = image.convert("RGB")

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image"
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Instead, just use your text prompt directly:
text_prompt = "Convert this image to markdown code block"

# Preprocess the inputs using model.preprocess_inputs
inputs = model.preprocess_inputs(text=text_prompt, image=image, processor=processor)

# Print number of tokens
input_token_count = len(inputs["input_ids"][0])
print(f"Input token length: {input_token_count}")

# Inference: Generation of the output with performance metrics
start_time = time.time()
streamer = TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, streamer=streamer)

generated_ids = [output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs["input_ids"], output_ids)]
output_text = processor.batch_decode(generated_ids, clean_up_tokenization_spaces=True, skip_special_tokens=True)

num_tokens_generated = len(generated_ids[0])
load_time = time.time() - start_load_time
generation_time = time.time() - start_time
tokens_per_second = num_tokens_generated / generation_time
average_token_latency = generation_time / num_tokens_generated

print("\nPerformance Report:")
print("-"*50)
print(f"Input Tokens        : {input_token_count:>9}")
print(f"Generated Tokens    : {num_tokens_generated:>9}")
print(f"Model Load Time     : {load_time:>9.2f} sec")
print(f"Generation Time     : {generation_time:>9.2f} sec")
print(f"Throughput          : {tokens_per_second:>9.2f} t/s")
print(f"Avg Latency/Token   : {average_token_latency:>9.3f} sec")

print(output_text)

At time of writing this model did not work with OpenVINO GenAI VLMPipline or ContinuousBatchingPipeline on CPU. For now, it might not work in OpenArc.

This model was quite good so I'll work on that.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Echo9Zulu/Nanonets-OCR2-3B-LM-INT4_ASYM-VE-FP16-ov

Finetuned
(3)
this model