Add vision processing components

Adding a vision model requires two image processor classes on top of the standard modular approach.

For the modeling and config steps, follow the modular guide first.

torchvision backend is the default and supports GPU acceleration.
PIL backend is a fallback when no GPU is available.

Both classes share the same preprocessing logic but have different backends. Their constructor signatures and default values must be identical. AutoImageProcessor.from_pretrained() selects the backend at load time and falls back to PIL when torchvision isn’t available. Mismatched signatures cause the same saved config to behave differently across environments.

torchvision

Create image_processing_<model_name>.py with a class that inherits from TorchvisionBackend. Define a kwargs class first if your processor needs custom parameters beyond the standard ImagesKwargs.

from ...image_processing_backends import TorchvisionBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...processing_utils import ImagesKwargs, Unpack
from ...utils import auto_docstring

class MyModelImageProcessorKwargs(ImagesKwargs, total=False):
    tile_size: int  # any model-specific kwargs

@auto_docstring
class MyModelImageProcessor(TorchvisionBackend):
    resample = PILImageResampling.BICUBIC
    image_mean = OPENAI_CLIP_MEAN
    image_std = OPENAI_CLIP_STD
    size = {"shortest_edge": 224}
    do_resize = True
    do_rescale = True
    do_normalize = True
    do_convert_rgb = True

    def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
        super().__init__(**kwargs)

PIL

Create image_processing_pil_<model_name>.py with a class that inherits from PilBackend. Duplicate the kwargs class here instead of importing it from the torchvision file because it can fail when torchvision isn’t installed. Add an # Adapted from comment so the two stay in sync. For processors with no custom parameters, use ImagesKwargs directly.

from ...image_processing_backends import PilBackend
from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
from ...processing_utils import ImagesKwargs, Unpack
from ...utils import auto_docstring

# Adapted from transformers.models.my_model.image_processing_my_model.MyModelImageProcessorKwargs
class MyModelImageProcessorKwargs(ImagesKwargs, total=False):
    tile_size: int  # any model-specific kwargs

@auto_docstring
class MyModelImageProcessorPil(PilBackend):
    resample = PILImageResampling.BICUBIC
    image_mean = OPENAI_CLIP_MEAN
    image_std = OPENAI_CLIP_STD
    size = {"shortest_edge": 224}
    do_resize = True
    do_rescale = True
    do_normalize = True
    do_convert_rgb = True

    def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
        super().__init__(**kwargs)

See CLIPImageProcessor/CLIPImageProcessorPil and LlavaOnevisionImageProcessor/LlavaOnevisionImageProcessorPil for reference.

Next steps

Read the Auto-generating docstrings guide to auto-generate consistent docstrings with @auto_docstring.
Read the Writing model tests guide to write integration tests for your model.

Update on GitHub

Transformers

Add vision processing components

torchvision

PIL

Next steps