Transformers documentation

EdgeTAMVideo

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

This model was released on 2025-01-13 and added to Hugging Face Transformers on 2025-09-29.

PyTorch SDPA FlashAttention

EdgeTAMVideo

Overview

The EdgeTAM model was proposed in EdgeTAM: On-Device Track Anything Model Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran.

EdgeTAM is an efficient adaptation of SAM 2 that introduces a 2D Spatial Perceiver architecture to optimize memory attention mechanisms for real-time video segmentation on mobile devices.

The abstract from the paper is the following:

On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 J&F on DAVIS 2017, MOSE, SA-V val, and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.

This model was contributed by yonigozlan. The original code can be found here.

Usage example

Video Segmentation and Tracking

EdgeTAM Video’s key strength is its ability to track objects across video frames efficiently on mobile devices. Here’s how to use it for video segmentation:

Basic Video Tracking

>>> from transformers import EdgeTamVideoModel, Sam2VideoProcessor, infer_device
>>> import torch

>>> device = infer_device()
>>> model = EdgeTamVideoModel.from_pretrained("yonigozlan/edgetam-video-1").to(device, dtype=torch.bfloat16)
>>> processor = Sam2VideoProcessor.from_pretrained("yonigozlan/edgetam-video-1")

>>> # Load video frames (example assumes you have a list of PIL Images)
>>> # video_frames = [Image.open(f"frame_{i:05d}.jpg") for i in range(num_frames)]

>>> # For this example, we'll use the video loading utility
>>> from transformers.video_utils import load_video
>>> video_url = "https://huggingface.co/datasets/hf-internal-testing/sam2-fixtures/resolve/main/bedroom.mp4"
>>> video_frames, _ = load_video(video_url)

>>> # Initialize video inference session
>>> inference_session = processor.init_video_session(
...     video=video_frames,
...     inference_device=device,
...     dtype=torch.bfloat16,
... )

>>> # Add click on first frame to select object
>>> ann_frame_idx = 0
>>> ann_obj_id = 1
>>> points = [[[[210, 350]]]]
>>> labels = [[[1]]]

>>> processor.add_inputs_to_inference_session(
...     inference_session=inference_session,
...     frame_idx=ann_frame_idx,
...     obj_ids=ann_obj_id,
...     input_points=points,
...     input_labels=labels,
... )

>>> # Segment the object on the first frame
>>> outputs = model(
...     inference_session=inference_session,
...     frame_idx=ann_frame_idx,
... )
>>> video_res_masks = processor.post_process_masks(
...     [outputs.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
... )[0]
>>> print(f"Segmentation shape: {video_res_masks.shape}")
Segmentation shape: torch.Size([1, 1, 540, 960])

>>> # Propagate through the entire video
>>> video_segments = {}
>>> for sam2_video_output in model.propagate_in_video_iterator(inference_session):
...     video_res_masks = processor.post_process_masks(
...         [sam2_video_output.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
...     )[0]
...     video_segments[sam2_video_output.frame_idx] = video_res_masks

>>> print(f"Tracked object through {len(video_segments)} frames")
Tracked object through 200 frames

Multi-Object Video Tracking

Track multiple objects simultaneously across video frames:

>>> # Reset for new tracking session
>>> inference_session.reset_inference_session()

>>> # Add multiple objects on the first frame
>>> ann_frame_idx = 0
>>> obj_ids = [2, 3]
>>> input_points = [[[[200, 300]], [[400, 150]]]]  # Points for two objects (batched)
>>> input_labels = [[[1], [1]]]

>>> processor.add_inputs_to_inference_session(
...     inference_session=inference_session,
...     frame_idx=ann_frame_idx,
...     obj_ids=obj_ids,
...     input_points=input_points,
...     input_labels=input_labels,
... )

>>> # Get masks for both objects on first frame
>>> outputs = model(
...     inference_session=inference_session,
...     frame_idx=ann_frame_idx,
... )

>>> # Propagate both objects through video
>>> video_segments = {}
>>> for sam2_video_output in model.propagate_in_video_iterator(inference_session):
...     video_res_masks = processor.post_process_masks(
...         [sam2_video_output.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
...     )[0]
...     video_segments[sam2_video_output.frame_idx] = {
...         obj_id: video_res_masks[i]
...         for i, obj_id in enumerate(inference_session.obj_ids)
...     }

>>> print(f"Tracked {len(inference_session.obj_ids)} objects through {len(video_segments)} frames")
Tracked 2 objects through 200 frames

Refining Video Segmentation

You can add additional clicks on any frame to refine the tracking:

>>> # Add refinement click on a later frame
>>> refine_frame_idx = 50
>>> ann_obj_id = 2  # Refining first object
>>> points = [[[[220, 280]]]]  # Additional point
>>> labels = [[[1]]]  # Positive click

>>> processor.add_inputs_to_inference_session(
...     inference_session=inference_session,
...     frame_idx=refine_frame_idx,
...     obj_ids=ann_obj_id,
...     input_points=points,
...     input_labels=labels,
... )

>>> # Re-propagate with the additional information
>>> video_segments = {}
>>> for sam2_video_output in model.propagate_in_video_iterator(inference_session):
...     video_res_masks = processor.post_process_masks(
...         [sam2_video_output.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
...     )[0]
...     video_segments[sam2_video_output.frame_idx] = video_res_masks

Streaming Video Inference

For real-time applications, EdgeTAM Video supports processing video frames as they arrive:

>>> # Initialize session for streaming
>>> inference_session = processor.init_video_session(
...     inference_device=device,
...     dtype=torch.bfloat16,
... )

>>> # Process frames one by one
>>> for frame_idx, frame in enumerate(video_frames[:10]):  # Process first 10 frames
...     inputs = processor(images=frame, device=device, return_tensors="pt")
...
...     if frame_idx == 0:
...         # Add point input on first frame
...         processor.add_inputs_to_inference_session(
...             inference_session=inference_session,
...             frame_idx=0,
...             obj_ids=1,
...             input_points=[[[[210, 350], [250, 220]]]],
...             input_labels=[[[1, 1]]],
...             original_size=inputs.original_sizes[0], # need to be provided when using streaming video inference
...         )
...
...     # Process current frame
...     sam2_video_output = model(inference_session=inference_session, frame=inputs.pixel_values[0])
...
...     video_res_masks = processor.post_process_masks(
...         [sam2_video_output.pred_masks], original_sizes=inputs.original_sizes, binarize=False
...     )[0]
...     print(f"Frame {frame_idx}: mask shape {video_res_masks.shape}")

Frame 0: mask shape torch.Size([1, 1, 540, 960])
...

Video Batch Processing for Multiple Objects

Track multiple objects simultaneously in video by adding them all at once:

>>> # Initialize video session
>>> inference_session = processor.init_video_session(
...     video=video_frames,
...     inference_device=device,
...     dtype=torch.bfloat16,
... )

>>> # Add multiple objects on the first frame using batch processing
>>> ann_frame_idx = 0
>>> obj_ids = [2, 3]  # Track two different objects
>>> input_points = [
...     [[[200, 300], [230, 250], [275, 175]], [[400, 150]]]
... ]  # Object 2: 3 points (2 positive, 1 negative); Object 3: 1 point
>>> input_labels = [
...     [[1, 1, 0], [1]]
... ]  # Object 2: positive, positive, negative; Object 3: positive

>>> processor.add_inputs_to_inference_session(
...     inference_session=inference_session,
...     frame_idx=ann_frame_idx,
...     obj_ids=obj_ids,
...     input_points=input_points,
...     input_labels=input_labels,
... )

>>> # Get masks for all objects on the first frame
>>> outputs = model(
...     inference_session=inference_session,
...     frame_idx=ann_frame_idx,
... )
>>> video_res_masks = processor.post_process_masks(
...     [outputs.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
... )[0]
>>> print(f"Generated masks for {video_res_masks.shape[0]} objects")
Generated masks for 2 objects

>>> # Propagate all objects through the video
>>> video_segments = {}
>>> for sam2_video_output in model.propagate_in_video_iterator(inference_session):
...     video_res_masks = processor.post_process_masks(
...         [sam2_video_output.pred_masks], original_sizes=[[inference_session.video_height, inference_session.video_width]], binarize=False
...     )[0]
...     video_segments[sam2_video_output.frame_idx] = {
...         obj_id: video_res_masks[i]
...         for i, obj_id in enumerate(inference_session.obj_ids)
...     }

>>> print(f"Tracked {len(inference_session.obj_ids)} objects through {len(video_segments)} frames")
Tracked 2 objects through 200 frames

EdgeTamVideoMaskDecoderConfig

class transformers.EdgeTamVideoMaskDecoderConfig

< >

( hidden_size = 256 hidden_act = 'gelu' mlp_dim = 2048 num_hidden_layers = 2 num_attention_heads = 8 attention_downsample_rate = 2 num_multimask_outputs = 3 iou_head_depth = 3 iou_head_hidden_dim = 256 dynamic_multimask_via_stability = True dynamic_multimask_stability_delta = 0.05 dynamic_multimask_stability_thresh = 0.98 **kwargs )

Parameters

  • hidden_size (int, optional, defaults to 256) — Dimensionality of the hidden states.
  • hidden_act (str, optional, defaults to "gelu") — The non-linear activation function in the EDGETAM_VIDEO mask decoder.
  • mlp_dim (int, optional, defaults to 2048) — The dimension of the MLP in the two-way transformer.
  • num_hidden_layers (int, optional, defaults to 2) — The number of hidden layers in the two-way transformer.
  • num_attention_heads (int, optional, defaults to 8) — The number of attention heads in the two-way transformer.
  • attention_downsample_rate (int, optional, defaults to 2) — The downsample rate for the attention layers.
  • num_multimask_outputs (int, optional, defaults to 3) — The number of multimask outputs.
  • iou_head_depth (int, optional, defaults to 3) — The depth of the IoU head.
  • iou_head_hidden_dim (int, optional, defaults to 256) — The hidden dimension of the IoU head.
  • dynamic_multimask_via_stability (bool, optional, defaults to True) — Whether to use dynamic multimask via stability.
  • dynamic_multimask_stability_delta (float, optional, defaults to 0.05) — The stability delta for the dynamic multimask.
  • dynamic_multimask_stability_thresh (float, optional, defaults to 0.98) — The stability threshold for the dynamic multimask.

This is the configuration class to store the configuration of a EdgeTamVideoMaskDecoder. It is used to instantiate a EDGETAM_VIDEO memory encoder according to the specified arguments, defining the model architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

EdgeTamVideoPromptEncoderConfig

class transformers.EdgeTamVideoPromptEncoderConfig

< >

( hidden_size = 256 image_size = 1024 patch_size = 16 mask_input_channels = 16 num_point_embeddings = 4 hidden_act = 'gelu' layer_norm_eps = 1e-06 scale = 1 **kwargs )

Parameters

  • hidden_size (int, optional, defaults to 256) — Dimensionality of the hidden states.
  • image_size (int, optional, defaults to 1024) — The expected output resolution of the image.
  • patch_size (int, optional, defaults to 16) — The size (resolution) of each patch.
  • mask_input_channels (int, optional, defaults to 16) — The number of channels to be fed to the MaskDecoder module.
  • num_point_embeddings (int, optional, defaults to 4) — The number of point embeddings to be used.
  • hidden_act (str, optional, defaults to "gelu") — The non-linear activation function in the encoder and pooler.
  • layer_norm_eps (float, optional, defaults to 1e-06) — The epsilon used by the layer normalization layers.
  • scale (float, optional, defaults to 1) — The scale factor for the prompt encoder.

This is the configuration class to store the configuration of a EdgeTamVideoPromptEncoder. The EdgeTamVideoPromptEncoder module is used to encode the input 2D points and bounding boxes.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

EdgeTamVideoConfig

class transformers.EdgeTamVideoConfig

< >

( vision_config = None prompt_encoder_config = None mask_decoder_config = None initializer_range = 0.02 num_maskmem = 7 image_size = 1024 sigmoid_scale_for_mem_enc = 20.0 sigmoid_bias_for_mem_enc = -10.0 enable_occlusion_spatial_embedding = True multimask_output_in_sam = True multimask_min_pt_num = 0 multimask_max_pt_num = 1 multimask_output_for_tracking = True max_object_pointers_in_encoder = 16 enable_temporal_pos_encoding_for_object_pointers = True memory_attention_hidden_size = 256 memory_attention_num_layers = 2 memory_attention_num_attention_heads = 1 memory_attention_downsample_rate = 1 memory_attention_mlp_hidden_size = 2048 memory_attention_mlp_hidden_act = 'relu' memory_attention_dropout = 0.1 memory_attention_rope_theta = 10000 memory_attention_rope_feat_sizes = None memory_attention_rope_k_sizes = None memory_attention_rope_dropout = 0.1 perceiver_resampler_num_latents = 256 perceiver_resampler_num_latents_2d = 256 perceiver_resampler_hidden_size = 64 perceiver_resampler_mlp_intermediate_size = 256 perceiver_resampler_num_attention_heads = 1 perceiver_resampler_attention_head_dim = 64 perceiver_resampler_num_layers = 2 perceiver_resampler_hidden_dropout = 0.0 perceiver_resampler_attention_dropout = 0.0 memory_encoder_hidden_size = 256 memory_encoder_output_channels = 64 mask_downsampler_embed_dim = 256 memory_fuser_intermediate_dim = 1024 mask_downsampler_kernel_size = 3 mask_downsampler_stride = 2 mask_downsampler_padding = 1 mask_downsampler_total_stride = 16 mask_downsampler_hidden_act = 'gelu' memory_fuser_num_layers = 2 memory_fuser_embed_dim = 256 memory_fuser_kernel_size = 7 memory_fuser_padding = 3 memory_fuser_layer_scale_init_value = 1e-06 memory_fuser_hidden_act = 'gelu' **kwargs )

Parameters

  • vision_config (Union[dict, EdgeTamVideoVisionConfig], optional) — Dictionary of configuration options used to initialize EdgeTamVideoVisionConfig.
  • prompt_encoder_config (Union[dict, EdgeTamVideoPromptEncoderConfig], optional) — Dictionary of configuration options used to initialize EdgeTamVideoPromptEncoderConfig.
  • mask_decoder_config (Union[dict, EdgeTamVideoMaskDecoderConfig], optional) — Dictionary of configuration options used to initialize EdgeTamMaskDecoderConfig.
  • initializer_range (float, optional, defaults to 0.02) — Standard deviation for parameter initialization.
  • num_maskmem (int, optional, defaults to 7) — The number of memory slots for the mask memory.
  • image_size (int, optional, defaults to 1024) — The size of the input images.
  • sigmoid_scale_for_mem_enc (float, optional, defaults to 20.0) — Scale factor for the sigmoid function in the memory encoder.
  • sigmoid_bias_for_mem_enc (float, optional, defaults to -10.0) — Bias for the sigmoid function in the memory encoder.
  • enable_occlusion_spatial_embedding (bool, optional, defaults to True) — Whether to enable spatial embedding for occlusions.
  • multimask_output_in_sam (bool, optional, defaults to True) — Whether to output multiple masks from the SAM head.
  • multimask_min_pt_num (int, optional, defaults to 0) — The minimum number of points to trigger multimask output.
  • multimask_max_pt_num (int, optional, defaults to 1) — The maximum number of points to trigger multimask output.
  • multimask_output_for_tracking (bool, optional, defaults to True) — Whether to use multimask output for tracking.
  • max_object_pointers_in_encoder (int, optional, defaults to 16) — The maximum number of object pointers in the encoder.
  • enable_temporal_pos_encoding_for_object_pointers (bool, optional, defaults to True) — Whether to enable temporal positional encoding for object pointers.
  • memory_attention_hidden_size (int, optional, defaults to 256) — Dimensionality of the memory attention hidden states.
  • memory_attention_num_layers (int, optional, defaults to 2) — The number of layers in the memory attention module.
  • memory_attention_num_attention_heads (int, optional, defaults to 1) — Number of attention heads for each attention layer in the memory attention.
  • memory_attention_downsample_rate (int, optional, defaults to 1) — The downsample rate for the attention layers.
  • memory_attention_mlp_hidden_size (int, optional, defaults to 2048) — The dimension of the feedforward network in the memory attention module.
  • memory_attention_mlp_hidden_act (str, optional, defaults to "relu") — The non-linear activation function in the feedforward network in the memory attention module.
  • memory_attention_dropout (float, optional, defaults to 0.1) — The dropout rate for the memory attention module.
  • memory_attention_rope_theta (float, optional, defaults to 10000) — The Rope theta parameter.
  • memory_attention_rope_feat_sizes (Tuple[int, int], optional, defaults to [64, 64]) — The feature sizes for the Rope positional encoding.
  • memory_attention_rope_k_sizes (List[int], optional, defaults to [16, 16]) — The key feature sizes for the RoPE positional encoding in memory attention.
  • memory_attention_rope_dropout (float, optional, defaults to 0.1) — The dropout rate for the Rope positional encoding.
  • perceiver_resampler_num_latents (int, optional, defaults to 256) — The number of 1D latent tokens in the perceiver resampler.
  • perceiver_resampler_num_latents_2d (int, optional, defaults to 256) — The number of 2D latent tokens in the perceiver resampler.
  • perceiver_resampler_hidden_size (int, optional, defaults to 64) — The hidden size of the perceiver resampler.
  • perceiver_resampler_mlp_intermediate_size (int, optional, defaults to 256) — The intermediate size of the feedforward network in the perceiver resampler.
  • perceiver_resampler_num_attention_heads (int, optional, defaults to 1) — The number of attention heads in the perceiver resampler.
  • perceiver_resampler_attention_head_dim (int, optional, defaults to 64) — The dimension of each attention head in the perceiver resampler.
  • perceiver_resampler_num_layers (int, optional, defaults to 2) — The number of layers in the perceiver resampler.
  • perceiver_resampler_hidden_dropout (float, optional, defaults to 0.0) — The dropout rate for the hidden layers in the perceiver resampler.
  • perceiver_resampler_attention_dropout (float, optional, defaults to 0.0) — The dropout rate for the attention layers in the perceiver resampler.
  • memory_encoder_hidden_size (int, optional, defaults to 256) — Dimensionality of the memory encoder hidden states.
  • memory_encoder_output_channels (int, optional, defaults to 64) — The number of output channels for the memory encoder.
  • mask_downsampler_embed_dim (int, optional, defaults to 256) — The dimension of the mask downsampler embedding.
  • memory_fuser_intermediate_dim (int, optional, defaults to 1024) — The intermediate dimension of the memory fuser feedforward network.
  • mask_downsampler_kernel_size (int, optional, defaults to 3) — The kernel size for the mask downsampler.
  • mask_downsampler_stride (int, optional, defaults to 2) — The stride for the mask downsampler.
  • mask_downsampler_padding (int, optional, defaults to 1) — The padding for the mask downsampler.
  • mask_downsampler_total_stride (int, optional, defaults to 16) — The total stride for the mask downsampler.
  • mask_downsampler_hidden_act (str, optional, defaults to "gelu") — The non-linear activation function in the mask downsampler.
  • memory_fuser_num_layers (int, optional, defaults to 2) — The number of layers in the memory fuser.
  • memory_fuser_embed_dim (int, optional, defaults to 256) — The dimension of the memory fuser embedding.
  • memory_fuser_kernel_size (int, optional, defaults to 7) — The kernel size for the memory fuser.
  • memory_fuser_padding (int, optional, defaults to 3) — The padding for the memory fuser.
  • memory_fuser_layer_scale_init_value (float, optional, defaults to 1e-06) — The initial value for the layer scale in the memory fuser.
  • memory_fuser_hidden_act (str, optional, defaults to "gelu") — The non-linear activation function in the memory fuser.

EdgeTamVideoConfig is the configuration class to store the configuration of a EdgeTamVideoModel. It is used to instantiate a EDGETAM model according to the specified arguments, defining the memory attention, memory encoder, and image encoder configs. Instantiating a configuration defaults will yield a similar configuration to that of the SAM 2.1 Hiera-tiny facebook/EdgeTAM architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import (
...     EdgeTamVisionConfig,
...     EdgeTamVideoPromptEncoderConfig,
...     EdgeTamVideoMaskDecoderConfig,
...     EdgeTamVideoModel,
...     EdgeTamVideoConfig,
... )

>>> # Initializing a EdgeTamVideoConfig with `"facebook/edgetam.1_hiera_tiny"` style configuration
>>> configuration = EdgeTamVideoConfig()

>>> # Initializing a EdgeTamVideoModel (with random weights) from the `"facebook/edgetam.1_hiera_tiny"` style configuration
>>> model = EdgeTamVideoModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

>>> # We can also initialize a EdgeTamConfig from a EdgeTamVisionConfig, EdgeTamPromptEncoderConfig, and EdgeTamMaskDecoderConfig

>>> # Initializing EDGETAM vision encoder, memory attention, and memory encoder configurations
>>> vision_config = EdgeTamVisionConfig()
>>> prompt_encoder_config = EdgeTamVideoPromptEncoderConfig()
>>> mask_decoder_config = EdgeTamVideoMaskDecoderConfig()

>>> config = EdgeTamVideoConfig(vision_config, prompt_encoder_config, mask_decoder_config)

EdgeTamVideoInferenceSession

class transformers.EdgeTamVideoInferenceSession

< >

( video: typing.Optional[torch.FloatTensor] = None video_height: typing.Optional[int] = None video_width: typing.Optional[int] = None inference_device: typing.Union[torch.device, str] = 'cpu' inference_state_device: typing.Union[torch.device, str] = 'cpu' video_storage_device: typing.Union[torch.device, str] = 'cpu' dtype: typing.Union[torch.dtype, str] = 'float32' max_vision_features_cache_size: int = 1 )

Parameters

  • video (torch.FloatTensor, optional) — The video to process. No need to provide when streaming.
  • video_height (int, optional) — The height of the video.
  • video_width (int, optional) — The width of the video.
  • inference_device (torch.device, optional, defaults to "cpu") — The device to use for inference.
  • inference_state_device (torch.device, optional, defaults to "cpu") — The device to store the inference state on.
  • video_storage_device (torch.device, optional, defaults to "cpu") — The device to store the video on.
  • dtype (torch.dtype, optional, defaults to "float32") — The dtype to use for the video.
  • max_vision_features_cache_size (int, optional, defaults to 1) — The maximum number of vision features to cache.

Manages video inference session parameters, state and cache.

add_mask_inputs

< >

( obj_idx: int frame_idx: int inputs: Tensor )

Add mask inputs with automatic device placement.

add_new_frame

< >

( pixel_values: Tensor frame_idx: typing.Optional[int] = None )

Add new frame with automatic device placement.

add_point_inputs

< >

( obj_idx: int frame_idx: int inputs: dict )

Add point inputs with automatic device placement.

get_frame

< >

( frame_idx: int )

Get frame from video.

get_obj_num

< >

( )

Get the total number of unique object ids received so far in this session.

get_output

< >

( obj_idx: int frame_idx: int output_key: str is_conditioning_frame: bool = True )

Parameters

  • obj_idx (int) — The index of the object.
  • frame_idx (int) — The index of the frame.
  • output_key (str) — The key of the output.
  • is_conditioning_frame (bool) — Whether the output is for a conditioning frame.

Get output with smart device management.

obj_id_to_idx

< >

( obj_id: int )

Map object ID to index, creating new entry if needed.

obj_idx_to_id

< >

( obj_idx: int )

Map model-side object index to client-side object id.

remove_mask_inputs

< >

( obj_idx: int frame_idx: int )

Remove mask inputs.

remove_point_inputs

< >

( obj_idx: int frame_idx: int )

Remove point inputs.

reset_inference_session

< >

( )

Reset tracking data and cache.

reset_tracking_data

< >

( )

Reset tracking data but keep cache.

store_output

< >

( obj_idx: int frame_idx: int output_key: typing.Optional[str] = None output_value: typing.Union[torch.Tensor, dict, NoneType] = None is_conditioning_frame: bool = True )

Parameters

  • obj_idx (int) — The index of the object.
  • frame_idx (int) — The index of the frame.
  • output_key (Optional[str]) — The key of the output. If None, the output is stored as a dictionary.
  • output_value (Optional[Union[torch.Tensor, dict]]) — The value of the output.
  • is_conditioning_frame (bool) — Whether the output is for a conditioning frame.

Store output with smart device management. If output_key is None, the output is stored as a dictionary.

EdgeTamVideoModel

class transformers.EdgeTamVideoModel

< >

( config: EdgeTamVideoConfig )

Parameters

  • config (EdgeTamVideoConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare Edgetam Video Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< >

( inference_session: EdgeTamVideoInferenceSession frame_idx: typing.Optional[int] = None frame: typing.Optional[torch.Tensor] = None reverse: bool = False ) transformers.models.edgetam_video.modeling_edgetam_video.EdgeTamVideoSegmentationOutput or tuple(torch.FloatTensor)

Parameters

  • inference_session (~models.edgetam_video.modeling_edgetam_video.EdgeTamVideoInferenceSession) — The video inference session object.
  • frame_idx (int, optional) — The index of the frame on which to run inference. No need to provide when inferring on a new streamed frame.
  • frame (torch.Tensor, optional) — The frame to process. Provide when streaming.
  • reverse (bool, optional, defaults to False) — Whether to propagate in reverse.

Returns

transformers.models.edgetam_video.modeling_edgetam_video.EdgeTamVideoSegmentationOutput or tuple(torch.FloatTensor)

A transformers.models.edgetam_video.modeling_edgetam_video.EdgeTamVideoSegmentationOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (EdgeTamVideoConfig) and inputs.

  • pred_masks (torch.FloatTensor of shape (batch_size, num_masks, height, width)) — The predicted masks stored at the model’s resolution.
  • frame_idx (int, optional, defaults to None) — The frame index of the video.

Propagate the objects through a streamed video frame.

Update on GitHub