Papers
arxiv:2510.24657

Group Relative Attention Guidance for Image Editing

Published on Oct 28
· Submitted by Wu Kai on Oct 29
Authors:
,
,
,
,
,
,
Kai Wu ,

Abstract

Group Relative Attention Guidance enhances image editing quality by modulating token deltas in Diffusion-in-Transformer models, providing fine-grained control over editing intensity.

AI-generated summary

Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.

Community

Paper author Paper submitter

We investigate the MM-Attention mechanism within the DiT model and observe that the Query (Q) and Key (K) tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model's inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance (GRAG), a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing.

image

Project page: https://little-misfit.github.io/GRAG-Image-Editing/
Github code: https://github.com/little-misfit/GRAG-Image-Editing
arXiv link: https://arxiv.org/abs/2510.24657

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.24657 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.24657 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.24657 in a Space README.md to link it from this page.

Collections including this paper 2