Papers
arxiv:2407.11306

PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer

Published on Jul 16, 2024
Authors:
,
,
,
,
,
,
,
,

Abstract

PADRe, a polynomial-based attention framework, replaces self-attention in transformers, offering faster computation and comparable accuracy across various computer vision tasks.

AI-generated summary

We present Polynomial Attention Drop-in Replacement (PADRe), a novel and unifying framework designed to replace the conventional self-attention mechanism in transformer models. Notably, several recent alternative attention mechanisms, including Hyena, Mamba, SimA, Conv2Former, and Castling-ViT, can be viewed as specific instances of our PADRe framework. PADRe leverages polynomial functions and draws upon established results from approximation theory, enhancing computational efficiency without compromising accuracy. PADRe's key components include multiplicative nonlinearities, which we implement using straightforward, hardware-friendly operations such as Hadamard products, incurring only linear computational and memory costs. PADRe further avoids the need for using complex functions such as Softmax, yet it maintains comparable or superior accuracy compared to traditional self-attention. We assess the effectiveness of PADRe as a drop-in replacement for self-attention across diverse computer vision tasks. These tasks include image classification, image-based 2D object detection, and 3D point cloud object detection. Empirical results demonstrate that PADRe runs significantly faster than the conventional self-attention (11x ~ 43x faster on server GPU and mobile NPU) while maintaining similar accuracy when substituting self-attention in the transformer models.

Community

Fascinating stuff: Extremely efficient replacement for attention that delivers great results on different vision tasks right out of the box. The paper provides a rigorous mathematical formulation of a highly generalizable framework and valuable theoretical insights as well as empirical results.
The main idea seems simple and elegant - it surprises me that nobody did serious experiments on this before.

The most basic type PADRe-2 reminds me of a 2d version of the GLU variant FFNBilinear. Without the need of an activation function it can learn arbitrary polynomials of degree 2^L (given a sufficient channel dimension) and achieve universal approximations.
I am looking forward to seeing applications on LLMs as well as further insights into higher degree variants. As well as more ablations on how different hyper-parameters interact with each other (e.g. does degree scale better in models that have a high width/depth ratio?).

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.11306 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.11306 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.11306 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.