RWKV-Qwen3-32B-hxa079-Low

Parameters: 33.8B

Architecture: RWKV hxa079+ / NoPE-GQA Hybrid Linear Attention

Composition: 56 RWKV layers + 8 NoPE-GQA layers

Model Description

RWKV-Qwen3-32B-hxa079-Low is a hybrid RWKV/Attention model designed for efficient large-scale inference on commodity GPUs.

The base architecture is RWKV hxa079+, an extended variant of BlinkDL’s RWKV-7 “Goose” Expressive Dynamic State Evolution design, incorporating modifications such as k_first and expanded decay mechanisms.

This model was trained using a Linear Conversion difficulty measurement　in Hidden-State Alignment Stage1 distillation process. In this stage, layer-wise difficulty and convergence characteristics were evaluated, and layers determined unsuitable for RWKV conversion were selectively replaced with NoPE-GQA Attention layers.

This selective substitution allows for a drastic reduction in the number of Full Attention layers (compared to %4 hybrid baselines and similar approaches) while maintaining nearly the same performance across most real-world tasks.

Benchmarks

Benchmark results are included in bench.txt within the release package.

On tasks heavily dependent on the raw number of attention layers (e.g., ruler_vt benchmarks), performance is notably weaker.
On other standard benchmarks, results remain within a practical and competitive range for a 32B-class model with reduced Attention overhead.

Motivation and Deployment Notes

Deploying customized LLMs at this scale remains highly resource-intensive. Even for business use cases requiring minimum viable deployment, full-Transformer architectures are severely limited by KVCache constraints, which restrict multi-batch inference.

In practice, 32B-class Transformer models require H100 GPUs as a baseline. However, the cost and scarcity of H100 units make them inaccessible for many organizations.

By contrast, cloud providers increasingly offer lower-cost GPUs such as the A6000 or even GPUs with <32GB memory at highly competitive rates.

The goal of this project is to deliver a lighter, more deployable 32B model that enables practical multi-batch inference even on these modest, affordable GPUs.

Acknowledgments

This project was made possible through computational resources and technical support provided by Recursal.AI, to whom we extend our deepest gratitude.

We are especially thankful to SmerkyG for his invaluable technical assistance and guidance throughout this research.

Inspired by: RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
Reference Code: https://github.com/recursal/RADLADS

Author’s Note

Building deployable, customized LLMs is an exhausting but deeply rewarding process. My personal motivation is to bridge the gap between cutting-edge model design and realistic deployment economics.

This release represents one step toward a future where high-parameter models can be deployed cost-effectively on widely available GPUs, without compromising too much on capability.

Downloads last month: 5

Safetensors

Model size

34B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OpenMOSE/RWKV-Qwen3-32B-hxa079-Low

Base model

Qwen/Qwen3-32B

Finetuned

(136)

this model

Quantizations

1 model

Collection including OpenMOSE/RWKV-Qwen3-32B-hxa079-Low

hxa079 RWKV-Transformer Hybrid series

Collection

HXA079 family of hybrid models, combining RWKV recurrent architectures with Transformer-based attention. Designed for efficient long-context. • 8 items • Updated 23 days ago