SAME: A Semantically-Aligned Music Autoencoder

Please note: For commercial use, please refer to https://stability.ai/license

Model Description

Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically Aligned Music autoEncoder), a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard) while maintaining excellent reconstruction quality and strong downstream generative performance. We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses. The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

Usage

This model can be used with:

  1. the stable-audio-3 inference and fine-tuning library
  2. the stable-audio-tools research library

Using with stable-audio-3

import torchaudio from stable_audio_3 import AutoencoderModel

ae = AutoencoderModel.from_pretrained("same-l") waveform, sr = torchaudio.load("audio.wav") latents = ae.encode(waveform, sr) audio_out = ae.decode(latents)

Using with stable-audio-tools

import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
  model_half = True

# Download model
model, model_config = get_pretrained_model("stabilityai/SAME-L")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)
if model_half:
  model = model.to(torch.float16)

audio, sr = torchaudio.load(/path/to/audiofile)  # [channels, samples]
if audio.shape[0] == 1:
    audio = audio.repeat(2, 1)

audio = audio.unsqueeze(0).to(device)
if model_half:
  audio = audio.half()
with torch.no_grad():
    latents = model.encode_audio(audio)  
    reconstructed = model.decode_audio(latents)         
reconstructed = reconstructed.squeeze(0).cpu()  
reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch.int16).cpu()

Model Details

Training dataset

Datasets Used

Our dataset consists of ~19,500 hours of licensed production audio from AudioSparx which includes a 66/25/9% mix of music, sound effects, and instrument stems.

Downloads last month
2,177
Safetensors
Model size
0.9B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Spaces using stabilityai/SAME-L 2

Collection including stabilityai/SAME-L

Paper for stabilityai/SAME-L