diffusion-gemma-asr-small

📝 Links: Blog · Demo Space · Code

Audio-native, multilingual speech recognition that transcribes through DiffusionGemma's own discrete-diffusion decoder — not autoregressive, not an external ASR decoder. Audio is projected directly into the Gemma embedding space, and the transcript is produced by parallel diffusion denoising (~8–16 steps), giving real-time-plus throughput where cost is set by the number of denoising steps, not the length of the transcript.

This repo ships the trained adapter only (projector + LoRA, ~42M params — 0.16% of the model). The frozen 26B DiffusionGemma backbone and the frozen whisper-small encoder load from their own repos.

How it works

raw audio ─► whisper-small encoder (frozen) ─► projector (trained, ~19M)
          ─► scatter into <audio> token slots of DiffusionGemma's encoder
          ─► DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio)
          ─► transcript

Backbone: google/diffusiongemma-26B-A4B-it — frozen, small LoRA adapters on encoder/decoder attention.
Audio frontend: openai/whisper-small encoder — frozen feature extractor (NOT a decoder).
Grounding: trained with three losses — uniform-diffusion (the generator), an AR auxiliary, and a CTC loss on the projector via the frozen lm_head (the key unlock that makes the audio embeddings transcript-predictive).

Usage

Install

pip install torch peft soundfile librosa huggingface_hub \
  "transformers @ git+https://github.com/huggingface/transformers.git"   # DiffusionGemma support

Transcribe in Python

import sys, soundfile as sf
from huggingface_hub import snapshot_download

repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small")   # this adapter (~170 MB)
sys.path.insert(0, repo)
from inference import load, transcribe                       # bundled in this repo

# Loads frozen DiffusionGemma-26B + whisper-small + this adapter (downloads bases on first run).
model, tok, fe = load(f"{repo}/diffusion_asr_small.pt", device="cuda")

wav, sr = sf.read("audio.wav")        # 16 kHz mono float32 (inference.py resamples if needed)
print(transcribe(wav, model, tok, fe, max_steps=16))

Or from the command line

python inference.py audio.wav        # run inside the downloaded repo dir

Long audio is split at silence (the encoder has a 30 s window, like Whisper). max_steps trades speed for accuracy — 8 is near-best and fastest, 16 is the default.

Languages & accuracy

Trained on FLEURS (6 languages) + LibriSpeech (en) + VoxPopuli (en/de/fr/es). WER/CER are Whisper-normalized (Open-ASR / Artificial-Analysis convention), 16 diffusion steps:

benchmark	metric	score
LibriSpeech test-clean (en)	WER	6.6%
FLEURS English	WER	15.7%
VoxPopuli English	WER	18.5%
FLEURS Hindi	CER	15.8%
FLEURS Mandarin	CER	29.6%

Among diffusion / non-autoregressive ASR it leads (6.6% on LibriSpeech vs Whisfusion's 8.3%, with a smaller encoder). It trails autoregressive Whisper — a training-data gap (~219 h seen), not architecture.

Files

diffusion_asr_small.pt — trained adapter ({"projector": ..., "lora": ...})
model.py, audio.py — model definition (self-contained)
inference.py — runnable example (load + segment + transcribe)
requirements.txt

Requirements / licensing

Needs transformers from main (DiffusionGemma support) + torch, peft.
Base models load from their own repos under their licenses: google/diffusiongemma-26B-A4B-it (Gemma terms) and openai/whisper-small (MIT).
This adapter: Apache-2.0.

Limitations

Per-segment window is ≤30 s (encoder limit) — long audio is chunked at silence, same as Whisper.
Mandarin is the weakest language; more data is the lever.

Prior work

Diffusion ASR is not new — TransFusion (2022) and Whisfusion (2025) came before. diffusion-gemma-asr-small is, as far as we know, the first multilingual one, the first built on DiffusionGemma, and the first done by training only a ~42M-param adapter on a frozen, off-the-shelf diffusion LLM.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for interfaze-ai/diffusion-gemma-asr-small

Base model

google/diffusiongemma-26B-A4B-it

Finetuned

(13)

this model

Space using interfaze-ai/diffusion-gemma-asr-small 1

Papers for interfaze-ai/diffusion-gemma-asr-small

Whisfusion: Parallel ASR Decoding via a Diffusion Transformer

Paper • 2508.07048 • Published Aug 9, 2025 • 1

TransFusion: Transcribing Speech with Multinomial Diffusion

Paper • 2210.07677 • Published Oct 14, 2022