diffusion-gemma-asr-small

πŸ“ Links: Blog Β· Demo Space Β· Code

Audio-native, multilingual speech recognition that transcribes through DiffusionGemma's own discrete-diffusion decoder β€” not autoregressive, not an external ASR decoder. Audio is projected directly into the Gemma embedding space, and the transcript is produced by parallel diffusion denoising (~8–16 steps), giving real-time-plus throughput where cost is set by the number of denoising steps, not the length of the transcript.

This repo ships the trained adapter only (projector + LoRA, ~42M params β€” 0.16% of the model). The frozen 26B DiffusionGemma backbone and the frozen whisper-small encoder load from their own repos.

How it works

raw audio ─► whisper-small encoder (frozen) ─► projector (trained, ~19M)
          ─► scatter into <audio> token slots of DiffusionGemma's encoder
          ─► DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio)
          ─► transcript
  • Backbone: google/diffusiongemma-26B-A4B-it β€” frozen, small LoRA adapters on encoder/decoder attention.
  • Audio frontend: openai/whisper-small encoder β€” frozen feature extractor (NOT a decoder).
  • Grounding: trained with three losses β€” uniform-diffusion (the generator), an AR auxiliary, and a CTC loss on the projector via the frozen lm_head (the key unlock that makes the audio embeddings transcript-predictive).

Usage

Install

pip install torch peft soundfile librosa huggingface_hub \
  "transformers @ git+https://github.com/huggingface/transformers.git"   # DiffusionGemma support

Transcribe in Python

import sys, soundfile as sf
from huggingface_hub import snapshot_download

repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small")   # this adapter (~170 MB)
sys.path.insert(0, repo)
from inference import load, transcribe                       # bundled in this repo

# Loads frozen DiffusionGemma-26B + whisper-small + this adapter (downloads bases on first run).
model, tok, fe = load(f"{repo}/diffusion_asr_small.pt", device="cuda")

wav, sr = sf.read("audio.wav")        # 16 kHz mono float32 (inference.py resamples if needed)
print(transcribe(wav, model, tok, fe, max_steps=16))

Or from the command line

python inference.py audio.wav        # run inside the downloaded repo dir

Long audio is split at silence (the encoder has a 30 s window, like Whisper). max_steps trades speed for accuracy β€” 8 is near-best and fastest, 16 is the default.

Languages & accuracy

Trained on FLEURS (6 languages) + LibriSpeech (en) + VoxPopuli (en/de/fr/es). WER/CER are Whisper-normalized (Open-ASR / Artificial-Analysis convention), 16 diffusion steps:

benchmark metric score
LibriSpeech test-clean (en) WER 6.6%
FLEURS English WER 15.7%
VoxPopuli English WER 18.5%
FLEURS Hindi CER 15.8%
FLEURS Mandarin CER 29.6%

Among diffusion / non-autoregressive ASR it leads (6.6% on LibriSpeech vs Whisfusion's 8.3%, with a smaller encoder). It trails autoregressive Whisper β€” a training-data gap (~219 h seen), not architecture.

Files

  • diffusion_asr_small.pt β€” trained adapter ({"projector": ..., "lora": ...})
  • model.py, audio.py β€” model definition (self-contained)
  • inference.py β€” runnable example (load + segment + transcribe)
  • requirements.txt

Requirements / licensing

  • Needs transformers from main (DiffusionGemma support) + torch, peft.
  • Base models load from their own repos under their licenses: google/diffusiongemma-26B-A4B-it (Gemma terms) and openai/whisper-small (MIT).
  • This adapter: Apache-2.0.

Limitations

  • Per-segment window is ≀30 s (encoder limit) β€” long audio is chunked at silence, same as Whisper.
  • Mandarin is the weakest language; more data is the lever.

Prior work

Diffusion ASR is not new β€” TransFusion (2022) and Whisfusion (2025) came before. diffusion-gemma-asr-small is, as far as we know, the first multilingual one, the first built on DiffusionGemma, and the first done by training only a ~42M-param adapter on a frozen, off-the-shelf diffusion LLM.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for interfaze-ai/diffusion-gemma-asr-small

Finetuned
(13)
this model

Space using interfaze-ai/diffusion-gemma-asr-small 1

Papers for interfaze-ai/diffusion-gemma-asr-small