File size: 6,548 Bytes
bb89909
 
 
 
 
 
 
 
 
 
 
 
6af81a5
 
bb89909
 
 
 
 
 
27391f8
bb89909
 
 
 
 
6af81a5
bb89909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6af81a5
bb89909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6af81a5
bb89909
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27391f8
bb89909
27391f8
6af81a5
 
 
 
 
 
 
bb89909
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
license: other
license_name: qwen
license_link: https://huggingface.co/Qwen/SAE-Res-Qwen3.5-35B-A3B-Base-W32K-L0_50/blob/main/LICENSE
language:
- en
tags:
- sparse-autoencoder
- sae
- mechanistic-interpretability
- interpretability
- qwen-scope
- arxiv:2605.11887
base_model: Qwen/Qwen3.5-35B-A3B-Base
---

## Qwen-Scope: Decoding Intelligence, Unleashing Potential

![Overview](https://qianwen-res.oss-cn-beijing.aliyuncs.com/qwen-scope/Figures/overview.png)

We are excited to introduce Qwen-Scope, an interpretability module trained on the Qwen3 and Qwen3.5 series models. Specifically, we integrated and trained Sparse Autoencoders (SAEs) within Qwen’s hidden layers. By implementing sparsity constraints, we can automatically extract data features that are highly decoupled, low-redundancy, and significantly more interpretable. Qwen-Scope can be used not only to analyze the internal mechanisms of Qwen’s behavior but also holds immense potential for model optimization. Application scenarios include steerable inference control, evaluation sample distribution analysis and comparison, data classification and synthesis, and model training and optimization. See our [technical report](https://qianwen-res.oss-accelerate.aliyuncs.com/qwen-scope/Qwen_Scope.pdf) for more details.

## Model Details

| Property | Value |
|---|---|
| Base model | [Qwen3.5-35B-A3B-Base](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-Base) |
| SAE width (`d_sae`) | 32768 |
| Hidden size (`d_model`) | 2048 |
| Expansion factor | 16Γ— |
| Top-K | 50 |
| Hook point | Residual stream |
| Layers covered | 0 – 39 (40 layers total) |
| File format | PyTorch `.pt` dict |

## Architecture

This is a **TopK SAE** β€” at each forward pass, exactly **50** features are kept non-zero.

Each checkpoint file `layer{n}.sae.pt` is a Python `dict` with four tensors:

| Key | Shape | Description |
|---|---|---|
| `W_enc` | `(32768, 2048)` | Encoder weight matrix |
| `W_dec` | `(2048, 32768)` | Decoder weight matrix |
| `b_enc` | `(32768,)` | Encoder bias |
| `b_dec` | `(2048,)` | Decoder bias |

## Files

This repository contains one SAE checkpoint per transformer layer (layers 0–39):

```
layer0.sae.pt
layer1.sae.pt
...
layer39.sae.pt
```

## Feature Activation Extraction

End-to-end demo: run the base LLM, hook the residual stream at a chosen layer, and extract sparse SAE feature activations.
For most of the situations, using SAEs trained on base models to explore the internal process of post-training checkpoints is also reasonable.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ── 1. Load base model ────────────────────────────────────────────────────────
model_name = "Qwen/Qwen3.5-35B-A3B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)
model.eval()

# ── 2. Load SAE for a target layer ───────────────────────────────────────────
LAYER = 0  # choose any layer in 0–39
sae = torch.load(f"layer{LAYER}.sae.pt", map_location="cpu")
W_enc = sae["W_enc"]  # (32768, 2048)
b_enc = sae["b_enc"]  # (32768,)

def get_feature_acts(residual: torch.Tensor) -> torch.Tensor:
    """residual: (..., 2048) β†’ sparse feature activations (..., 32768)"""
    pre_acts = residual @ W_enc.T + b_enc
    topk_vals, topk_idx = pre_acts.topk(50, dim=-1)
    acts = torch.zeros_like(pre_acts)
    acts.scatter_(-1, topk_idx, topk_vals)
    return acts

# ── 3. Hook residual stream after the target transformer layer ────────────────
captured = {}

def _hook(module, input, output):
    hidden = output[0] if isinstance(output, tuple) else output
    captured["residual"] = hidden.detach().cpu()

hook = model.model.layers[LAYER].register_forward_hook(_hook)

# ── 4. Forward pass ───────────────────────────────────────────────────────────
text = "The capital of France is"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    model(**inputs)
hook.remove()

# ── 5. Extract feature activations ───────────────────────────────────────────
residual = captured["residual"]               # (1, seq_len, 2048)
feature_acts = get_feature_acts(residual)     # (1, seq_len, 32768)

# Inspect active features for the last token
last_token_acts = feature_acts[0, -1]         # (32768,)
active_idx = last_token_acts.nonzero(as_tuple=True)[0]
print(f"Active features : {active_idx.tolist()}")
print(f"Feature values  : {last_token_acts[active_idx].tolist()}")
```

## Gradio Demo

We also provide a gradio demo `app.py`. You can run it locally:
```
python app.py \
    --model Qwen/Qwen3.5-35B-A3B-Base \
    --model-name-sae-trained-from qwen3.5-35b-a3b-base \
    --model-name-analyzing-now qwen3.5-35b-a3b \
    --sae-path Qwen/SAE-Res-Qwen3.5-35B-A3B-Base-W32K-L0_50 \
    --top-k 50 \
    --num-layers 40 \
    --sae-width 32768 \
    --d-model 2048 \
    --server-port 7860
```

## Caution
It is strictly prohibited to use interpretability tools for non-scientific research purposes to interfere with model capabilities, or to fabricate, generate, and disseminate harmful information that violates public order, good morals, and socialist core values, including pornographic, violent, discriminatory, or incendiary content. Violators will have their authorization automatically terminated and shall bear all legal liabilities arising therefrom. The right of final interpretation of this statement belongs to the project owner.

## Citation

If you use these SAEs in your research, please cite:


```bibtex
@misc{qwen_scope,
      title={{Qwen-Scope}: Turning Sparse Features into Development Tools for Large Language Models},
      author={Boyi Deng and Xu Wang and Yaoning Wang and Yu Wan and Yubo Ma and Baosong Yang and Haoran Wei and Jialong Tang and Huan Lin and Ruize Gao and Tianhao Li and Qian Cao and Xuancheng Ren and Xiaodong Deng and An Yang and Fei Huang and Dayiheng Liu and Jingren Zhou},
      year={2026},
      eprint={2605.11887},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.11887},
}
```