You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/models/lorsa.md
+59-52Lines changed: 59 additions & 52 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,33 @@
1
1
# Low-Rank Sparse Attention (Lorsa)
2
2
3
-
Low-Rank Sparse Attention (Lorsa) is a specialized sparse dictionary architecture designed to decompose attention layers into interpretable sparse components. Unlike standard SAEs that treat attention as a black box, Lorsa explicitly models the query-key-value structure while maintaining sparsity and interpretability. Lorsa decomposes attention computations into interpretable sparse features that preserve positional information through explicit query-key attention mechanisms. This allows for fine-grained analysis of attention patterns and understanding how models route information based on both content and position.
3
+
Low-Rank Sparse Attention (Lorsa) is a specialized sparse dictionary architecture designed to decompose attention layers into interpretable sparse components. Unlike standard SAEs that treat attention as a black box, Lorsa explicitly models the query-key-value structure while maintaining sparsity and interpretability.
4
+
5
+
Given an input sequence \(X \in \mathbb{R}^{n \times d}\), Lorsa has:
6
+
7
+
-\(n_{\text{qk\_heads}}\) QK heads, each with projections \(W_q^h, W_k^h \in \mathbb{R}^{d \times d_{\text{qk\_head}}}\)
8
+
-\(n_{\text{ov\_heads}}\) rank-1 OV heads, each with projections \(\mathbf{w}_v^i \in \mathbb{R}^{d \times 1}\), \(\mathbf{w}_o^i \in \mathbb{R}^{1 \times d}\)
9
+
10
+
Every group of \(n_{\text{ov\_heads}} / n_{\text{qk\_heads}}\) consecutive OV heads shares the same QK head. Denote the QK head assigned to OV head \(i\) as \(h(i)\). The forward pass for each OV head \(i\) is:
11
+
12
+
\[
13
+
\begin{aligned}
14
+
Q^{h(i)} &= X W_q^{h(i)}, \quad K^{h(i)} = X W_k^{h(i)} \\
The architecture was introduced in [*Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition*](https://openreview.net/forum?id=9A2etpDFIB) (ICLR 2026), which proposes using sparse dictionary learning to address *attention superposition*—the challenge of disentangling attention-mediated interactions between features at different token positions. For detailed architectural specifications and mathematical formulations, please refer to this paper.
6
33
@@ -63,26 +90,30 @@ lorsa_config = LorsaConfig(
63
90
64
91
#### Attention Dimensions
65
92
93
+
We recommend setting `d_qk_head` to match the target model's head dimension. `n_qk_heads` can be freely chosen: a natural starting point is `n_qk_heads = n_heads * expansion_factor` (n_heads is the num of attention heads of target attention layer), though a smaller value is also reasonable if you want to reduce Lorsa's parameter count(not less than `n_heads`).
94
+
66
95
| Parameter | Type | Description | Default |
67
96
|-----------|------|-------------|---------|
68
-
|`n_qk_heads`|`int`| Number of query-key attention heads | Required |
69
-
|`d_qk_head`|`int`| Dimension per query-key head | Required |
70
-
|`n_ctx`|`int`| Maximum context length / sequence length| Required |
97
+
|`n_qk_heads`|`int`| Number of QK heads.| Required |
98
+
|`d_qk_head`|`int`| Dimension per QK head.| Required |
99
+
|`n_ctx`|`int`| Maximum context length.| Required |
71
100
72
-
!!! note "Number of Value Heads"
73
-
The number of value heads (output features) is automatically computed as: `n_ov_heads = expansion_factor * d_model` (same as `d_sae`). The `ov_group_size` is `n_ov_heads // n_qk_heads`.
101
+
!!! note "Number of OV Heads"
102
+
The number of OV heads is automatically computed as: `n_ov_heads = expansion_factor * d_model` (same as `d_sae`).
74
103
75
104
#### Positional Embeddings
76
105
106
+
It is strongly recommended to copy the positional embedding parameters directly from the target model's implementation. Incorrect settings will make it harder for Lorsa to learn the target attention patterns.
107
+
77
108
| Parameter | Type | Description | Default |
78
109
|-----------|------|-------------|---------|
79
110
|`positional_embedding_type`|`str`| Type of positional embedding: `"rotary"` or `"none"`|`"rotary"`|
80
111
|`rotary_dim`|`int`| Dimension of rotary embeddings (typically `d_qk_head`) | Required |
81
112
|`rotary_base`|`int`| Base for rotary embeddings frequency |`10000`|
82
-
|`rotary_adjacent_pairs`|`bool`| Whether to apply RoPE on adjacent pairs vs. all dimensions |`True`|
83
-
|`rotary_scale`|`int`| Scaling factor for rotary embeddings |`1`|
113
+
|`rotary_adjacent_pairs`|`bool`| Whether to apply RoPE on adjacent pairs |`True`|
114
+
|`rotary_scale`|`int`| Scaling factor of the head dimension for rotary embeddings |`1`|
84
115
85
-
#### NTK-Aware RoPE (for Llama 3.1 and 3.2 herd models)
116
+
#### NTK-Aware RoPE (only for Llama 3.1 and 3.2 herd models)
For Lorsa, initialization from the original model's attention weights is highly recommended:
202
-
203
-
```python
204
-
InitializerConfig(
205
-
grid_search_init_norm=True,
206
-
initialize_lorsa_with_mhsa=True, # Initialize Q, K from attention weights
207
-
initialize_W_D_with_active_subspace=True, # Initialize V, O from attention weights
208
-
model_layer=13, # Specify layer to extract attention weights from
209
-
)
210
-
```
211
-
212
-
This initialization helps Lorsa start from a good approximation of the attention computation.
213
-
214
221
### Important Training Considerations
215
222
216
223
1.**Sequence batching**: Since Lorsa operates on sequences, `batch_size` in `ActivationFactoryConfig` represents the number of sequences (not tokens). The effective token batch size is `batch_size * n_ctx`.
Copy file name to clipboardExpand all lines: docs/models/sae.md
+12-1Lines changed: 12 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,17 @@
1
1
# Sparse Autoencoder (SAE)
2
2
3
-
Sparse Autoencoders (SAEs) are the foundational architecture for learning interpretable features from language model activations. They decompose neural network activations into sparse, interpretable features that help address the superposition problem. An SAE consists of an encoder that maps model activations to a higher-dimensional latent space and a decoder that reconstructs the original activations. The key innovation is enforcing sparsity through activation functions or regularization, which encourages the model to learn monosemantic features—where each feature represents a single concept.
3
+
Sparse Autoencoders (SAEs) are the foundational architecture for learning interpretable features from language model activations. They decompose neural network activations into sparse, interpretable features that help address the superposition problem.
4
+
5
+
Given a model activation vector \(\mathbf{x} \in \mathbb{R}^{d_{\text{model}}}\), an SAE first **encodes** it into a high-dimensional sparse latent representation, then **decodes** it back to reconstruct the original activation:
where \(W_E \in \mathbb{R}^{d_{\text{SAE}} \times d_{\text{model}}}\) and \(W_D \in \mathbb{R}^{d_{\text{model}} \times d_{\text{SAE}}}\) are the encoder and decoder weight matrices, \(\mathbf{b}_E, \mathbf{b}_D\) are bias terms, and \(\sigma(\cdot)\) is a sparsity-inducing activation function (e.g., ReLU, TopK). The model is trained to minimize the reconstruction loss \(\|\mathbf{x} - \hat{\mathbf{x}}\|^2\) while keeping \(\mathbf{z}\) sparse, encouraging each dimension of \(\mathbf{z}\) to correspond to a monosemantic feature.
4
15
5
16
The architecture was introduced in foundational works including [*Sparse Autoencoders Find Highly Interpretable Features in Language Models*](https://arxiv.org/abs/2309.08600) and [*Towards Monosemanticity: Decomposing Language Models With Dictionary Learning*](https://transformer-circuits.pub/2023/monosemantic-features). For detailed architectural specifications and mathematical formulations, please refer to these papers.
Copy file name to clipboardExpand all lines: docs/models/transcoder.md
+5-26Lines changed: 5 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,39 +6,18 @@ Transcoders were introduced in the following papers: [*Automatically Identifying
6
6
7
7
## Configuration
8
8
9
-
Transcoders use the same `SAEConfig`class as standard SAEs. All sparse dictionary models inherit common parameters from `BaseSAEConfig`. See the [Common Configuration Parameters](overview.md#common-configuration-parameters) section for the full list of inherited parameters.
9
+
Transcoders use the same `SAEConfig`and `InitializerConfig`as standard SAEs. See the [SAE configuration guide](sae.md#configuration)for the full parameter reference.
10
10
11
-
### Transcoder-Specific Parameters
11
+
The only essential difference is that `hook_point_in` and `hook_point_out` must point to **different** locations—typically the input and output of the MLP sublayer you want to decompose:
12
12
13
13
```python
14
-
from lm_saes import SAEConfig
15
-
import torch
16
-
17
14
transcoder_config = SAEConfig(
18
-
# Transcoder-specific: different hook points
19
-
hook_point_in="blocks.6.ln2.hook_normalized", # Input to MLP
20
-
hook_point_out="blocks.6.hook_mlp_out", # Output from MLP
21
-
use_glu_encoder=False,
22
-
23
-
# Common parameters (documented in Sparse Dictionaries overview)
24
-
d_model=768,
25
-
expansion_factor=32,
26
-
act_fn="topk",
27
-
top_k=64,
28
-
dtype=torch.float32,
29
-
device="cuda",
15
+
hook_point_in="blocks.6.ln2.hook_normalized", # before MLP
16
+
hook_point_out="blocks.6.hook_mlp_out", # after MLP
17
+
...
30
18
)
31
19
```
32
20
33
-
| Parameter | Type | Description | Default |
34
-
|-----------|------|-------------|---------|
35
-
|`hook_point_in`|`str`| Hook point before the computational unit (e.g., `blocks.L.ln2.hook_normalized` for MLP input). Must differ from `hook_point_out` for transcoders | Required |
36
-
|`hook_point_out`|`str`| Hook point after the computational unit (e.g., `blocks.L.hook_mlp_out` for MLP output). Must differ from `hook_point_in` for transcoders | Required |
37
-
|`use_glu_encoder`|`bool`| Whether to use a Gated Linear Unit (GLU) in the encoder. GLU can improve expressiveness but increases parameter count |`False`|
38
-
39
-
!!! important "Transcoder vs SAE"
40
-
When `hook_point_in != hook_point_out`, the configuration defines a transcoder rather than a standard SAE. This allows the model to learn the transformation between two different points in the network.
41
-
42
21
### Initialization Strategy
43
22
44
23
Proper initialization is crucial for training high-quality transcoders. We recommend the following configuration:
0 commit comments