OpenMOSS
diff --git a/‎.github/workflows/docs.yml‎
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/docs.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 18 additions & 4 deletions b/‎docs/index.md‎
Lines changed: 18 additions & 4 deletions
diff --git a/‎docs/models/lorsa.md‎
Lines changed: 59 additions & 52 deletions b/‎docs/models/lorsa.md‎
Lines changed: 59 additions & 52 deletions
diff --git a/‎docs/models/sae.md‎
Lines changed: 12 additions & 1 deletion b/‎docs/models/sae.md‎
Lines changed: 12 additions & 1 deletion
diff --git a/‎docs/models/transcoder.md‎
Lines changed: 5 additions & 26 deletions b/‎docs/models/transcoder.md‎
Lines changed: 5 additions & 26 deletions
diff --git a/‎examples/load_hf_model.py‎
Lines changed: 37 additions & 0 deletions b/‎examples/load_hf_model.py‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎mkdocs.yml‎
Lines changed: 1 addition & 1 deletion b/‎mkdocs.yml‎
Lines changed: 1 addition & 1 deletion
@@ -1,8 +1,8 @@
 name: docs
 on:
   push:
-    branches:
-      - main
+    tags:
+      - "v*" # Triggers on version tags
 
 permissions:
   contents: write
 
@@ -168,6 +168,9 @@ cython_debug/
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
 
+# VS Code
+.vscode/
+
 ### Python Patch ###
 # Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
 poetry.toml
 
@@ -51,10 +51,24 @@ This library provides:
 
 Load any Sparse Autoencoder or other sparse dictionaries in `Language-Model-SAEs` or SAELens format.
 
-```python
-# Load Gemma Scope 2 SAE
-sae = AbstractSparseAutoEncoder.from_pretrained("gemma-scope-2-1b-pt-res-all:layer_12_width_16k_l0_small")
-```
+=== "Language-Model-SAEs"
+
+    ```python
+    # Load Llama Scope 2 Transcoder
+    sae = AbstractSparseAutoEncoder.from_pretrained(
+        "OpenMOSS-Team/Llama-Scope-2-Qwen3-1.7B:transcoder/8x/k128/layer12_transcoder_8x_k128",
+        fold_activation_scale=False
+    )
+    ```
+
+=== "SAELens"
+
+    ```python
+    # Load Gemma Scope 2 SAE
+    sae = AbstractSparseAutoEncoder.from_pretrained(
+        "gemma-scope-2-1b-pt-res-all:layer_12_width_16k_l0_small",
+    )
+    ```    
 
 ### Training a Sparse Autoencoder
 
 
@@ -1,6 +1,33 @@
 # Low-Rank Sparse Attention (Lorsa)
 
-Low-Rank Sparse Attention (Lorsa) is a specialized sparse dictionary architecture designed to decompose attention layers into interpretable sparse components. Unlike standard SAEs that treat attention as a black box, Lorsa explicitly models the query-key-value structure while maintaining sparsity and interpretability. Lorsa decomposes attention computations into interpretable sparse features that preserve positional information through explicit query-key attention mechanisms. This allows for fine-grained analysis of attention patterns and understanding how models route information based on both content and position.
+Low-Rank Sparse Attention (Lorsa) is a specialized sparse dictionary architecture designed to decompose attention layers into interpretable sparse components. Unlike standard SAEs that treat attention as a black box, Lorsa explicitly models the query-key-value structure while maintaining sparsity and interpretability.
+
+Given an input sequence \(X \in \mathbb{R}^{n \times d}\), Lorsa has:
+
+- \(n_{\text{qk\_heads}}\) QK heads, each with projections \(W_q^h, W_k^h \in \mathbb{R}^{d \times d_{\text{qk\_head}}}\)
+- \(n_{\text{ov\_heads}}\) rank-1 OV heads, each with projections \(\mathbf{w}_v^i \in \mathbb{R}^{d \times 1}\), \(\mathbf{w}_o^i \in \mathbb{R}^{1 \times d}\)
+
+Every group of \(n_{\text{ov\_heads}} / n_{\text{qk\_heads}}\) consecutive OV heads shares the same QK head. Denote the QK head assigned to OV head \(i\) as \(h(i)\). The forward pass for each OV head \(i\) is:
+
+\[
+\begin{aligned}
+Q^{h(i)} &= X W_q^{h(i)}, \quad K^{h(i)} = X W_k^{h(i)} \\
+A^{h(i)} &= \operatorname{softmax}\!\left(\frac{Q^{h(i)} {(K^{h(i)})}^\top}{\sqrt{d_{\text{qk\_head}}}}\right) \in \mathbb{R}^{n \times n} \\
+\tilde{\mathbf{z}}^i &= A^{h(i)}\, (X \mathbf{w}_v^i) \in \mathbb{R}^{n \times 1}
+\end{aligned}
+\]
+
+The pre-activations across all OV heads are then passed through a sparsity-inducing activation function \(\sigma(\cdot)\):
+
+\[
+[\mathbf{z}^0, \ldots, \mathbf{z}^{n_{\text{ov\_heads}}-1}] = \sigma([\tilde{\mathbf{z}}^0, \ldots, \tilde{\mathbf{z}}^{n_{\text{ov\_heads}}-1}])
+\]
+
+The final output sums the contributions of all OV heads weighted by their activations:
+
+\[
+\hat{Y} = \sum_{i=0}^{n_{\text{ov\_heads}}-1} \mathbf{z}^i\, (\mathbf{w}_o^i)^\top \in \mathbb{R}^{n \times d}
+\]
 
 The architecture was introduced in [*Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition*](https://openreview.net/forum?id=9A2etpDFIB) (ICLR 2026), which proposes using sparse dictionary learning to address *attention superposition*—the challenge of disentangling attention-mediated interactions between features at different token positions. For detailed architectural specifications and mathematical formulations, please refer to this paper.
 
@@ -63,26 +90,30 @@ lorsa_config = LorsaConfig(
 
 #### Attention Dimensions
 
+We recommend setting `d_qk_head` to match the target model's head dimension. `n_qk_heads` can be freely chosen: a natural starting point is `n_qk_heads = n_heads * expansion_factor` (n_heads is the num of attention heads of target attention layer), though a smaller value is also reasonable if you want to reduce Lorsa's parameter count(not less than `n_heads`).
+
 | Parameter | Type | Description | Default |
 |-----------|------|-------------|---------|
-| `n_qk_heads` | `int` | Number of query-key attention heads | Required |
-| `d_qk_head` | `int` | Dimension per query-key head | Required |
-| `n_ctx` | `int` | Maximum context length / sequence length | Required |
+| `n_qk_heads` | `int` | Number of QK heads. | Required |
+| `d_qk_head` | `int` | Dimension per QK head. | Required |
+| `n_ctx` | `int` | Maximum context length. | Required |
 
-!!! note "Number of Value Heads"
-    The number of value heads (output features) is automatically computed as: `n_ov_heads = expansion_factor * d_model` (same as `d_sae`). The `ov_group_size` is `n_ov_heads // n_qk_heads`.
+!!! note "Number of OV Heads"
+    The number of OV heads is automatically computed as: `n_ov_heads = expansion_factor * d_model` (same as `d_sae`).
 
 #### Positional Embeddings
 
+It is strongly recommended to copy the positional embedding parameters directly from the target model's implementation. Incorrect settings will make it harder for Lorsa to learn the target attention patterns.
+
 | Parameter | Type | Description | Default |
 |-----------|------|-------------|---------|
 | `positional_embedding_type` | `str` | Type of positional embedding: `"rotary"` or `"none"` | `"rotary"` |
 | `rotary_dim` | `int` | Dimension of rotary embeddings (typically `d_qk_head`) | Required |
 | `rotary_base` | `int` | Base for rotary embeddings frequency | `10000` |
-| `rotary_adjacent_pairs` | `bool` | Whether to apply RoPE on adjacent pairs vs. all dimensions | `True` |
-| `rotary_scale` | `int` | Scaling factor for rotary embeddings | `1` |
+| `rotary_adjacent_pairs` | `bool` | Whether to apply RoPE on adjacent pairs | `True` |
+| `rotary_scale` | `int` | Scaling factor of the head dimension for rotary embeddings | `1` |
 
-#### NTK-Aware RoPE (for Llama 3.1 and 3.2 herd models)
+#### NTK-Aware RoPE (only for Llama 3.1 and 3.2 herd models)
 
 | Parameter | Type | Description | Default |
 |-----------|------|-------------|---------|
@@ -92,15 +123,30 @@ lorsa_config = LorsaConfig(
 | `NTK_by_parts_high_freq_factor` | `float` | High-frequency component scaling factor | `1.0` |
 | `old_context_len` | `int` | Original context length before scaling | `2048` |
 
-#### Attention Settings
+#### Attention Computation Details
 
 | Parameter | Type | Description | Default |
 |-----------|------|-------------|---------|
-| `attn_scale` | `float \| None` | Attention scaling factor. If `None`, uses $\frac{1}{\sqrt{d_{\text{qk\_head}}}}$ | `None` |
+| `attn_scale` | `float | None` | Attention scaling factor. If `None`, uses $\frac{1}{\sqrt{d_{\text{qk\_head}}}}$ | `None` |
 | `use_post_qk_ln` | `bool` | Apply LayerNorm/RMSNorm after computing Q and K projections | `False` |
-| `normalization_type` | `str \| None` | Normalization type: `"LN"` (LayerNorm) or `"RMS"` (RMSNorm). Only used when `use_post_qk_ln=True` | `None` |
+| `normalization_type` | `str | None` | Normalization type: `"LN"` (LayerNorm) or `"RMS"` (RMSNorm). Only used when `use_post_qk_ln=True` | `None` |
 | `eps` | `float` | Epsilon for numerical stability in normalization | `1e-6` |
 
+### Initialization Strategy
+
+For Lorsa, initialization from the original model's attention weights is highly recommended:
+
+```python
+InitializerConfig(
+    grid_search_init_norm=True,
+    initialize_lorsa_with_mhsa=True,  # Initialize Q, K from attention weights
+    initialize_W_D_with_active_subspace=True,  # Initialize V, O from attention weights
+    model_layer=13,  # Specify layer to extract attention weights from
+)
+```
+
+This initialization helps Lorsa start from a good approximation of the attention computation.
+
 ## Training
 
 ### Basic Training Setup
@@ -125,31 +171,7 @@ settings = TrainLorsaSettings(
     sae=LorsaConfig(
         hook_point_in="blocks.13.ln1.hook_normalized",
         hook_point_out="blocks.13.hook_attn_out",
-        d_model=2048,
-        expansion_factor=32,
-        
-        # Attention configuration
-        n_qk_heads=16,
-        d_qk_head=128,
-        n_ctx=2048,
-        
-        # RoPE configuration
-        positional_embedding_type="rotary",
-        rotary_dim=128,
-        rotary_base=1000000,
-        rotary_adjacent_pairs=False,
-        
-        # Sparsity
-        act_fn="topk",
-        top_k=256,
-        
-        # Normalization
-        use_post_qk_ln=True,
-        normalization_type="RMS",
-        eps=1e-6,
-        
-        dtype=torch.float32,
-        device="cuda",
+        # ... other settings ...
     ),
     initializer=InitializerConfig(
         grid_search_init_norm=True,
@@ -196,21 +218,6 @@ settings = TrainLorsaSettings(
 train_lorsa(settings)
 ```
 
-### Initialization Strategy
-
-For Lorsa, initialization from the original model's attention weights is highly recommended:
-
-```python
-InitializerConfig(
-    grid_search_init_norm=True,
-    initialize_lorsa_with_mhsa=True,  # Initialize Q, K from attention weights
-    initialize_W_D_with_active_subspace=True,  # Initialize V, O from attention weights
-    model_layer=13,  # Specify layer to extract attention weights from
-)
-```
-
-This initialization helps Lorsa start from a good approximation of the attention computation.
-
 ### Important Training Considerations
 
 1. **Sequence batching**: Since Lorsa operates on sequences, `batch_size` in `ActivationFactoryConfig` represents the number of sequences (not tokens). The effective token batch size is `batch_size * n_ctx`.
 
@@ -1,6 +1,17 @@
 # Sparse Autoencoder (SAE)
 
-Sparse Autoencoders (SAEs) are the foundational architecture for learning interpretable features from language model activations. They decompose neural network activations into sparse, interpretable features that help address the superposition problem. An SAE consists of an encoder that maps model activations to a higher-dimensional latent space and a decoder that reconstructs the original activations. The key innovation is enforcing sparsity through activation functions or regularization, which encourages the model to learn monosemantic features—where each feature represents a single concept.
+Sparse Autoencoders (SAEs) are the foundational architecture for learning interpretable features from language model activations. They decompose neural network activations into sparse, interpretable features that help address the superposition problem.
+
+Given a model activation vector \(\mathbf{x} \in \mathbb{R}^{d_{\text{model}}}\), an SAE first **encodes** it into a high-dimensional sparse latent representation, then **decodes** it back to reconstruct the original activation:
+
+\[
+\begin{aligned}
+\mathbf{z} &= \sigma(W_E \mathbf{x} + \mathbf{b}_E) \in \mathbb{R}^{d_{\text{SAE}}} \\
+\hat{\mathbf{x}} &= W_D \mathbf{z} + \mathbf{b}_D \in \mathbb{R}^{d_{\text{model}}}
+\end{aligned}
+\]
+
+where \(W_E \in \mathbb{R}^{d_{\text{SAE}} \times d_{\text{model}}}\) and \(W_D \in \mathbb{R}^{d_{\text{model}} \times d_{\text{SAE}}}\) are the encoder and decoder weight matrices, \(\mathbf{b}_E, \mathbf{b}_D\) are bias terms, and \(\sigma(\cdot)\) is a sparsity-inducing activation function (e.g., ReLU, TopK). The model is trained to minimize the reconstruction loss \(\|\mathbf{x} - \hat{\mathbf{x}}\|^2\) while keeping \(\mathbf{z}\) sparse, encouraging each dimension of \(\mathbf{z}\) to correspond to a monosemantic feature.
 
 The architecture was introduced in foundational works including [*Sparse Autoencoders Find Highly Interpretable Features in Language Models*](https://arxiv.org/abs/2309.08600) and [*Towards Monosemanticity: Decomposing Language Models With Dictionary Learning*](https://transformer-circuits.pub/2023/monosemantic-features). For detailed architectural specifications and mathematical formulations, please refer to these papers.
 
 
@@ -6,39 +6,18 @@ Transcoders were introduced in the following papers: [*Automatically Identifying
 
 ## Configuration
 
-Transcoders use the same `SAEConfig` class as standard SAEs. All sparse dictionary models inherit common parameters from `BaseSAEConfig`. See the [Common Configuration Parameters](overview.md#common-configuration-parameters) section for the full list of inherited parameters.
+Transcoders use the same `SAEConfig` and `InitializerConfig` as standard SAEs. See the [SAE configuration guide](sae.md#configuration) for the full parameter reference.
 
-### Transcoder-Specific Parameters
+The only essential difference is that `hook_point_in` and `hook_point_out` must point to **different** locations—typically the input and output of the MLP sublayer you want to decompose:
 
 ```python
-from lm_saes import SAEConfig
-import torch
-
 transcoder_config = SAEConfig(
-    # Transcoder-specific: different hook points
-    hook_point_in="blocks.6.ln2.hook_normalized",  # Input to MLP
-    hook_point_out="blocks.6.hook_mlp_out",        # Output from MLP
-    use_glu_encoder=False,
-    
-    # Common parameters (documented in Sparse Dictionaries overview)
-    d_model=768,
-    expansion_factor=32,
-    act_fn="topk",
-    top_k=64,
-    dtype=torch.float32,
-    device="cuda",
+    hook_point_in="blocks.6.ln2.hook_normalized",  # before MLP
+    hook_point_out="blocks.6.hook_mlp_out",        # after MLP
+    ...
 )
 ```
 
-| Parameter | Type | Description | Default |
-|-----------|------|-------------|---------|
-| `hook_point_in` | `str` | Hook point before the computational unit (e.g., `blocks.L.ln2.hook_normalized` for MLP input). Must differ from `hook_point_out` for transcoders | Required |
-| `hook_point_out` | `str` | Hook point after the computational unit (e.g., `blocks.L.hook_mlp_out` for MLP output). Must differ from `hook_point_in` for transcoders | Required |
-| `use_glu_encoder` | `bool` | Whether to use a Gated Linear Unit (GLU) in the encoder. GLU can improve expressiveness but increases parameter count | `False` |
-
-!!! important "Transcoder vs SAE"
-    When `hook_point_in != hook_point_out`, the configuration defines a transcoder rather than a standard SAE. This allows the model to learn the transformation between two different points in the network.
-
 ### Initialization Strategy
 
 Proper initialization is crucial for training high-quality transcoders. We recommend the following configuration:
 
@@ -0,0 +1,37 @@
+"""
+Load a Transcoder from HuggingFace.
+"""
+
+import torch
+from transformer_lens import HookedTransformer
+
+from lm_saes.abstract_sae import AbstractSparseAutoEncoder
+
+# Load Gemma Scope 2 SAE from HuggingFace
+sae = AbstractSparseAutoEncoder.from_pretrained(
+    "OpenMOSS-Team/Llama-Scope-2-Qwen3-1.7B:transcoder/8x/k128/layer12_transcoder_8x_k128",
+    fold_activation_scale=False,
+).to("cpu")
+
+print(f"Loaded SAE: {sae.cfg}")
+
+# Load Gemma 3 with TransformerLens
+model = HookedTransformer.from_pretrained("Qwen/Qwen3-1.7B")
+model.to("cpu")
+model.eval()
+
+prompt = "The capital of France is"
+tokens = model.to_tokens(prompt)
+_, cache = model.run_with_cache(tokens, names_filter=[sae.cfg.hook_point_in, sae.cfg.hook_point_out])
+x = cache[sae.cfg.hook_point_in]
+label = cache[sae.cfg.hook_point_out]
+
+with torch.no_grad():
+    feature_acts = sae.encode(x)
+    reconstructed = sae.decode(feature_acts)
+
+l0 = (feature_acts > 0).sum(dim=-1).float().mean()
+mse = (x.to(sae.cfg.dtype) - reconstructed).pow(2).mean()
+print(f"Prompt: {prompt}")
+print(f"Average L0: {l0.item():.1f}")
+print(f"Reconstruction MSE: {mse.item():.6f}")
@@ -50,7 +50,7 @@ nav:
       - Overview: models/overview.md
       - Sparse Autoencoder: models/sae.md
       - Transcoder: models/transcoder.md
-      - Cross Layer Transcoder: models/clt.md
+      # - Cross Layer Transcoder: models/clt.md
       - Low-Rank Sparse Attention: models/lorsa.md
   - Analyze SAEs: analyze-saes.md
   - Distributed Guidelines: distributed-guidelines.md