Skip to content

[OpenAI Privacy Filter] banded SWA eager attention (O(N*W) instead of O(N^2))#46168

Closed
kiankyars wants to merge 1 commit into
huggingface:mainfrom
kiankyars:openai-privacy-filter-banded-eager
Closed

[OpenAI Privacy Filter] banded SWA eager attention (O(N*W) instead of O(N^2))#46168
kiankyars wants to merge 1 commit into
huggingface:mainfrom
kiankyars:openai-privacy-filter-banded-eager

Conversation

@kiankyars

@kiankyars kiankyars commented May 23, 2026

Copy link
Copy Markdown

Summary

The current eager_attention_forward for OpenAIPrivacyFilter materialises the full [B, num_heads, N, N] score tensor before applying the per-query band mask, so long inputs OOM well below the model's documented 128k position budget (a 100k-token forward needs ~92 TB of mask scores at fp32 vs ~4 GB for the banded path). This PR replaces the kernel with a windowed F.pad + Tensor.unfold implementation that matches the production attention path in OpenAI's Apache-2.0 reference repo — see opf/_model/model.py::sdpa, the bidirectional branch at lines ~431-473. Memory is O(N * window) instead of O(N * N).

Flash-Attention and SDPA paths are unaffected. This is a fix to the attn_implementation="eager" code path (and the default when FA2 isn't available).

What's in the PR

  • src/transformers/models/openai_privacy_filter/modular_openai_privacy_filter.py — rewrite eager_attention_forward:
    • Pad + Tensor.unfold extract K/V windows of width 2*(sliding_window-1)+1 per query (matches HF's existing sliding_window_bidirectional_overlay mask and the FA2 window_size = (sw-1, sw-1) convention).
    • Reshape to [B, N, KV, Q, D] so the einsum stays banded; the intermediate scores tensor is [B, N, KV, Q, W], never [B, H, N, N].
    • Per-key padding is recovered from the additive mask's diagonal (O(B*N)); no full mask materialisation.
    • Sinks (already in natural-log space after the HF checkpoint converter applies * log(2) on load) are concatenated as a single column to scores before softmax and dropped after — bit-identical to the previous behaviour.
    • attn_weights is returned as None because the banded path does not build the full N x N matrix; documented in the model_doc page.
  • src/transformers/models/openai_privacy_filter/modeling_openai_privacy_filter.py — regenerated by utils/modular_model_converter.py (the now-unused repeat_kv import is dropped automatically).
  • tests/models/openai_privacy_filter/test_modeling_openai_privacy_filter.py — adds test_eager_banded_swa_locality: a one-layer config with sliding_window=2, flipping a token at distance 3 (outside band) must leave the target's hidden state unchanged (<1e-5), and a sanity flip at distance 1 (inside band) is verified to move it.
  • docs/source/en/model_doc/openai_privacy_filter.md — adds an "Attention backends" subsection explaining the FA2 vs eager trade-off and the attn_weights=None behaviour of banded eager.

Test plan

All commands run from the branch in a fresh venv.

uv venv .venv && uv pip install -e ".[testing,torch]"
.venv/bin/pytest tests/models/openai_privacy_filter/test_modeling_openai_privacy_filter.py::OpenAIPrivacyFilterModelTest -v

Result (CPU, no real weights):

test_config PASSED
test_eager_banded_swa_locality PASSED
test_model PASSED
test_token_classification_model PASSED

End-to-end check on real openai/privacy-filter weights (CPU, the existing OpenAIPrivacyFilterModelIntegrationTest::test_inference_predicted_token_classification only has a CUDA expectation so it raises ValueError: No matching expectation found for ('cpu', None, None); reproducing its forward manually):

Input:  "My name is Harry Potter and my email is harry.potter@hogwarts.edu."
Output: ['O', 'O', 'O',
         'B-private_person', 'E-private_person',
         'O', 'O', 'O', 'O',
         'B-private_email', 'I-private_email', 'I-private_email', 'I-private_email',
         'I-private_email', 'I-private_email', 'I-private_email', 'I-private_email',
         'E-private_email',
         'O']
MATCH: True

i.e. bit-identical predictions to the previous quadratic implementation. Adding a ("cpu", None, None) entry to the existing Expectations table could be a small follow-up to make the slow integration test runnable off CUDA; not done here to keep this PR focused.

Memory check: a 100k-token field that OOMs under the current eager kernel completes under the banded path with roughly B * N * window * (KV * D) * dtype_bytes ≈ 4 GB of K/V window cache (vs B * num_heads * N * N * 4 B ≈ ~92 TB of fp32 mask scores under the previous implementation).

Notes

  • This fixes the immediate user-visible bug (eager OOM on long inputs at fp32, CPU, MPS, or non-Ampere GPUs). The FA2 path was already correct and unaffected.
  • Attribution to OpenAI's reference is in the docstring; both repos are Apache-2.0.
  • SDPA support is not added in this PR. The eager path is now efficient enough that SDPA is a separate, lower-priority piece of work (and SDPA's fused softmax makes attention-sink handling non-trivial).

AI assistance disclosure (per CLAUDE.md)

This change was authored with Claude Code assistance and reviewed end-to-end by the submitting human (@kiankyars), who is responsible for defending every line of the diff. No separate coordination issue was filed before this PR because the bug (eager OOM well below the documented 128k context budget) is concrete, the fix is reference-faithful to OpenAI's open-source implementation.

🤖 Generated with Claude Code

… O(N^2))

`eager_attention_forward` materialised the full N x N attention matrix
before applying the per-query band mask, so long inputs OOM well below
the model's documented 128k context budget (a 100k-token forward needs
~92 TB of mask scores at fp32 vs ~4 GB for the banded path).

Port the unfold-based banded attention from OpenAI's Apache-2.0
reference implementation (`opf/_model/model.py::sdpa`, the bidirectional
branch) so the kernel is `O(N * window)` instead of `O(N * N)`:

* `F.pad` + `Tensor.unfold` extract K/V windows of width
  `2*(sliding_window-1)+1` per query (matches HF's existing
  `sliding_window_bidirectional_overlay` mask and the flash-attention
  `window_size = (sw-1, sw-1)` convention).
* Reshape to `[B, N, KV, Q, D]` so the einsum stays banded; the
  intermediate scores tensor is `[B, N, KV, Q, W]`, never `[B, H, N, N]`.
* Per-key padding is recovered from the additive mask's diagonal
  (`O(B*N)`); no full mask materialisation.
* Sinks (already in natural-log space after the HF checkpoint converter
  applies `* log(2)` on load) are concatenated as a single column to
  scores before softmax and dropped after, matching the previous
  behaviour exactly.
* `attn_weights` is now `None` because the banded path does not build
  the full `N x N` matrix; documented in the model_doc page.

Tests:
* New `test_eager_banded_swa_locality`: one-layer config with
  `sliding_window=2`, flipping a token at distance 3 (outside band) is
  verified to leave the target's hidden state unchanged (<1e-5), and a
  sanity flip at distance 1 (inside band) is verified to move it.
* `test_model`, `test_token_classification_model`, `test_config` pass
  unchanged.
* End-to-end forward on real `openai/privacy-filter` weights with
  `"My name is Harry Potter and my email is harry.potter@hogwarts.edu."`
  produces the exact CUDA-expected token-class sequence
  (`B/E-private_person` at 3-4, `B/I*7/E-private_email` at 9-17), i.e.
  bit-identical predictions to the previous quadratic implementation.

AI assistance disclosure (per `CLAUDE.md`): this change was authored
with Claude Code assistance and reviewed end-to-end by the human
submitter, who is responsible for defending every line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: openai_privacy_filter

@github-actions

Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46168&sha=b90184

@vasqu vasqu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this way on purpose to utilize the fully materialized masks as we prefer the simplcity. For performance, we never suggest to use eager but SDPA or FA. Eager is only useful to get the attention weights, but nothing else

@kiankyars

Copy link
Copy Markdown
Author

We do this way on purpose to utilize the fully materialized masks as we prefer the simplcity. For performance, we never suggest to use eager but SDPA or FA. Eager is only useful to get the attention weights, but nothing else

OK, thanks for your clarification Anton and for reviewing this! Will close now.

@kiankyars kiankyars closed this May 27, 2026
@kiankyars

Copy link
Copy Markdown
Author

@vasqu what's the recommended way of using this model then? With FA2 or SDPA, do you just monkey patch it?

@vasqu

vasqu commented May 28, 2026

Copy link
Copy Markdown
Collaborator

No monkey patching needed, I would suggest FA as it won't materialize the masks. For training you could use flex attention (with proper preparation of the masks to be done in the data collator or similar).

SDPA sadly doesnt support attention sinks so it cant be used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants