[OpenAI Privacy Filter] banded SWA eager attention (O(N*W) instead of O(N^2)) by kiankyars · Pull Request #46168 · huggingface/transformers

kiankyars · 2026-05-23T05:32:32Z

Summary

The current eager_attention_forward for OpenAIPrivacyFilter materialises the full [B, num_heads, N, N] score tensor before applying the per-query band mask, so long inputs OOM well below the model's documented 128k position budget (a 100k-token forward needs ~92 TB of mask scores at fp32 vs ~4 GB for the banded path). This PR replaces the kernel with a windowed F.pad + Tensor.unfold implementation that matches the production attention path in OpenAI's Apache-2.0 reference repo — see opf/_model/model.py::sdpa, the bidirectional branch at lines ~431-473. Memory is O(N * window) instead of O(N * N).

Flash-Attention and SDPA paths are unaffected. This is a fix to the attn_implementation="eager" code path (and the default when FA2 isn't available).

What's in the PR

src/transformers/models/openai_privacy_filter/modular_openai_privacy_filter.py — rewrite eager_attention_forward:
- Pad + Tensor.unfold extract K/V windows of width 2*(sliding_window-1)+1 per query (matches HF's existing sliding_window_bidirectional_overlay mask and the FA2 window_size = (sw-1, sw-1) convention).
- Reshape to [B, N, KV, Q, D] so the einsum stays banded; the intermediate scores tensor is [B, N, KV, Q, W], never [B, H, N, N].
- Per-key padding is recovered from the additive mask's diagonal (O(B*N)); no full mask materialisation.
- Sinks (already in natural-log space after the HF checkpoint converter applies * log(2) on load) are concatenated as a single column to scores before softmax and dropped after — bit-identical to the previous behaviour.
- attn_weights is returned as None because the banded path does not build the full N x N matrix; documented in the model_doc page.
src/transformers/models/openai_privacy_filter/modeling_openai_privacy_filter.py — regenerated by utils/modular_model_converter.py (the now-unused repeat_kv import is dropped automatically).
tests/models/openai_privacy_filter/test_modeling_openai_privacy_filter.py — adds test_eager_banded_swa_locality: a one-layer config with sliding_window=2, flipping a token at distance 3 (outside band) must leave the target's hidden state unchanged (<1e-5), and a sanity flip at distance 1 (inside band) is verified to move it.
docs/source/en/model_doc/openai_privacy_filter.md — adds an "Attention backends" subsection explaining the FA2 vs eager trade-off and the attn_weights=None behaviour of banded eager.

Test plan

All commands run from the branch in a fresh venv.

uv venv .venv && uv pip install -e ".[testing,torch]"
.venv/bin/pytest tests/models/openai_privacy_filter/test_modeling_openai_privacy_filter.py::OpenAIPrivacyFilterModelTest -v

Result (CPU, no real weights):

test_config PASSED
test_eager_banded_swa_locality PASSED
test_model PASSED
test_token_classification_model PASSED

End-to-end check on real openai/privacy-filter weights (CPU, the existing OpenAIPrivacyFilterModelIntegrationTest::test_inference_predicted_token_classification only has a CUDA expectation so it raises ValueError: No matching expectation found for ('cpu', None, None); reproducing its forward manually):

Input:  "My name is Harry Potter and my email is harry.potter@hogwarts.edu."
Output: ['O', 'O', 'O',
         'B-private_person', 'E-private_person',
         'O', 'O', 'O', 'O',
         'B-private_email', 'I-private_email', 'I-private_email', 'I-private_email',
         'I-private_email', 'I-private_email', 'I-private_email', 'I-private_email',
         'E-private_email',
         'O']
MATCH: True

i.e. bit-identical predictions to the previous quadratic implementation. Adding a ("cpu", None, None) entry to the existing Expectations table could be a small follow-up to make the slow integration test runnable off CUDA; not done here to keep this PR focused.

Memory check: a 100k-token field that OOMs under the current eager kernel completes under the banded path with roughly B * N * window * (KV * D) * dtype_bytes ≈ 4 GB of K/V window cache (vs B * num_heads * N * N * 4 B ≈ ~92 TB of fp32 mask scores under the previous implementation).

Notes

This fixes the immediate user-visible bug (eager OOM on long inputs at fp32, CPU, MPS, or non-Ampere GPUs). The FA2 path was already correct and unaffected.
Attribution to OpenAI's reference is in the docstring; both repos are Apache-2.0.
SDPA support is not added in this PR. The eager path is now efficient enough that SDPA is a separate, lower-priority piece of work (and SDPA's fused softmax makes attention-sink handling non-trivial).

AI assistance disclosure (per CLAUDE.md)

This change was authored with Claude Code assistance and reviewed end-to-end by the submitting human (@kiankyars), who is responsible for defending every line of the diff. No separate coordination issue was filed before this PR because the bug (eager OOM well below the documented 128k context budget) is concrete, the fix is reference-faithful to OpenAI's open-source implementation.

🤖 Generated with Claude Code

… O(N^2)) `eager_attention_forward` materialised the full N x N attention matrix before applying the per-query band mask, so long inputs OOM well below the model's documented 128k context budget (a 100k-token forward needs ~92 TB of mask scores at fp32 vs ~4 GB for the banded path). Port the unfold-based banded attention from OpenAI's Apache-2.0 reference implementation (`opf/_model/model.py::sdpa`, the bidirectional branch) so the kernel is `O(N * window)` instead of `O(N * N)`: * `F.pad` + `Tensor.unfold` extract K/V windows of width `2*(sliding_window-1)+1` per query (matches HF's existing `sliding_window_bidirectional_overlay` mask and the flash-attention `window_size = (sw-1, sw-1)` convention). * Reshape to `[B, N, KV, Q, D]` so the einsum stays banded; the intermediate scores tensor is `[B, N, KV, Q, W]`, never `[B, H, N, N]`. * Per-key padding is recovered from the additive mask's diagonal (`O(B*N)`); no full mask materialisation. * Sinks (already in natural-log space after the HF checkpoint converter applies `* log(2)` on load) are concatenated as a single column to scores before softmax and dropped after, matching the previous behaviour exactly. * `attn_weights` is now `None` because the banded path does not build the full `N x N` matrix; documented in the model_doc page. Tests: * New `test_eager_banded_swa_locality`: one-layer config with `sliding_window=2`, flipping a token at distance 3 (outside band) is verified to leave the target's hidden state unchanged (<1e-5), and a sanity flip at distance 1 (inside band) is verified to move it. * `test_model`, `test_token_classification_model`, `test_config` pass unchanged. * End-to-end forward on real `openai/privacy-filter` weights with `"My name is Harry Potter and my email is harry.potter@hogwarts.edu."` produces the exact CUDA-expected token-class sequence (`B/E-private_person` at 3-4, `B/I*7/E-private_email` at 9-17), i.e. bit-identical predictions to the previous quadratic implementation. AI assistance disclosure (per `CLAUDE.md`): this change was authored with Claude Code assistance and reviewed end-to-end by the human submitter, who is responsible for defending every line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-23T05:33:45Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: openai_privacy_filter

github-actions · 2026-05-23T05:38:05Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46168&sha=b90184

vasqu

We do this way on purpose to utilize the fully materialized masks as we prefer the simplcity. For performance, we never suggest to use eager but SDPA or FA. Eager is only useful to get the attention weights, but nothing else

kiankyars · 2026-05-27T18:22:53Z

We do this way on purpose to utilize the fully materialized masks as we prefer the simplcity. For performance, we never suggest to use eager but SDPA or FA. Eager is only useful to get the attention weights, but nothing else

OK, thanks for your clarification Anton and for reviewing this! Will close now.

kiankyars · 2026-05-27T20:47:58Z

@vasqu what's the recommended way of using this model then? With FA2 or SDPA, do you just monkey patch it?

vasqu · 2026-05-28T12:33:44Z

No monkey patching needed, I would suggest FA as it won't materialize the masks. For training you could use flex attention (with proper preparation of the masks to be done in the data collator or similar).

SDPA sadly doesnt support attention sinks so it cant be used

vasqu reviewed May 26, 2026

View reviewed changes

kiankyars closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenAI Privacy Filter] banded SWA eager attention (O(N*W) instead of O(N^2))#46168

[OpenAI Privacy Filter] banded SWA eager attention (O(N*W) instead of O(N^2))#46168
kiankyars wants to merge 1 commit into
huggingface:mainfrom
kiankyars:openai-privacy-filter-banded-eager

kiankyars commented May 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

vasqu left a comment

Uh oh!

kiankyars commented May 27, 2026

Uh oh!

kiankyars commented May 27, 2026

Uh oh!

vasqu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kiankyars commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

Test plan

Notes

AI assistance disclosure (per CLAUDE.md)

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

kiankyars commented May 27, 2026

Uh oh!

kiankyars commented May 27, 2026

Uh oh!

vasqu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kiankyars commented May 23, 2026 •

edited

Loading