[OpenAI Privacy Filter] banded SWA eager attention (O(N*W) instead of O(N^2))#46168
[OpenAI Privacy Filter] banded SWA eager attention (O(N*W) instead of O(N^2))#46168kiankyars wants to merge 1 commit into
Conversation
… O(N^2)) `eager_attention_forward` materialised the full N x N attention matrix before applying the per-query band mask, so long inputs OOM well below the model's documented 128k context budget (a 100k-token forward needs ~92 TB of mask scores at fp32 vs ~4 GB for the banded path). Port the unfold-based banded attention from OpenAI's Apache-2.0 reference implementation (`opf/_model/model.py::sdpa`, the bidirectional branch) so the kernel is `O(N * window)` instead of `O(N * N)`: * `F.pad` + `Tensor.unfold` extract K/V windows of width `2*(sliding_window-1)+1` per query (matches HF's existing `sliding_window_bidirectional_overlay` mask and the flash-attention `window_size = (sw-1, sw-1)` convention). * Reshape to `[B, N, KV, Q, D]` so the einsum stays banded; the intermediate scores tensor is `[B, N, KV, Q, W]`, never `[B, H, N, N]`. * Per-key padding is recovered from the additive mask's diagonal (`O(B*N)`); no full mask materialisation. * Sinks (already in natural-log space after the HF checkpoint converter applies `* log(2)` on load) are concatenated as a single column to scores before softmax and dropped after, matching the previous behaviour exactly. * `attn_weights` is now `None` because the banded path does not build the full `N x N` matrix; documented in the model_doc page. Tests: * New `test_eager_banded_swa_locality`: one-layer config with `sliding_window=2`, flipping a token at distance 3 (outside band) is verified to leave the target's hidden state unchanged (<1e-5), and a sanity flip at distance 1 (inside band) is verified to move it. * `test_model`, `test_token_classification_model`, `test_config` pass unchanged. * End-to-end forward on real `openai/privacy-filter` weights with `"My name is Harry Potter and my email is harry.potter@hogwarts.edu."` produces the exact CUDA-expected token-class sequence (`B/E-private_person` at 3-4, `B/I*7/E-private_email` at 9-17), i.e. bit-identical predictions to the previous quadratic implementation. AI assistance disclosure (per `CLAUDE.md`): this change was authored with Claude Code assistance and reviewed end-to-end by the human submitter, who is responsible for defending every line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[For maintainers] Suggested jobs to run (before merge) run-slow: openai_privacy_filter |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46168&sha=b90184 |
vasqu
left a comment
There was a problem hiding this comment.
We do this way on purpose to utilize the fully materialized masks as we prefer the simplcity. For performance, we never suggest to use eager but SDPA or FA. Eager is only useful to get the attention weights, but nothing else
OK, thanks for your clarification Anton and for reviewing this! Will close now. |
|
@vasqu what's the recommended way of using this model then? With FA2 or SDPA, do you just monkey patch it? |
|
No monkey patching needed, I would suggest FA as it won't materialize the masks. For training you could use flex attention (with proper preparation of the masks to be done in the data collator or similar). SDPA sadly doesnt support attention sinks so it cant be used |
Summary
The current
eager_attention_forwardforOpenAIPrivacyFiltermaterialises the full[B, num_heads, N, N]score tensor before applying the per-query band mask, so long inputs OOM well below the model's documented 128k position budget (a 100k-token forward needs ~92 TB of mask scores at fp32 vs ~4 GB for the banded path). This PR replaces the kernel with a windowedF.pad+Tensor.unfoldimplementation that matches the production attention path in OpenAI's Apache-2.0 reference repo — seeopf/_model/model.py::sdpa, the bidirectional branch at lines ~431-473. Memory isO(N * window)instead ofO(N * N).Flash-Attention and SDPA paths are unaffected. This is a fix to the
attn_implementation="eager"code path (and the default when FA2 isn't available).What's in the PR
src/transformers/models/openai_privacy_filter/modular_openai_privacy_filter.py— rewriteeager_attention_forward:Tensor.unfoldextract K/V windows of width2*(sliding_window-1)+1per query (matches HF's existingsliding_window_bidirectional_overlaymask and the FA2window_size = (sw-1, sw-1)convention).[B, N, KV, Q, D]so the einsum stays banded; the intermediate scores tensor is[B, N, KV, Q, W], never[B, H, N, N].O(B*N)); no full mask materialisation.* log(2)on load) are concatenated as a single column to scores before softmax and dropped after — bit-identical to the previous behaviour.attn_weightsis returned asNonebecause the banded path does not build the fullN x Nmatrix; documented in the model_doc page.src/transformers/models/openai_privacy_filter/modeling_openai_privacy_filter.py— regenerated byutils/modular_model_converter.py(the now-unusedrepeat_kvimport is dropped automatically).tests/models/openai_privacy_filter/test_modeling_openai_privacy_filter.py— addstest_eager_banded_swa_locality: a one-layer config withsliding_window=2, flipping a token at distance 3 (outside band) must leave the target's hidden state unchanged (<1e-5), and a sanity flip at distance 1 (inside band) is verified to move it.docs/source/en/model_doc/openai_privacy_filter.md— adds an "Attention backends" subsection explaining the FA2 vs eager trade-off and theattn_weights=Nonebehaviour of banded eager.Test plan
All commands run from the branch in a fresh venv.
Result (CPU, no real weights):
End-to-end check on real
openai/privacy-filterweights (CPU, the existingOpenAIPrivacyFilterModelIntegrationTest::test_inference_predicted_token_classificationonly has a CUDA expectation so it raisesValueError: No matching expectation found for ('cpu', None, None); reproducing its forward manually):i.e. bit-identical predictions to the previous quadratic implementation. Adding a
("cpu", None, None)entry to the existingExpectationstable could be a small follow-up to make the slow integration test runnable off CUDA; not done here to keep this PR focused.Memory check: a 100k-token field that OOMs under the current eager kernel completes under the banded path with roughly
B * N * window * (KV * D) * dtype_bytes ≈ 4 GBof K/V window cache (vsB * num_heads * N * N * 4 B ≈ ~92 TBof fp32 mask scores under the previous implementation).Notes
AI assistance disclosure (per CLAUDE.md)
This change was authored with Claude Code assistance and reviewed end-to-end by the submitting human (@kiankyars), who is responsible for defending every line of the diff. No separate coordination issue was filed before this PR because the bug (eager OOM well below the documented 128k context budget) is concrete, the fix is reference-faithful to OpenAI's open-source implementation.
🤖 Generated with Claude Code