Skip to content

hydra: add bounded-residency decode attention#873

Closed
newjordan wants to merge 1 commit into
huggingface:mainfrom
newjordan:hydra/add-bounded-residency-decode-attention
Closed

hydra: add bounded-residency decode attention#873
newjordan wants to merge 1 commit into
huggingface:mainfrom
newjordan:hydra/add-bounded-residency-decode-attention

Conversation

@newjordan
Copy link
Copy Markdown

Summary

This PR adds hydra, an experimental bounded-residency decode attention kernel
for long-context inference.

Hydra keeps a fixed resident attention set during decode: sink tokens, recent
tokens, and selected older KV pages. The goal is narrow: improve fit/usability
for specific long-context decode workloads while keeping clear evidence
boundaries and avoiding universal speedup or production-readiness claims.

Included

  • hydra/build.toml, flake.nix, README.md, and CARD.md
  • Triton/Python source under hydra/torch-ext/hydra/
  • import, CSR, and CUDA decode/parity tests under hydra/tests/
  • isolated decode benchmark under hydra/benchmarks/benchmark_hydra_decode.py
  • readme_example.py for source-packet validation before publication and Hub
    loading after publication

Validation

Final source package tarball used for validation:

sha256: bff743b66ad67bd4c7bdd8ae190dc7672335b3a6c422af307a77a47c0942a57e

Builder gate on Vast RTX A6000, driver 570.133.20, CUDA 12.8 path:

RUN_BUILDER=1 BUILDER_VARIANT=torch210-cxx11-cu128-x86_64-linux BENCH_ITERS=20 BENCH_WARMUP=5 BENCH_TOKENS=8192 scripts/validate_hf_setup_package.sh

Result:

  • local package pytest: 6 passed
  • isolated decode smoke: 0.2166 ms/iter
  • kernel-builder pytest: 4 passed, 2 skipped
  • exit status: 0

Additional package-smoke/HF benchmark matrix was run across RTX 3060, RTX 3070,
RTX 3080, RTX 3090, RTX 4070 Ti, RTX 4090, A100 SXM4, RTX A6000, and RTX PRO
6000 Blackwell variants. These rows are used as hardware/runtime evidence only,
not as universal performance claims.

Non-claims

This PR does not claim universal speedups, production readiness, broad
model-quality preservation, or generic support across every model/GPU.
Exact-model Qwen FP8 rows are treated as proof-of-concept evidence only.

@newjordan newjordan requested review from danieldk and drbh as code owners May 18, 2026 20:27
@github-actions
Copy link
Copy Markdown

Hi @newjordan, thanks for your interest in contributing!

This project requires that pull request authors are vouched, and you are not in the list of vouched users.

This PR will be closed automatically. See https://github.com/huggingface/kernels-community/blob/main/CONTRIBUTING.md for more details.

@github-actions github-actions Bot closed this May 18, 2026
@newjordan
Copy link
Copy Markdown
Author

Closing this upstream PR per maintainer guidance to publish Hydra as a community kernel under our own namespace. Public source: https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra ; Hub repo: https://huggingface.co/Frosty40/hydra

@danieldk
Copy link
Copy Markdown
Member

Closing this upstream PR per maintainer guidance to publish Hydra as a community kernel under our own namespace. Public source: https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra ; Hub repo: https://huggingface.co/Frosty40/hydra

Nice! You may want to apply for being able to generate kernel-type repositories (since that's required for kernels 0.14 and later). You can do so through Settings (of the user or org) -> Account, there will be a section to do this:

Screenshot 2026-05-19 at 10 34 23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants