hydra: add bounded-residency decode attention#873
Conversation
|
Hi @newjordan, thanks for your interest in contributing! This project requires that pull request authors are vouched, and you are not in the list of vouched users. This PR will be closed automatically. See https://github.com/huggingface/kernels-community/blob/main/CONTRIBUTING.md for more details. |
|
Closing this upstream PR per maintainer guidance to publish Hydra as a community kernel under our own namespace. Public source: https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra ; Hub repo: https://huggingface.co/Frosty40/hydra |
Nice! You may want to apply for being able to generate kernel-type repositories (since that's required for kernels 0.14 and later). You can do so through Settings (of the user or org) -> Account, there will be a section to do this:
|

Summary
This PR adds
hydra, an experimental bounded-residency decode attention kernelfor long-context inference.
Hydra keeps a fixed resident attention set during decode: sink tokens, recent
tokens, and selected older KV pages. The goal is narrow: improve fit/usability
for specific long-context decode workloads while keeping clear evidence
boundaries and avoiding universal speedup or production-readiness claims.
Included
hydra/build.toml,flake.nix,README.md, andCARD.mdhydra/torch-ext/hydra/hydra/tests/hydra/benchmarks/benchmark_hydra_decode.pyreadme_example.pyfor source-packet validation before publication and Hubloading after publication
Validation
Final source package tarball used for validation:
Builder gate on Vast RTX A6000, driver 570.133.20, CUDA 12.8 path:
Result:
6 passed0.2166 ms/iterkernel-builderpytest:4 passed, 2 skipped0Additional package-smoke/HF benchmark matrix was run across RTX 3060, RTX 3070,
RTX 3080, RTX 3090, RTX 4070 Ti, RTX 4090, A100 SXM4, RTX A6000, and RTX PRO
6000 Blackwell variants. These rows are used as hardware/runtime evidence only,
not as universal performance claims.
Non-claims
This PR does not claim universal speedups, production readiness, broad
model-quality preservation, or generic support across every model/GPU.
Exact-model Qwen FP8 rows are treated as proof-of-concept evidence only.