turbo-tasks: task-storage memory wins#93720
Open
lukesandberg wants to merge 4 commits into
Open
Conversation
Contributor
Author
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Contributor
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
lukesandberg
commented
May 9, 2026
Contributor
Stats from current PR✅ No significant changes detected📊 All Metrics📖 Metrics GlossaryDev Server Metrics:
Build Metrics:
Change Thresholds:
⚡ Dev Server
📦 Dev Server (Webpack) (Legacy)📦 Dev Server (Webpack)
⚡ Production Builds
📦 Production Builds (Webpack) (Legacy)📦 Production Builds (Webpack)
📦 Bundle SizesBundle Sizes⚡ TurbopackClient Main Bundles
Server Middleware
Build DetailsBuild Manifests
📦 WebpackClient Main Bundles
Polyfills
Pages
Server Edge SSR
Middleware
Build DetailsBuild Manifests
Build Cache
🔄 Shared (bundler-independent)Runtimes
📎 Tarball URLCommit: 50785a8 |
dbced22 to
50785a8
Compare
2f21410 to
cf840de
Compare
42ece5a to
7221af0
Compare
Migrate Arc<CachedTaskType> to triomphe::Arc via CachedTaskTypeArc newtype. Saves one usize per allocation (no weak count) and avoids the weak-count CAS in drop_slow compared to std::sync::Arc. We never need Weak<CachedTaskType>, so the trade-off is favorable.
Replace the `(CellRef, Option<u64>)` and `(CellId, Option<u64>, TaskId)`
tuples used for cell-edge tracking with a `CellDependency` enum:
enum CellDependency {
All(CellRef),
Hash(CellRef, u64),
}
`Option<u64>` previously cost a full 16 B (8 B discriminant + 8 B value,
aligned). With an explicit enum the layout algorithm reuses the niche on
`ValueTypeId` (`NonZero<u16>`) inside `CellRef.cell.type_id` for the
variant tag, dropping the element from 32 B to 24 B. That in turn shrinks
`LazyField` from 56 B to 48 B.
Also adds `CellDependency::into_parts()` and uses it in
`iter_cell_dependents` / `iter_cell_dependencies` hot loops to avoid
checking the enum discriminant twice via back-to-back
`cell_ref()` + `key()` calls.
Replace `TaskStorage::lazy: Vec<LazyField>` with a custom 16 B `TinyVec` (u8 len + u8 cap, the schema's max field count is well under 255). Drops `size_of::<TaskStorage>()` from 136 B to 128 B. Included micro-benchmarks show `TinyVec` push is 11-32% faster than `Vec` across realistic sizes and iter is neutral. The `TinyVec` type: * Carries a `const MAX: u8` generic parameter that strictly caps push count and tightens the growth schedule (doubles until it would exceed MAX, then caps exactly at MAX). The schema macro emits `TinyVec<LazyField, N>` where `N` is the exact lazy-field count, so the cap matches the actual schema size (e.g. with 24 variants we end at cap=24 instead of cap=32, saving a few slots per fully-populated task). A compile-time static assert rejects `MAX = 0` at monomorphization. * Tightens visibility: `new`, `capacity`, `as_slice`, `as_mut_slice`, `reserve` are private; `len` / `is_empty` stay pub as a clippy-preferred pair. * Delegates `retain_mut` to `Vec::retain_mut` via a `Vec::from_raw_parts` round-trip — `retain_mut` is cold relative to push so the round-trip cost is irrelevant, and this drops ~7 unsafe blocks with a panic partial-shift guard. * Delegates owned `IntoIter` to `std::vec::IntoIter` via the same Vec round-trip, dropping ~50 lines of unsafe. * Drops `Extend` and `FromIterator` trait impls; the only caller path uses an inherent `extend_exact` which requires an `ExactSizeIterator` and reserves exactly once. * Drops `iter`, `iter_mut`, `last_mut`, `Index`, `IndexMut` — all reachable through `Deref<Target = [T]>`. `for x in &tv` / `for x in &mut tv` still need `IntoIterator` impls for refs because the `for`-loop desugar doesn't apply `Deref` coercion across the reference boundary. Net unsafe count in `TinyVec` is 5 in the hot path plus 1 in `retain_mut` and 1 in `IntoIter` — each upholding a single local invariant or just round-tripping through Vec.
Each schema field's `I` (inline capacity) was previously fixed at 1. With `SmallVec`'s `union` feature on, the heap variant occupies 16 bytes (`NonNull<T> + usize`), so the SmallVec body is always `max(16, N * sizeof(T))` aligned to `max(align(T), 8)`. Net: growing `N` is free until `N * sizeof(T)` exceeds 16, after which each step adds `align_up(sizeof(T), 8)` bytes. Two opportunities follow: * For lazy fields, the `LazyField` enum already pays a 40-byte payload budget (the largest variants — `cell_data`, `cell_data_hash`, `AutoSet<CellDependency>`, `CounterMap<CollectibleRef, i32>` — saturate it at I=1). Smaller-element fields sat at 32 B with 8 B of unused padding. Bumping them to fill 40 B is zero-cost: TaskStorage and LazyField sizes don't change. * For inline fields on TaskStorage, raising `I` is free up to the SmallVec 16-byte body limit (e.g. `AutoSet<TaskId>` to I=4, `CounterMap<TaskId, u32>` to I=2). To allow per-field tuning, parameterize: * `CounterMap` over `const I: usize`, propagating to its `AutoMap` inner. * The `AutoSet` / `AutoMap` schema aliases drop their hardcoded `1`; each field declares its own `I`. Per-field choices: * `output_dependent` (inline) `AutoSet<TaskId>` → 4 (32 B) * `upper` (inline) `CounterMap<TaskId, u32>` → 2 (32 B) * `children`, `output_dependencies`, `outdated_output_dependencies` `AutoSet<TaskId>` → 6 (40 B) * `collectibles_dependencies`, `outdated_collectibles_dependencies` `AutoSet<CollectiblesRef>` → 3 (40 B) * `collectibles_dependents` `AutoSet<(TraitTypeId, TaskId)>` → 3 (40 B) * `followers`, `aggregated_dirty_containers`, `aggregated_current_session_clean_containers` `CounterMap<TaskId, _>` → 3 (40 B) * `cell_type_max_index` `AutoMap<ValueTypeId, u32>` → 3 (40 B) Variants already at 40 B (cell_data, cell_data_hash, AutoSet<CellDependency>, CounterMap<CollectibleRef, i32>) stay at I=1. `in_progress_cells` stays at I=1 to avoid overflowing LazyField under the `hanging_detection` feature (which inflates `Event` from 8 B to 16 B). `TaskStorage` stays at 128 B, `LazyField` at 48 B; only inline capacities change.
7221af0 to
0938c4c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Four small, independent changes that shrink
TaskStorageand the data it owns:Recommend reviewing commit-by-commit
Arc<CachedTaskType>→triomphe::Arc<CachedTaskType>.triomphe::Arcis already a workspace dep used inReadRef/SharedReference.CachedTaskTypenever appears in aWeak<...>, so we can drop the weak count and the CAS indrop_slow. Saves oneusizeper allocation. Migrated via aCachedTaskTypeArcnewtype so the bincodeEncode/Decodeimpls don't need to cross the orphan rule.Niche-encode
CellDependency. Thecell_dependencies/cell_dependentssets used to hold(CellRef, Option<u64>)tuples —Option<u64>cost a full 16 B (8 B discriminant + 8 B value, aligned), making each element 32 B. ACellDependencyenum with two variants (All(CellRef)/Hash(CellRef, u64)) lets the layout algorithm reuse the niche onValueTypeId(NonZero<u16>) insideCellRef.cell.type_idfor the variant tag. Element size drops 32 → 24 B;LazyFieldfrom 56 → 48 B. The same enum backs both forward and reverse edges — forcell_dependentswe re-pointCellRef.taskat the dependent task.Added
CellDependency::into_parts()and use it initer_cell_dependents/iter_cell_dependencieshot loops so the discriminant is checked once instead of twice via back-to-backcell_ref()+key()calls.TaskStorage::lazy: Vec<LazyField>→TinyVec<LazyField>. The lazy vec only ever holds ~25 elements (one per declared lazy field in the schema). SwappingVec's 24 B(ptr, len, cap)header for(ptr, len: u8, cap: u8)+ 6 B padding gives 16 B. Dropssize_of::<TaskStorage>()from 136 → 128 B.TinyVecis hand-rolled so I added a push/iter micro-benchmark to confirm it doesn't lose performance vs stdVec. Results below.Rightsize collections → Explore the
AutoSet/AutoMaptypes in storage_schema and ensure each one is maximally sized for its natural alignment.Benchmark results
next buildon a representative app (15 runs each, M4 Pro,caffeinate -dimsu nice -n -20)Fresh same-day baseline against branch:
MaxRSS is the headline. −0.43 GB on a 12.5 GB working set, with t=−17.86 (every branch run lower than every canary run, CV ≤ 0.6% on both sides). Wall / user / sys are all within noise — this PR is a memory win with no measurable timing impact.
TinyVecvsVecmicro-bench (turbo-tasks/benches/tiny_vec.rs, 200 samples each)TinyVec push is 11–32% faster than Vec push across all realistic sizes; iter is identical. Run with
cargo bench -p turbo-tasks --bench tiny_vec.task_overhead/turboCriterion bench (M4 Pro,--sample-size 200)