Commit dacf606
Han Wang
test(pt_expt): shrink change-bias water dataset to 5 frames
``TestChangeBias`` was the dominant memory hog in the ``Test Python``
shard ``(10, 3.13)`` of the CI matrix — by itself it peaked at ~5 GB
RSS, leaving so little headroom under the 7 GB GitHub-hosted runner
ceiling that the shard sporadically lost communication with the
GitHub Actions server (intermittent ``runner lost communication``
flake observed across many recent PRs).
Profile finding: peak RSS scales **linearly at ~50 MB per frame**
during ``dp change-bias``'s in-process ``main(cmds)`` call. The
forward over ``compute_output_stats`` enumerates ``nbatches = min(
data.get_nbatches()) = 80`` frames of the water example, and each
frame leaks ~50 MB into torch's caching allocator (not autograd —
the wrapper is already in ``torch.no_grad()``; the leak is in
``forward_common_atomic`` somewhere and is a separate bug).
Constraint: we **must** keep ``nbatches == total dataset frames``
to preserve determinism for ``test_change_bias_pt2_pte_consistency``
which compares two .pte and .pt2 invocations with ``atol=1e-10``.
``_load_batch_set`` shuffles the loaded set, so a value of
``nbatches < total_frames`` would sample a random subset and the
two calls (running in the same Python process with an advancing
``dp_random`` state) would see different frames and produce
different biases. Full enumeration sees every frame and so the
aggregate bias is invariant under shuffle.
Solution: build a 5-frame subset of ``examples/water/data/data_0``
in ``TestChangeBias.setUpClass`` and point both the trainer config
and the change-bias ``-s`` argument at it. ``nbatches`` then
resolves to 5 (= the new dataset size, = full enumeration), peak
RSS drops to ~1.7 GB for the whole class, and all 9 tests in the
class (including the strict atol=1e-10 consistency check) still
pass. Class wall time also improves (~3:55 → less data-loop work
in each change-bias invocation).1 parent 4e64f8b commit dacf606
1 file changed
Lines changed: 56 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
118 | 118 | | |
119 | 119 | | |
120 | 120 | | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
121 | 157 | | |
122 | 158 | | |
123 | 159 | | |
124 | 160 | | |
125 | 161 | | |
126 | | - | |
127 | | - | |
128 | | - | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
129 | 182 | | |
130 | 183 | | |
131 | 184 | | |
132 | | - | |
133 | | - | |
134 | 185 | | |
135 | 186 | | |
136 | 187 | | |
| |||
0 commit comments