Hi, thanks for open-sourcing the In-Place TTT codebase.
We are trying to reproduce a small-scale long-context training run using the public code and public Qwen3 checkpoints. We are not using short-text data, and our current setup uses 32K-token long-document sequences.
We ran several diagnostic attempts and consistently found that the base model remains healthy when ttt_mode=False, but the TTT update path collapses during long-context inference when ttt_mode=True.
Summary of our setup:
- Base model: Qwen3-4B-Base
- devices: 2*H100 PCIe 80G
- Sequence length: 32768
- Data: long-document 32K-token sequences
- Training: full-parameter training
- Optimizer: AdamW with FP32 parameters and FP32 Adam moments
- Forward: BF16 autocast
- Loss: fused CE, verified numerically against standard torch CE
ttt_mode=True
ttt_layers: [0, 6, 12, 18, 24, 30, 36]
- On Qwen3-4B / Qwen3-8B, this effectively activates 6 layers because layer 36 does not exist.
At around step 500, we observe:
ttt_conv grows from zero, but remains small.
ttt_proj norm appears almost unchanged from initialization.
- The base model remains healthy if the checkpoint is loaded with
ttt_mode=False.
- With
ttt_mode=True, long-context inference collapses: NIAH-style probes produce empty / repeated / newline outputs.
- The collapse appears to come from accumulated
present_w drift over multiple TTT chunks, not from base-weight corruption.
We would appreciate guidance on whether this matches a known failure mode.
Specific questions:
- Should
ttt_proj receive a separate learning rate or warmup schedule?
- In your successful runs, how quickly do
ttt_conv and ttt_proj norms move during the first 500–1000 steps?
- Do you have an expected training-loss curve for a small public reproduction run?
- Is PG19-style 32K long-document training sufficient, or is ProLong 64K strongly required?
- Is there a recommended sequence-length curriculum, e.g. 32K → 64K → 128K?
- Is
ttt_layers=[0,6,12,18,24,30,36] intended for 36-layer Qwen models, even though layer index 36 is out of range?
- Are there any required parameter groups, optimizer settings, or initialization details not reflected in the public config?
- Is there a minimal known-good public setup, such as Qwen3-4B + ProLong subset + exact command, that should show non-collapsing TTT behavior within 500–1000 steps?
We are happy to provide more exact logs if useful. Our main goal is to determine whether the public recipe is sufficient to reproduce a functional TTT update path on public models and long-context data.
Thanks!
Hi, thanks for open-sourcing the In-Place TTT codebase.
We are trying to reproduce a small-scale long-context training run using the public code and public Qwen3 checkpoints. We are not using short-text data, and our current setup uses 32K-token long-document sequences.
We ran several diagnostic attempts and consistently found that the base model remains healthy when
ttt_mode=False, but the TTT update path collapses during long-context inference whenttt_mode=True.Summary of our setup:
ttt_mode=Truettt_layers:[0, 6, 12, 18, 24, 30, 36]At around step 500, we observe:
ttt_convgrows from zero, but remains small.ttt_projnorm appears almost unchanged from initialization.ttt_mode=False.ttt_mode=True, long-context inference collapses: NIAH-style probes produce empty / repeated / newline outputs.present_wdrift over multiple TTT chunks, not from base-weight corruption.We would appreciate guidance on whether this matches a known failure mode.
Specific questions:
ttt_projreceive a separate learning rate or warmup schedule?ttt_convandttt_projnorms move during the first 500–1000 steps?ttt_layers=[0,6,12,18,24,30,36]intended for 36-layer Qwen models, even though layer index 36 is out of range?We are happy to provide more exact logs if useful. Our main goal is to determine whether the public recipe is sufficient to reproduce a functional TTT update path on public models and long-context data.
Thanks!