Reproduction question: TTT update path collapses during long-context inference under public recipe

Hi, thanks for open-sourcing the In-Place TTT codebase.

We are trying to reproduce a small-scale long-context training run using the public code and public Qwen3 checkpoints. We are not using short-text data, and our current setup uses 32K-token long-document sequences.

We ran several diagnostic attempts and consistently found that the base model remains healthy when `ttt_mode=False`, but the TTT update path collapses during long-context inference when `ttt_mode=True`.

Summary of our setup:

- Base model: Qwen3-4B-Base
- devices: 2*H100 PCIe 80G
- Sequence length: 32768
- Data: long-document 32K-token sequences
- Training: full-parameter training
- Optimizer: AdamW with FP32 parameters and FP32 Adam moments
- Forward: BF16 autocast
- Loss: fused CE, verified numerically against standard torch CE
- `ttt_mode=True`
- `ttt_layers`: `[0, 6, 12, 18, 24, 30, 36]`
  - On Qwen3-4B / Qwen3-8B, this effectively activates 6 layers because layer 36 does not exist.

At around step 500, we observe:

- `ttt_conv` grows from zero, but remains small.
- `ttt_proj` norm appears almost unchanged from initialization.
- The base model remains healthy if the checkpoint is loaded with `ttt_mode=False`.
- With `ttt_mode=True`, long-context inference collapses: NIAH-style probes produce empty / repeated / newline outputs.
- The collapse appears to come from accumulated `present_w` drift over multiple TTT chunks, not from base-weight corruption.

We would appreciate guidance on whether this matches a known failure mode.

Specific questions:

1. Should `ttt_proj` receive a separate learning rate or warmup schedule?
2. In your successful runs, how quickly do `ttt_conv` and `ttt_proj` norms move during the first 500–1000 steps?
3. Do you have an expected training-loss curve for a small public reproduction run?
4. Is PG19-style 32K long-document training sufficient, or is ProLong 64K strongly required?
5. Is there a recommended sequence-length curriculum, e.g. 32K → 64K → 128K?
6. Is `ttt_layers=[0,6,12,18,24,30,36]` intended for 36-layer Qwen models, even though layer index 36 is out of range?
7. Are there any required parameter groups, optimizer settings, or initialization details not reflected in the public config?
8. Is there a minimal known-good public setup, such as Qwen3-4B + ProLong subset + exact command, that should show non-collapsing TTT behavior within 500–1000 steps?

We are happy to provide more exact logs if useful. Our main goal is to determine whether the public recipe is sufficient to reproduce a functional TTT update path on public models and long-context data.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction question: TTT update path collapses during long-context inference under public recipe #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reproduction question: TTT update path collapses during long-context inference under public recipe #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions