Make the per-frame time embedding registry-based and config-selectable

amazloumi · amazloumi · commit bcb435f0eaa2 · 2026-06-26T13:25:17.000-04:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -60,10 +60,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - **Frame-aware model + training wiring.** `kempnerforge/model/vlm.py`: `_project_image_features` → `_project_visual_features` folds the frame axis through the encoder + pooler to `(B, F·P′, dim)` (a single image is the `F == 1` case). `VLMWrapper` gains `frames_per_clip`, threaded through `build_parallel_model` / `_build_vlm` / `build_vlm_wrapper` so the static visual-token count equals `F·P′` (drives the residual budget and MoT's positional split; static == runtime). `scripts/train.py` builds the video dataset/collator when `[video]` is set. Adds `configs/train/vlm_video_webvid.toml` (SigLIP2 + avgpool + WebVid).
   - Tests: `tests/unit/test_video_io.py`, `test_video_dataset.py`, `test_video_config.py`; video-forward cases (all four archs) + image-path regression in `test_vlm.py`; pooling-adapter cases in `test_adapter.py`. Docs: `docs/how-to/train-on-video.md`.
   - Deferred (follow-ups): grounding (`<points>`/`<tracks>` outputs with point-F1 / track-J&F eval), frame-mask-aware attention, bidirectional visual attention, VLM sequence packing, long-context (blocked on context-parallel being wired), and warm-start from a converted image-VLM checkpoint.
-- **Per-frame timestamps for video.** Each sampled frame carries its actual presentation time (seconds), embedded and added to that frame's visual tokens so the model can reason about *when* events occur, not just frame order. Zero-initialized so it is identity at step 0 (warm-start) and learned from there.
+- **Per-frame timestamps for video.** Each sampled frame carries its actual presentation time (seconds), embedded and added to that frame's visual tokens so the model can reason about *when* events occur, not just frame order. Registry-driven and config-selected (so new techniques drop in as small additions); zero-initialized so it is identity at step 0 (warm-start) and learned from there.
   - `kempnerforge/data/video_io.py`: `decode_video_frames` returns `(frames, times)` (the matched frames' presentation times); `kempnerforge/data/video_dataset.py` emits a `frame_times` `(F,)` tensor and `VideoCollator` stacks it to `(B, F)`.
-  - `kempnerforge/model/frame_time.py`: `FrameTimeEmbedding` (sinusoidal features at log-spaced periods → zero-init projection), a `VLMWrapper` submodule applied per frame in `_project_visual_features` (video only; `None` for the image path) and built + FSDP-sharded + meta-materialized at both build sites (`build_vlm_wrapper`, `_build_vlm`). `scripts/train.py` threads `frame_times` into the VLM forward.
-  - Tests: `tests/unit/test_frame_time.py` + frame-time forward cases in `test_vlm.py`.
+  - `kempnerforge/model/frame_time.py`: a `TimeEmbedding` base (the additive `(B, F) seconds → (B, F, dim)` contract) + the `"sinusoidal"` implementation, registered via `@registry.register_time_embedding` and built through `build_time_embedding`. Applied per frame in `_project_visual_features` as a `VLMWrapper` submodule (video only; `None` for the image path) and built + FSDP-sharded + meta-materialized at both build sites (`build_vlm_wrapper`, `_build_vlm`).
+  - `kempnerforge/config/time_embedding.py`: the `[time_embedding]` `TimeEmbeddingConfig` (`type` selects the registered builder; `type = "none"` disables it), wired into `JobConfig` and threaded through `build_parallel_model`; `scripts/train.py` passes `config.time_embedding` and threads `frame_times` into the forward.
+  - Sequence-*modifying* time encodings (e.g. Molmo2-style interleaved text time-tokens) are a separate future hook at the sequence-assembly layer, gated on interleaved/variable-length sequence support — out of scope for this additive registry.
+  - Tests: `tests/unit/test_frame_time.py`, `test_time_embedding_config.py`; frame-time forward + `type="none"` cases in `test_vlm.py`; the video build path in `test_distributed.py`.
 - `install-and-verify` plugin skill: runs `uv sync`, asserts Python ≥ 3.12, then runs the four CI gate checks (`ruff check`, `ruff format --check`, `pyright`, `pytest tests/unit/`). Canonical first command after cloning.
 - `.python-version` pinned to `>=3.12` so uv resolves the interpreter explicitly. Teammates on 3.13 use 3.13 (no download); 3.11-only users get 3.12 auto-fetched.
 - **Dynamic-checkpointing window** (`[checkpoint.dyn_ckpt_window]`). Opt-in dense save phase: inside `[start, stop]` a registered strategy decides which steps to save; outside the window the regular `interval` cadence applies. The default strategy, `"power2"`, saves at `start` and at every `start + 2^k` while `<= stop` — tight near the start of the window, doubling thereafter. Useful for analyzing early-training dynamics, where the loss moves fastest. The default `CheckpointConfig` is unchanged (no `dyn_ckpt_window`, interval-only saves).
diff --git a/docs/how-to/train-on-video.md b/docs/how-to/train-on-video.md
@@ -26,10 +26,13 @@ A clip of `F` frames becomes `F × P′` visual tokens:
      `max_seq_len`).
 
 Temporal order is carried by frame order (sequential positions). On top of that,
-each frame's **timestamp in seconds** is embedded (sinusoidal features → a
-zero-initialized projection) and added to that frame's visual tokens, so the
-model sees *when* each frame occurs, not just its order. Grounding outputs are a
-separate follow-up (see below).
+each frame's **timestamp in seconds** is embedded and added to that frame's
+visual tokens, so the model sees *when* each frame occurs, not just its order.
+The embedding is registry-driven: `[time_embedding].type` selects it
+(`sinusoidal` by default — sinusoidal features at log-spaced periods through a
+zero-initialized projection; `none` disables it), so new techniques (learned,
+Fourier, …) register as small additions and switch via config. Grounding
+outputs are a separate follow-up (see below).
 
 ## Token budget
 
@@ -94,6 +97,11 @@ time, so it is set in the TOML, not via a `--vlm.arch=` CLI override.)
 - **Grounding outputs are a follow-up** — per-frame timestamps are encoded (see
   above), but structured grounding (`<points>`/`<tracks>` outputs with point-F1
   / track-J&F eval) is not yet implemented.
+- **Sequence-modifying time encodings are a separate hook** — the
+  `[time_embedding]` registry is for *additive* per-frame embeddings (no change
+  to sequence length). Molmo2-style interleaved text time-tokens change the
+  token sequence and need interleaved/variable-length sequence support KF does
+  not have yet; they would hook the sequence-assembly layer, not this registry.
 - **Padded frames are not yet masked from attention** — short clips pad to
   `max_frames` with blank frames; a `frame_mask` is produced but not yet
   consumed by the attention mask.
diff --git a/kempnerforge/config/job.py b/kempnerforge/config/job.py
@@ -14,6 +14,7 @@
 from kempnerforge.config.optimizer import OptimizerConfig
 from kempnerforge.config.profiling import ProfilingConfig
 from kempnerforge.config.scheduler import SchedulerConfig
+from kempnerforge.config.time_embedding import TimeEmbeddingConfig
 from kempnerforge.config.training import TrainConfig
 from kempnerforge.config.video import VideoConfig
 from kempnerforge.config.vision import VisionEncoderConfig
@@ -53,6 +54,7 @@ class JobConfig:
     adapter: AdapterConfig | None = None
     vlm: VLMConfig | None = None
     video: VideoConfig | None = None
+    time_embedding: TimeEmbeddingConfig | None = None
 
     def __post_init__(self) -> None:
         """Cross-section invariants that fire at construction time.
diff --git a/kempnerforge/config/registry.py b/kempnerforge/config/registry.py
@@ -186,6 +186,27 @@ def get_adapter(self, name: str) -> Callable:
     def list_adapters(self) -> list[str]:
         return self.list("adapter")
 
+    def register_time_embedding(self, name: str) -> Callable:
+        """Decorator to register a time-embedding builder.
+
+        Builders take ``(dim, **kwargs)`` and return an ``nn.Module`` mapping
+        per-frame timestamps ``(B, F)`` in seconds to an additive embedding
+        ``(B, F, dim)`` (and exposing ``reset_parameters()`` for meta-device
+        builds). Selected by ``[time_embedding].type`` on the VLM video path.
+        """
+
+        def decorator(fn: Callable) -> Callable:
+            self.register("time_embedding", name, fn)
+            return fn
+
+        return decorator
+
+    def get_time_embedding(self, name: str) -> Callable:
+        return self.get("time_embedding", name)
+
+    def list_time_embeddings(self) -> list[str]:
+        return self.list("time_embedding")
+
     def register_dyn_ckpt_strategy(self, name: str) -> Callable:
         """Decorator to register a dynamic-checkpointing-window strategy.
 
diff --git a/kempnerforge/config/schema.py b/kempnerforge/config/schema.py
@@ -15,6 +15,7 @@
 from kempnerforge.config.optimizer import OptimizerConfig  # noqa: F401
 from kempnerforge.config.profiling import ProfilingConfig  # noqa: F401
 from kempnerforge.config.scheduler import SchedulerConfig, SchedulerType  # noqa: F401
+from kempnerforge.config.time_embedding import TimeEmbeddingConfig  # noqa: F401
 from kempnerforge.config.training import ActivationCheckpointing, TrainConfig  # noqa: F401
 from kempnerforge.config.vision import VisionEncoderConfig  # noqa: F401
 from kempnerforge.config.vlm import (  # noqa: F401
diff --git a/kempnerforge/config/time_embedding.py b/kempnerforge/config/time_embedding.py
@@ -0,0 +1,73 @@
+"""Time-embedding (per-frame timestamp) configuration.
+
+``TimeEmbeddingConfig`` selects which per-frame timestamp embedding the VLM
+video path uses and parameterizes it. Dispatched via the ``time_embedding``
+registry at build time (see ``kempnerforge/model/frame_time.py``).
+
+In TOML, ``[time_embedding]`` is a top-level section parallel to ``[adapter]``.
+It is only consumed for video (``frames_per_clip > 1``); the image and text
+paths never build one. ``type = "none"`` disables the embedding even for video.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Any
+
+from kempnerforge.config.registry import registry
+
+
+@dataclass
+class TimeEmbeddingConfig:
+    """Selects the time-embedding type and parameterizes it.
+
+    Register a new technique via ``@registry.register_time_embedding`` and select
+    it with ``type``; ``type = "none"`` disables the embedding entirely.
+
+    Fields:
+        type: Registry key for the builder (``"sinusoidal"`` default, or ``"none"``).
+        num_bands: Number of sinusoidal frequency bands (``"sinusoidal"`` only).
+        min_period: Shortest period in seconds (finest temporal resolution).
+        max_period: Longest period in seconds (coarsest temporal scale).
+    """
+
+    type: str = "sinusoidal"
+    num_bands: int = 16
+    min_period: float = 0.5
+    max_period: float = 256.0
+
+    def __post_init__(self) -> None:
+        if self.type == "none":
+            return
+        # Late import: importing the module triggers the
+        # ``@registry.register_time_embedding`` decorators. Doing it at module
+        # scope would create a circular import via the config/model graph.
+        import kempnerforge.model.frame_time  # noqa: F401, PLC0415
+
+        registered = tuple(registry.list_time_embeddings())
+        if self.type not in registered:
+            raise ValueError(
+                f"Unknown time_embedding.type: {self.type!r}. "
+                f"Registered: {sorted(registered)} (or 'none' to disable)."
+            )
+        if self.num_bands <= 0:
+            raise ValueError(f"time_embedding.num_bands must be positive (got {self.num_bands})")
+        if not 0.0 < self.min_period < self.max_period:
+            raise ValueError(
+                f"time_embedding requires 0 < min_period < max_period "
+                f"(got min_period={self.min_period}, max_period={self.max_period})"
+            )
+
+    @property
+    def enabled(self) -> bool:
+        """Whether a module should be built (``type != "none"``)."""
+        return self.type != "none"
+
+    def extra_kwargs(self) -> dict[str, Any]:
+        """Builder kwargs beyond ``dim``. Type-specific builders take what they
+        need and swallow the rest via ``**_`` (mirrors ``AdapterConfig``)."""
+        return {
+            "num_bands": self.num_bands,
+            "min_period": self.min_period,
+            "max_period": self.max_period,
+        }
diff --git a/kempnerforge/distributed/parallel.py b/kempnerforge/distributed/parallel.py
@@ -381,6 +381,7 @@ def _build_vlm(
     compile_model: bool,
     fp8: bool,
     frames_per_clip: int = 1,
+    time_embedding_config=None,
 ) -> torch.nn.Module:
     """Build a VLM wrapper with parallelism applied in the correct order.
 
@@ -406,7 +407,7 @@ def _build_vlm(
     from kempnerforge.distributed.expert_parallel import apply_expert_parallel
     from kempnerforge.distributed.tensor_parallel import apply_tensor_parallel
     from kempnerforge.model.adapter import build_adapter
-    from kempnerforge.model.frame_time import FrameTimeEmbedding
+    from kempnerforge.model.frame_time import build_time_embedding
     from kempnerforge.model.transformer import Transformer
     from kempnerforge.model.vlm import (
         VLMWrapper,
@@ -442,9 +443,14 @@ def _build_vlm(
         transformer = Transformer(
             model_config, vlm_config=vlm_config, num_image_tokens=visual_tokens
         )
-        # Video gets a per-frame timestamp embedding; built alongside the adapter
-        # so it shares the meta/CPU build + materialize path below.
-        frame_time_embed = FrameTimeEmbedding(model_config.dim) if frames_per_clip > 1 else None
+        # Video gets a per-frame timestamp embedding (registry-selected via
+        # [time_embedding]); built alongside the adapter so it shares the
+        # meta/CPU build + materialize path below.
+        frame_time_embed = (
+            build_time_embedding(time_embedding_config, model_config.dim)
+            if frames_per_clip > 1
+            else None
+        )
 
     strategy = build_modality_strategy(vlm_config)
     wrapper = VLMWrapper(
@@ -535,6 +541,7 @@ def build_parallel_model(
     compile_model: bool = False,
     fp8: bool = False,
     frames_per_clip: int = 1,
+    time_embedding_config=None,
 ) -> torch.nn.Module:
     """Build a Transformer (or a VLMWrapper) with parallelism applied.
 
@@ -579,6 +586,7 @@ def build_parallel_model(
             compile_model=compile_model,
             fp8=fp8,
             frames_per_clip=frames_per_clip,
+            time_embedding_config=time_embedding_config,
         )
 
     from kempnerforge.distributed.tensor_parallel import apply_tensor_parallel
diff --git a/kempnerforge/model/frame_time.py b/kempnerforge/model/frame_time.py
@@ -23,12 +23,42 @@
 from __future__ import annotations
 
 import math
+from typing import Any
 
 import torch
 import torch.nn as nn
 
+from kempnerforge.config.registry import registry
 
-class FrameTimeEmbedding(nn.Module):
+
+class TimeEmbedding(nn.Module):
+    """Base for per-frame timestamp embeddings (the *additive* family).
+
+    Contract: ``forward(times: (B, F) seconds) -> (B, F, dim)`` — an additive
+    embedding added to each frame's visual tokens, with **no change to sequence
+    length** — plus ``reset_parameters()`` so meta-device builds can re-init
+    after ``to_empty``. Register a new technique with
+    ``@registry.register_time_embedding`` and select it via
+    ``[time_embedding].type``; ``build_time_embedding`` dispatches through the
+    registry.
+
+    Out of scope (a separate, future integration point): sequence-*modifying*
+    time encodings — e.g. Molmo2-style textual time-tokens interleaved between
+    frame groups — change the token sequence (count / ``output_slice`` /
+    ``modality_ids`` / MoT split) and need tokenizer + interleaved-sequence
+    support KF does not have yet. Those would hook the sequence-assembly layer
+    (``ModalityStrategy.prepare``), not this additive registry; set
+    ``[time_embedding].type = "none"`` to run them instead of an additive one.
+    """
+
+    def forward(self, times: torch.Tensor) -> torch.Tensor:  # pragma: no cover - interface
+        raise NotImplementedError
+
+    def reset_parameters(self) -> None:  # pragma: no cover - interface
+        raise NotImplementedError
+
+
+class FrameTimeEmbedding(TimeEmbedding):
     """Sinusoidal embedding of a per-frame timestamp (seconds) -> model dim.
 
     Args:
@@ -93,3 +123,37 @@ def forward(self, times: torch.Tensor) -> torch.Tensor:
         ang = times.to(torch.float32).unsqueeze(-1) * (2.0 * math.pi / periods)  # (B, F, bands)
         feats = torch.cat([torch.sin(ang), torch.cos(ang)], dim=-1)  # (B, F, 2*bands)
         return self.proj(feats.to(self.proj.weight.dtype))
+
+
+@registry.register_time_embedding("sinusoidal")
+def _build_sinusoidal(
+    dim: int,
+    *,
+    num_bands: int = 16,
+    min_period: float = 0.5,
+    max_period: float = 256.0,
+    **_: Any,
+) -> FrameTimeEmbedding:
+    """Registry builder for the sinusoidal time embedding."""
+    return FrameTimeEmbedding(
+        dim, num_bands=num_bands, min_period=min_period, max_period=max_period
+    )
+
+
+def build_time_embedding(time_embedding_config: Any, dim: int) -> TimeEmbedding | None:
+    """Build the per-frame time embedding from a ``TimeEmbeddingConfig``.
+
+    Returns ``None`` when disabled (``type == "none"``). A ``None`` config falls
+    back to the default (sinusoidal) so video callers that pass nothing keep the
+    default behavior. The config is duck-typed (``.enabled`` / ``.type`` /
+    ``.extra_kwargs()``) to avoid a model->config import cycle, matching
+    ``build_adapter``.
+    """
+    if time_embedding_config is None:
+        from kempnerforge.config.time_embedding import TimeEmbeddingConfig  # noqa: PLC0415
+
+        time_embedding_config = TimeEmbeddingConfig()
+    if not time_embedding_config.enabled:
+        return None
+    builder = registry.get_time_embedding(time_embedding_config.type)
+    return builder(dim, **time_embedding_config.extra_kwargs())
diff --git a/kempnerforge/model/vlm.py b/kempnerforge/model/vlm.py
@@ -45,10 +45,11 @@
 from kempnerforge.config.adapter import AdapterConfig
 from kempnerforge.config.registry import registry
 from kempnerforge.config.schema import ModelConfig
+from kempnerforge.config.time_embedding import TimeEmbeddingConfig
 from kempnerforge.config.vision import VisionEncoderConfig
 from kempnerforge.config.vlm import FreezeSpec, VLMConfig
 from kempnerforge.model.adapter import VisionAdapter, build_adapter
-from kempnerforge.model.frame_time import FrameTimeEmbedding
+from kempnerforge.model.frame_time import TimeEmbedding, build_time_embedding
 from kempnerforge.model.modality import ModalityContext
 from kempnerforge.model.transformer import Transformer
 from kempnerforge.model.vision import VisionEncoder
@@ -309,7 +310,7 @@ def __init__(
         transformer: Transformer,
         strategy: ModalityStrategy,
         frames_per_clip: int = 1,
-        frame_time_embed: FrameTimeEmbedding | None = None,
+        frame_time_embed: TimeEmbedding | None = None,
     ) -> None:
         super().__init__()
         self.vision_encoder = vision_encoder
@@ -391,6 +392,7 @@ def build_vlm_wrapper(
     adapter_config: AdapterConfig,
     vlm_config: VLMConfig,
     frames_per_clip: int = 1,
+    time_embedding_config: TimeEmbeddingConfig | None = None,
 ) -> VLMWrapper:
     """Build a ``VLMWrapper`` from the four top-level configs.
 
@@ -434,8 +436,13 @@ def build_vlm_wrapper(
         )
     transformer = Transformer(model_config, vlm_config=vlm_config, num_image_tokens=visual_tokens)
     strategy = build_modality_strategy(vlm_config)
-    # Video clips get a per-frame timestamp embedding; the image path (F=1) does not.
-    frame_time_embed = FrameTimeEmbedding(model_config.dim) if frames_per_clip > 1 else None
+    # Video clips get a per-frame timestamp embedding (registry-selected via
+    # [time_embedding]); the image path (F=1) does not. type="none" disables it.
+    frame_time_embed = (
+        build_time_embedding(time_embedding_config, model_config.dim)
+        if frames_per_clip > 1
+        else None
+    )
     return VLMWrapper(
         encoder,
         adapter,
diff --git a/scripts/train.py b/scripts/train.py
@@ -175,6 +175,7 @@ def main() -> None:
             adapter_config=adapter_cfg,
             vlm_config=vlm_cfg,
             frames_per_clip=(config.video.max_frames if config.video is not None else 1),
+            time_embedding_config=config.time_embedding,
             ac_mode=tc.activation_checkpointing,
             mp_policy=mp_policy,
             param_dtype=tc.param_dtype,
diff --git a/tests/unit/test_frame_time.py b/tests/unit/test_frame_time.py
diff --git a/tests/unit/test_time_embedding_config.py b/tests/unit/test_time_embedding_config.py
diff --git a/tests/unit/test_vlm.py b/tests/unit/test_vlm.py