KempnerInstitute · amazloumi · Jun 26, 2026 · Jun 26, 2026 · Jun 27, 2026 · Jun 29, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -65,7 +65,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - **Video data path (pluggable, registry-driven).** `kempnerforge/data/video_io.py`: timestamp-based frame sampling (target fps, first & last frame kept — Molmo2 §3.1/§A) registered as the `"uniform"` sampling policy and selectable via `[video].sampling_policy`; PyAV decode (lazy-imported). `kempnerforge/data/video_dataset.py`: a `VideoDataset` base + the WebVid-style `WebVidVideoDataset` (CSV manifest + `id[:2]/id[:4]/id[:6]/id.mp4` mapping) registered as `"webvid"` via `@registry.register_video_dataset`, plus a `build_video_dataset` dispatch — so other dataset styles are additive registrations selected by `[video].dataset_type`. The WebVid corpus directory is parameterized by `[video].dataset_name` (no longer hardcoded to `webvid-10M`). `VideoCollator` → `(B, F, 3, H, W)` + a frame-validity mask; an undecodable clip is masked out (no loss). `kempnerforge/config/registry.py`: `register_video_dataset` / `register_sampling_policy` registries. `kempnerforge/config/video.py`: the `[video]` `VideoConfig` section (`data_root`, `dataset_type`, `dataset_name`, `sampling_policy`, `split`, `fps`, `max_frames`, `min_frames`, `frame_size`, `max_samples`), wired into `JobConfig` (+ `is_video`). `av` is an optional `video` dependency group (`uv sync --group video`); CI installs it for the lint + unit-test jobs.
   - **Frame-aware model + training wiring.** `kempnerforge/model/vlm.py`: `_project_image_features` → `_project_visual_features` folds the frame axis through the encoder + pooler to `(B, F·P′, dim)` (a single image is the `F == 1` case). `VLMWrapper` gains `frames_per_clip`, threaded through `build_parallel_model` / `_build_vlm` / `build_vlm_wrapper` so the static visual-token count equals `F·P′` (drives the residual budget and MoT's positional split; static == runtime). `scripts/train.py` builds the video dataset/collator when `[video]` is set. Adds `configs/train/vlm_video_webvid.toml` (SigLIP2 + avgpool + WebVid).
   - Tests: `tests/unit/test_video_io.py`, `test_video_dataset.py`, `test_video_config.py`; video-forward cases (all four archs) + image-path regression in `test_vlm.py`; pooling-adapter cases in `test_adapter.py`. Docs: `docs/how-to/train-on-video.md`.
-  - Deferred (follow-ups; the registries make these additive): more video dataset styles (HuggingFace video sets, flat folders, alternate manifests) and frame-sampling policies; per-frame timestamp tokens + grounding (`<points>`/`<tracks>` outputs with point-F1 / track-J&F eval), bidirectional visual attention, VLM sequence packing, long-context (blocked on context-parallel being wired), and warm-start from a converted image-VLM checkpoint.
+  - Deferred (follow-ups; the registries make these additive): more video dataset styles (HuggingFace video sets, flat folders, alternate manifests) and frame-sampling policies; interleaved text time-tokens (Molmo2-style, sequence-modifying), grounding (`<points>`/`<tracks>` outputs with point-F1 / track-J&F eval), bidirectional visual attention, VLM sequence packing, long-context (blocked on context-parallel being wired), and warm-start from a converted image-VLM checkpoint.
+- **Per-frame timestamps for video.** Each sampled frame carries its actual presentation time (seconds), embedded and added to that frame's visual tokens so the model can reason about *when* events occur, not just frame order. Registry-driven and config-selected (so new techniques drop in as small additions); zero-initialized so it is identity at step 0 (warm-start) and learned from there.
+  - `kempnerforge/data/video_io.py`: `decode_video_frames` returns `(frames, times)` (the matched frames' presentation times); `kempnerforge/data/video_dataset.py` emits a `frame_times` `(F,)` tensor and `VideoCollator` stacks it to `(B, F)`.
+  - `kempnerforge/model/frame_time.py`: a `TimeEmbedding` base (the additive `(B, F) seconds → (B, F, dim)` contract) + the `"sinusoidal"` implementation, registered via `@registry.register_time_embedding` and built through `build_time_embedding`. Applied per frame in `_project_visual_features` as a `VLMWrapper` submodule (video only; `None` for the image path) and built + FSDP-sharded + meta-materialized at both build sites (`build_vlm_wrapper`, `_build_vlm`).
+  - `kempnerforge/config/time_embedding.py`: the `[time_embedding]` `TimeEmbeddingConfig` (`type` selects the registered builder; `type = "none"` disables it), wired into `JobConfig` and threaded through `build_parallel_model`; `scripts/train.py` passes `config.time_embedding` and threads `frame_times` into the forward. A non-video config that still sets a non-default `[time_embedding]` gets a set-but-ineffective warning (`config/job.py`, mirroring the HF-encoder-override warning).
+  - `kempnerforge/config/vlm.py`: `frame_time_embed` added to `DEFAULT_MODULE_PATTERNS`, so the submodule is freeze-addressable (a freeze spec or schedule can stage it, like the adapter).
+  - Sequence-*modifying* time encodings (e.g. Molmo2-style interleaved text time-tokens) are a separate future hook at the sequence-assembly layer, gated on interleaved/variable-length sequence support — out of scope for this additive registry.
+  - Tests: `tests/unit/test_frame_time.py`, `test_time_embedding_config.py`; frame-time forward + `type="none"` + state_dict round-trip + freeze-addressability cases in `test_vlm.py`; the set-but-ineffective warning in `test_config.py`; the video build path in `test_distributed.py`.
 - **Padded frames masked from attention (all four archs).** Short/undecodable video clips pad to `max_frames` with blank frames; the `frame_mask` is now consumed so real tokens never attend to padded-frame visual tokens. One per-token validity mask, `ModalityContext.key_padding_mask` `(B, S)`, threads through the model: the shared `Attention` ANDs it with the causal (and doc) mask — covering Joint-Decoder and MoMa — `MoTAttention` builds an explicit causal-AND-valid mask, and Cross-Attention masks the padded image K/V via its existing `image_mask`. A NaN guard unmasks fully-masked query rows (an all-padded clip) so softmax stays finite. It is a **pure mask — no new state-dict keys**, so checkpoints stay compatible both ways; the image (`F=1`) and text paths are unchanged (no mask is built, so they keep the FlashAttention-2 path). Note: for the image-prefix arches (Joint-Decoder/MoT/MoMa) video self-attention always takes the explicit-mask SDPA path (FA2 disabled, a `(B,1,S,S)` mask materialized) even for fully-decoded clips — the result is identical to causal-only but not free; a deliberate `torch.compile`/DP-friendly trade-off (one graph, no host sync), with FA2-recovery / FlexAttention left as a follow-up. (Cross-Attention keeps FA2 on its text self-attention and only masks padded image K/V in the cross-attention blocks.) Foundation for variable-length / mixed image+video batches.
   - `kempnerforge/model/modality.py`: `ModalityContext.key_padding_mask` field (+ invariant). `kempnerforge/model/vlm.py`: `_visual_token_mask` expands `frame_mask (B,F)` → `(B, F·P′)`; the four strategies place it (image-prefix arches → `key_padding_mask`, Cross-Attention → `image_mask`). `attention.py` / `mot.py` / `cross_attention.py` consume the mask (+ NaN guard); `moma.py`'s `MoMaFFN` also excludes padded positions from expert-choice routing (so padded tokens never consume expert capacity). `scripts/train.py` threads `batch["frame_mask"]` into the forward.
   - Deferred: MoT configured with an *MoE* FFN still routes padded tokens through the shared `MoEMLP` — a follow-up ("generic token-validity in MoE") would mask that and padded text alike. MoT-dense (the default) and MoMa are fully masked.

diff --git a/docs/how-to/train-on-video.md b/docs/how-to/train-on-video.md
@@ -25,8 +25,14 @@ A clip of `F` frames becomes `F × P′` visual tokens:
      blocks; the residual stays text-only (so it fits more frames per
      `max_seq_len`).
 
-Temporal order is carried by frame order (sequential positions). Per-frame
-timestamp tokens and grounding outputs are a separate follow-up (see below).
+Temporal order is carried by frame order (sequential positions). On top of that,
+each frame's **timestamp in seconds** is embedded and added to that frame's
+visual tokens, so the model sees *when* each frame occurs, not just its order.
+The embedding is registry-driven: `[time_embedding].type` selects it
+(`sinusoidal` by default — sinusoidal features at log-spaced periods through a
+zero-initialized projection; `none` disables it), so new techniques (learned,
+Fourier, …) register as small additions and switch via config. Grounding
+outputs are a separate follow-up (see below).
 
 ## Token budget
 
@@ -99,9 +105,22 @@ time, so it is set in the TOML, not via a `--vlm.arch=` CLI override.)
 
 ## Constraints and follow-ups
 
-- **Causal attention; no per-frame timestamps yet** — temporal order is frame
-  order. Per-frame timestamp tokens + grounding (`<points>`/`<tracks>` outputs
-  with point-F1 / track-J&F eval) are a follow-up.
+- **Grounding outputs are a follow-up** — per-frame timestamps are encoded (see
+  above), but structured grounding (`<points>`/`<tracks>` outputs with point-F1
+  / track-J&F eval) is not yet implemented.
+- **Sequence-modifying time encodings are a separate hook** — the
+  `[time_embedding]` registry is for *additive* per-frame embeddings (no change
+  to sequence length). Molmo2-style interleaved text time-tokens change the
+  token sequence and need interleaved/variable-length sequence support KF does
+  not have yet; they would hook the sequence-assembly layer, not this registry.
+- **Inference must pass `frame_times`** — a video model silently drops the
+  learned temporal signal if `frame_times` is `None` (no error is raised).
+  Training threads it automatically; eval/generate paths must pass it for video
+  models.
+- **Resuming a pre-timestamp video checkpoint** — a checkpoint trained before
+  per-frame timestamps lacks the `frame_time_embed` keys, so loading it into the
+  current (default-on) video model needs `[time_embedding].type = "none"` or a
+  warm-start key-fill.
 - **Padded frames are masked from attention** — short/undecodable clips pad to
   `max_frames` with blank frames, and the `frame_mask` is consumed so real
   tokens never attend to padded-frame visual tokens (MoMa also drops them from

diff --git a/kempnerforge/config/job.py b/kempnerforge/config/job.py
@@ -14,6 +14,7 @@
 from kempnerforge.config.optimizer import OptimizerConfig
 from kempnerforge.config.profiling import ProfilingConfig
 from kempnerforge.config.scheduler import SchedulerConfig
+from kempnerforge.config.time_embedding import TimeEmbeddingConfig
 from kempnerforge.config.training import TrainConfig
 from kempnerforge.config.video import VideoConfig
 from kempnerforge.config.vision import VisionEncoderConfig
@@ -53,6 +54,7 @@ class JobConfig:
     adapter: AdapterConfig | None = None
     vlm: VLMConfig | None = None
     video: VideoConfig | None = None
+    time_embedding: TimeEmbeddingConfig | None = None
 
     def __post_init__(self) -> None:
         """Cross-section invariants that fire at construction time.
@@ -124,6 +126,22 @@ def __post_init__(self) -> None:
                 "the VLM wrapper, so a [vlm] section (and [vision_encoder]) is required."
             )
 
+        # Set-but-ineffective [time_embedding] warning. The per-frame time
+        # embedding is built only for video (frames_per_clip > 1); an explicit,
+        # enabled [time_embedding] on a non-video config is silently ignored, so
+        # warn (mirrors the HF-encoder-override warning above). type="none" is
+        # an intentional disable and stays quiet.
+        if self.time_embedding is not None and self.time_embedding.enabled and self.video is None:
+            import logging
+
+            logging.getLogger(__name__).warning(
+                "[time_embedding] is set (type=%r) but no [video] section is present; "
+                "the time embedding is built only for video (frames_per_clip > 1), so it "
+                'will be ignored. Set [time_embedding].type = "none" or remove the section '
+                "to silence this.",
+                self.time_embedding.type,
+            )
+
     @property
     def is_vlm(self) -> bool:
         """Whether this job builds a ``VLMWrapper`` around the text backbone."""

diff --git a/kempnerforge/config/registry.py b/kempnerforge/config/registry.py
@@ -227,6 +227,27 @@ def get_sampling_policy(self, name: str) -> Callable:
     def list_sampling_policies(self) -> list[str]:
         return self.list("sampling_policy")
 
+    def register_time_embedding(self, name: str) -> Callable:
+        """Decorator to register a time-embedding builder.
+
+        Builders take ``(dim, **kwargs)`` and return an ``nn.Module`` mapping
+        per-frame timestamps ``(B, F)`` in seconds to an additive embedding
+        ``(B, F, dim)`` (and exposing ``reset_parameters()`` for meta-device
+        builds). Selected by ``[time_embedding].type`` on the VLM video path.
+        """
+
+        def decorator(fn: Callable) -> Callable:
+            self.register("time_embedding", name, fn)
+            return fn
+
+        return decorator
+
+    def get_time_embedding(self, name: str) -> Callable:
+        return self.get("time_embedding", name)
+
+    def list_time_embeddings(self) -> list[str]:
+        return self.list("time_embedding")
+
     def register_dyn_ckpt_strategy(self, name: str) -> Callable:
         """Decorator to register a dynamic-checkpointing-window strategy.
 

diff --git a/kempnerforge/config/schema.py b/kempnerforge/config/schema.py
@@ -15,6 +15,7 @@
 from kempnerforge.config.optimizer import OptimizerConfig  # noqa: F401
 from kempnerforge.config.profiling import ProfilingConfig  # noqa: F401
 from kempnerforge.config.scheduler import SchedulerConfig, SchedulerType  # noqa: F401
+from kempnerforge.config.time_embedding import TimeEmbeddingConfig  # noqa: F401
 from kempnerforge.config.training import ActivationCheckpointing, TrainConfig  # noqa: F401
 from kempnerforge.config.vision import VisionEncoderConfig  # noqa: F401
 from kempnerforge.config.vlm import (  # noqa: F401

diff --git a/kempnerforge/config/time_embedding.py b/kempnerforge/config/time_embedding.py
@@ -0,0 +1,73 @@
+"""Time-embedding (per-frame timestamp) configuration.
+
+``TimeEmbeddingConfig`` selects which per-frame timestamp embedding the VLM
+video path uses and parameterizes it. Dispatched via the ``time_embedding``
+registry at build time (see ``kempnerforge/model/frame_time.py``).
+
+In TOML, ``[time_embedding]`` is a top-level section parallel to ``[adapter]``.
+It is only consumed for video (``frames_per_clip > 1``); the image and text
+paths never build one. ``type = "none"`` disables the embedding even for video.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Any
+
+from kempnerforge.config.registry import registry
+
+
+@dataclass
+class TimeEmbeddingConfig:
+    """Selects the time-embedding type and parameterizes it.
+
+    Register a new technique via ``@registry.register_time_embedding`` and select
+    it with ``type``; ``type = "none"`` disables the embedding entirely.
+
+    Fields:
+        type: Registry key for the builder (``"sinusoidal"`` default, or ``"none"``).
+        num_bands: Number of sinusoidal frequency bands (``"sinusoidal"`` only).
+        min_period: Shortest period in seconds (finest temporal resolution).
+        max_period: Longest period in seconds (coarsest temporal scale).
+    """
+
+    type: str = "sinusoidal"
+    num_bands: int = 16
+    min_period: float = 0.5
+    max_period: float = 256.0
+
+    def __post_init__(self) -> None:
+        if self.type == "none":
+            return
+        # Late import: importing the module triggers the
+        # ``@registry.register_time_embedding`` decorators. Doing it at module
+        # scope would create a circular import via the config/model graph.
+        import kempnerforge.model.frame_time  # noqa: F401, PLC0415
+
+        registered = tuple(registry.list_time_embeddings())
+        if self.type not in registered:
+            raise ValueError(
+                f"Unknown time_embedding.type: {self.type!r}. "
+                f"Registered: {sorted(registered)} (or 'none' to disable)."
+            )
+        if self.num_bands <= 0:
+            raise ValueError(f"time_embedding.num_bands must be positive (got {self.num_bands})")
+        if not 0.0 < self.min_period < self.max_period:
+            raise ValueError(
+                f"time_embedding requires 0 < min_period < max_period "
+                f"(got min_period={self.min_period}, max_period={self.max_period})"
+            )
+
+    @property
+    def enabled(self) -> bool:
+        """Whether a module should be built (``type != "none"``)."""
+        return self.type != "none"
+
+    def extra_kwargs(self) -> dict[str, Any]:
+        """Builder kwargs beyond ``dim``. Type-specific builders take what they
+        need and swallow the rest via ``**_`` (mirrors ``AdapterConfig``)."""
+        return {
+            "num_bands": self.num_bands,
+            "min_period": self.min_period,
+            "max_period": self.max_period,
+        }
diff --git a/kempnerforge/config/vlm.py b/kempnerforge/config/vlm.py
@@ -47,6 +47,7 @@
     "transformer": ["transformer", "transformer.*"],
     "vision_encoder": ["vision_encoder", "vision_encoder.*"],
     "adapter": ["adapter", "adapter.*"],
+    "frame_time_embed": ["frame_time_embed", "frame_time_embed.*"],
 }