Skip to content
9 changes: 8 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **Video data path (pluggable, registry-driven).** `kempnerforge/data/video_io.py`: timestamp-based frame sampling (target fps, first & last frame kept — Molmo2 §3.1/§A) registered as the `"uniform"` sampling policy and selectable via `[video].sampling_policy`; PyAV decode (lazy-imported). `kempnerforge/data/video_dataset.py`: a `VideoDataset` base + the WebVid-style `WebVidVideoDataset` (CSV manifest + `id[:2]/id[:4]/id[:6]/id.mp4` mapping) registered as `"webvid"` via `@registry.register_video_dataset`, plus a `build_video_dataset` dispatch — so other dataset styles are additive registrations selected by `[video].dataset_type`. The WebVid corpus directory is parameterized by `[video].dataset_name` (no longer hardcoded to `webvid-10M`). `VideoCollator` → `(B, F, 3, H, W)` + a frame-validity mask; an undecodable clip is masked out (no loss). `kempnerforge/config/registry.py`: `register_video_dataset` / `register_sampling_policy` registries. `kempnerforge/config/video.py`: the `[video]` `VideoConfig` section (`data_root`, `dataset_type`, `dataset_name`, `sampling_policy`, `split`, `fps`, `max_frames`, `min_frames`, `frame_size`, `max_samples`), wired into `JobConfig` (+ `is_video`). `av` is an optional `video` dependency group (`uv sync --group video`); CI installs it for the lint + unit-test jobs.
- **Frame-aware model + training wiring.** `kempnerforge/model/vlm.py`: `_project_image_features` → `_project_visual_features` folds the frame axis through the encoder + pooler to `(B, F·P′, dim)` (a single image is the `F == 1` case). `VLMWrapper` gains `frames_per_clip`, threaded through `build_parallel_model` / `_build_vlm` / `build_vlm_wrapper` so the static visual-token count equals `F·P′` (drives the residual budget and MoT's positional split; static == runtime). `scripts/train.py` builds the video dataset/collator when `[video]` is set. Adds `configs/train/vlm_video_webvid.toml` (SigLIP2 + avgpool + WebVid).
- Tests: `tests/unit/test_video_io.py`, `test_video_dataset.py`, `test_video_config.py`; video-forward cases (all four archs) + image-path regression in `test_vlm.py`; pooling-adapter cases in `test_adapter.py`. Docs: `docs/how-to/train-on-video.md`.
- Deferred (follow-ups; the registries make these additive): more video dataset styles (HuggingFace video sets, flat folders, alternate manifests) and frame-sampling policies; per-frame timestamp tokens + grounding (`<points>`/`<tracks>` outputs with point-F1 / track-J&F eval), bidirectional visual attention, VLM sequence packing, long-context (blocked on context-parallel being wired), and warm-start from a converted image-VLM checkpoint.
- Deferred (follow-ups; the registries make these additive): more video dataset styles (HuggingFace video sets, flat folders, alternate manifests) and frame-sampling policies; interleaved text time-tokens (Molmo2-style, sequence-modifying), grounding (`<points>`/`<tracks>` outputs with point-F1 / track-J&F eval), bidirectional visual attention, VLM sequence packing, long-context (blocked on context-parallel being wired), and warm-start from a converted image-VLM checkpoint.
- **Per-frame timestamps for video.** Each sampled frame carries its actual presentation time (seconds), embedded and added to that frame's visual tokens so the model can reason about *when* events occur, not just frame order. Registry-driven and config-selected (so new techniques drop in as small additions); zero-initialized so it is identity at step 0 (warm-start) and learned from there.
- `kempnerforge/data/video_io.py`: `decode_video_frames` returns `(frames, times)` (the matched frames' presentation times); `kempnerforge/data/video_dataset.py` emits a `frame_times` `(F,)` tensor and `VideoCollator` stacks it to `(B, F)`.
- `kempnerforge/model/frame_time.py`: a `TimeEmbedding` base (the additive `(B, F) seconds → (B, F, dim)` contract) + the `"sinusoidal"` implementation, registered via `@registry.register_time_embedding` and built through `build_time_embedding`. Applied per frame in `_project_visual_features` as a `VLMWrapper` submodule (video only; `None` for the image path) and built + FSDP-sharded + meta-materialized at both build sites (`build_vlm_wrapper`, `_build_vlm`).
- `kempnerforge/config/time_embedding.py`: the `[time_embedding]` `TimeEmbeddingConfig` (`type` selects the registered builder; `type = "none"` disables it), wired into `JobConfig` and threaded through `build_parallel_model`; `scripts/train.py` passes `config.time_embedding` and threads `frame_times` into the forward. A non-video config that still sets a non-default `[time_embedding]` gets a set-but-ineffective warning (`config/job.py`, mirroring the HF-encoder-override warning).
- `kempnerforge/config/vlm.py`: `frame_time_embed` added to `DEFAULT_MODULE_PATTERNS`, so the submodule is freeze-addressable (a freeze spec or schedule can stage it, like the adapter).
- Sequence-*modifying* time encodings (e.g. Molmo2-style interleaved text time-tokens) are a separate future hook at the sequence-assembly layer, gated on interleaved/variable-length sequence support — out of scope for this additive registry.
- Tests: `tests/unit/test_frame_time.py`, `test_time_embedding_config.py`; frame-time forward + `type="none"` + state_dict round-trip + freeze-addressability cases in `test_vlm.py`; the set-but-ineffective warning in `test_config.py`; the video build path in `test_distributed.py`.
- **Padded frames masked from attention (all four archs).** Short/undecodable video clips pad to `max_frames` with blank frames; the `frame_mask` is now consumed so real tokens never attend to padded-frame visual tokens. One per-token validity mask, `ModalityContext.key_padding_mask` `(B, S)`, threads through the model: the shared `Attention` ANDs it with the causal (and doc) mask — covering Joint-Decoder and MoMa — `MoTAttention` builds an explicit causal-AND-valid mask, and Cross-Attention masks the padded image K/V via its existing `image_mask`. A NaN guard unmasks fully-masked query rows (an all-padded clip) so softmax stays finite. It is a **pure mask — no new state-dict keys**, so checkpoints stay compatible both ways; the image (`F=1`) and text paths are unchanged (no mask is built, so they keep the FlashAttention-2 path). Note: for the image-prefix arches (Joint-Decoder/MoT/MoMa) video self-attention always takes the explicit-mask SDPA path (FA2 disabled, a `(B,1,S,S)` mask materialized) even for fully-decoded clips — the result is identical to causal-only but not free; a deliberate `torch.compile`/DP-friendly trade-off (one graph, no host sync), with FA2-recovery / FlexAttention left as a follow-up. (Cross-Attention keeps FA2 on its text self-attention and only masks padded image K/V in the cross-attention blocks.) Foundation for variable-length / mixed image+video batches.
- `kempnerforge/model/modality.py`: `ModalityContext.key_padding_mask` field (+ invariant). `kempnerforge/model/vlm.py`: `_visual_token_mask` expands `frame_mask (B,F)` → `(B, F·P′)`; the four strategies place it (image-prefix arches → `key_padding_mask`, Cross-Attention → `image_mask`). `attention.py` / `mot.py` / `cross_attention.py` consume the mask (+ NaN guard); `moma.py`'s `MoMaFFN` also excludes padded positions from expert-choice routing (so padded tokens never consume expert capacity). `scripts/train.py` threads `batch["frame_mask"]` into the forward.
- Deferred: MoT configured with an *MoE* FFN still routes padded tokens through the shared `MoEMLP` — a follow-up ("generic token-validity in MoE") would mask that and padded text alike. MoT-dense (the default) and MoMa are fully masked.
Expand Down
29 changes: 24 additions & 5 deletions docs/how-to/train-on-video.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,14 @@ A clip of `F` frames becomes `F × P′` visual tokens:
blocks; the residual stays text-only (so it fits more frames per
`max_seq_len`).

Temporal order is carried by frame order (sequential positions). Per-frame
timestamp tokens and grounding outputs are a separate follow-up (see below).
Temporal order is carried by frame order (sequential positions). On top of that,
each frame's **timestamp in seconds** is embedded and added to that frame's
visual tokens, so the model sees *when* each frame occurs, not just its order.
The embedding is registry-driven: `[time_embedding].type` selects it
(`sinusoidal` by default — sinusoidal features at log-spaced periods through a
zero-initialized projection; `none` disables it), so new techniques (learned,
Fourier, …) register as small additions and switch via config. Grounding
outputs are a separate follow-up (see below).

## Token budget

Expand Down Expand Up @@ -99,9 +105,22 @@ time, so it is set in the TOML, not via a `--vlm.arch=` CLI override.)

## Constraints and follow-ups

- **Causal attention; no per-frame timestamps yet** — temporal order is frame
order. Per-frame timestamp tokens + grounding (`<points>`/`<tracks>` outputs
with point-F1 / track-J&F eval) are a follow-up.
- **Grounding outputs are a follow-up** — per-frame timestamps are encoded (see
above), but structured grounding (`<points>`/`<tracks>` outputs with point-F1
/ track-J&F eval) is not yet implemented.
- **Sequence-modifying time encodings are a separate hook** — the
`[time_embedding]` registry is for *additive* per-frame embeddings (no change
to sequence length). Molmo2-style interleaved text time-tokens change the
token sequence and need interleaved/variable-length sequence support KF does
not have yet; they would hook the sequence-assembly layer, not this registry.
- **Inference must pass `frame_times`** — a video model silently drops the
learned temporal signal if `frame_times` is `None` (no error is raised).
Training threads it automatically; eval/generate paths must pass it for video
models.
- **Resuming a pre-timestamp video checkpoint** — a checkpoint trained before
per-frame timestamps lacks the `frame_time_embed` keys, so loading it into the
current (default-on) video model needs `[time_embedding].type = "none"` or a
warm-start key-fill.
- **Padded frames are masked from attention** — short/undecodable clips pad to
`max_frames` with blank frames, and the `frame_mask` is consumed so real
tokens never attend to padded-frame visual tokens (MoMa also drops them from
Expand Down
18 changes: 18 additions & 0 deletions kempnerforge/config/job.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from kempnerforge.config.optimizer import OptimizerConfig
from kempnerforge.config.profiling import ProfilingConfig
from kempnerforge.config.scheduler import SchedulerConfig
from kempnerforge.config.time_embedding import TimeEmbeddingConfig
from kempnerforge.config.training import TrainConfig
from kempnerforge.config.video import VideoConfig
from kempnerforge.config.vision import VisionEncoderConfig
Expand Down Expand Up @@ -53,6 +54,7 @@ class JobConfig:
adapter: AdapterConfig | None = None
vlm: VLMConfig | None = None
video: VideoConfig | None = None
time_embedding: TimeEmbeddingConfig | None = None

def __post_init__(self) -> None:
"""Cross-section invariants that fire at construction time.
Expand Down Expand Up @@ -124,6 +126,22 @@ def __post_init__(self) -> None:
"the VLM wrapper, so a [vlm] section (and [vision_encoder]) is required."
)

# Set-but-ineffective [time_embedding] warning. The per-frame time
# embedding is built only for video (frames_per_clip > 1); an explicit,
# enabled [time_embedding] on a non-video config is silently ignored, so
# warn (mirrors the HF-encoder-override warning above). type="none" is
# an intentional disable and stays quiet.
if self.time_embedding is not None and self.time_embedding.enabled and self.video is None:
import logging

logging.getLogger(__name__).warning(
"[time_embedding] is set (type=%r) but no [video] section is present; "
"the time embedding is built only for video (frames_per_clip > 1), so it "
'will be ignored. Set [time_embedding].type = "none" or remove the section '
"to silence this.",
self.time_embedding.type,
)

@property
def is_vlm(self) -> bool:
"""Whether this job builds a ``VLMWrapper`` around the text backbone."""
Expand Down
21 changes: 21 additions & 0 deletions kempnerforge/config/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,27 @@ def get_sampling_policy(self, name: str) -> Callable:
def list_sampling_policies(self) -> list[str]:
return self.list("sampling_policy")

def register_time_embedding(self, name: str) -> Callable:
"""Decorator to register a time-embedding builder.

Builders take ``(dim, **kwargs)`` and return an ``nn.Module`` mapping
per-frame timestamps ``(B, F)`` in seconds to an additive embedding
``(B, F, dim)`` (and exposing ``reset_parameters()`` for meta-device
builds). Selected by ``[time_embedding].type`` on the VLM video path.
"""

def decorator(fn: Callable) -> Callable:
self.register("time_embedding", name, fn)
return fn

return decorator

def get_time_embedding(self, name: str) -> Callable:
return self.get("time_embedding", name)

def list_time_embeddings(self) -> list[str]:
return self.list("time_embedding")

def register_dyn_ckpt_strategy(self, name: str) -> Callable:
"""Decorator to register a dynamic-checkpointing-window strategy.

Expand Down
1 change: 1 addition & 0 deletions kempnerforge/config/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from kempnerforge.config.optimizer import OptimizerConfig # noqa: F401
from kempnerforge.config.profiling import ProfilingConfig # noqa: F401
from kempnerforge.config.scheduler import SchedulerConfig, SchedulerType # noqa: F401
from kempnerforge.config.time_embedding import TimeEmbeddingConfig # noqa: F401
from kempnerforge.config.training import ActivationCheckpointing, TrainConfig # noqa: F401
from kempnerforge.config.vision import VisionEncoderConfig # noqa: F401
from kempnerforge.config.vlm import ( # noqa: F401
Expand Down
73 changes: 73 additions & 0 deletions kempnerforge/config/time_embedding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
"""Time-embedding (per-frame timestamp) configuration.

``TimeEmbeddingConfig`` selects which per-frame timestamp embedding the VLM
video path uses and parameterizes it. Dispatched via the ``time_embedding``
registry at build time (see ``kempnerforge/model/frame_time.py``).

In TOML, ``[time_embedding]`` is a top-level section parallel to ``[adapter]``.
It is only consumed for video (``frames_per_clip > 1``); the image and text
paths never build one. ``type = "none"`` disables the embedding even for video.
"""

from __future__ import annotations

from dataclasses import dataclass
from typing import Any

from kempnerforge.config.registry import registry


@dataclass
class TimeEmbeddingConfig:
"""Selects the time-embedding type and parameterizes it.

Register a new technique via ``@registry.register_time_embedding`` and select
it with ``type``; ``type = "none"`` disables the embedding entirely.

Fields:
type: Registry key for the builder (``"sinusoidal"`` default, or ``"none"``).
num_bands: Number of sinusoidal frequency bands (``"sinusoidal"`` only).
min_period: Shortest period in seconds (finest temporal resolution).
max_period: Longest period in seconds (coarsest temporal scale).
"""

type: str = "sinusoidal"
num_bands: int = 16
min_period: float = 0.5
max_period: float = 256.0

def __post_init__(self) -> None:
if self.type == "none":
return
# Late import: importing the module triggers the
# ``@registry.register_time_embedding`` decorators. Doing it at module
# scope would create a circular import via the config/model graph.
import kempnerforge.model.frame_time # noqa: F401, PLC0415

registered = tuple(registry.list_time_embeddings())
if self.type not in registered:
raise ValueError(
f"Unknown time_embedding.type: {self.type!r}. "
f"Registered: {sorted(registered)} (or 'none' to disable)."
)
if self.num_bands <= 0:
raise ValueError(f"time_embedding.num_bands must be positive (got {self.num_bands})")
if not 0.0 < self.min_period < self.max_period:
raise ValueError(
f"time_embedding requires 0 < min_period < max_period "
f"(got min_period={self.min_period}, max_period={self.max_period})"
)

@property
def enabled(self) -> bool:
"""Whether a module should be built (``type != "none"``)."""
return self.type != "none"

def extra_kwargs(self) -> dict[str, Any]:
"""Builder kwargs beyond ``dim``. Type-specific builders take what they
need and swallow the rest via ``**_`` (mirrors ``AdapterConfig``)."""
return {
"num_bands": self.num_bands,
"min_period": self.min_period,
"max_period": self.max_period,
}
1 change: 1 addition & 0 deletions kempnerforge/config/vlm.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
"transformer": ["transformer", "transformer.*"],
"vision_encoder": ["vision_encoder", "vision_encoder.*"],
"adapter": ["adapter", "adapter.*"],
"frame_time_embed": ["frame_time_embed", "frame_time_embed.*"],
}


Expand Down
Loading
Loading