release: v1.4.2 — JOSS review version

InfantLab · claude · InfantLab · commit 013634adf0b7 · 2026-03-04T14:49:06.000Z
Migrate CLIP to open_clip (LAION-2B ViT-B-32), update deprecated
HuggingFace auth parameters, harden GUID handling, remove superseded
voice_emotion_baseline pipeline, and add JOSS cover letter.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.devcontainer/devcontainer.json b/.devcontainer/devcontainer.json
@@ -16,24 +16,11 @@
 
                  },
     "forwardPorts":  [
-                         18011,
-                         18012,
-                         18013,
-                         18014,
-                         18015,
-                         19011,
-                         19012,
-                         19013,
-                         19014,
-                         19015
+                         18011
                      ],
     "portsAttributes":  {
                             "18011":  {
-                                          "label":  "VideoAnnotator API (default)",
-                                          "onAutoForward":  "notify"
-                                      },
-                            "19011":  {
-                                          "label":  "Video Annotation Viewer",
+                                          "label":  "VideoAnnotator API",
                                           "onAutoForward":  "notify"
                                       }
                         },
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,6 +15,32 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Benchmark results and performance validation
 - Additional contributor documentation improvements
 
+## [1.4.2] - 2026-03-04
+
+### JOSS Review Version
+
+This release accompanies the JOSS submission of VideoAnnotator and its companion project Video Annotation Viewer.
+
+#### Changed
+
+- **CLIP migration**: Migrated scene-classification pipeline from `clip` to `open_clip`, using the LAION-2B pretrained `ViT-B-32` model for improved availability and reproducibility.
+- **HuggingFace auth**: Updated diarization and Whisper pipelines to use the current `token` parameter instead of the deprecated `use_auth_token`.
+- **Devcontainer**: Simplified forwarded-port list to the single default API port (18011).
+
+#### Fixed
+
+- **Database GUID handling**: Added defensive `try/except` in the `GUID` type decorator to gracefully handle malformed UUID values.
+- **Diarization init**: Wrapped model loading in explicit error handling with a clear log message on failure.
+
+#### Removed
+
+- **Voice emotion baseline**: Removed `voice_emotion_baseline` pipeline metadata and associated tests (superseded by LAION EmoNet voice pipeline).
+
+#### Documentation
+
+- Added JOSS cover letter (`paper/cover_letter.md`).
+- Updated paper bibliography version to v1.4.2.
+
 ## [1.4.1] - 2025-12-26
 
 ### Release Quality, Docs, and Developer Experience
diff --git a/CITATION.cff b/CITATION.cff
@@ -28,5 +28,5 @@ authors:
     orcid: "https://orcid.org/0000-0001-5846-3444"
 license: "MIT"
 repository-code: "https://github.com/InfantLab/VideoAnnotator"
-version: "1.4.1"
-date-released: "2025-12-18"
+version: "1.4.2"
+date-released: "2026-03-04"
diff --git a/paper/cover_letter.md b/paper/cover_letter.md
@@ -0,0 +1,28 @@
+**Cover letter — JOSS submission**
+**VideoAnnotator: an extensible, reproducible toolkit for automated video annotation in behavioral research**
+
+Dear Editor,
+
+We are pleased to submit *VideoAnnotator* for consideration by the Journal of Open Source Software. This is an open-source Python toolkit that provides a unified, locally deployed framework for automated multi-modal video annotation — covering person tracking, facial analysis, scene detection, and audio processing — aimed at behavioral, social, and health researchers.
+
+We believe the submission addresses the JOSS review criteria as follows:
+
+- **Open license**: The software is released under the MIT license.
+- **Repository and archival**: The source is hosted on GitHub at [InfantLab/VideoAnnotator](https://github.com/InfantLab/VideoAnnotator) and we will generate a versioned Zenodo DOI upon acceptance.
+- **Contribution and community guidelines**: The repository includes `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, issue templates, and `CITATION.cff`.
+- **Automated tests and CI**: A pytest suite of 74 test files (unit, integration, and performance) runs via GitHub Actions on Ubuntu, Windows, and macOS with Python 3.12, alongside ruff linting, mypy type-checking, Trivy security scanning, and Codecov reporting.
+- **Functionality documentation**: Full API documentation and usage guides are provided in the repository and rendered online.
+- **Statement of need and state of the field**: The paper includes both sections, positioning VideoAnnotator relative to existing tools (ELAN, Datavyu, DeepLabCut, Py-Feat, OpenFace, openSMILE, PySceneDetect, pyannote) and explaining the gap it fills.
+- **References**: All key upstream models and comparable tools are cited in the bibliography.
+- **Research application**: The paper includes a research-impact statement describing current use at Stellenbosch University and the University of Oxford for caregiver–child interaction studies under the Global Parenting Initiative.
+- **AI disclosure**: Included per JOSS policy.
+
+We would also like to note that a parallel JOSS submission is being prepared for the companion project, **Video Annotation Viewer** ([InfantLab/video-annotation-viewer](https://github.com/InfantLab/video-annotation-viewer)), which provides the interactive browser-based interface for reviewing and validating VideoAnnotator outputs. The two packages are designed to work together but are independently installable and have distinct codebases.
+
+This is our first submission to JOSS, so we very much appreciate any guidance you can offer throughout the review process. We are happy to address any feedback promptly.
+
+Thank you for your time and consideration.
+
+Sincerely,
+
+Caspar Addyman (corresponding author), Jeremiah Ishaya, Irene Uwerikowe, Daniel Stamate, Jamie Lachman, and Mark Tomlinson
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -3,7 +3,7 @@ @misc{videoannotator
   author       = {Addyman, Caspar and Ishaya, Jeremiah and Uwerikowe, Irene and Stamate, Daniel and Lachman, Jamie and Tomlinson, Mark},
   year         = {2026},
   howpublished = {\url{https://github.com/InfantLab/VideoAnnotator}},
-  note         = {Version v1.4.1}
+  note         = {Version v1.4.2}
 }
 
 @inproceedings{openface3,
@@ -113,7 +113,7 @@ @software{viewer
   author  = {Addyman, Caspar  and Uwerikowe, Irene and Ishaya, Jeremiah and Stamate, Daniel and Lachman, Jamie and Tomlinson, Mark},
   year    = {2026},
   url     = {https://github.com/InfantLab/video-annotation-viewer},
-  version = {0.4.2}
+  version = {0.6.2}
 }
 
 @book{observer,
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "videoannotator"
-version = "1.4.1"
+version = "1.4.2"
 description = "A modern, modular toolkit for analyzing, processing, and visualizing human interaction videos"
 readme = "README.md"
 license = "MIT"
diff --git a/src/videoannotator/database/models.py b/src/videoannotator/database/models.py
@@ -53,9 +53,11 @@ def process_result_value(self, value, dialect):
         """Convert database values back into UUID instances."""
         if value is None:
             return value
-        else:
-            if not isinstance(value, uuid.UUID):
-                return uuid.UUID(value)
+        if isinstance(value, uuid.UUID):
+            return value
+        try:
+            return uuid.UUID(value)
+        except (ValueError, AttributeError):
             return value
 
 
diff --git a/src/videoannotator/pipelines/audio_processing/diarization_pipeline.py b/src/videoannotator/pipelines/audio_processing/diarization_pipeline.py
@@ -65,14 +65,18 @@ def initialize(self) -> None:
 
         self.logger.info(f"Loading diarization model: {self.config['model']}")
 
-        if hf_token:
-            self.diarization_model = PyAnnotePipeline.from_pretrained(
-                self.config["model"], use_auth_token=hf_token
-            )
-        else:
-            self.diarization_model = PyAnnotePipeline.from_pretrained(
-                self.config["model"]
-            )
+        try:
+            if hf_token:
+                self.diarization_model = PyAnnotePipeline.from_pretrained(
+                    self.config["model"], token=hf_token
+                )
+            else:
+                self.diarization_model = PyAnnotePipeline.from_pretrained(
+                    self.config["model"]
+                )
+        except Exception as e:
+            self.logger.error(f"Failed to load diarization model: {e}")
+            raise
 
         self.is_initialized = True
         self.logger.info("DiarizationPipeline initialized")
diff --git a/src/videoannotator/pipelines/audio_processing/whisper_base_pipeline.py b/src/videoannotator/pipelines/audio_processing/whisper_base_pipeline.py
@@ -255,7 +255,7 @@ def _load_hf_whisper_model(
             # Load processor
             processor_kwargs = {"cache_dir": cache_dir}
             if auth_token:
-                processor_kwargs["use_auth_token"] = auth_token
+                processor_kwargs["token"] = auth_token
 
             self.whisper_processor = WhisperProcessor.from_pretrained(
                 model_id, **processor_kwargs
@@ -264,7 +264,7 @@ def _load_hf_whisper_model(
             # Load model
             model_kwargs = {"cache_dir": cache_dir}
             if auth_token:
-                model_kwargs["use_auth_token"] = auth_token
+                model_kwargs["token"] = auth_token
 
             # Add FP16 if requested and on GPU
             if self.config.get("use_fp16", True) and self.device.type == "cuda":
diff --git a/src/videoannotator/pipelines/scene_detection/scene_pipeline.py b/src/videoannotator/pipelines/scene_detection/scene_pipeline.py
@@ -29,7 +29,7 @@
     SCENEDETECT_AVAILABLE = False
 
 try:
-    import clip
+    import open_clip
     import torch
 
     CLIP_AVAILABLE = True
@@ -59,7 +59,7 @@ def __init__(self, config: dict[str, Any] | None = None):
                 "office",
                 "playground",
             ],
-            "clip_model": "ViT-B/32",
+            "clip_model": "ViT-B-32",
             "use_gpu": True,
             "keyframe_extraction": "middle",  # Extract keyframe from middle of scene
         }
@@ -70,6 +70,7 @@ def __init__(self, config: dict[str, Any] | None = None):
         self.logger = logging.getLogger(__name__)
         self.clip_model = None
         self.clip_preprocess = None
+        self.clip_tokenizer = None
         self.device = None
 
     def process(
@@ -230,7 +231,7 @@ def _classify_scenes(
 
             # Prepare text prompts
             text_prompts = [f"a {prompt}" for prompt in self.config["scene_prompts"]]
-            text = clip.tokenize(text_prompts).to(self.device)
+            text = self.clip_tokenizer(text_prompts).to(self.device)
 
             classified_segments = []
             cap = cv2.VideoCapture(video_path)
@@ -312,9 +313,14 @@ def _initialize_clip(self):
         self.device = (
             "cuda" if self.config["use_gpu"] and torch.cuda.is_available() else "cpu"
         )
-        self.clip_model, self.clip_preprocess = clip.load(
-            self.config["clip_model"], device=self.device
+        self.clip_model, _, self.clip_preprocess = (
+            open_clip.create_model_and_transforms(
+                self.config["clip_model"],
+                pretrained="laion2b_s34b_b79k",
+                device=self.device,
+            )
         )
+        self.clip_tokenizer = open_clip.get_tokenizer(self.config["clip_model"])
         self.logger.info(
             f"CLIP model loaded: {self.config['clip_model']} on {self.device}"
         )
@@ -417,6 +423,7 @@ def cleanup(self) -> None:
 
         self.clip_model = None
         self.clip_preprocess = None
+        self.clip_tokenizer = None
         self.device = None
         self.is_initialized = False
         self.logger.info("Scene Detection Pipeline cleaned up")
diff --git a/src/videoannotator/registry/metadata/voice_emotion_baseline.yaml b/src/videoannotator/registry/metadata/voice_emotion_baseline.yaml
diff --git a/src/videoannotator/version.py b/src/videoannotator/version.py
@@ -10,10 +10,10 @@
 from videoannotator.utils.logging_config import get_logger
 
 logger = get_logger("videoannotator.version")
-__version__ = "1.4.1"
-__version_info__ = (1, 4, 1, "final")
-# Release version for v1.4.1
-__release_date__ = "2025-12-18"
+__version__ = "1.4.2"
+__version_info__ = (1, 4, 2, "final")
+# Release version for v1.4.2
+__release_date__ = "2026-03-04"
 __author__ = "VideoAnnotator Team"
 __license__ = "MIT"
 
diff --git a/tests/pipelines/test_scene_detection.py b/tests/pipelines/test_scene_detection.py
@@ -35,7 +35,7 @@ def test_default_configuration(self):
         # Verify current default configuration
         assert pipeline.config["threshold"] == 30.0
         assert pipeline.config["min_scene_length"] == 2.0
-        assert pipeline.config["clip_model"] == "ViT-B/32"
+        assert pipeline.config["clip_model"] == "ViT-B-32"
         assert pipeline.config["use_gpu"]
 
     def test_initialization_lifecycle(self):
diff --git a/tests/unit/registry/test_voice_emotion_pipeline.py b/tests/unit/registry/test_voice_emotion_pipeline.py