Skip to content

Feature/video transcriptions#589

Open
parse-nip wants to merge 7 commits into
siddharthvaddem:mainfrom
parse-nip:feature/video-transcriptions
Open

Feature/video transcriptions#589
parse-nip wants to merge 7 commits into
siddharthvaddem:mainfrom
parse-nip:feature/video-transcriptions

Conversation

@parse-nip
Copy link
Copy Markdown

@parse-nip parse-nip commented May 14, 2026

Pull Request Template

Description

Adds the new in-browser auto captions feature to OpenScreen. This generates caption annotations from recorded video audio inside the editor, makes them editable on the timeline, and includes follow-up fixes to improve timing accuracy and preserve visible gaps during pauses in speech.

How it works (implementation)

  • Speech recognition: captions are produced in the browser with Transformers.js (@xenova/transformers), using the Xenova/whisper-tiny automatic-speech-recognition model (Whisper “tiny” weights). The first run downloads model assets (~75 MB) to the client; later runs reuse the cache.
  • Audio path: the editor decodes the video’s audio track to mono 16 kHz float PCM (Whisper’s expected input). It prefers decodeAudioData when the container is supported; otherwise it uses the same web-demuxer + AudioDecoder stack as export. Long clips are truncated to a configured max duration for captioning; duration metadata reflects the decoded slice.
  • Pre-processing: optional leading-silence trim before inference (with a full-buffer retry if the trimmed buffer yields no segments, so we do not false-negative on quiet intros). User trim regions are applied when mapping timestamps so trimmed audio is not transcribed.
  • Timestamps → annotations: Whisper can return word-level or phrase-level timestamps. The pipeline merges/dedupes chunk artifacts, groups tokens into on-screen lines using your min/max words per caption settings, reconciles timeline gaps between auto-caption regions, and inserts text annotations tagged as auto-generated so span edits stay consistent.

Motivation

OpenScreen did not have a built-in captioning workflow, so users had no way to generate subtitles directly inside the app. This change adds that feature and addresses the key usability issues that came up during testing: inaccurate timing, captions collapsing across pauses, and occasional empty transcription results on spoken clips.

Type of Change

  • New Feature
  • Bug Fix
  • Refactor / Code Cleanup
  • Documentation Update
  • Other (please specify)

Screenshots / Video

Video (if applicable):

Screen.Recording.2026-05-14.094520.mp4

Testing

Reviewer flow:

Start the app in dev mode with npm run dev
Open a recording with clear spoken audio
Generate auto captions from the editor
Verify caption annotations are created on the timeline
Verify captions disappear during real pauses and resume when speech starts again
Verify caption timing stays aligned with speech and does not collapse everything into one continuous run

Checklist

  • I have performed a self-review of my code.
  • I have added any necessary screenshots or videos.
  • I have linked related issue(s) and updated the changelog if applicable.

Thank you for contributing!

Summary by CodeRabbit

  • New Features

    • Automatic caption generation from video audio with a toolbar button, configurable min/max words, progress/status toasts, and auto-inserted captions that sync edits across caption-driven annotations.
    • Enhanced on-device transcription and audio extraction for more reliable captioning.
  • Localization

    • Added auto-caption UI strings for Arabic, English, Spanish, French, Japanese, Korean, Turkish, Simplified and Traditional Chinese.
  • Bug Fixes

    • Fixed dropdown layering so select menus render above other UI.
    • Updated toast/notification styling to a dark theme.

Review Change Stack

parse-nip and others added 2 commits May 13, 2026 22:33
Decode mono 16k audio from the editor video, run Whisper via Transformers.js, and insert linked text annotations with timing and layout helpers.

Adds Vite resolve shims for Node-only imports used by the model stack, optional leading-silence trim for the caption buffer, timeline gap reconciliation for auto-caption regions, and editor i18n. Raises Select content z-index so caption controls stay usable over the video surface.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds in-browser Whisper transcription types and a robust transcribeMono16kToSegments entrypoint (slice handling, dedupe, retries, Node-version masking), plus small UI tweaks: Sonner Toaster theme/class merging and removal of a Toaster prop.

Changes

Auto-captions Feature

Layer / File(s) Summary
Toaster & Sonner UI tweaks
src/components/ui/sonner.tsx, src/App.tsx
Switches Sonner wrapper to dark theme, merges caller className via cn, updates toast classNames, and removes className="pointer-events-auto" from App's Toaster.
Whisper transcription core
src/lib/captioning/transcribe.ts
Adds exported caption types and implements transcribeMono16kToSegments with UI-yielding, withoutNodeVersion masking, slice-based transcription for long audio, chunk→segment conversion, deduplication, trim filtering, and retry strategies.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • siddharthvaddem
  • FabLrc

Poem

whisper wakens from the hush,
slices hum and timings flush,
toast goes dark, the UI sips tea,
segments stitch where words should be,
tiny miracles in JS, lowkey.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Feature/video transcriptions' accurately describes the main change—adding auto captions/transcription via Whisper. It's concise and clear about the primary feature, though somewhat broad.
Description check ✅ Passed The description includes all required sections: a detailed breakdown of the feature, clear motivation, type of change checkboxes (New Feature and Refactor marked), testing instructions, demo video, and a completed checklist. It goes above and beyond with implementation specifics.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c431fe473

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/components/video-editor/VideoEditor.tsx
@siddharthvaddem
Copy link
Copy Markdown
Owner

someone beat me to it lol - i was going to work on this soon.

can you add more detailed description of the implementation - is this using whisper?

also glad that you added a recording of it. I'll take a more closer look later in the week when I have some downtime.

But thanks for working on this!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (4)
src/components/video-editor/VideoEditor.tsx (1)

1022-1040: 💤 Low value

nit: reconcile runs for every annotation span change, including blur/non-caption ones

handleAnnotationSpanChange is also wired up as the blur span handler (line 2253), and even regular text/figure annotations flow through here. for those, reconcileAutoCaptionTimelineGaps is presumably a no-op (only auto-caption siblings produce gaps), but it still walks the whole array on every span tweak — and span changes can fire pretty frequently during drag.

cheap shortcut: only reconcile when the edited region is actually annotationSource === "auto-caption" (or when the array contains any). totally optional, won't move any numbers on small projects, just leaving it here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/components/video-editor/VideoEditor.tsx` around lines 1022 - 1040,
handleAnnotationSpanChange currently always calls
reconcileAutoCaptionTimelineGaps after updating annotationRegions, causing
unnecessary full-array work on every drag/blur; change it to only run
reconciliation when relevant: inside the pushState update, determine whether the
edited region (matched by id) has annotationSource === "auto-caption" (or
short-circuit if any region in prev.annotationRegions already has
annotationSource === "auto-caption"), and only call
reconcileAutoCaptionTimelineGaps(next) in that case; otherwise return {
annotationRegions: next } directly. Refer to handleAnnotationSpanChange,
pushState, annotationRegions, reconcileAutoCaptionTimelineGaps and the region.id
/ annotationSource fields to locate the change.
src/lib/captioning/leadingSilence.ts (1)

66-83: 💤 Low value

nit: could combine the two maps into one pass

lines 71-81 do two separate maps when you could do both operations in one:

return regions
  .map((r) => ({
    ...r,
    startMs: Math.max(0, r.startMs - trimMs),
    endMs: Math.max(0, r.endMs - trimMs),
  }))
  .filter((r) => r.endMs > r.startMs);

current version is totally fine tho, just a tiny micro-optimization that probably doesn't matter.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/captioning/leadingSilence.ts` around lines 66 - 83, The two
successive .map passes in shiftTrimRegionsMsForCaptionBuffer can be combined
into a single pass: in the mapping for each TrimRegion (refer to function
shiftTrimRegionsMsForCaptionBuffer and parameter trimMs) compute startMs and
endMs as Math.max(0, r.startMs - trimMs) and Math.max(0, r.endMs - trimMs)
respectively, return the updated object, then keep the existing .filter(r =>
r.endMs > r.startMs); this removes the extra iteration while preserving
behavior.
src/lib/captioning/extractMono16k.ts (1)

87-103: 💤 Low value

durationSec after truncation might be confusing

when truncated is true, you're returning the original durationSec but the samples array is shorter. this could trip up downstream code that assumes samples.length / 16000 ≈ durationSec.

maybe rename to originalDurationSec or return the actual truncated duration? or add a comment explaining why original is preserved.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/captioning/extractMono16k.ts` around lines 87 - 103, The
truncateAndResampleTo16k function currently returns the original durationSec
even when audio was truncated, which can mislead callers; change the return
value to reflect the actual post-truncation duration (e.g., compute durationSec
= samples.length / 16000 or set to Math.min(originalDurationSec,
MAX_CAPTION_AUDIO_SEC) after resampling) and update any callers expecting the
old semantics (or alternatively rename the original parameter to
originalDurationSec if you prefer preserving both values) — locate this logic in
truncateAndResampleTo16k and adjust the returned durationSec to match the
truncated/resampled samples.
src/lib/captioning/extractMono16kWebDemuxer.ts (1)

45-74: 💤 Low value

frames are consumed (closed) by this function

line 65 closes each frame after processing. if the caller still needs these frames, this will break. might want to add a comment or rename to something like mergeAndConsumeDecodedAudio to make it explicit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/captioning/extractMono16kWebDemuxer.ts` around lines 45 - 74, The
function mergeDecodedAudioToMonoLinear currently calls frame.close() and thus
consumes frames; change its API to make consumption explicit by adding a boolean
parameter (e.g., consumeFrames = false) to mergeDecodedAudioToMonoLinear and
only call frame.close() when consumeFrames is true, or alternatively rename to
mergeAndConsumeDecodedAudio if you want a consuming variant; update any callers
to pass the flag or use the renamed function and add a short comment noting the
behavior; reference mergeDecodedAudioToMonoLinear, the frame.close() call, and
audioDataFrameToMono when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/components/video-editor/VideoEditor.tsx`:
- Around line 1835-1852: The retry path transcribes the full `samples` buffer
but the post-processing always adds `trimSec` to `segmentsRaw`, causing retryed
segments to be double-shifted; change the logic around the
`transcribeMono16kToSegments` call(s) so you track which buffer produced
`segmentsRaw` (e.g., a boolean like `usedTrimmed = true/false` when calling with
`speechSamples` vs `samples`) and only add `trimSec` to segment timestamps when
`usedTrimmed` is true; update the block that builds `segments` (the mapping that
adjusts `startSec`/`endSec`) to consult that flag instead of unconditionally
checking `trimSec`, leaving the retry results unshifted when they came from the
untouched `samples`.

In `@src/i18n/locales/ar/editor.json`:
- Around line 45-62: The autoCaptions block currently contains English strings;
replace each value under "autoCaptions" (keys: button, dialogTitle,
dialogDescription, minWords, maxWords, wordsCount, generate, dialogCancel,
generating, loadingModel, busy, done, noneHeard, noAudio, failed, truncated)
with Arabic translations—for example use: button/dialogTitle: "التسميات
التوضيحية التلقائية", dialogDescription: "اختر تقريبا كم عدد الكلمات التي تظهر
في كل تسمية توضيحية. يتم توزيع التوقيت عبر الكلمات في تلك العبارة.", minWords:
"الحد الأدنى من الكلمات لكل تسمية", maxWords: "الحد الأقصى من الكلمات لكل
تسمية", wordsCount: "{{count}} كلمة", generate: "توليد", dialogCancel: "إلغاء",
generating: "جارٍ توليد التسميات من الصوت…", loadingModel: "جارٍ تحميل نموذج
الكلام (سيتم تنزيل ~75 ميغابايت عند الاستخدام الأول)…", busy: "توليد التسميات
قيد التنفيذ بالفعل.", done: "تمت إضافة {{count}} تسمية.", noneHeard: "لم يتم
الكشف عن أي كلام.", noAudio: "لا يحتوي هذا الفيديو على صوت صالح للنسخ.", failed:
"تعذّر توليد التسميات.", truncated: "تم نسخ الدقائق الأولى فقط: {{minutes}}
دقيقة."; keep the keys unchanged so localization lookup (autoCaptions) continues
to work.

In `@src/i18n/locales/es/editor.json`:
- Around line 45-62: Replace the English strings under the "autoCaptions" object
with Spanish translations for every key (button, dialogTitle, dialogDescription,
minWords, maxWords, wordsCount, generate, dialogCancel, generating,
loadingModel, busy, done, noneHeard, noAudio, failed, truncated); update the
values so UI shows Spanish text (e.g., "Auto captions" -> "Subtítulos
automáticos", "Generate" -> "Generar", "Cancel" -> "Cancelar", status/error
messages and the templated strings like "{{count}} words" and "Only the first
{{minutes}} minutes were transcribed." should be translated preserving the
placeholders like {{count}} and {{minutes}). Ensure punctuation and ellipses
match original intent (e.g., "Generating captions from audio…" keeps ellipsis)
and verify plural/template grammar in wordsCount/done/truncated.

In `@src/i18n/locales/fr/editor.json`:
- Around line 45-62: Translate every English string inside the "autoCaptions"
object to French while preserving all JSON keys and interpolation placeholders
(e.g., {{count}}, {{minutes}}), punctuation, ellipses and parentheses; update
values for "button", "dialogTitle", "dialogDescription", "minWords", "maxWords",
"wordsCount", "generate", "dialogCancel", "generating", "loadingModel" (keep the
"~75 MB" note and parentheses), "busy", "done", "noneHeard", "noAudio",
"failed", and "truncated" so the UI displays fully French text for the
autoCaptions section.

In `@src/i18n/locales/ja-JP/editor.json`:
- Around line 45-62: The autoCaptions block in the ja-JP editor locale is still
English; translate each value for keys under "autoCaptions" (e.g., "button",
"dialogTitle", "dialogDescription", "minWords", "maxWords", "wordsCount",
"generate", "dialogCancel", "generating", "loadingModel", "busy", "done",
"noneHeard", "noAudio", "failed", "truncated") into natural Japanese and replace
the English strings in the "autoCaptions" object in
src/i18n/locales/ja-JP/editor.json so the caption-generation UI displays
Japanese text for Japanese users.

In `@src/i18n/locales/ko-KR/editor.json`:
- Around line 45-62: The autoCaptions translation block currently contains
English strings; update each key under "autoCaptions" (button, dialogTitle,
dialogDescription, minWords, maxWords, wordsCount, generate, dialogCancel,
generating, loadingModel, busy, done, noneHeard, noAudio, failed, truncated)
with proper Korean translations preserving interpolation tokens like {{count}}
and {{minutes}} and ellipses; ensure punctuation and spacing match locale
conventions and that message semantics (e.g., informational vs. error) are
preserved.

In `@src/i18n/locales/tr/editor.json`:
- Around line 45-62: Replace the English strings under the autoCaptions object
with Turkish translations so the UI is fully localized; update the values for
keys button, dialogTitle, dialogDescription, minWords, maxWords, wordsCount,
generate, dialogCancel, generating, loadingModel, busy, done, noneHeard,
noAudio, failed, and truncated by providing appropriate Turkish text (e.g.,
"Otomatik altyazılar" for titles/buttons, Turkish sentences for descriptions and
status messages, and localized plural form for wordsCount like "{{count}}
kelime" and minutes in truncated like "{{minutes}} dakika"). Ensure the existing
key names remain unchanged and preserve interpolation tokens like {{count}} and
{{minutes}} exactly as they appear.

In `@src/i18n/locales/zh-CN/editor.json`:
- Around line 45-62: The autoCaptions locale block is still in English; replace
each string value under the "autoCaptions" object (keys: button, dialogTitle,
dialogDescription, minWords, maxWords, wordsCount, generate, dialogCancel,
generating, loadingModel, busy, done, noneHeard, noAudio, failed, truncated)
with proper Simplified Chinese translations, preserving all interpolation tokens
like {{count}} and {{minutes}} and punctuation/ellipsis, and do not change the
key names or JSON structure so the app consumes the translated strings
correctly.

In `@src/i18n/locales/zh-TW/editor.json`:
- Around line 45-62: The autoCaptions translation block is still in English
across locales; update the "autoCaptions" object (keys: button, dialogTitle,
dialogDescription, minWords, maxWords, wordsCount, generate, dialogCancel,
generating, loadingModel, busy, done, noneHeard, noAudio, failed, truncated) in
every non-English locale file (ar, es, fr, ja-JP, ko-KR, tr, zh-CN, zh-TW) with
proper localized strings matching the tone and placeholders (e.g., keep
{{count}} and {{minutes}} intact), ensure grammar/placeholder placement is
correct for each language, and run/verify the i18n validation script to catch
missing keys or formatting issues.

In `@src/lib/captioning/annotationsFromCaptions.ts`:
- Around line 535-546: The initial call to
dedupeAdjacentCaptionRepeats(segments) is removing legitimate repeated tokens
before line assembly in word mode; stop deduping the raw input and instead run
dedupe only after grouping. Remove the pre-group dedupe (the dedupedIn variable)
and pass the original segments into groupPhraseCaptionSegmentsIntoLines or
groupTimedCaptionWordsIntoLines, then call dedupeAdjacentCaptionRepeats on the
grouped result (as you already do) before finalizeCaptionSegmentsForPlayback;
keep finalizeCaptionSegmentsForPlayback unchanged.

In `@src/lib/captioning/transcribe.ts`:
- Around line 125-137: The raw chunk dedupe logic in transcribe.ts (variables:
segments, rawDeduped, CHUNK_DUP_MAX_GAP_SEC) is too aggressive and collapses
repeated words separated by small gaps; change the merge condition so
identical-text segments are only merged when they truly overlap or immediately
touch (e.g., require seg.startSec <= prev.endSec) instead of allowing a
configurable gap (remove or set CHUNK_DUP_MAX_GAP_SEC to 0), keeping the rest of
the merge logic (updating prev.startSec/prev.endSec) intact.

---

Nitpick comments:
In `@src/components/video-editor/VideoEditor.tsx`:
- Around line 1022-1040: handleAnnotationSpanChange currently always calls
reconcileAutoCaptionTimelineGaps after updating annotationRegions, causing
unnecessary full-array work on every drag/blur; change it to only run
reconciliation when relevant: inside the pushState update, determine whether the
edited region (matched by id) has annotationSource === "auto-caption" (or
short-circuit if any region in prev.annotationRegions already has
annotationSource === "auto-caption"), and only call
reconcileAutoCaptionTimelineGaps(next) in that case; otherwise return {
annotationRegions: next } directly. Refer to handleAnnotationSpanChange,
pushState, annotationRegions, reconcileAutoCaptionTimelineGaps and the region.id
/ annotationSource fields to locate the change.

In `@src/lib/captioning/extractMono16k.ts`:
- Around line 87-103: The truncateAndResampleTo16k function currently returns
the original durationSec even when audio was truncated, which can mislead
callers; change the return value to reflect the actual post-truncation duration
(e.g., compute durationSec = samples.length / 16000 or set to
Math.min(originalDurationSec, MAX_CAPTION_AUDIO_SEC) after resampling) and
update any callers expecting the old semantics (or alternatively rename the
original parameter to originalDurationSec if you prefer preserving both values)
— locate this logic in truncateAndResampleTo16k and adjust the returned
durationSec to match the truncated/resampled samples.

In `@src/lib/captioning/extractMono16kWebDemuxer.ts`:
- Around line 45-74: The function mergeDecodedAudioToMonoLinear currently calls
frame.close() and thus consumes frames; change its API to make consumption
explicit by adding a boolean parameter (e.g., consumeFrames = false) to
mergeDecodedAudioToMonoLinear and only call frame.close() when consumeFrames is
true, or alternatively rename to mergeAndConsumeDecodedAudio if you want a
consuming variant; update any callers to pass the flag or use the renamed
function and add a short comment noting the behavior; reference
mergeDecodedAudioToMonoLinear, the frame.close() call, and audioDataFrameToMono
when making the change.

In `@src/lib/captioning/leadingSilence.ts`:
- Around line 66-83: The two successive .map passes in
shiftTrimRegionsMsForCaptionBuffer can be combined into a single pass: in the
mapping for each TrimRegion (refer to function
shiftTrimRegionsMsForCaptionBuffer and parameter trimMs) compute startMs and
endMs as Math.max(0, r.startMs - trimMs) and Math.max(0, r.endMs - trimMs)
respectively, return the updated object, then keep the existing .filter(r =>
r.endMs > r.startMs); this removes the extra iteration while preserving
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fc01dfd4-7cce-43c9-9fa2-e07e89630c21

📥 Commits

Reviewing files that changed from the base of the PR and between b0293e7 and 1c431fe.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (25)
  • package.json
  • src/components/ui/select.tsx
  • src/components/video-editor/VideoEditor.tsx
  • src/components/video-editor/projectPersistence.ts
  • src/components/video-editor/types.ts
  • src/i18n/locales/ar/editor.json
  • src/i18n/locales/en/editor.json
  • src/i18n/locales/es/editor.json
  • src/i18n/locales/fr/editor.json
  • src/i18n/locales/ja-JP/editor.json
  • src/i18n/locales/ko-KR/editor.json
  • src/i18n/locales/tr/editor.json
  • src/i18n/locales/zh-CN/editor.json
  • src/i18n/locales/zh-TW/editor.json
  • src/lib/captioning/annotationsFromCaptions.test.ts
  • src/lib/captioning/annotationsFromCaptions.ts
  • src/lib/captioning/captionConstants.ts
  • src/lib/captioning/extractMono16k.ts
  • src/lib/captioning/extractMono16kWebDemuxer.ts
  • src/lib/captioning/index.ts
  • src/lib/captioning/leadingSilence.ts
  • src/lib/captioning/transcribe.ts
  • src/lib/vite-stubs/empty-node-module.ts
  • src/lib/vite-stubs/onnxruntime-node-stub.ts
  • vite.config.ts

Comment thread src/components/video-editor/VideoEditor.tsx
Comment thread src/i18n/locales/ar/editor.json
Comment thread src/i18n/locales/es/editor.json
Comment thread src/i18n/locales/fr/editor.json
Comment thread src/i18n/locales/ja-JP/editor.json
Comment thread src/i18n/locales/tr/editor.json
Comment thread src/i18n/locales/zh-CN/editor.json
Comment thread src/i18n/locales/zh-TW/editor.json
Comment thread src/lib/captioning/annotationsFromCaptions.ts Outdated
Comment thread src/lib/captioning/transcribe.ts
- Gate trimSec segment shift when transcription used full buffer retry
- Restore UTF-8 autoCaptions strings (ar, es, fr, ja-JP, ko-KR, tr, zh-CN, zh-TW)
- Dedupe caption segments after grouping only; stricter chunk dedupe in transcribe
- Post-truncate duration, explicit consume merge for web demuxer, trim region shift cleanup
- Tests for caption annotation pipeline

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/captioning/annotationsFromCaptions.ts`:
- Around line 459-466: The single-span (group length === 1) case currently only
inserts line breaks and leaves the full original time range intact, so
phrase-mode captions never get time-sliced; change the branch handling the lone
span (the `only` variable in annotationsFromCaptions.ts) to call
wrapCaptionTextByWordBounds(only.text, minWords, maxWords) to produce an array
of caption segments, then split the time range (only.startSec..only.endSec)
evenly across those segments and push one caption object per segment with the
segment text and its computed startSec/endSec; ensure you still use minWords and
maxWords and preserve the existing out.push shape.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8fcd92b9-825c-454f-bcb1-804d9aa1f23d

📥 Commits

Reviewing files that changed from the base of the PR and between 1c431fe and 2b7e6f1.

📒 Files selected for processing (15)
  • src/components/video-editor/VideoEditor.tsx
  • src/i18n/locales/ar/editor.json
  • src/i18n/locales/es/editor.json
  • src/i18n/locales/fr/editor.json
  • src/i18n/locales/ja-JP/editor.json
  • src/i18n/locales/ko-KR/editor.json
  • src/i18n/locales/tr/editor.json
  • src/i18n/locales/zh-CN/editor.json
  • src/i18n/locales/zh-TW/editor.json
  • src/lib/captioning/annotationsFromCaptions.test.ts
  • src/lib/captioning/annotationsFromCaptions.ts
  • src/lib/captioning/extractMono16k.ts
  • src/lib/captioning/extractMono16kWebDemuxer.ts
  • src/lib/captioning/leadingSilence.ts
  • src/lib/captioning/transcribe.ts
✅ Files skipped from review due to trivial changes (5)
  • src/i18n/locales/tr/editor.json
  • src/i18n/locales/ko-KR/editor.json
  • src/i18n/locales/ar/editor.json
  • src/i18n/locales/zh-CN/editor.json
  • src/i18n/locales/fr/editor.json
🚧 Files skipped from review as they are similar to previous changes (8)
  • src/i18n/locales/ja-JP/editor.json
  • src/lib/captioning/extractMono16kWebDemuxer.ts
  • src/i18n/locales/zh-TW/editor.json
  • src/lib/captioning/annotationsFromCaptions.test.ts
  • src/lib/captioning/extractMono16k.ts
  • src/lib/captioning/leadingSilence.ts
  • src/lib/captioning/transcribe.ts
  • src/components/video-editor/VideoEditor.tsx

Comment thread src/lib/captioning/annotationsFromCaptions.ts Outdated
Phrase mode with one merged span now splits wrapCaptionTextByWordBounds
lines into separate CaptionSegments with even time allocation across
the phrase span (fallback when duration is too short for min spans).
Update unit test accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>
@parse-nip
Copy link
Copy Markdown
Author

parse-nip commented May 14, 2026

@siddharthvaddem Thanks for the review, no problem hahah. I have updated the PR: expanded the description with how the pipeline works (yes, it uses Whisper tiny via Transformers.js in the browser), and pushed the follow-up fixes we discussed. Appreciate you taking a closer look when you have time.

Merge siddharthvaddem/openscreen main and resolve conflicts in VideoEditor.tsx (export diagnostics, save-diagnostic handler, auto captions, toolbar chrome) and package-lock.json. Apply Biome formatting after merge so hooks pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/components/video-editor/VideoEditor.tsx (1)

1300-1311: 💤 Low value

three near-identical sync handlers — fine for now, but a tiny helper would save future eyes

each of handleAnnotationStyleChange / handleAnnotationPositionChange / handleAnnotationSizeChange runs the same dance: find the touched region, check annotationSource === "auto-caption", then map-with-conditional. it works, and the empty-find case correctly no-ops because syncAutoCaptions becomes false. nothing actually broken here.

if you ever touch this surface again, something like:

function applyToCaptionFamily<R extends AnnotationRegion>(
  regions: R[],
  id: string,
  patch: (region: R) => R,
): R[] {
  const touched = regions.find((r) => r.id === id);
  const syncAll = touched?.annotationSource === "auto-caption";
  return regions.map((region) => {
    if (syncAll && region.annotationSource === "auto-caption") return patch(region);
    return region.id === id ? patch(region) : region;
  });
}

would collapse all three handlers to one-liners. low-priority though, totally fine to defer.

Also applies to: 1374-1385, 1392-1403

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/components/video-editor/VideoEditor.tsx` around lines 1300 - 1311,
Extract the repeated map/find logic into a small helper (e.g.,
applyToCaptionFamily) that accepts regions, id, and a patch function and
implements the described syncAutoCaptions behavior; then replace the duplicate
blocks inside handleAnnotationStyleChange, handleAnnotationPositionChange, and
handleAnnotationSizeChange (the pushState callers where you currently find
touched and map regions) to call applyToCaptionFamily(prev.annotationRegions,
id, region => ({ ...region, style: { ...region.style, ...style } }) or the
appropriate patch for position/size), returning the new annotationRegions array
from pushState.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/components/video-editor/VideoEditor.tsx`:
- Around line 1300-1311: Extract the repeated map/find logic into a small helper
(e.g., applyToCaptionFamily) that accepts regions, id, and a patch function and
implements the described syncAutoCaptions behavior; then replace the duplicate
blocks inside handleAnnotationStyleChange, handleAnnotationPositionChange, and
handleAnnotationSizeChange (the pushState callers where you currently find
touched and map regions) to call applyToCaptionFamily(prev.annotationRegions,
id, region => ({ ...region, style: { ...region.style, ...style } }) or the
appropriate patch for position/size), returning the new annotationRegions array
from pushState.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdcd45dd-e161-4b47-a334-66ed990eb1d2

📥 Commits

Reviewing files that changed from the base of the PR and between 3a48e54 and 5417581.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (4)
  • package.json
  • src/components/video-editor/VideoEditor.tsx
  • src/components/video-editor/projectPersistence.ts
  • src/components/video-editor/types.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/components/video-editor/types.ts
  • package.json

@FabLrc
Copy link
Copy Markdown
Collaborator

FabLrc commented May 14, 2026

This is a very cool feature, nice work !
Can you add something that show that transcriptions is loading ?

Use one Sonner id for the full caption flow, show a distinct transcribing step, yield before Whisper so updates paint, match editor dark chrome on toasts, and keep pointer-events only on toast bodies. Add transcribing strings across locales.

Co-authored-by: Cursor <cursoragent@cursor.com>
@parse-nip
Copy link
Copy Markdown
Author

Hey @FabLrc I just added in the toast notifcations updates for the transcriptions.

Screen.Recording.2026-05-15.122648.mp4

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/captioning/transcribe.ts`:
- Around line 298-302: The fallback path currently calls transcribeOne(true,
...) which bypasses trimming and can reintroduce captions inside user-trimmed
regions; modify the logic after each retry (the transcribeOne(true, true,
timestampMode) and transcribeOne(true, false, timestampMode) calls) to reapply
the existing trimRegions filter (or call the same trimming utility used
elsewhere) to the returned segments before assigning to segments, or
alternatively split the retry so it only disables silence-based trimming but
still enforces trimRegions; ensure to reference transcribeOne, segments, trims
and the trimRegions trimming function/utility when making the change.
- Around line 242-253: After each awaited transcriber call (calls to
runTranscriberOnSlice in the short-audio branch that uses
padTailSliceForTranscribe and in the multi-slice branch that builds/pushes
segments), check the AbortSignal (signal.aborted) immediately before any
post-processing (extractChunksFromAsrResult, segmentsFromTranscriberChunks, or
pushing results). If the signal is aborted, stop processing and propagate
cancellation (throw an AbortError or rethrow a cancellation) instead of
returning/pushing segments so cancellation cannot resolve with captions; update
the short-audio path around TRANSCRIBE_SLICE_SAMPLES and the multi-slice path
that handles pushing slices accordingly.
- Around line 38-56: withoutNodeVersion currently removes process.versions.node
for the entire transcription Promise which can hide Node for unrelated async
work; change the usage so that withoutNodeVersion only wraps the ORT
initialization steps (the minimal synchronous/async calls that detect Node)
instead of the whole transcription flow: refactor callers to call
withoutNodeVersion around the ORT init function/block (replace broad fn usage
with a short-lived wrapper around the ORT init routine) or, if multiple
overlapping transcriptions must be prevented, implement a simple mutex/semaphore
around the ORT init path so only one transcription can clear node-version at a
time; keep the withoutNodeVersion helper the same but limit its scope to the ORT
init symbol(s) rather than the full transcription function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 26da2b9b-1a54-4b39-b904-e33ec145f12d

📥 Commits

Reviewing files that changed from the base of the PR and between 5417581 and a0bcd9f.

📒 Files selected for processing (13)
  • src/App.tsx
  • src/components/ui/sonner.tsx
  • src/components/video-editor/VideoEditor.tsx
  • src/i18n/locales/ar/editor.json
  • src/i18n/locales/en/editor.json
  • src/i18n/locales/es/editor.json
  • src/i18n/locales/fr/editor.json
  • src/i18n/locales/ja-JP/editor.json
  • src/i18n/locales/ko-KR/editor.json
  • src/i18n/locales/tr/editor.json
  • src/i18n/locales/zh-CN/editor.json
  • src/i18n/locales/zh-TW/editor.json
  • src/lib/captioning/transcribe.ts
✅ Files skipped from review due to trivial changes (4)
  • src/i18n/locales/ko-KR/editor.json
  • src/i18n/locales/zh-TW/editor.json
  • src/i18n/locales/es/editor.json
  • src/i18n/locales/ar/editor.json
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/i18n/locales/fr/editor.json
  • src/i18n/locales/zh-CN/editor.json
  • src/i18n/locales/tr/editor.json
  • src/i18n/locales/ja-JP/editor.json
  • src/i18n/locales/en/editor.json
  • src/components/video-editor/VideoEditor.tsx

Comment thread src/lib/captioning/transcribe.ts
Comment thread src/lib/captioning/transcribe.ts Outdated
Comment thread src/lib/captioning/transcribe.ts Outdated
Scope withoutNodeVersion to Transformers import/pipeline only; reapply trim-region filtering after ignore-trims retries; check AbortSignal after each slice inference before chunk processing.

Co-authored-by: Cursor <cursoragent@cursor.com>

const trims = options?.trimRegions ?? [];
options?.onStatus?.("transcribe");
if (options?.signal?.aborted) throw new DOMException("Aborted", "AbortError");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need both checks on 248 and 250 for throwing these errors?


if (options?.signal?.aborted) throw new DOMException("Aborted", "AbortError");

await yieldForUiPaint();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why run whisper on the main renderer thread? Is there a way to do this either through web worker so that it doesn't block UI?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair enough, i will move it.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also move auto captioning button to timeline?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@siddharthvaddem
Copy link
Copy Markdown
Owner

@parse-nip I would like to get this in before it further diverges, making it a lot more work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants