Skip to content

Commit d8c5524

Browse files
aramb-devclaude
andcommitted
feat(transcribe): switch from OpenAI Whisper to WhisperX model
Replace openai/whisper with victor-upmeet/whisperx for 70x realtime transcription with word-level timestamps and speaker diarization support. Update audio input param from `audio` to `audio_file`, add diarization config with HuggingFace token, and update output parsing for WhisperX segment format with fallback for legacy Whisper output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent d4bca3a commit d8c5524

4 files changed

Lines changed: 442 additions & 25 deletions

File tree

STUDIO_ROADMAP.md

Lines changed: 371 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,371 @@
1+
# Transcription Studio Roadmap
2+
3+
> **Vision**: Transform Transcription Studio into a world-class, professional-grade audio transcription workspace that rivals dedicated desktop applications.
4+
5+
---
6+
7+
## Table of Contents
8+
9+
1. [Phase 1: Foundation & Quick Wins](#phase-1-foundation--quick-wins)
10+
2. [Phase 2: Professional Audio Experience](#phase-2-professional-audio-experience)
11+
3. [Phase 3: Advanced Editing & Collaboration](#phase-3-advanced-editing--collaboration)
12+
4. [Phase 4: AI-Powered Features](#phase-4-ai-powered-features)
13+
5. [Phase 5: Enterprise & Scale](#phase-5-enterprise--scale)
14+
6. [Technical Debt & Infrastructure](#technical-debt--infrastructure)
15+
16+
---
17+
18+
## Phase 1: Foundation & Quick Wins
19+
20+
**Timeline**: 1-2 weeks
21+
**Goal**: Fix existing issues and establish a solid foundation
22+
23+
### 1.1 Standalone Studio Page
24+
25+
- [ ] Create `/studio` route as dedicated page (not just modal)
26+
- [ ] Deep linking support with session ID (`/studio?session=abc123`)
27+
- [ ] Browser history integration (back/forward navigation)
28+
- [ ] SEO meta tags for the studio page
29+
- [ ] Open Graph preview for shared links
30+
31+
### 1.2 Fix Existing Bugs
32+
33+
- [ ] **"Download All Formats" button** - Currently shows toast but does nothing
34+
- [ ] **DOCX export** - Currently exports plain text, not real DOCX format
35+
- [ ] **Audio URL persistence** - Improve localStorage handling for audio URLs
36+
- [ ] **Dark mode inconsistencies** - Fix contrast issues in segments view
37+
- [ ] **Native audio controls showing** - Hide native `<audio controls>` element
38+
39+
### 1.3 Mobile Responsiveness
40+
41+
- [ ] Responsive layout for tablets and phones
42+
- [ ] Stacked layout on mobile (audio player → transcript → controls)
43+
- [ ] Touch-friendly segment tapping
44+
- [ ] Swipe gestures for navigation
45+
- [ ] Bottom sheet for export options on mobile
46+
47+
### 1.4 Keyboard Shortcuts
48+
49+
| Shortcut | Action |
50+
|----------|--------|
51+
| `Space` | Play/Pause |
52+
| `` / `` | Skip -5s / +5s |
53+
| `Shift + ←` / `` | Skip -30s / +30s |
54+
| `` / `` | Volume up/down |
55+
| `M` | Mute/Unmute |
56+
| `Ctrl/Cmd + C` | Copy transcript |
57+
| `Ctrl/Cmd + F` | Focus search |
58+
| `Escape` | Close modal / Clear search |
59+
| `1-9` | Jump to 10%-90% of audio |
60+
61+
### 1.5 Loading & Empty States
62+
63+
- [ ] Skeleton loaders for audio player
64+
- [ ] Skeleton loaders for transcript segments
65+
- [ ] Empty state illustrations
66+
- [ ] Error state with retry options
67+
68+
---
69+
70+
## Phase 2: Professional Audio Experience
71+
72+
**Timeline**: 2-3 weeks
73+
**Goal**: Create a best-in-class audio playback experience
74+
75+
### 2.1 Advanced Audio Player
76+
77+
- [ ] **Playback speed control** (0.5x, 0.75x, 1x, 1.25x, 1.5x, 2x)
78+
- [ ] **Loop selection** - Loop a specific time range
79+
- [ ] **A-B repeat** - Set start/end points for repetition
80+
- [ ] **Pitch correction** - Maintain pitch at different speeds
81+
- [ ] **Audio normalization** - Consistent volume levels
82+
83+
### 2.2 Waveform Visualization
84+
85+
- [ ] Real-time waveform display using Web Audio API
86+
- [ ] Zoomable waveform (pinch to zoom on mobile)
87+
- [ ] Click-to-seek on waveform
88+
- [ ] Segment regions highlighted on waveform
89+
- [ ] Current position indicator
90+
- [ ] Mini-map for long audio files
91+
92+
### 2.3 Segment Navigation
93+
94+
- [ ] Previous/Next segment buttons
95+
- [ ] Segment list with jump-to functionality
96+
- [ ] Auto-scroll transcript to current segment
97+
- [ ] Segment bookmarking
98+
- [ ] Quick navigation panel (timestamps sidebar)
99+
100+
### 2.4 Audio Quality Enhancements
101+
102+
- [ ] Noise reduction toggle (client-side)
103+
- [ ] Bass/Treble equalizer
104+
- [ ] Audio ducking for background music
105+
- [ ] Stereo/Mono toggle
106+
107+
---
108+
109+
## Phase 3: Advanced Editing & Collaboration
110+
111+
**Timeline**: 3-4 weeks
112+
**Goal**: Enable professional transcript editing workflows
113+
114+
### 3.1 Inline Transcript Editing
115+
116+
- [ ] Click-to-edit segment text
117+
- [ ] Real-time character count
118+
- [ ] Undo/Redo stack (Ctrl+Z / Ctrl+Y)
119+
- [ ] Edit history with timestamps
120+
- [ ] Diff view showing original vs edited
121+
- [ ] Batch find & replace
122+
123+
### 3.2 Segment Management
124+
125+
- [ ] Split segments at cursor position
126+
- [ ] Merge adjacent segments
127+
- [ ] Adjust segment timestamps manually
128+
- [ ] Delete segments
129+
- [ ] Add new segments
130+
- [ ] Drag-and-drop segment reordering
131+
132+
### 3.3 Speaker Diarization UI
133+
134+
- [ ] Visual speaker labels (Speaker 1, Speaker 2, etc.)
135+
- [ ] Custom speaker names (editable)
136+
- [ ] Color-coded speakers throughout transcript
137+
- [ ] Speaker timeline view
138+
- [ ] Filter transcript by speaker
139+
- [ ] Speaker statistics (word count, speaking time)
140+
141+
### 3.4 Annotations & Comments
142+
143+
- [ ] Add notes to specific timestamps
144+
- [ ] Highlight important sections
145+
- [ ] Tag segments (e.g., "action item", "question", "decision")
146+
- [ ] Export annotations separately
147+
- [ ] Comment threads on segments
148+
149+
### 3.5 Version Control
150+
151+
- [ ] Auto-save drafts to IndexedDB
152+
- [ ] Version history with restore
153+
- [ ] Compare versions side-by-side
154+
- [ ] Export specific versions
155+
156+
---
157+
158+
## Phase 4: AI-Powered Features
159+
160+
**Timeline**: 4-6 weeks
161+
**Goal**: Leverage AI to add intelligent features
162+
163+
### 4.1 Smart Summarization
164+
165+
- [ ] One-click transcript summary
166+
- [ ] Key points extraction
167+
- [ ] Action items detection
168+
- [ ] Meeting minutes generation
169+
- [ ] Custom summary length (brief/detailed)
170+
171+
### 4.2 Translation
172+
173+
- [ ] Translate transcript to 50+ languages
174+
- [ ] Side-by-side original + translation view
175+
- [ ] Export translated versions
176+
- [ ] Auto-detect source language
177+
178+
### 4.3 Intelligent Search
179+
180+
- [ ] Semantic search (find by meaning, not just keywords)
181+
- [ ] "Find similar segments"
182+
- [ ] Question answering ("What did they say about X?")
183+
- [ ] Topic clustering
184+
185+
### 4.4 Auto-Correction
186+
187+
- [ ] Grammar and spelling suggestions
188+
- [ ] Punctuation improvement
189+
- [ ] Filler word removal (um, uh, like)
190+
- [ ] Sentence boundary detection
191+
- [ ] Proper noun capitalization
192+
193+
### 4.5 Content Analysis
194+
195+
- [ ] Sentiment analysis per segment
196+
- [ ] Topic detection and tagging
197+
- [ ] Named entity recognition (people, places, organizations)
198+
- [ ] Keyword extraction
199+
- [ ] Word cloud generation
200+
201+
### 4.6 Voice Commands
202+
203+
- [ ] "Play", "Pause", "Skip forward"
204+
- [ ] "Go to minute 5"
205+
- [ ] "Find [keyword]"
206+
- [ ] "Summarize this"
207+
208+
---
209+
210+
## Phase 5: Enterprise & Scale
211+
212+
**Timeline**: 6-8 weeks
213+
**Goal**: Features for teams and power users
214+
215+
### 5.1 User Accounts & Cloud Sync
216+
217+
- [ ] User authentication (OAuth, email/password)
218+
- [ ] Cloud storage for transcriptions
219+
- [ ] Sync across devices
220+
- [ ] Transcription history dashboard
221+
- [ ] Usage analytics
222+
223+
### 5.2 Team Collaboration
224+
225+
- [ ] Shared workspaces
226+
- [ ] Real-time collaborative editing
227+
- [ ] Role-based permissions (viewer, editor, admin)
228+
- [ ] Assignment and task tracking
229+
- [ ] Activity feed
230+
231+
### 5.3 Batch Processing
232+
233+
- [ ] Upload multiple files at once
234+
- [ ] Queue management
235+
- [ ] Bulk export
236+
- [ ] Folder organization
237+
- [ ] Batch operations (delete, move, tag)
238+
239+
### 5.4 Integrations
240+
241+
- [ ] **Google Drive** - Import/export
242+
- [ ] **Dropbox** - Import/export
243+
- [ ] **Notion** - Export as page
244+
- [ ] **Slack** - Share transcripts
245+
- [ ] **Zapier/Make** - Automation workflows
246+
- [ ] **Zoom/Teams/Meet** - Direct recording import
247+
- [ ] **YouTube** - Transcribe from URL
248+
- [ ] **Podcast RSS** - Batch transcribe episodes
249+
250+
### 5.5 API & Webhooks
251+
252+
- [ ] Public REST API for transcriptions
253+
- [ ] Webhook notifications (transcription complete, etc.)
254+
- [ ] API key management
255+
- [ ] Rate limiting dashboard
256+
- [ ] SDK for common languages
257+
258+
### 5.6 Advanced Export Options
259+
260+
- [ ] **PDF** - Professional formatted document with timestamps
261+
- [ ] **DOCX** - Proper Word document with styles
262+
- [ ] **SRT/VTT** - Subtitle formats (already implemented)
263+
- [ ] **JSON** - Full data export with all metadata
264+
- [ ] **XML** - Structured export
265+
- [ ] **EDL** - Edit Decision List for video editors
266+
- [ ] **Markdown** - With timestamps and speaker labels
267+
- [ ] **HTML** - Interactive web page
268+
- [ ] **CSV** - Spreadsheet format
269+
270+
---
271+
272+
## Technical Debt & Infrastructure
273+
274+
### Performance Optimizations
275+
276+
- [ ] Virtualized segment list for long transcripts (react-window)
277+
- [ ] Lazy load audio waveform
278+
- [ ] Web Workers for audio processing
279+
- [ ] Service Worker for offline support
280+
- [ ] Optimize bundle size (code splitting)
281+
282+
### Code Quality
283+
284+
- [ ] Extract AudioPlayer into reusable component
285+
- [ ] Create custom hooks for audio state management
286+
- [ ] Add comprehensive unit tests
287+
- [ ] Add E2E tests with Playwright
288+
- [ ] Storybook for component documentation
289+
290+
### Accessibility (a11y)
291+
292+
- [ ] Full keyboard navigation
293+
- [ ] Screen reader support (ARIA labels)
294+
- [ ] High contrast mode
295+
- [ ] Reduced motion support
296+
- [ ] Focus indicators
297+
298+
### Internationalization (i18n)
299+
300+
- [ ] UI translation support
301+
- [ ] RTL language support
302+
- [ ] Locale-aware formatting (dates, numbers)
303+
304+
---
305+
306+
## Success Metrics
307+
308+
| Metric | Target |
309+
|--------|--------|
310+
| Time to first transcription | < 30 seconds |
311+
| Studio load time | < 2 seconds |
312+
| Mobile usability score | > 90 |
313+
| Lighthouse performance | > 90 |
314+
| User satisfaction (NPS) | > 50 |
315+
| Export success rate | > 99% |
316+
| Audio playback reliability | > 99.5% |
317+
318+
---
319+
320+
## Priority Matrix
321+
322+
```
323+
HIGH IMPACT
324+
325+
┌───────────────────┼───────────────────┐
326+
│ │ │
327+
│ • Standalone │ • Waveform │
328+
│ page │ • AI Summary │
329+
│ • Mobile │ • Collaboration │
330+
│ • Keyboard │ • Cloud sync │
331+
│ shortcuts │ │
332+
│ • Fix exports │ │
333+
LOW ├───────────────────┼───────────────────┤ HIGH
334+
EFFORT │ EFFORT
335+
│ │ │
336+
│ • Dark mode │ • Voice commands │
337+
│ fixes │ • Video editor │
338+
│ • Loading │ integration │
339+
│ states │ • Real-time │
340+
│ │ collab │
341+
│ │ │
342+
└───────────────────┼───────────────────┘
343+
344+
LOW IMPACT
345+
```
346+
347+
---
348+
349+
## Getting Started
350+
351+
**Recommended order of implementation:**
352+
353+
1. **Week 1-2**: Phase 1 (Foundation) - Fix bugs, add standalone page, keyboard shortcuts
354+
2. **Week 3-4**: Phase 2.1-2.2 (Audio) - Playback speed, waveform visualization
355+
3. **Week 5-6**: Phase 3.1-3.3 (Editing) - Inline editing, speaker diarization
356+
4. **Week 7-8**: Phase 4.1-4.2 (AI) - Summarization, translation
357+
5. **Week 9+**: Phase 5 (Enterprise) - Based on user feedback and demand
358+
359+
---
360+
361+
## Notes
362+
363+
- All features should maintain backward compatibility
364+
- Progressive enhancement - basic functionality works without JS
365+
- Privacy-first - no data sent to servers without explicit consent
366+
- Offline-capable where possible
367+
- Mobile-first responsive design
368+
369+
---
370+
371+
*Last updated: January 2026*

0 commit comments

Comments
 (0)