TabVision

⚠️ HISTORICAL — NOT CANONICAL (deprecated 2026-06-01). This is the original v0 product spec for the Electron + Flask desktop app. The canonical specification is SPEC.md (the v1 Python-CLI spec with the §8 contracts and §1.4 acceptance gate). This file is retained only for historical reference and is linked from AUDIT.md / README.md. Do not treat anything below as a current requirement.

Automatic Guitar Tab Transcription from Video

A desktop application that analyzes video recordings of guitar playing and generates accurate tablature by combining audio pitch detection with visual finger tracking.

Overview
Core Features
System Architecture
Tech Stack
Data Models
Phased Build Plan
Technical Risks & Mitigations

Overview

The Problem

Writing guitar tabs manually is tedious. Existing transcription tools either rely solely on audio (which can't distinguish between the same note played at different fret positions) or require expensive professional software.

The Solution

TabVision uses a multi-modal approach:

Audio analysis detects which pitches are being played and when
Video analysis confirms which fret/string position was actually used
Fusion engine combines both signals for accurate transcription with confidence scoring

Target User

Guitarists who want to transcribe their own playing quickly and accurately.

Core Features

Input

Upload video files (MP4, MOV)
In-app webcam recording
Maximum duration: ~5 minutes
Requirements: Guitar neck visible and roughly centered in frame, horizontal orientation

Processing

Cloud-based async processing
Progress tracking with status updates
Stages: audio extraction → pitch detection → video analysis → fusion

Output

Interactive tab editor synced to video playback
Confidence highlighting (green/yellow/red)
Direct fret number editing for corrections
Export to plain text (Ultimate Guitar format) and PDF

Assumptions

Standard tuning (EADGBE)
Clean guitar audio (no backing track/vocals)
Right-handed playing (left-handed support planned for later)
User specifies capo position if applicable

System Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              ELECTRON APP                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         React Frontend                               │   │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐   │   │
│  │  │ Video Input  │  │ Tab Editor   │  │ Export Panel             │   │   │
│  │  │ - Webcam     │  │ - Synced     │  │ - Plain text             │   │   │
│  │  │ - Upload     │  │   playback   │  │ - PDF                    │   │   │
│  │  │ - Preview    │  │ - Confidence │  │                          │   │   │
│  │  │              │  │   colors     │  │                          │   │   │
│  │  │              │  │ - Direct     │  │                          │   │   │
│  │  │              │  │   editing    │  │                          │   │   │
│  │  └──────────────┘  └──────────────┘  └──────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                        │
│                         Electron Main Process                               │
│                         - File system access                                │
│                         - Video encoding (ffmpeg)                           │
│                         - API communication                                 │
└────────────────────────────────────┬────────────────────────────────────────┘
                                     │ HTTPS
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              CLOUD BACKEND (Modal)                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                         Flask API                                    │   │
│  │  POST /jobs         - Upload video, create job                       │   │
│  │  GET  /jobs/:id     - Poll job status                                │   │
│  │  GET  /jobs/:id/result - Download completed tab data                 │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     │                                       │
│  ┌──────────────────────────────────▼──────────────────────────────────┐   │
│  │                      Processing Pipeline                             │   │
│  │                                                                      │   │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │   │
│  │  │ Video/Audio │    │ Audio       │    │ Video                   │  │   │
│  │  │ Splitter    │───▶│ Pipeline    │    │ Pipeline                │  │   │
│  │  │ (ffmpeg)    │    │             │    │                         │  │   │
│  │  └─────────────┘    │ Basic Pitch │    │ MediaPipe Hands         │  │   │
│  │        │            │ ──────────▶ │    │ ──────────────────────▶ │  │   │
│  │        │            │ Pitch/onset │    │ Finger landmarks        │  │   │
│  │        │            │ detection   │    │                         │  │   │
│  │        │            └──────┬──────┘    │ Fretboard detection     │  │   │
│  │        │                   │           │ ──────────────────────▶ │  │   │
│  │        └───────────────────┼──────────▶│ Fret geometry mapping   │  │   │
│  │                            │           └───────────┬─────────────┘  │   │
│  │                            │                       │                │   │
│  │                            ▼                       ▼                │   │
│  │                     ┌──────────────────────────────────────┐        │   │
│  │                     │         Fusion Engine                │        │   │
│  │                     │  - Match pitches to fret positions   │        │   │
│  │                     │  - Resolve ambiguities               │        │   │
│  │                     │  - Calculate confidence scores       │        │   │
│  │                     │  - Handle open strings (fret 0)      │        │   │
│  │                     │  - Detect muted notes (X)            │        │   │
│  │                     └──────────────────┬───────────────────┘        │   │
│  │                                        │                            │   │
│  │                                        ▼                            │   │
│  │                              TabDocument JSON                       │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  Storage                                                             │   │
│  │  - Job queue (Redis or SQLite for MVP)                               │   │
│  │  - Video file storage (local disk or S3)                             │   │
│  │  - Results cache                                                     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Pipeline Details

Audio Pipeline

Extract audio track from video using ffmpeg
Run through Basic Pitch (Spotify's polyphonic pitch detection)
Get list of (timestamp, MIDI note, confidence) tuples
Map each MIDI note to candidate fret/string positions

Video Pipeline

Extract frames at audio onset timestamps (not every frame)
Run MediaPipe Hands to get finger landmark positions
Detect fretboard geometry (one-time at video start, with minor tracking corrections):
- Edge detection to find neck outline
- Fret wire detection (vertical lines)
- Perspective correction to normalize coordinate space
Map finger landmarks to fret/string positions using detected geometry

Fusion Engine

For each detected pitch from audio:
- Get candidate fret/string positions (same note can be played multiple places)
- Get observed finger position from video at that timestamp
- Match audio candidate to video observation
Handle edge cases:
- Open strings: pitch detected but no finger → fret 0
- Muted notes: finger detected but no pitch → mark as X
Calculate confidence:
- High (green): audio and video agree clearly
- Medium (yellow): plausible but some ambiguity
- Low (red): conflict between audio/video or missing data

Tech Stack

Frontend (Electron + React)

Purpose	Library	Notes
Framework	React 18	Industry standard, large ecosystem
State management	Zustand	Simpler than Redux, good for complex state
Video player	Custom HTML5 video	Need fine-grained control for timestamp syncing
Tab rendering	Canvas or SVG	Pixel-level control for click targets
Styling	Tailwind CSS	Fast iteration
PDF export	jsPDF or pdfmake	Client-side generation
Webcam	navigator.mediaDevices	Built into Electron
Video encoding	ffmpeg-static	Bundled with Electron

Backend (Python + Flask)

Purpose	Library	Notes
Framework	Flask	Simple, lightweight
Async jobs	RQ (Redis Queue) or Celery	Background processing
Audio ML	basic-pitch	Spotify's polyphonic pitch detection
Video ML	mediapipe	Google's hand tracking
Video processing	opencv-python	Frame extraction, image processing
Audio extraction	ffmpeg-python or pydub	Split audio from video
Fretboard geometry	numpy, opencv	Edge detection, perspective transforms
Storage	SQLite (MVP) → PostgreSQL	Job tracking
File storage	Local disk (MVP) → S3/R2	Video uploads

Infrastructure

Purpose	Choice	Notes
Hosting	Modal	GPU-capable Flask API and background processing
File storage (scale)	Cloudflare R2 or AWS S3	When needed
Domain	Any registrar	Namecheap, Cloudflare, etc.

Data Models

TabDocument

The core data structure returned by the backend and manipulated by the frontend editor.

interface TabDocument {
  id: string;
  createdAt: string;              // ISO timestamp
  duration: number;               // video duration in seconds
  capoFret: number;               // 0 = no capo
  tuning: string[];               // ["E", "A", "D", "G", "B", "E"] for standard
  
  notes: TabNote[];
}

interface TabNote {
  id: string;
  timestamp: number;              // seconds from video start
  
  // What was detected
  string: 1 | 2 | 3 | 4 | 5 | 6;  // 1 = high E, 6 = low E
  fret: number | "X";             // 0 = open, "X" = muted
  
  // Confidence scoring
  confidence: number;             // 0.0 - 1.0
  confidenceLevel: "high" | "medium" | "low";  // >0.8, 0.5-0.8, <0.5
  
  // Editor state
  isEdited: boolean;
  originalFret?: number | "X";    // preserve original if user edits
  
  // Debug data (optional)
  detectedPitch?: number;         // Hz
  detectedMidiNote?: number;      // MIDI note number
}

Job (Backend)

@dataclass
class Job:
    id: str                       # UUID
    status: str                   # pending | processing | completed | failed
    created_at: datetime
    updated_at: datetime
    
    # Input
    video_path: str               # uploaded file location
    capo_fret: int                # 0 = no capo
    
    # Progress tracking
    progress: float               # 0.0 - 1.0
    current_stage: str            # uploading | extracting_audio | 
                                  # analyzing_audio | analyzing_video | 
                                  # fusing | complete
    
    # Output
    result_path: str | None       # TabDocument JSON path when complete
    error_message: str | None     # if failed

API Endpoints

POST /jobs
  Body: multipart/form-data with video file + capo_fret
  Returns: { job_id: string }

GET /jobs/:id
  Returns: { 
    status: string, 
    progress: float, 
    current_stage: string,
    error_message?: string 
  }

GET /jobs/:id/result
  Returns: TabDocument JSON (only when status == "completed")

Phased Build Plan

Phase 0: Skeleton (Week 1)

Goal: End-to-end hello world—upload a video, get a dummy response back.

Electron + React boilerplate with hot reload working
Basic UI: upload button, status display, placeholder editor area
Flask backend with /jobs endpoints (returns fake data, no real processing)
File upload flow working
Deploy backend to Modal
Verify Electron app can communicate with deployed backend

Deliverable: Upload a video → see fake TabDocument rendered.

Phase 1: Audio Pipeline (Weeks 2-3)

Goal: Given a video, extract pitches and timestamps.

ffmpeg integration: extract audio track from uploaded video
Basic Pitch integration: process audio, get pitch/onset data
Build MIDI-to-guitar mapping: for each detected note, list candidate fret/string positions
Implement "best guess" fingering heuristic (prefer lower positions, common chord shapes)
Output preliminary TabDocument based on audio alone
Display results in frontend (read-only, no editing yet)

Deliverable: Upload video → see tab output based purely on audio analysis.

Phase 2: Video Pipeline (Weeks 4-6)

Goal: Detect fretboard geometry and finger positions.

Deliverable: Given a video frame, output "finger detected at fret N, string M area" with reasonable accuracy.

Phase 3: Fusion (Weeks 7-8)

Goal: Combine audio and video data for accurate transcription.

Deliverable: Accurate tab output with confidence scores for each note.

Phase 4: Editor UI (Weeks 9-10)

Goal: Interactive tab editor with video synchronization.

Deliverable: Full editing workflow with video sync.

Phase 5: Recording & Export (Weeks 11-12)

Goal: Complete feature set for v1.

Deliverable: Feature-complete MVP.

Phase 6: Polish & Portfolio-Ready (Weeks 13-14)

Goal: Make it presentation-worthy.

Deliverable: Portfolio-ready project with documentation.

Technical Risks & Mitigations

Risk	Severity	Likelihood	Mitigation Strategy
Fretboard detection fails on varied lighting/angles	High	Medium	Require consistent camera setup for MVP. Add "draw box around fretboard" fallback if automatic detection fails. Consider calibration step.
Basic Pitch struggles with guitar timbre	Medium	Low	Test early with real guitar recordings. Preprocess audio (noise gate, EQ). Evaluate alternatives (Omnizart, CREPE) if needed.
MediaPipe loses tracking when fingers overlap	Medium	Medium	Use audio as primary source when video is ambiguous. Flag as low confidence rather than guessing.
Video/audio sync drift over long recordings	Medium	Low	Use video's embedded audio track, not separate recording. Sync to detected onsets, not wall clock time. Re-sync periodically.
Processing time too long for good UX	Medium	Medium	Process only frames near audio onsets, not every frame. Downsample video resolution. Show granular progress updates.
Large video uploads fail or timeout	Low	Medium	Chunked upload with resume capability. Client-side compression before upload. Enforce 5-minute limit strictly.
Electron app bundle too large	Low	High	Accept ~150-200MB as reasonable tradeoff. Optimize with electron-builder if needed.
Polyphonic passages (fast arpeggios) overwhelm system	Medium	Medium	Increase frame analysis rate for dense passages. Accept lower confidence on very fast playing.

Future Enhancements (Post-MVP)

Left-handed player support (video mirroring)
Custom tuning support
Multi-track support (rhythm + lead)
Playback audio synthesis
Direct Ultimate Guitar upload integration (if API becomes available)
Mobile app version
Batch processing for multiple videos
Collaborative editing / sharing
Training custom ML models on user-corrected data

Development Notes

Local Development Setup

# Frontend (Electron + React)
cd tabvision-client
npm install
npm run dev

# Backend (Flask)
cd tabvision-server
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt
flask run

Key Dependencies to Install

# Backend
pip install flask basic-pitch mediapipe opencv-python ffmpeg-python numpy

# Frontend
npm install electron react zustand tailwindcss
npm install -g electron-builder  # for packaging

Testing Strategy

Unit tests: Fusion logic, fret mapping calculations
Integration tests: Full pipeline with sample videos
Manual testing: Various guitars, lighting conditions, playing styles
Build test dataset: Record yourself playing known passages, verify output accuracy

FilesExpand file tree

tabvision_specification.md

Latest commit

History