|
| 1 | +# Audio Model — Acceptance Criteria |
| 2 | + |
| 3 | +Each feature below defines the exact input→output contract. A test passes **only** if the output matches the expectation precisely. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Phase 1 — Audio Input Pipeline |
| 8 | + |
| 9 | +### Feature 1: `--audio` CLI flag accepted |
| 10 | +- **Input**: Launch SwiftLM with `--audio` flag |
| 11 | +- **Expected**: Flag is parsed without error; server starts (may warn "no audio model loaded" if no model specified) |
| 12 | +- **FAIL if**: Flag causes argument parsing error or crash |
| 13 | + |
| 14 | +### Feature 2: Base64 WAV data URI extraction |
| 15 | +- **Input**: Message content part with `{"type": "input_audio", "input_audio": {"data": "<base64-wav>", "format": "wav"}}` |
| 16 | +- **Expected**: `extractAudio()` returns valid PCM sample data |
| 17 | +- **FAIL if**: Returns nil, crashes, or silently ignores the audio part |
| 18 | + |
| 19 | +### Feature 3: WAV header parsing |
| 20 | +- **Input**: 16-bit, 16kHz, mono WAV file (44-byte header + PCM data) |
| 21 | +- **Expected**: Parser extracts: `sampleRate=16000`, `channels=1`, `bitsPerSample=16`, `dataOffset=44` |
| 22 | +- **FAIL if**: Any header field is wrong, or parser crashes on valid WAV |
| 23 | + |
| 24 | +### Feature 4: Mel spectrogram generation |
| 25 | +- **Input**: 1 second of 440Hz sine wave at 16kHz sample rate (16000 samples) |
| 26 | +- **Expected**: Output is a 2D MLXArray with shape `[80, N]` where N = number of frames |
| 27 | +- **FAIL if**: Output shape is wrong, values are all zero, or function crashes |
| 28 | +- **NOTE**: Use `Accelerate.framework` vDSP FFT for efficiency |
| 29 | + |
| 30 | +### Feature 5: Mel spectrogram dimensions |
| 31 | +- **Input**: 30 seconds of audio at 16kHz |
| 32 | +- **Expected**: Output shape matches Whisper's expected `[80, 3000]` (80 mel bins, 3000 frames for 30s) |
| 33 | +- **FAIL if**: Frame count doesn't match Whisper's hop_length=160 convention |
| 34 | + |
| 35 | +### Feature 6: Long audio chunking |
| 36 | +- **Input**: 90 seconds of audio |
| 37 | +- **Expected**: Audio is split into 3 x 30-second chunks, each producing `[80, 3000]` mel spectrograms |
| 38 | +- **FAIL if**: Single oversized tensor is created, or chunks overlap/drop samples |
| 39 | + |
| 40 | +### Feature 7: Silent audio handling |
| 41 | +- **Input**: 1 second of all-zero PCM samples |
| 42 | +- **Expected**: Returns valid mel spectrogram (all low-energy values); no crash, no division-by-zero |
| 43 | +- **FAIL if**: Function crashes, returns NaN, or throws |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## Phase 2 — Speech-to-Text (STT) |
| 48 | + |
| 49 | +### Feature 8: Whisper model type registered |
| 50 | +- **Input**: Check `ALMTypeRegistry.shared` for key `"whisper"` |
| 51 | +- **Expected**: Registry contains a valid model creator for `"whisper"` |
| 52 | +- **FAIL if**: Key not found or creator returns nil |
| 53 | + |
| 54 | +### Feature 9: Whisper encoder output |
| 55 | +- **Input**: `[80, 3000]` mel spectrogram tensor |
| 56 | +- **Expected**: Encoder returns hidden states tensor of shape `[1, 1500, encoder_dim]` |
| 57 | +- **FAIL if**: Output shape is wrong or values are all zero |
| 58 | + |
| 59 | +### Feature 10: Whisper decoder output |
| 60 | +- **Input**: Encoder hidden states + start-of-transcript token |
| 61 | +- **Expected**: Decoder generates a token ID sequence terminated by end-of-transcript |
| 62 | +- **FAIL if**: Returns empty sequence, hangs, or crashes |
| 63 | + |
| 64 | +### Feature 11: Transcription endpoint |
| 65 | +- **Input**: POST `/v1/audio/transcriptions` with base64 WAV body |
| 66 | +- **Expected**: Response JSON: `{"text": "..."}` |
| 67 | +- **FAIL if**: Endpoint returns 404, 500, or malformed JSON |
| 68 | + |
| 69 | +### Feature 12: Transcription accuracy |
| 70 | +- **Input**: Known fixture WAV of "the quick brown fox" |
| 71 | +- **Expected**: `text` field contains words matching the spoken content (fuzzy match acceptable) |
| 72 | +- **FAIL if**: Completely wrong transcription or empty text |
| 73 | +- **Fixture**: `fixtures/quick_brown_fox.wav` |
| 74 | + |
| 75 | +--- |
| 76 | + |
| 77 | +## Phase 3 — Multimodal Audio Fusion |
| 78 | + |
| 79 | +### Feature 13: Gemma 4 audio_config parsed |
| 80 | +- **Input**: Gemma 4 `config.json` with `audio_config.model_type: "gemma4_audio"` |
| 81 | +- **Expected**: Configuration struct correctly populates audio encoder fields (hidden_size=1024, num_hidden_layers=12, num_attention_heads=8) |
| 82 | +- **FAIL if**: Audio config is nil or fields are zero/default |
| 83 | + |
| 84 | +### Feature 14: Audio token interleaving |
| 85 | +- **Input**: Text tokens `[101, 102]` + audio embeddings `[A1, A2, A3]` + `boa_token_id=255010` + `eoa_token_id=255011` |
| 86 | +- **Expected**: Combined sequence: `[101, 102, 255010, A1, A2, A3, 255011]` |
| 87 | +- **FAIL if**: Audio tokens are appended instead of interleaved at correct position |
| 88 | + |
| 89 | +### Feature 15: Audio token boundaries |
| 90 | +- **Input**: Audio segment with known `boa_token_id` and `eoa_token_id` |
| 91 | +- **Expected**: `boa` token appears immediately before first audio embedding; `eoa` token appears immediately after last |
| 92 | +- **FAIL if**: Boundary tokens are missing, duplicated, or in wrong position |
| 93 | + |
| 94 | +### Feature 16: Trimodal request (text + vision + audio) |
| 95 | +- **Input**: POST with text prompt + base64 image + base64 WAV audio |
| 96 | +- **Expected**: All three modalities are parsed, encoded, and fused without crash; model produces output |
| 97 | +- **FAIL if**: Any modality is silently dropped, or server crashes |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## Phase 4 — Text-to-Speech (TTS) Output |
| 102 | + |
| 103 | +### Feature 17: TTS endpoint accepts input |
| 104 | +- **Input**: POST `/v1/audio/speech` with `{"input": "Hello world", "voice": "default"}` |
| 105 | +- **Expected**: Response status 200 with `Content-Type: audio/wav` |
| 106 | +- **FAIL if**: Returns 404, 500, or non-audio content type |
| 107 | + |
| 108 | +### Feature 18: Vocoder output |
| 109 | +- **Input**: Sequence of audio output tokens from language model |
| 110 | +- **Expected**: Vocoder produces PCM waveform with valid sample values (not all zero, not NaN) |
| 111 | +- **FAIL if**: Output is silence, contains NaN, or has wrong sample rate |
| 112 | + |
| 113 | +### Feature 19: Valid WAV output |
| 114 | +- **Input**: Generated PCM from vocoder |
| 115 | +- **Expected**: Output has valid 44-byte WAV header with correct `sampleRate`, `bitsPerSample`, `dataSize` |
| 116 | +- **FAIL if**: Header is malformed, file size doesn't match header, or file is not playable |
| 117 | + |
| 118 | +### Feature 20: Streaming TTS output |
| 119 | +- **Input**: POST `/v1/audio/speech` with `"stream": true` |
| 120 | +- **Expected**: Response is chunked transfer-encoding with progressive PCM/WAV chunks |
| 121 | +- **FAIL if**: Entire response is buffered before sending, or chunks have invalid boundaries |
0 commit comments