Add live audio transcription streaming support to Foundry Local JS SDK (#486)

rui-ren · ruiren_microsoft · Copilot · Prathik Rao · commit 5069277d4636 · 2026-04-02T10:12:05.000-07:00
Here's the corrected PR description with all names aligned to the actual
code:

---

Adds real-time audio streaming support to the Foundry Local JS SDK,
enabling live microphone-to-text transcription via ONNX Runtime GenAI
ASR.

The existing AudioClient only supports file-based transcription. This PR
introduces `LiveAudioTranscriptionSession` that accepts continuous PCM
audio chunks (e.g., from a microphone) and returns partial/final
transcription results as an async iterable.

- `src/openai/liveAudioTranscriptionClient.ts` — Streaming session with
`start()`, `append()`, `getTranscriptionStream()`, `stop()`, `dispose()`
- `src/openai/liveAudioTranscriptionTypes.ts` —
`LiveAudioTranscriptionResponse` and `CoreErrorResponse` interfaces,
`tryParseCoreError()` helper
- `src/detail/coreInterop.ts` — Added `executeCommandWithBinary()`
method and `StreamingRequestBuffer` struct for binary PCM data transport
- app.js — E2E example with microphone capture (naudiodon2) and
synthetic audio fallback
- `test/openai/liveAudioTranscription.test.ts` — Unit tests for
types/settings and E2E test with synthetic PCM audio

- `src/imodel.ts` — Added `createLiveTranscriptionSession()` to
interface
- `src/model.ts` — Delegates to
`selectedVariant.createLiveTranscriptionSession()`
- `src/modelVariant.ts` — Implementation (creates new
`LiveAudioTranscriptionSession(modelId, coreInterop)`)
- `src/index.ts` — Exports `LiveAudioTranscriptionSession`,
`LiveAudioTranscriptionOptions`, `LiveAudioTranscriptionResponse`,
`TranscriptionContentPart`

```js
const session = model.createLiveTranscriptionSession();

session.settings.sampleRate = 16000;
session.settings.channels = 1;
session.settings.language = "en";

await session.start();

// Push audio from microphone callback
await session.append(pcmBytes);

// Read results as async iterable
for await (const result of session.getTranscriptionStream()) {
    console.log(result.content[0].text);
}

await session.stop();
```

- **Internal async push queue** — Bounded `AsyncQueue&lt;T&gt;` serializes
audio pushes from any context (safe for mic callbacks) and provides
backpressure via FIFO resolver queue. Mirrors C#'s `Channel&lt;T&gt;` pattern.
- **Binary data transport** — `executeCommandWithBinary()` sends PCM
bytes alongside JSON params via `StreamingRequestBuffer`, with
transcription results parsed from push responses.
- **Settings freeze** — Audio format settings are snapshot-copied and
`Object.freeze()`d at `start()`, immutable during the session
- **Buffer copy** — `append()` copies the input `Uint8Array` before
queueing, safe when caller reuses buffers
- **Drain-on-stop** — `stop()` completes the push queue, waits for the
push loop to drain, parses final transcription from stop response, then
completes the output stream
- **Error propagation** — `start()` failures are propagated to
`outputQueue` so `getTranscriptionStream()` consumers see the error;
`tryParseCoreError()` handles both raw JSON and CoreInterop-prefixed
error messages
- **Dispose safety** — `dispose()` wraps `stop()` in try/catch, never
throws

This PR adds the JS SDK surface. The 3 native commands
(`audio_stream_start`, `audio_stream_push`, `audio_stream_stop`) are
routed through `execute_command` and the new
`execute_command_with_binary` exports. The code compiles with zero
TypeScript errors without the native library.

- ✅ TypeScript compilation — 0 errors across all source files
- ✅ Unit tests for `parseTranscriptionResult()`, `tryParseCoreError()`,
`LiveAudioTranscriptionOptions`
- ✅ E2E test with synthetic PCM audio (skips gracefully if native core
unavailable)

This implementation mirrors the C# `LiveAudioTranscriptionSession` with
identical logic:
- Same session lifecycle: `start` → `append` → `getStream` → `stop`
- Same push loop with error handling and binary data transport
- Same settings freeze and buffer copy semantics
- Same drain-before-stop ordering with final result parsing
- Same E2E test pattern (synthetic 440Hz sine wave, 100ms chunks,
ConversationItem-shaped response validation)
- Same renamed types: `LiveAudioTranscription*` (matching C# rename)

---

Changes from the original:
| Old (incorrect) | New (matches code) |
|---|---|
| `LiveAudioTranscriptionClient` | `LiveAudioTranscriptionSession` |
| `LiveAudioTranscriptionSettings` | `LiveAudioTranscriptionOptions` |
| `LiveAudioTranscriptionResult` | `LiveAudioTranscriptionResponse` |
| `createLiveTranscriptionClient()` | `createLiveTranscriptionSession()`
|

---------

Co-authored-by: ruiren_microsoft &lt;ruiren@microsoft.com&gt;
Co-authored-by: Copilot &lt;198982749+Copilot@users.noreply.github.com&gt;
Co-authored-by: Kunal Vaishnavi &lt;kvaishnavi@microsoft.com&gt;
diff --git a/samples/js/live-audio-transcription-example/README.md b/samples/js/live-audio-transcription-example/README.md
@@ -0,0 +1,58 @@
+# Live Audio Transcription Example
+
+Real-time microphone-to-text transcription using the Foundry Local JS SDK with Nemotron ASR.
+
+## Prerequisites
+
+- [Foundry Local](https://github.com/microsoft/Foundry-Local) installed
+- Node.js 18+
+- A microphone (optional — falls back to synthetic audio)
+
+## Setup
+
+```bash
+npm install foundry-local-sdk naudiodon2
+```
+
+> **Note:** `naudiodon2` is optional — provides cross-platform microphone capture. Without it, the example falls back to synthetic audio for testing.
+
+## Run
+
+```bash
+node app.js
+```
+
+Speak into your microphone. Transcription appears in real-time. Press `Ctrl+C` to stop.
+
+## How it works
+
+1. Initializes the Foundry Local SDK and loads the Nemotron ASR model
+2. Creates a `LiveAudioTranscriptionSession` with 16kHz/16-bit/mono PCM settings
+3. Captures microphone audio via `naudiodon2` (or generates synthetic audio as fallback)
+4. Pushes PCM chunks to the SDK via `session.append()`
+5. Reads transcription results via `for await (const result of session.getTranscriptionStream())`
+6. Access text via `result.content[0].text` (OpenAI Realtime ConversationItem pattern)
+
+## API
+
+```javascript
+const audioClient = model.createAudioClient();
+const session = audioClient.createLiveTranscriptionSession();
+session.settings.sampleRate = 16000;
+session.settings.channels = 1;
+session.settings.language = 'en';
+
+await session.start();
+
+// Push audio
+await session.append(pcmBytes);
+
+// Read results
+for await (const result of session.getTranscriptionStream()) {
+    console.log(result.content[0].text);       // transcribed text
+    console.log(result.content[0].transcript); // alias (OpenAI compat)
+    console.log(result.is_final);              // true for final results
+}
+
+await session.stop();
+```
diff --git a/samples/js/live-audio-transcription-example/app.js b/samples/js/live-audio-transcription-example/app.js
@@ -0,0 +1,157 @@
+// Live Audio Transcription Example — Foundry Local JS SDK
+//
+// Demonstrates real-time microphone-to-text using the JS SDK.
+// Requires: npm install foundry-local-sdk naudiodon2
+//
+// Usage: node app.js
+
+import { FoundryLocalManager } from 'foundry-local-sdk';
+
+console.log('╔══════════════════════════════════════════════════════════╗');
+console.log('║   Foundry Local — Live Audio Transcription (JS SDK)     ║');
+console.log('╚══════════════════════════════════════════════════════════╝');
+console.log();
+
+// Initialize the Foundry Local SDK
+console.log('Initializing Foundry Local SDK...');
+const manager = FoundryLocalManager.create({
+    appName: 'foundry_local_live_audio',
+    logLevel: 'info'
+});
+console.log('✓ SDK initialized');
+
+// Get and load the nemotron model
+const modelAlias = 'nemotron';
+let model = await manager.catalog.getModel(modelAlias);
+if (!model) {
+    console.error(`ERROR: Model "${modelAlias}" not found in catalog.`);
+    process.exit(1);
+}
+
+console.log(`Found model: ${model.id}`);
+console.log('Downloading model (if needed)...');
+await model.download((progress) => {
+    process.stdout.write(`\rDownloading... ${progress.toFixed(2)}%`);
+});
+console.log('\n✓ Model downloaded');
+
+console.log('Loading model...');
+await model.load();
+console.log('✓ Model loaded');
+
+// Create live transcription session
+const audioClient = model.createAudioClient();
+const session = audioClient.createLiveTranscriptionSession();
+session.settings.sampleRate = 16000;  // Default is 16000; shown here for clarity
+session.settings.channels = 1;
+session.settings.bitsPerSample = 16;
+session.settings.language = 'en';
+
+console.log('Starting streaming session...');
+await session.start();
+console.log('✓ Session started');
+
+// Read transcription results in background
+const readPromise = (async () => {
+    try {
+        for await (const result of session.getTranscriptionStream()) {
+            const text = result.content?.[0]?.text;
+            if (result.is_final) {
+                console.log();
+                console.log(`  [FINAL] ${text}`);
+            } else if (text) {
+                process.stdout.write(text);
+            }
+        }
+    } catch (err) {
+        if (err.name !== 'AbortError') {
+            console.error('Stream error:', err.message);
+        }
+    }
+})();
+
+// --- Microphone capture ---
+// This example uses naudiodon2 for cross-platform audio capture.
+// Install with: npm install naudiodon2
+//
+// If you prefer a different audio library, just push PCM bytes
+// (16-bit signed LE, mono, 16kHz) via session.append().
+
+let audioInput;
+try {
+    const { default: portAudio } = await import('naudiodon2');
+
+    audioInput = portAudio.AudioIO({
+        inOptions: {
+            channelCount: session.settings.channels,
+            sampleFormat: session.settings.bitsPerSample === 16
+                ? portAudio.SampleFormat16Bit
+                : portAudio.SampleFormat32Bit,
+            sampleRate: session.settings.sampleRate,
+            framesPerBuffer: 1600,  // 100ms chunks
+            maxQueue: 15            // buffer during event-loop blocks from sync FFI calls
+        }
+    });
+
+    let appendPending = false;
+    audioInput.on('data', (buffer) => {
+        if (appendPending) return; // drop frame while backpressured
+        const pcm = new Uint8Array(buffer);
+        appendPending = true;
+        session.append(pcm).then(() => {
+            appendPending = false;
+        }).catch((err) => {
+            appendPending = false;
+            console.error('append error:', err.message);
+        });
+    });
+
+    console.log();
+    console.log('════════════════════════════════════════════════════════════');
+    console.log('  LIVE TRANSCRIPTION ACTIVE');
+    console.log('  Speak into your microphone.');
+    console.log('  Press Ctrl+C to stop.');
+    console.log('════════════════════════════════════════════════════════════');
+    console.log();
+
+    audioInput.start();
+} catch (err) {
+    console.warn('⚠ Could not initialize microphone (naudiodon2 may not be installed).');
+    console.warn('  Install with: npm install naudiodon2');
+    console.warn('  Falling back to synthetic audio test...');
+    console.warn();
+
+    // Fallback: push 2 seconds of synthetic PCM (440Hz sine wave)
+    const sampleRate = session.settings.sampleRate;
+    const duration = 2;
+    const totalSamples = sampleRate * duration;
+    const pcmBytes = new Uint8Array(totalSamples * 2);
+    for (let i = 0; i < totalSamples; i++) {
+        const t = i / sampleRate;
+        const sample = Math.round(32767 * 0.5 * Math.sin(2 * Math.PI * 440 * t));
+        pcmBytes[i * 2] = sample & 0xFF;
+        pcmBytes[i * 2 + 1] = (sample >> 8) & 0xFF;
+    }
+
+    // Push in 100ms chunks
+    const chunkSize = (sampleRate / 10) * 2;
+    for (let offset = 0; offset < pcmBytes.length; offset += chunkSize) {
+        const len = Math.min(chunkSize, pcmBytes.length - offset);
+        await session.append(pcmBytes.slice(offset, offset + len));
+    }
+
+    console.log('✓ Synthetic audio pushed');
+}
+
+// Handle graceful shutdown
+process.on('SIGINT', async () => {
+    console.log('\n\nStopping...');
+    if (audioInput) {
+        audioInput.quit();
+    }
+    await session.stop();
+    await readPromise;
+    await model.unload();
+    console.log('✓ Done');
+    process.exit(0);
+});
diff --git a/sdk/js/src/detail/coreInterop.ts b/sdk/js/src/detail/coreInterop.ts
@@ -19,6 +19,16 @@ koffi.struct('ResponseBuffer', {
     ErrorLength: 'int32_t',
 });
 
+// Extended request struct for binary data (audio streaming)
+koffi.struct('StreamingRequestBuffer', {
+    Command: 'char*',
+    CommandLength: 'int32_t',
+    Data: 'char*',              // JSON params
+    DataLength: 'int32_t',
+    BinaryData: 'void*',        // raw PCM audio bytes
+    BinaryDataLength: 'int32_t',
+});
+
 const CallbackType = koffi.proto('void CallbackType(void *data, int32_t length, void *userData)');
 
 const __filename = fileURLToPath(import.meta.url);
@@ -28,6 +38,7 @@ export class CoreInterop {
     private lib: any;
     private execute_command: any;
     private execute_command_with_callback: any;
+    private execute_command_with_binary: any = null;
 
     private static _getLibraryExtension(): string {
         const platform = process.platform;
@@ -93,6 +104,7 @@ export class CoreInterop {
 
         this.execute_command = this.lib.func('void execute_command(RequestBuffer *request, _Inout_ ResponseBuffer *response)');
         this.execute_command_with_callback = this.lib.func('void execute_command_with_callback(RequestBuffer *request, _Inout_ ResponseBuffer *response, CallbackType *callback, void *userData)');
+        this.execute_command_with_binary = this.lib.func('void execute_command_with_binary(StreamingRequestBuffer *request, _Inout_ ResponseBuffer *response)');
     }
 
     public executeCommand(command: string, params?: any): string {
diff --git a/sdk/js/src/detail/model.ts b/sdk/js/src/detail/model.ts
@@ -177,6 +177,14 @@ export class Model implements IModel {
         return this.selectedVariant.createAudioClient();
     }
 
+    /**
+     * Creates a LiveAudioTranscriptionSession for real-time audio streaming ASR.
+     * @returns A LiveAudioTranscriptionSession instance.
+     */
+    public createLiveTranscriptionSession(): LiveAudioTranscriptionSession {
+        return this.selectedVariant.createLiveTranscriptionSession();
+    }
+
     /**
      * Creates a ResponsesClient for interacting with the model via the Responses API.
      * @param baseUrl - The base URL of the Foundry Local web service.
diff --git a/sdk/js/src/detail/modelVariant.ts b/sdk/js/src/detail/modelVariant.ts
@@ -170,6 +170,14 @@ export class ModelVariant implements IModel {
         return new AudioClient(this._modelInfo.id, this.coreInterop);
     }
 
+    /**
+     * Creates a LiveAudioTranscriptionSession for real-time audio streaming ASR.
+     * @returns A LiveAudioTranscriptionSession instance.
+     */
+    public createLiveTranscriptionSession(): LiveAudioTranscriptionSession {
+        return new LiveAudioTranscriptionSession(this._modelInfo.id, this.coreInterop);
+    }
+
     /**
      * Creates a ResponsesClient for interacting with the model via the Responses API.
      * @param baseUrl - The base URL of the Foundry Local web service.
diff --git a/sdk/js/src/imodel.ts b/sdk/js/src/imodel.ts
@@ -1,5 +1,6 @@
 import { ChatClient } from './openai/chatClient.js';
 import { AudioClient } from './openai/audioClient.js';
+import { LiveAudioTranscriptionSession } from './openai/liveAudioTranscriptionClient.js';
 import { ResponsesClient } from './openai/responsesClient.js';
 import { ModelInfo } from './types.js';
 
@@ -24,6 +25,13 @@ export interface IModel {
 
     createChatClient(): ChatClient;
     createAudioClient(): AudioClient;
+
+    /**
+     * Creates a LiveAudioTranscriptionSession for real-time audio streaming ASR.
+     * The model must be loaded before calling this method.
+     * @returns A LiveAudioTranscriptionSession instance.
+     */
+    createLiveTranscriptionSession(): LiveAudioTranscriptionSession;
     /**
      * Creates a ResponsesClient for interacting with the model via the Responses API.
      * Unlike createChatClient/createAudioClient (which use FFI), the Responses API
diff --git a/sdk/js/src/index.ts b/sdk/js/src/index.ts
@@ -8,6 +8,8 @@ export { ModelVariant } from './detail/modelVariant.js';
 export type { IModel } from './imodel.js';
 export { ChatClient, ChatClientSettings } from './openai/chatClient.js';
 export { AudioClient, AudioClientSettings } from './openai/audioClient.js';
+export { LiveAudioTranscriptionSession, LiveAudioTranscriptionOptions } from './openai/liveAudioTranscriptionClient.js';
+export type { LiveAudioTranscriptionResponse, TranscriptionContentPart } from './openai/liveAudioTranscriptionTypes.js';
 export { ResponsesClient, ResponsesClientSettings, getOutputText } from './openai/responsesClient.js';
 export { ModelLoadManager } from './detail/modelLoadManager.js';
 /** @internal */
diff --git a/sdk/js/src/openai/audioClient.ts b/sdk/js/src/openai/audioClient.ts
@@ -1,4 +1,5 @@
 import { CoreInterop } from '../detail/coreInterop.js';
+import { LiveAudioTranscriptionSession } from './liveAudioTranscriptionClient.js';
 
 export class AudioClientSettings {
     language?: string;
@@ -56,6 +57,14 @@ export class AudioClient {
         this.coreInterop = coreInterop;
     }
 
+    /**
+     * Creates a LiveAudioTranscriptionSession for real-time audio streaming ASR.
+     * @returns A LiveAudioTranscriptionSession instance.
+     */
+    public createLiveTranscriptionSession(): LiveAudioTranscriptionSession {
+        return new LiveAudioTranscriptionSession(this.modelId, this.coreInterop);
+    }
+
     /**
      * Validates that the audio file path is a non-empty string.
      * @internal
diff --git a/sdk/js/src/openai/liveAudioTranscriptionClient.ts b/sdk/js/src/openai/liveAudioTranscriptionClient.ts
diff --git a/sdk/js/src/openai/liveAudioTranscriptionTypes.ts b/sdk/js/src/openai/liveAudioTranscriptionTypes.ts
diff --git a/sdk/js/test/openai/liveAudioTranscription.test.ts b/sdk/js/test/openai/liveAudioTranscription.test.ts