Skip to content

Commit c94bb03

Browse files
rui-renruiren_microsoftCopilot
authored
Add Nemotron-ASR streaming inference to Python SDK (#612)
## Add Nemotron-ASR streaming inference to Python SDK ### Description Adds real-time audio streaming support to the Foundry Local Python SDK, enabling live microphone-to-text transcription via ONNX Runtime GenAI's StreamingProcessor API (Nemotron ASR). This is the Python port of C# PR #485 with full feature parity. The existing `AudioClient` only supports file-based transcription. This PR introduces `LiveAudioTranscriptionSession` that accepts continuous PCM audio chunks (e.g., from a microphone) and returns partial/final transcription results as a synchronous generator. ### What's included **New files** - `src/openai/live_audio_transcription_client.py` — Streaming session with `start()`, `append()`, `get_transcription_stream()`, `stop()` - `src/openai/live_audio_transcription_types.py` — `LiveAudioTranscriptionResponse` (ConversationItem-shaped), `LiveAudioTranscriptionOptions`, `CoreErrorResponse`, `TranscriptionContentPart` - `test/openai/test_live_audio_transcription.py` — 22 unit tests for deserialization, settings, state guards, streaming pipeline - `test/openai/test_live_audio_transcription_e2e.py` — E2E test with real native DLLs and nemotron model - `test/openai/conftest.py` — DLL preload for E2E tests - `samples/python/live-audio-transcription/src/app.py` — Live microphone transcription demo **Modified files** - `src/openai/audio_client.py` — Added `create_live_transcription_session()` factory method - `src/detail/core_interop.py` — Added `StreamingRequestBuffer` struct, `execute_command_with_binary()`, `start_audio_stream`, `push_audio_data`, `stop_audio_stream` methods, and `_load_dll_win()` for robust DLL loading on Windows - `src/openai/__init__.py` — Exported new live transcription types - `test/conftest.py` — Pre-load ORT/GenAI DLLs before brotli import to avoid Windows DLL search conflicts ### API surface ```python audio_client = model.get_audio_client() session = audio_client.create_live_transcription_session() session.settings.sample_rate = 16000 session.settings.channels = 1 session.settings.language = "en" session.start() # Push audio from microphone callback (thread-safe) session.append(pcm_bytes) # Read results as synchronous generator for result in session.get_transcription_stream(): print(result.content[0].text) session.stop() ``` ### C# parity | C# API | Python API | Notes | |---|---|---| | `CreateLiveTranscriptionSession()` | `create_live_transcription_session()` | ✅ | | `StartAsync(ct)` | `start()` | Sync (matches Python SDK convention) | | `AppendAsync(ReadOnlyMemory<byte>, ct)` | `append(bytes)` | Thread-safe, copies data | | `GetTranscriptionStream()` | `get_transcription_stream()` | Generator (sync equivalent of IAsyncEnumerable) | | `StopAsync(ct)` | `stop()` | Drains push queue, sends native stop, surfaces final result | | `IAsyncDisposable` | Context manager (`with`) | Idiomatic Python equivalent | | `LiveAudioTranscriptionOptions` | `LiveAudioTranscriptionOptions` | Same fields: sample_rate, channels, bits_per_sample, language, push_queue_capacity | | `LiveAudioTranscriptionResponse` | `LiveAudioTranscriptionResponse` | ConversationItem-shaped: content[0].text/transcript, is_final, start_time, end_time | ### Design highlights - **Output type alignment** — `LiveAudioTranscriptionResponse` uses the OpenAI Realtime `ConversationItem` shape (`content[0].text/transcript`) for forward compatibility - **Internal push queue** — Bounded `queue.Queue` serializes audio pushes from any thread (safe for mic callbacks) with backpressure - **Fail-fast on errors** — Push loop terminates immediately on any native error (no retry logic) - **Settings freeze** — Audio format settings are snapshot-copied at `start()` and immutable during the session - **Buffer copy** — `append()` copies input data to avoid issues with callers reusing buffers (e.g., PyAudio) - **Routes through existing exports** — `start_audio_stream` and `stop_audio_stream` route through `execute_command`; `push_audio_data` routes through `execute_command_with_binary` — no new native entry points required - **DLL loading fix** — Uses `LoadLibraryExW` with `LOAD_WITH_ALTERED_SEARCH_PATH` on Windows to prevent conflicts with stale system-level ORT DLLs ### Verified working - ✅ 22 unit tests passing (deserialization, settings, state guards, streaming pipeline with mocked core) - ✅ E2E test passing (SDK → Core.dll → onnxruntime-genai.dll → onnxruntime.dll with nemotron model) - ✅ Full session lifecycle: start → push synthetic PCM → stop → verify results - ✅ Existing tests unaffected --------- Co-authored-by: ruiren_microsoft <ruiren@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent ec94cba commit c94bb03

6 files changed

Lines changed: 973 additions & 1 deletion

File tree

sdk/python/src/detail/core_interop.py

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,23 @@ class RequestBuffer(ctypes.Structure):
4646
]
4747

4848

49+
class StreamingRequestBuffer(ctypes.Structure):
50+
"""ctypes Structure matching the native ``StreamingRequestBuffer`` C struct.
51+
52+
Extends ``RequestBuffer`` with binary data fields for sending raw payloads
53+
(e.g. PCM audio bytes) alongside JSON parameters.
54+
"""
55+
56+
_fields_ = [
57+
("Command", ctypes.c_void_p),
58+
("CommandLength", ctypes.c_int),
59+
("Data", ctypes.c_void_p),
60+
("DataLength", ctypes.c_int),
61+
("BinaryData", ctypes.c_void_p),
62+
("BinaryDataLength", ctypes.c_int),
63+
]
64+
65+
4966
class ResponseBuffer(ctypes.Structure):
5067
"""ctypes Structure matching the native ``ResponseBuffer`` C struct."""
5168

@@ -173,6 +190,16 @@ def _initialize_native_libraries() -> 'NativeBinaryPaths':
173190
ctypes.c_void_p] # user_data
174191
lib.execute_command_with_callback.restype = None
175192

193+
# execute_command_with_binary is required for live audio streaming.
194+
# Guard with try/except until Core packages with this symbol are released.
195+
try:
196+
lib.execute_command_with_binary.argtypes = [ctypes.POINTER(StreamingRequestBuffer),
197+
ctypes.POINTER(ResponseBuffer)]
198+
lib.execute_command_with_binary.restype = None
199+
except AttributeError:
200+
logger.debug("execute_command_with_binary not exported by Core — "
201+
"live audio streaming will not be available until Core is updated")
202+
176203
return paths
177204

178205
@staticmethod
@@ -295,6 +322,66 @@ def execute_command_with_callback(self, command_name: str, command_input: Option
295322
response = self._execute_command(command_name, command_input, callback)
296323
return response
297324

325+
def execute_command_with_binary(self, command_name: str,
326+
command_input: Optional[InteropRequest],
327+
binary_data: bytes) -> Response:
328+
"""Execute a command with both JSON parameters and a raw binary payload.
329+
330+
Used for operations like pushing PCM audio data alongside JSON metadata.
331+
332+
Args:
333+
command_name: The native command name (e.g. ``"audio_stream_push"``).
334+
command_input: Optional request parameters (serialized as JSON).
335+
binary_data: Raw binary payload (e.g. PCM audio bytes).
336+
337+
Returns:
338+
A ``Response`` with ``data`` on success or ``error`` on failure.
339+
"""
340+
logger.debug("Executing command with binary: %s Input: %s BinaryLen: %d",
341+
command_name, command_input.params if command_input else None, len(binary_data))
342+
343+
cmd_ptr, cmd_len, cmd_buf = CoreInterop._to_c_buffer(command_name)
344+
data_ptr, data_len, data_buf = CoreInterop._to_c_buffer(
345+
command_input.to_json() if command_input else None
346+
)
347+
348+
# Keep binary data alive for the duration of the native call
349+
binary_buf = ctypes.create_string_buffer(binary_data)
350+
binary_ptr = ctypes.cast(binary_buf, ctypes.c_void_p)
351+
352+
req = StreamingRequestBuffer(
353+
Command=cmd_ptr, CommandLength=cmd_len,
354+
Data=data_ptr, DataLength=data_len,
355+
BinaryData=binary_ptr, BinaryDataLength=len(binary_data),
356+
)
357+
resp = ResponseBuffer()
358+
lib = CoreInterop._flcore_library
359+
360+
lib.execute_command_with_binary(ctypes.byref(req), ctypes.byref(resp))
361+
362+
req = None # Free Python reference to request
363+
364+
response_str = ctypes.string_at(resp.Data, resp.DataLength).decode("utf-8") if resp.Data else None
365+
error_str = ctypes.string_at(resp.Error, resp.ErrorLength).decode("utf-8") if resp.Error else None
366+
367+
lib.free_response(resp)
368+
369+
return Response(data=response_str, error=error_str)
370+
371+
# --- Audio streaming session support ---
372+
373+
def start_audio_stream(self, command_input: InteropRequest) -> Response:
374+
"""Start a real-time audio streaming session via ``audio_stream_start``."""
375+
return self.execute_command("audio_stream_start", command_input)
376+
377+
def push_audio_data(self, command_input: InteropRequest, audio_data: bytes) -> Response:
378+
"""Push a chunk of raw PCM audio data via ``audio_stream_push``."""
379+
return self.execute_command_with_binary("audio_stream_push", command_input, audio_data)
380+
381+
def stop_audio_stream(self, command_input: InteropRequest) -> Response:
382+
"""Stop a real-time audio streaming session via ``audio_stream_stop``."""
383+
return self.execute_command("audio_stream_stop", command_input)
384+
298385

299386
def get_cached_model_ids(core_interop: CoreInterop) -> list[str]:
300387
"""Get the list of models that have been downloaded and are cached."""

sdk/python/src/openai/__init__.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,22 @@
77
from .chat_client import ChatClient, ChatClientSettings
88
from .audio_client import AudioClient
99
from .embedding_client import EmbeddingClient
10+
from .live_audio_transcription_client import LiveAudioTranscriptionSession
11+
from .live_audio_transcription_types import (
12+
CoreErrorResponse,
13+
LiveAudioTranscriptionOptions,
14+
LiveAudioTranscriptionResponse,
15+
TranscriptionContentPart,
16+
)
1017

11-
__all__ = ["AudioClient", "ChatClient", "ChatClientSettings", "EmbeddingClient"]
18+
__all__ = [
19+
"AudioClient",
20+
"ChatClient",
21+
"ChatClientSettings",
22+
"CoreErrorResponse",
23+
"EmbeddingClient",
24+
"LiveAudioTranscriptionOptions",
25+
"LiveAudioTranscriptionResponse",
26+
"LiveAudioTranscriptionSession",
27+
"TranscriptionContentPart",
28+
]

sdk/python/src/openai/audio_client.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414

1515
from ..detail.core_interop import CoreInterop, InteropRequest
1616
from ..exception import FoundryLocalException
17+
from .live_audio_transcription_client import LiveAudioTranscriptionSession
1718

1819
logger = logging.getLogger(__name__)
1920

@@ -61,6 +62,25 @@ def __init__(self, model_id: str, core_interop: CoreInterop):
6162
self.settings = AudioSettings()
6263
self._core_interop = core_interop
6364

65+
def create_live_transcription_session(self) -> LiveAudioTranscriptionSession:
66+
"""Create a real-time streaming transcription session.
67+
68+
Audio data is pushed in as PCM chunks and transcription results are
69+
returned as a synchronous generator.
70+
71+
Returns:
72+
A streaming session that should be stopped when done.
73+
Supports use as a context manager::
74+
75+
with audio_client.create_live_transcription_session() as session:
76+
session.settings.sample_rate = 16000
77+
session.start()
78+
session.append(pcm_bytes)
79+
for result in session.get_transcription_stream():
80+
print(result.content[0].text)
81+
"""
82+
return LiveAudioTranscriptionSession(self.model_id, self._core_interop)
83+
6484
@staticmethod
6585
def _validate_audio_file_path(audio_file_path: str) -> None:
6686
"""Validate that the audio file path is a non-empty string."""

0 commit comments

Comments
 (0)