Skip to content

Commit 429dd1c

Browse files
authored
Python: support (Azure) OpenAI realtime audio models (#13291)
### Motivation and Context As of 2/27/2026, revived the PR, and the latest code is working for both Azure OpenAI realtime models and OpenAI realtime models. SK has support for realtime-preview models; however, since they've gone to GA we have not added support for the latest library abstractions. This PR brings in the changes to support running models like `gpt-realtime-1.5`, `gpt-realtime`, `gpt-realtime-mini` or `gpt-audio`. <!-- Thank you for your contribution to the semantic-kernel repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> ### Description - Closes #13267 - Code now relies on `openai` >= 2.0 - Considered a breaking change due some new config added to the execution settings. <!-- Describe your changes, the overall approach, the underlying design. These notes will help understanding how your code works. Thanks! --> ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [X] The code builds clean without any errors or warnings - [X] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [X] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄
1 parent 74d4310 commit 429dd1c

14 files changed

Lines changed: 1062 additions & 769 deletions

File tree

python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ dependencies = [
3737
"numpy >= 1.25.0; python_version < '3.12'",
3838
"numpy >= 1.26.0; python_version >= '3.12'",
3939
# openai connector
40-
"openai >= 1.98.0,<2",
40+
"openai >= 2.0.0",
4141
# openapi and swagger
4242
"openapi_core >= 0.18,<0.20",
4343
"websockets >= 13, < 16",

python/samples/concepts/realtime/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ These samples are more complex then most because of the nature of these API's. T
55
To run these samples, you will need to have the following setup:
66

77
- Environment variables for OpenAI (websocket or WebRTC), with your key and OPENAI_REALTIME_MODEL_ID set.
8-
- Environment variables for Azure (websocket only), set with your endpoint, optionally a key and AZURE_OPENAI_REALTIME_DEPLOYMENT_NAME set. The API version needs to be at least `2024-10-01-preview`.
8+
- Environment variables for Azure (websocket only), set with your endpoint, optionally a key and AZURE_OPENAI_REALTIME_DEPLOYMENT_NAME set. The API version needs to be at least `2025-08-28`.
99
- To run the sample with a simple version of a class that handles the incoming and outgoing sound you need to install the following packages in your environment:
1010
- semantic-kernel[realtime]
1111
- pyaudio

python/samples/concepts/realtime/realtime_agent_with_function_calling_webrtc.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,6 @@
55
from datetime import datetime
66
from random import randint
77

8-
from azure.identity import AzureCliCredential
9-
108
from samples.concepts.realtime.utils import AudioPlayerWebRTC, AudioRecorderWebRTC, check_audio_devices
119
from semantic_kernel.connectors.ai import FunctionChoiceBehavior
1210
from semantic_kernel.connectors.ai.open_ai import (
@@ -81,8 +79,12 @@ async def main() -> None:
8179
# and can also be passed in the receive method
8280
# You can also pass in kernel, plugins, chat_history or settings here.
8381
# For WebRTC the audio_track is required
82+
83+
# Note: api_version (either through settings or directly in the client) must be set to "2025-08-28"
84+
# for Azure OpenAI deployments realtime deployments.
8485
realtime_agent = AzureRealtimeWebRTC(
85-
audio_track=AudioRecorderWebRTC(), region="swedencentral", plugins=[Helpers()], credential=AzureCliCredential()
86+
audio_track=AudioRecorderWebRTC(),
87+
plugins=[Helpers()],
8688
)
8789

8890
# Create the settings for the session
@@ -103,6 +105,7 @@ async def main() -> None:
103105
flowery prose.
104106
""",
105107
voice="alloy",
108+
output_modalities=["text", "audio"],
106109
turn_detection=TurnDetection(type="server_vad", create_response=True, silence_duration_ms=800, threshold=0.8),
107110
function_choice_behavior=FunctionChoiceBehavior.Auto(),
108111
)

python/samples/concepts/realtime/realtime_agent_with_function_calling_websocket.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,9 @@ async def main() -> None:
8282
# to signal the end of the user's turn and start the response.
8383
# manual VAD is not part of this sample
8484
# for more info: https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-turn_detection
85+
86+
# Note: api_version (either through settings or directly in the client) must be set to "2025-08-28"
87+
# for Azure OpenAI deployments realtime deployments.
8588
settings = AzureRealtimeExecutionSettings(
8689
instructions="""
8790
You are a chat bot. Your name is Mosscap and

python/samples/concepts/realtime/simple_realtime_chat_webrtc.py

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,11 @@
55

66
from samples.concepts.realtime.utils import AudioPlayerWebRTC, AudioRecorderWebRTC, check_audio_devices
77
from semantic_kernel.connectors.ai.open_ai import (
8+
AzureRealtimeExecutionSettings,
89
ListenEvents,
9-
OpenAIRealtimeExecutionSettings,
10-
OpenAIRealtimeWebRTC,
1110
)
11+
from semantic_kernel.connectors.ai.open_ai.services.azure_realtime import AzureRealtimeWebRTC
12+
from semantic_kernel.contents import RealtimeTextEvent
1213

1314
logging.basicConfig(level=logging.WARNING)
1415
utils_log = logging.getLogger("samples.concepts.realtime.utils")
@@ -42,7 +43,7 @@ async def main() -> None:
4243
# create the realtime client and optionally add the audio output function, this is optional
4344
# you can define the protocol to use, either "websocket" or "webrtc"
4445
# they will behave the same way, even though the underlying protocol is quite different
45-
settings = OpenAIRealtimeExecutionSettings(
46+
settings = AzureRealtimeExecutionSettings(
4647
instructions="""
4748
You are a chat bot. Your name is Mosscap and
4849
you have one goal: figure out what people need.
@@ -55,28 +56,40 @@ async def main() -> None:
5556
# see https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-voice
5657
# for more details.
5758
voice="alloy",
59+
# Enable both text and audio output to get transcripts
60+
output_modalities=["text", "audio"],
61+
)
62+
# Note: api_version (either through settings or directly in the client) must be set to "2025-08-28"
63+
# for Azure OpenAI deployments realtime deployments.
64+
realtime_client = AzureRealtimeWebRTC(
65+
audio_track=AudioRecorderWebRTC(),
66+
settings=settings,
5867
)
59-
realtime_client = OpenAIRealtimeWebRTC(audio_track=AudioRecorderWebRTC(), settings=settings)
6068
# Create the settings for the session
6169
audio_player = AudioPlayerWebRTC()
6270
# the context manager calls the create_session method on the client and starts listening to the audio stream
6371
async with audio_player, realtime_client:
6472
async for event in realtime_client.receive(audio_output_callback=audio_player.client_callback):
65-
match event.event_type:
66-
case "text":
67-
# the model returns both audio and transcript of the audio, which we will print
68-
print(event.text.text, end="")
69-
case "service":
70-
# OpenAI Specific events
71-
if event.service_type == ListenEvents.SESSION_UPDATED:
72-
print("Session updated")
73-
if event.service_type == ListenEvents.RESPONSE_CREATED:
74-
print("\nMosscap (transcript): ", end="")
73+
match event:
74+
case RealtimeTextEvent():
75+
# Only process delta events for streaming, skip done events to avoid duplication
76+
if event.service_type and "delta" in event.service_type and event.text.text:
77+
print(event.text.text, end="", flush=True)
78+
# Add newline when transcript is complete (done event)
79+
elif event.service_type and "done" in event.service_type:
80+
print() # Add newline for readability
81+
case _:
82+
# Handle service events
83+
if event.event_type == "service" and event.service_type:
84+
if event.service_type == ListenEvents.SESSION_UPDATED:
85+
print("Session updated")
86+
elif event.service_type == ListenEvents.RESPONSE_CREATED:
87+
print("\nMosscap (transcript): ", end="")
7588

7689

7790
if __name__ == "__main__":
7891
print(
79-
"Instructions: start speaking. "
92+
"Instructions: start speaking when you see 'Session updated.' "
8093
"The model will detect when you stop and automatically start responding. "
8194
"Press ctrl + c to stop the program."
8295
)

python/samples/concepts/realtime/simple_realtime_chat_websocket.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,6 @@
33
import asyncio
44
import logging
55

6-
from azure.identity import AzureCliCredential
7-
86
from samples.concepts.realtime.utils import AudioPlayerWebsocket, AudioRecorderWebsocket, check_audio_devices
97
from semantic_kernel.connectors.ai.open_ai import (
108
AzureRealtimeExecutionSettings,
@@ -59,7 +57,11 @@ async def main() -> None:
5957
# for more details.
6058
voice="shimmer",
6159
)
62-
realtime_client = AzureRealtimeWebsocket(settings=settings, credential=AzureCliCredential())
60+
# Note: api_version (either through settings or directly in the client) must be set to "2025-08-28"
61+
# for Azure OpenAI deployments realtime deployments.
62+
realtime_client = AzureRealtimeWebsocket(
63+
settings=settings,
64+
)
6365
audio_player = AudioPlayerWebsocket()
6466
audio_recorder = AudioRecorderWebsocket(realtime_client=realtime_client)
6567
# Create the settings for the session
@@ -84,7 +86,7 @@ async def main() -> None:
8486

8587
if __name__ == "__main__":
8688
print(
87-
"Instructions: Start speaking. "
89+
"Instructions: Start speaking when you see 'Session updated.' "
8890
"The model will detect when you stop and automatically start responding. "
8991
"Press ctrl + c to stop the program."
9092
)

python/samples/concepts/realtime/utils.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,7 @@ def _sounddevice_callback(self, outdata, frames, time, status):
321321
logger.debug(f"Audio output status: {status}")
322322
if self._queue:
323323
if self._queue.empty():
324+
outdata[:] = 0
324325
return
325326
data = self._queue.get_nowait()
326327
outdata[:] = data.reshape(outdata.shape)

python/semantic_kernel/connectors/ai/open_ai/const.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22

33
from typing import Final
44

5-
DEFAULT_AZURE_API_VERSION: Final[str] = "2024-10-21"
5+
DEFAULT_AZURE_API_VERSION: Final[str] = "2025-08-28"

python/semantic_kernel/connectors/ai/open_ai/prompt_execution_settings/open_ai_realtime_execution_settings.py

Lines changed: 45 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ class TurnDetection(KernelBaseModel):
5454
class OpenAIRealtimeExecutionSettings(PromptExecutionSettings):
5555
"""Request settings for OpenAI realtime services."""
5656

57-
modalities: Sequence[Literal["audio", "text"]] | None = None
57+
output_modalities: Sequence[Literal["audio", "text"]] | None = None
5858
ai_model_id: Annotated[str | None, Field(None, serialization_alias="model")] = None
5959
instructions: str | None = None
6060
voice: str | None = None
@@ -76,10 +76,52 @@ class OpenAIRealtimeExecutionSettings(PromptExecutionSettings):
7676
"on the function choice configuration.",
7777
),
7878
] = None
79-
temperature: Annotated[float | None, Field(ge=0.6, le=1.2)] = None
80-
max_response_output_tokens: Annotated[int | Literal["inf"] | None, Field(gt=0)] = None
79+
max_output_tokens: Annotated[int | Literal["inf"] | None, Field(gt=0)] = None
8180
input_audio_noise_reduction: dict[Literal["type"], Literal["near_field", "far_field"]] | None = None
8281

82+
def prepare_settings_dict(self, **kwargs) -> dict[str, Any]:
83+
"""Prepare the settings as a dictionary for sending to the AI service.
84+
85+
For realtime settings, we need to properly structure the audio configuration
86+
to match the OpenAI API expectations where voice and turn_detection are nested
87+
under the audio field.
88+
"""
89+
# Get the base settings dict (excludes service_id, extension_data, etc.)
90+
settings_dict = super().prepare_settings_dict(**kwargs)
91+
92+
# Build the audio configuration object
93+
audio_config: dict[str, Any] = {}
94+
95+
# Handle voice (goes in audio.output.voice)
96+
if "voice" in settings_dict:
97+
audio_config.setdefault("output", {})["voice"] = settings_dict.pop("voice")
98+
99+
# Handle turn_detection (goes in audio.input.turn_detection)
100+
if "turn_detection" in settings_dict:
101+
audio_config.setdefault("input", {})["turn_detection"] = settings_dict.pop("turn_detection")
102+
103+
# Handle input audio format
104+
if "input_audio_format" in settings_dict:
105+
audio_config.setdefault("input", {})["format"] = settings_dict.pop("input_audio_format")
106+
107+
# Handle output audio format
108+
if "output_audio_format" in settings_dict:
109+
audio_config.setdefault("output", {})["format"] = settings_dict.pop("output_audio_format")
110+
111+
# Handle input audio transcription
112+
if "input_audio_transcription" in settings_dict:
113+
audio_config.setdefault("input", {})["transcription"] = settings_dict.pop("input_audio_transcription")
114+
115+
# Handle input audio noise reduction
116+
if "input_audio_noise_reduction" in settings_dict:
117+
audio_config.setdefault("input", {})["noise_reduction"] = settings_dict.pop("input_audio_noise_reduction")
118+
119+
# Add the audio config if it has any content
120+
if audio_config:
121+
settings_dict["audio"] = audio_config
122+
123+
return settings_dict
124+
83125

84126
class AzureRealtimeExecutionSettings(OpenAIRealtimeExecutionSettings):
85127
"""Request settings for Azure OpenAI realtime services."""

0 commit comments

Comments
 (0)