|
3 | 3 | [](https://buildwithfern.com?utm_source=github&utm_medium=github&utm_campaign=readme&utm_source=https%3A%2F%2Fgithub.com%2FAgoraIO-Conversational-AI%2Fagent-server-sdk-python) |
4 | 4 | [](https://pypi.python.org/pypi/agent-server-sdk-python) |
5 | 5 |
|
6 | | -The Agora Conversational AI SDK provides convenient access to the Agora Conversational AI APIs, |
7 | | -enabling you to build voice-powered AI agents with support for both cascading flows (ASR -> LLM -> TTS) |
| 6 | +The Agora Conversational AI SDK provides convenient access to the Agora Conversational AI APIs, |
| 7 | +enabling you to build voice-powered AI agents with support for both cascading flows (ASR -> LLM -> TTS) |
8 | 8 | and multimodal flows (MLLM) for real-time audio processing. |
9 | 9 |
|
| 10 | + |
10 | 11 | ## Table of Contents |
11 | 12 |
|
12 | 13 | - [Installation](#installation) |
13 | 14 | - [Quick Start](#quick-start) |
14 | 15 | - [Documentation](#documentation) |
15 | 16 | - [Reference](#reference) |
16 | 17 | - [Mllm Flow Multimodal](#mllm-flow-multimodal) |
| 18 | +- [Mllm Flow Multimodal](#mllm-flow-multimodal) |
17 | 19 | - [Usage](#usage) |
18 | 20 | - [Async Client](#async-client) |
19 | 21 | - [Exception Handling](#exception-handling) |
@@ -152,6 +154,71 @@ A full reference for this library is available [here](https://github.com/AgoraIO |
152 | 154 |
|
153 | 155 | For real-time audio processing using OpenAI's Realtime API or Google Gemini Live, use the MLLM (Multimodal Large Language Model) flow instead of the cascading ASR -> LLM -> TTS flow. See the [MLLM Overview](https://docs.agora.io/en/conversational-ai/models/mllm/overview) for more details. |
154 | 156 |
|
| 157 | +```python |
| 158 | +from agora_agent import Agora, Area |
| 159 | +from agora_agent.agentkit import ( |
| 160 | + AdvancedFeatures, |
| 161 | + TurnDetectionConfig, |
| 162 | + TurnDetectionTypeValues, |
| 163 | +) |
| 164 | +from agora_agent.agents import ( |
| 165 | + StartAgentsRequestProperties, |
| 166 | + StartAgentsRequestPropertiesMllm, |
| 167 | + StartAgentsRequestPropertiesMllmVendor, |
| 168 | + StartAgentsRequestPropertiesTts, |
| 169 | + StartAgentsRequestPropertiesTtsVendor, |
| 170 | + StartAgentsRequestPropertiesLlm, |
| 171 | +) |
| 172 | + |
| 173 | +client = Agora( |
| 174 | + area=Area.US, |
| 175 | + app_id="YOUR_APP_ID", |
| 176 | + app_certificate="YOUR_APP_CERTIFICATE", |
| 177 | +) |
| 178 | + |
| 179 | +client.agents.start( |
| 180 | + client.app_id, |
| 181 | + name="mllm_agent", |
| 182 | + properties=StartAgentsRequestProperties( |
| 183 | + channel="channel_name", |
| 184 | + token="your_token", |
| 185 | + agent_rtc_uid="1001", |
| 186 | + remote_rtc_uids=["1002"], |
| 187 | + idle_timeout=120, |
| 188 | + advanced_features=AdvancedFeatures(enable_mllm=True), |
| 189 | + mllm=StartAgentsRequestPropertiesMllm( |
| 190 | + url="wss://api.openai.com/v1/realtime", |
| 191 | + api_key="<your_openai_api_key>", |
| 192 | + vendor=StartAgentsRequestPropertiesMllmVendor.OPENAI, |
| 193 | + params={ |
| 194 | + "model": "gpt-4o-realtime-preview", |
| 195 | + "voice": "alloy", |
| 196 | + }, |
| 197 | + input_modalities=["audio"], |
| 198 | + output_modalities=["text", "audio"], |
| 199 | + greeting_message="Hello! I'm ready to chat in real-time.", |
| 200 | + ), |
| 201 | + turn_detection=TurnDetectionConfig( |
| 202 | + type=TurnDetectionTypeValues.SERVER_VAD, # deprecated; use config.end_of_speech instead |
| 203 | + threshold=0.5, |
| 204 | + silence_duration_ms=500, |
| 205 | + ), |
| 206 | + # TTS and LLM are still required but not used when MLLM is enabled |
| 207 | + tts=StartAgentsRequestPropertiesTts( |
| 208 | + vendor=StartAgentsRequestPropertiesTtsVendor.MICROSOFT, |
| 209 | + params={}, |
| 210 | + ), |
| 211 | + llm=StartAgentsRequestPropertiesLlm( |
| 212 | + url="https://api.openai.com/v1/chat/completions", |
| 213 | + ), |
| 214 | + ), |
| 215 | +) |
| 216 | +``` |
| 217 | + |
| 218 | +## MLLM Flow (Multimodal) |
| 219 | + |
| 220 | +For real-time audio processing using OpenAI's Realtime API or Google Gemini Live, use the MLLM (Multimodal Large Language Model) flow instead of the cascading ASR -> LLM -> TTS flow. See the [MLLM Overview](https://docs.agora.io/en/conversational-ai/models/mllm/overview) for more details. |
| 221 | + |
155 | 222 | ```python |
156 | 223 | from agora-agent-server-sdk import Agora |
157 | 224 | from agora-agent-server-sdk.agents import ( |
@@ -212,6 +279,7 @@ client.agents.start( |
212 | 279 | ) |
213 | 280 | ``` |
214 | 281 |
|
| 282 | + |
215 | 283 | ## Usage |
216 | 284 |
|
217 | 285 | Instantiate and use the client with the following: |
|
0 commit comments