Skip to content

Latest commit

 

History

History
176 lines (136 loc) · 4.8 KB

File metadata and controls

176 lines (136 loc) · 4.8 KB
sidebar_position 2
title MLLM Flow (Multimodal)
description Use OpenAI Realtime, Gemini Live, Vertex AI, or xAI for end-to-end audio processing.

MLLM Flow (Multimodal)

The MLLM (Multimodal LLM) flow uses a single model to handle both audio input and output — no separate STT or TTS step. This gives the model direct access to voice tone, pacing, and emotion.

MLLM vendors supported by AgentKit:

  • OpenAI Realtimegpt-4o-realtime-preview and related models
  • Gemini Live — direct Google AI API access for audio-native Gemini models
  • Vertex AI — Gemini Live through Google Cloud Vertex AI
  • xAI Grok — xAI Realtime API

Enable MLLM Mode

Call agent.with_mllm(vendor) to enable MLLM mode. The builder sets mllm.enable = True automatically. MLLM sessions do not require TTS, STT, or LLM vendors. Avatars are currently supported only with the cascading ASR + LLM + TTS pipeline.

Set the agent instance name when you create the session:

from agora_agent import Agent
import time

agent = Agent(client=client)
session = agent.create_session(
    channel=f"demo-channel-{int(time.time())}",
    agent_uid="1",
    remote_uids=["100"],
    name=f"conversation-{int(time.time())}",
)

OpenAI Realtime

Sync

from agora_agent import Agent, Agora, Area, OpenAIRealtime
import time

client = Agora(
    area=Area.US,
    app_id='your-app-id',
    app_certificate='your-app-certificate',
)

agent = (
    Agent(client=client)
    .with_mllm(OpenAIRealtime(
        api_key='your-openai-key',
        model='gpt-4o-realtime-preview',
    ))
)

session = agent.create_session(channel=f"demo-channel-{int(time.time())}", agent_uid='1', remote_uids=['100'], name=f"conversation-{int(time.time())}")
agent_id = session.start()
# Agent handles audio end-to-end — no separate STT/TTS needed
session.stop()

Async

import asyncio
from agora_agent import Agent, AsyncAgora, Area, OpenAIRealtime
import time

async def main():
    client = AsyncAgora(
        area=Area.US,
        app_id='your-app-id',
        app_certificate='your-app-certificate',
        )

    agent = (
        Agent(client=client)
        .with_mllm(OpenAIRealtime(
            api_key='your-openai-key',
            model='gpt-4o-realtime-preview',
        ))
    )

    session = agent.create_session(channel=f"demo-channel-{int(time.time())}", agent_uid='1', remote_uids=['100'], name=f"conversation-{int(time.time())}")
    agent_id = await session.start()
    await session.stop()

asyncio.run(main())

Gemini Live

Gemini Live uses a Google AI API key:

from agora_agent import Agent, Agora, Area, GeminiLive
import time

client = Agora(
    area=Area.AP,
    app_id='your-app-id',
    app_certificate='your-app-certificate',
)

agent = (
    Agent(client=client)
    .with_mllm(GeminiLive(
        api_key='your-google-ai-api-key',
        model='gemini-live-2.5-flash',
        voice='Aoede',
    ))
)

session = agent.create_session(channel=f"demo-channel-{int(time.time())}", agent_uid='1', remote_uids=['100'], name=f"conversation-{int(time.time())}")
agent_id = session.start()
session.stop()

xAI Grok

from agora_agent import Agent, Agora, Area, XaiGrok
import time

client = Agora(area=Area.US, app_id='your-app-id', app_certificate='your-app-certificate')

agent = (
    Agent(client=client)
    .with_mllm(XaiGrok(
        api_key='your-xai-key',
        voice='eve',
        language='en',
        sample_rate=24000,
        output_modalities=['audio', 'text'],
    ))
)

session = agent.create_session(channel=f"demo-channel-{int(time.time())}", agent_uid='1', remote_uids=['100'], name=f"conversation-{int(time.time())}")
agent_id = session.start()
session.stop()

For xAI turn detection, use mllm.turn_detection with agora_vad or server_vad.

OpenAI Realtime with Custom Options

from agora_agent import OpenAIRealtime

mllm = OpenAIRealtime(
    api_key='your-openai-key',
    model='gpt-4o-realtime-preview',
    url='wss://custom-endpoint.example.com',
    greeting_message='Hello! I am ready to help.',
    input_modalities=['audio', 'text'],
    output_modalities=['audio', 'text'],
    params={'temperature': 0.8},
)

When to Use MLLM vs. Cascading

Consideration MLLM Cascading
Latency Lower — single model, no pipeline Higher — three models in sequence
Voice control Model-dependent Full vendor choice for TTS
Vendor flexibility Limited to supported MLLM providers (OpenAI Realtime, Gemini Live, Vertex AI, xAI Grok) Mix and match LLM, TTS, and STT vendors
Audio understanding Model hears tone, pacing, emotion STT produces text only

Next Steps

  • For the cascading pipeline, see Cascading Flow
  • To add a visual avatar, use the cascading pipeline and see Avatars