Skip to content

feat: add audio input support across providers and chatui recording issue fix#7378

Open
Soulter wants to merge 3 commits intomasterfrom
feat/llm-audio
Open

feat: add audio input support across providers and chatui recording issue fix#7378
Soulter wants to merge 3 commits intomasterfrom
feat/llm-audio

Conversation

@Soulter
Copy link
Copy Markdown
Member

@Soulter Soulter commented Apr 5, 2026

  • Introduced audio_urls parameter in Provider class and related methods to handle audio input.
  • Updated ProviderAnthropic, ProviderGoogleGenAI, and ProviderOpenAIOfficial to process audio URLs.
  • Enhanced media_utils with functions to ensure audio format compatibility and detect audio types.
  • Modified dashboard components to display audio input support and handle audio attachments in messages.
  • Updated localization files to include audio as a supported modality.
  • Added new icons for audio input in the dashboard UI.

Modifications / 改动点

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果


Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。

  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”

  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txtpyproject.toml 文件相应位置。

  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Add end-to-end audio input support across providers and dashboard, and improve multimedia handling and recording flows.

New Features:

  • Support passing audio URLs/paths through ProviderRequest and provider APIs for OpenAI, Gemini, Anthropic, and generic provider flows.
  • Enable sending recorded audio from the chat UI as attachments, wiring uploaded audio into backend message parts and LLM requests.
  • Expose provider audio-input capability in the dashboard via model metadata, configuration schema, and new audio icons.

Bug Fixes:

  • Fix chat UI recording upload so it preserves the correct MIME type, filename extension, and attachment identifier when posting audio files.

Enhancements:

  • Normalize and convert audio inputs to compatible formats (notably WAV/MP3) with new media utilities and preprocessing of Record components.
  • Extend modality sanitization and placeholder logic to handle audio alongside images and tools, improving behavior with providers that lack audio support.
  • Simplify message component models by removing unused WeChat-specific and caching fields and updating respond-stage handling accordingly.

Documentation:

  • Update localized configuration metadata to document audio as a supported modality in provider capabilities.

…ssue fix

- Introduced audio_urls parameter in Provider class and related methods to handle audio input.
- Updated ProviderAnthropic, ProviderGoogleGenAI, and ProviderOpenAIOfficial to process audio URLs.
- Enhanced media_utils with functions to ensure audio format compatibility and detect audio types.
- Modified dashboard components to display audio input support and handle audio attachments in messages.
- Updated localization files to include audio as a supported modality.
- Added new icons for audio input in the dashboard UI.
@auto-assign auto-assign bot requested review from Fridemn and LIghtJUNction April 5, 2026 17:31
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. feature:chatui The bug / feature is about astrbot's chatui, webchat labels Apr 5, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 5 issues, and left some high level feedback:

  • In gemini_source.process_image_url the list comprehension now routes all non-text items that are not explicitly image_url through process_audio_url(item["audio_url"]), which will raise a KeyError for other part types (e.g. input_audio or any future multimodal types); consider branching explicitly on item["type"] and only calling process_audio_url when the type is audio_url.
  • In ProviderRequest._encode_audio_bs64 the returned data URI is always labeled as audio/wav regardless of the real source format, while other code paths (e.g. dashboard recording uploads, Gemini/OpenAI resolution) may pass through MP3/OGG/WEBM without conversion; it would be safer either to normalize to WAV before encoding or to detect and use the correct MIME type when building the data URI.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `gemini_source.process_image_url` the list comprehension now routes all non-text items that are not explicitly `image_url` through `process_audio_url(item["audio_url"])`, which will raise a `KeyError` for other part types (e.g. `input_audio` or any future multimodal types); consider branching explicitly on `item["type"]` and only calling `process_audio_url` when the type is `audio_url`.
- In `ProviderRequest._encode_audio_bs64` the returned data URI is always labeled as `audio/wav` regardless of the real source format, while other code paths (e.g. dashboard recording uploads, Gemini/OpenAI resolution) may pass through MP3/OGG/WEBM without conversion; it would be safer either to normalize to WAV before encoding or to detect and use the correct MIME type when building the data URI.

## Individual Comments

### Comment 1
<location path="astrbot/core/provider/entities.py" line_range="280-289" />
<code_context>
             return "data:image/jpeg;base64," + image_bs64
-        return ""
+
+    async def _encode_audio_bs64(
+        self,
+        audio_path: str,
+        source_ref: str | None = None,
+    ) -> str:
+        """将音频转换为 base64"""
+        mime_type = "audio/wav"
+
+        if audio_path.startswith("base64://"):
+            return audio_path.replace("base64://", f"data:{mime_type};base64,", 1)
+
+        with open(audio_path, "rb") as f:
+            audio_bs64 = base64.b64encode(f.read()).decode("utf-8")
+            return f"data:{mime_type};base64," + audio_bs64

</code_context>
<issue_to_address>
**suggestion (bug_risk):** Audio is always labeled as audio/wav even when the original format is different.

`_encode_audio_bs64` hardcodes `mime_type = "audio/wav"`, but `audio_path` may point to MP3/OGG/etc., and for HTTP URLs in `assemble_context` you don’t normalize to WAV before encoding. This can yield a data URL whose MIME type doesn’t match the actual bytes. Please either convert non-WAV inputs to WAV before calling this helper (as with `ensure_wav`) or infer the MIME type (e.g., from extension or magic) so the MIME and content stay consistent.
</issue_to_address>

### Comment 2
<location path="astrbot/core/provider/entities.py" line_range="291-293" />
<code_context>
+        if audio_path.startswith("base64://"):
+            return audio_path.replace("base64://", f"data:{mime_type};base64,", 1)
+
+        with open(audio_path, "rb") as f:
+            audio_bs64 = base64.b64encode(f.read()).decode("utf-8")
+            return f"data:{mime_type};base64," + audio_bs64


</code_context>
<issue_to_address>
**issue (bug_risk):** Audio encoding lacks error handling, unlike the image path, which can cause request assembly to fail on I/O errors.

Here we don't catch `FileNotFoundError`/`OSError`, so any bad audio path will raise from `assemble_context` and fail the whole request. Consider matching the image helper’s behavior by handling these errors (e.g., log and return empty/None) so a single bad audio file doesn’t abort an otherwise valid call.
</issue_to_address>

### Comment 3
<location path="astrbot/core/provider/sources/openai_source.py" line_range="317-326" />
<code_context>
+        return url
+
+    async def _audio_ref_to_local_path(self, audio_ref: str) -> str:
+        if audio_ref.startswith("http"):
+            suffix = Path(urlparse(audio_ref).path).suffix or ".wav"
+            temp_dir = Path(get_astrbot_temp_path())
+            temp_dir.mkdir(parents=True, exist_ok=True)
+            target_path = temp_dir / f"provider_audio_{uuid.uuid4().hex}{suffix}"
+            await download_file(audio_ref, str(target_path))
+            return str(target_path)
+        if audio_ref.startswith("file://"):
+            return self._file_uri_to_path(audio_ref)
+        return audio_ref
+
+    async def _resolve_audio_part(self, audio_ref: str) -> dict | None:
</code_context>
<issue_to_address>
**question:** Audio ref resolution only supports http/file paths and treats all other schemes as filesystem paths.

`_audio_ref_to_local_path` only treats `http` and `file://` specially and otherwise returns `audio_ref` unchanged. If `AudioURLPart.audio_url.url` is ever a `data:` / `base64://` or other non-file URL, `_resolve_audio_part` will pass that into `ensure_wav`, which expects a real filesystem path and will fail with a confusing I/O error. If such URLs are in scope, we should either decode/handle them here or explicitly reject them (with logging/skip) rather than letting them propagate as if they were local paths.
</issue_to_address>

### Comment 4
<location path="astrbot/core/provider/sources/gemini_source.py" line_range="343" />
<code_context>
                         (
                             types.Part.from_text(text=item["text"] or " ")
                             if item["type"] == "text"
-                            else process_image_url(item["image_url"])
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting small helpers for building content parts and for converting audio paths to data URLs to keep the Gemini adapter simpler and easier to extend.

You can reduce the added complexity in two small, focused steps without changing behavior.

---

### 1. Replace nested ternary in the list comprehension with a small helper

The current comprehension:

```python
parts = [
    (
        types.Part.from_text(text=item["text"] or " ")
        if item["type"] == "text"
        else (
            process_image_url(item["image_url"])
            if item["type"] == "image_url"
            else process_audio_url(item["audio_url"])
        )
    )
    for item in content
]
```

is now hard to scan and extend. A tiny `build_part` helper makes the intent explicit and keeps the comprehension simple:

```python
def build_part(item: dict) -> types.Part:
    t = item.get("type")
    if t == "text":
        return types.Part.from_text(text=item.get("text") or " ")
    if t == "image_url":
        return process_image_url(item["image_url"])
    if t == "audio_url":
        return process_audio_url(item["audio_url"])
    raise ValueError(f"Unsupported content item type: {t}")

# usage
if isinstance(content, list):
    parts = [build_part(item) for item in content]
else:
    parts = [create_text_part(content)]
```

This keeps the existing behavior but removes the nested ternary chain and makes future media types easier to add.

---

### 2. Extract a reusable audio → data-URL helper to avoid duplicating the pipeline

`resolve_audio_part` is implementing a full “path/URL → temp file → wav/mp3 → bytes → base64 → data URL” pipeline inline. That logic is likely to be reused by other providers and is conceptually separate from “Gemini content assembly”.

You can move the generic part into a shared helper (e.g. in `astrbot.core.utils.media_utils`) and keep `resolve_audio_part` thin:

```python
# in astrbot.core.utils.media_utils (or similar)
from pathlib import Path
from urllib.parse import urlparse
import base64
import uuid

from .astrbot_path import get_astrbot_temp_path
from .io import download_file
from .media_utils import ensure_wav  # if this is the same module, rename to avoid cycle
from astrbot import logger

async def audio_path_to_data_url(audio_path: str) -> str | None:
    if audio_path.startswith("http"):
        suffix = Path(urlparse(audio_path).path).suffix or ".wav"
        temp_dir = Path(get_astrbot_temp_path())
        temp_dir.mkdir(parents=True, exist_ok=True)
        resolved_path = str(temp_dir / f"provider_audio_{uuid.uuid4().hex}{suffix}")
        await download_file(audio_path, resolved_path)
    elif audio_path.startswith("file:///"):
        resolved_path = audio_path.replace("file:///", "")
    else:
        resolved_path = audio_path

    suffix = Path(resolved_path).suffix.lower()
    if suffix != ".mp3":
        resolved_path = await ensure_wav(resolved_path)
        suffix = ".wav"

    try:
        audio_bytes = Path(resolved_path).read_bytes()
    except OSError as exc:
        logger.warning(f"Failed to read audio file {resolved_path}, skipping. Error: {exc}")
        return None

    mime_type = {
        ".wav": "audio/wav",
        ".mp3": "audio/mp3",
    }.get(suffix, "audio/wav")

    audio_data = base64.b64encode(audio_bytes).decode("utf-8")
    return f"data:{mime_type};base64,{audio_data}"
```

Then `resolve_audio_part` in this provider becomes a small Gemini-specific wrapper:

```python
from astrbot.core.utils.media_utils import audio_path_to_data_url

async def resolve_audio_part(audio_path: str) -> dict | None:
    audio_url = await audio_path_to_data_url(audio_path)
    if not audio_url:
        return None
    return {
        "type": "audio_url",
        "audio_url": {"url": audio_url},
    }
```

This keeps the Gemini adapter focused on content formatting, avoids duplicating the audio pipeline that other providers already implement, and makes it straightforward to fix or extend audio handling in one place.
</issue_to_address>

### Comment 5
<location path="astrbot/core/provider/entities.py" line_range="188" />
<code_context>

         return "\n".join(result_parts)

     async def assemble_context(self) -> dict:
-        """将请求(prompt 和 image_urls)包装成 OpenAI 的消息格式。"""
+        """将请求(prompt、image_urls 和 audio_urls)包装成统一消息格式。"""
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting image/audio handling into dedicated helper methods so `assemble_context` stays linear and focused on assembling content blocks rather than media-specific branching details.

You can keep `ProviderRequest` simpler by pushing the media‑specific branching into small helpers and reusing common logic, instead of inlining everything in `assemble_context`.

Two concrete steps:

---

### 1. Extract image/audio block helpers to keep `assemble_context` linear

This keeps `assemble_context` focused on “what goes into the content list”, not “how each media is downloaded/encoded”.

```python
async def assemble_context(self) -> dict:
    """将请求(prompt、image_urls 和 audio_urls)包装成统一消息格式。"""
    content_blocks: list[dict] = []

    # 1. 用户原始发言(OpenAI 建议:用户发言在前)
    if self.prompt and self.prompt.strip():
        content_blocks.append({"type": "text", "text": self.prompt})
    elif self.image_urls:
        content_blocks.append({"type": "text", "text": "[图片]"})
    elif self.audio_urls:
        content_blocks.append({"type": "text", "text": "[音频]"})

    # 2. 额外的内容块(系统提醒、指令等)
    for part in self.extra_user_content_parts or []:
        content_blocks.append(part.model_dump())

    # 3. 图片内容
    await self._append_image_blocks(content_blocks)

    # 4. 音频内容
    await self._append_audio_blocks(content_blocks)

    # 兼容简单格式
    if (
        len(content_blocks) == 1
        and content_blocks[0]["type"] == "text"
        and not self.extra_user_content_parts
        and not self.image_urls
        and not self.audio_urls
    ):
        return {"role": "user", "content": content_blocks[0]["text"]}

    return {"role": "user", "content": content_blocks}

async def _append_image_blocks(self, content_blocks: list[dict]) -> None:
    if not self.image_urls:
        return
    for image_url in self.image_urls:
        # 保持现有逻辑不变
        if image_url.startswith("http"):
            image_path = await download_image_by_url(image_url)
            image_data = await self._encode_image_bs64(image_path)
        else:
            image_data = await self._encode_image_bs64(image_url)

        if not image_data:
            logger.warning(f"图片 {image_url} 得到的结果为空,将忽略。")
            continue

        content_blocks.append({"type": "image_url", "image_url": {"url": image_data}})

async def _append_audio_blocks(self, content_blocks: list[dict]) -> None:
    if not self.audio_urls:
        return
    for audio_url in self.audio_urls:
        audio_data = await self._load_and_encode_audio(audio_url)
        if not audio_data:
            logger.warning(f"音频 {audio_url} 得到的结果为空,将忽略。")
            continue
        content_blocks.append({"type": "audio_url", "audio_url": {"url": audio_data}})
```

---

### 2. Centralize audio path/download handling into a small helper

This avoids repeating the `http` / `file:///` / local path branching inside `assemble_context` and makes it more testable.

```python
async def _load_and_encode_audio(self, audio_url: str) -> str:
    if audio_url.startswith("http"):
        parsed_url = urlparse(audio_url)
        suffix = Path(parsed_url.path).suffix
        temp_audio_path = (
            Path(get_astrbot_temp_path())
            / f"provider_request_audio_{uuid.uuid4().hex}{suffix}"
        )
        await download_file(audio_url, str(temp_audio_path))
        return await self._encode_audio_bs64(str(temp_audio_path), source_ref=audio_url)

    if audio_url.startswith("file:///"):
        audio_path = audio_url.replace("file:///", "")
        return await self._encode_audio_bs64(audio_path, source_ref=audio_url)

    # 默认视为本地路径或 base64://
    return await self._encode_audio_bs64(audio_url, source_ref=audio_url)
```

This keeps all existing behavior and media support, but makes `ProviderRequest` easier to scan and reduces the multimodal branching in `assemble_context` to a couple of clearly‑named calls.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +280 to +289
async def _encode_audio_bs64(
self,
audio_path: str,
source_ref: str | None = None,
) -> str:
"""将音频转换为 base64"""
mime_type = "audio/wav"

if audio_path.startswith("base64://"):
return audio_path.replace("base64://", f"data:{mime_type};base64,", 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Audio is always labeled as audio/wav even when the original format is different.

_encode_audio_bs64 hardcodes mime_type = "audio/wav", but audio_path may point to MP3/OGG/etc., and for HTTP URLs in assemble_context you don’t normalize to WAV before encoding. This can yield a data URL whose MIME type doesn’t match the actual bytes. Please either convert non-WAV inputs to WAV before calling this helper (as with ensure_wav) or infer the MIME type (e.g., from extension or magic) so the MIME and content stay consistent.

Comment on lines +291 to +293
with open(audio_path, "rb") as f:
audio_bs64 = base64.b64encode(f.read()).decode("utf-8")
return f"data:{mime_type};base64," + audio_bs64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Audio encoding lacks error handling, unlike the image path, which can cause request assembly to fail on I/O errors.

Here we don't catch FileNotFoundError/OSError, so any bad audio path will raise from assemble_context and fail the whole request. Consider matching the image helper’s behavior by handling these errors (e.g., log and return empty/None) so a single bad audio file doesn’t abort an otherwise valid call.

Comment on lines +317 to +326
if audio_ref.startswith("http"):
suffix = Path(urlparse(audio_ref).path).suffix or ".wav"
temp_dir = Path(get_astrbot_temp_path())
temp_dir.mkdir(parents=True, exist_ok=True)
target_path = temp_dir / f"provider_audio_{uuid.uuid4().hex}{suffix}"
await download_file(audio_ref, str(target_path))
return str(target_path)
if audio_ref.startswith("file://"):
return self._file_uri_to_path(audio_ref)
return audio_ref
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Audio ref resolution only supports http/file paths and treats all other schemes as filesystem paths.

_audio_ref_to_local_path only treats http and file:// specially and otherwise returns audio_ref unchanged. If AudioURLPart.audio_url.url is ever a data: / base64:// or other non-file URL, _resolve_audio_part will pass that into ensure_wav, which expects a real filesystem path and will fail with a confusing I/O error. If such URLs are in scope, we should either decode/handle them here or explicitly reject them (with logging/skip) rather than letting them propagate as if they were local paths.

@@ -331,7 +342,11 @@ def append_or_extend(
(
types.Part.from_text(text=item["text"] or " ")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting small helpers for building content parts and for converting audio paths to data URLs to keep the Gemini adapter simpler and easier to extend.

You can reduce the added complexity in two small, focused steps without changing behavior.


1. Replace nested ternary in the list comprehension with a small helper

The current comprehension:

parts = [
    (
        types.Part.from_text(text=item["text"] or " ")
        if item["type"] == "text"
        else (
            process_image_url(item["image_url"])
            if item["type"] == "image_url"
            else process_audio_url(item["audio_url"])
        )
    )
    for item in content
]

is now hard to scan and extend. A tiny build_part helper makes the intent explicit and keeps the comprehension simple:

def build_part(item: dict) -> types.Part:
    t = item.get("type")
    if t == "text":
        return types.Part.from_text(text=item.get("text") or " ")
    if t == "image_url":
        return process_image_url(item["image_url"])
    if t == "audio_url":
        return process_audio_url(item["audio_url"])
    raise ValueError(f"Unsupported content item type: {t}")

# usage
if isinstance(content, list):
    parts = [build_part(item) for item in content]
else:
    parts = [create_text_part(content)]

This keeps the existing behavior but removes the nested ternary chain and makes future media types easier to add.


2. Extract a reusable audio → data-URL helper to avoid duplicating the pipeline

resolve_audio_part is implementing a full “path/URL → temp file → wav/mp3 → bytes → base64 → data URL” pipeline inline. That logic is likely to be reused by other providers and is conceptually separate from “Gemini content assembly”.

You can move the generic part into a shared helper (e.g. in astrbot.core.utils.media_utils) and keep resolve_audio_part thin:

# in astrbot.core.utils.media_utils (or similar)
from pathlib import Path
from urllib.parse import urlparse
import base64
import uuid

from .astrbot_path import get_astrbot_temp_path
from .io import download_file
from .media_utils import ensure_wav  # if this is the same module, rename to avoid cycle
from astrbot import logger

async def audio_path_to_data_url(audio_path: str) -> str | None:
    if audio_path.startswith("http"):
        suffix = Path(urlparse(audio_path).path).suffix or ".wav"
        temp_dir = Path(get_astrbot_temp_path())
        temp_dir.mkdir(parents=True, exist_ok=True)
        resolved_path = str(temp_dir / f"provider_audio_{uuid.uuid4().hex}{suffix}")
        await download_file(audio_path, resolved_path)
    elif audio_path.startswith("file:///"):
        resolved_path = audio_path.replace("file:///", "")
    else:
        resolved_path = audio_path

    suffix = Path(resolved_path).suffix.lower()
    if suffix != ".mp3":
        resolved_path = await ensure_wav(resolved_path)
        suffix = ".wav"

    try:
        audio_bytes = Path(resolved_path).read_bytes()
    except OSError as exc:
        logger.warning(f"Failed to read audio file {resolved_path}, skipping. Error: {exc}")
        return None

    mime_type = {
        ".wav": "audio/wav",
        ".mp3": "audio/mp3",
    }.get(suffix, "audio/wav")

    audio_data = base64.b64encode(audio_bytes).decode("utf-8")
    return f"data:{mime_type};base64,{audio_data}"

Then resolve_audio_part in this provider becomes a small Gemini-specific wrapper:

from astrbot.core.utils.media_utils import audio_path_to_data_url

async def resolve_audio_part(audio_path: str) -> dict | None:
    audio_url = await audio_path_to_data_url(audio_path)
    if not audio_url:
        return None
    return {
        "type": "audio_url",
        "audio_url": {"url": audio_url},
    }

This keeps the Gemini adapter focused on content formatting, avoids duplicating the audio pipeline that other providers already implement, and makes it straightforward to fix or extend audio handling in one place.


return "\n".join(result_parts)

async def assemble_context(self) -> dict:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting image/audio handling into dedicated helper methods so assemble_context stays linear and focused on assembling content blocks rather than media-specific branching details.

You can keep ProviderRequest simpler by pushing the media‑specific branching into small helpers and reusing common logic, instead of inlining everything in assemble_context.

Two concrete steps:


1. Extract image/audio block helpers to keep assemble_context linear

This keeps assemble_context focused on “what goes into the content list”, not “how each media is downloaded/encoded”.

async def assemble_context(self) -> dict:
    """将请求(prompt、image_urls 和 audio_urls)包装成统一消息格式。"""
    content_blocks: list[dict] = []

    # 1. 用户原始发言(OpenAI 建议:用户发言在前)
    if self.prompt and self.prompt.strip():
        content_blocks.append({"type": "text", "text": self.prompt})
    elif self.image_urls:
        content_blocks.append({"type": "text", "text": "[图片]"})
    elif self.audio_urls:
        content_blocks.append({"type": "text", "text": "[音频]"})

    # 2. 额外的内容块(系统提醒、指令等)
    for part in self.extra_user_content_parts or []:
        content_blocks.append(part.model_dump())

    # 3. 图片内容
    await self._append_image_blocks(content_blocks)

    # 4. 音频内容
    await self._append_audio_blocks(content_blocks)

    # 兼容简单格式
    if (
        len(content_blocks) == 1
        and content_blocks[0]["type"] == "text"
        and not self.extra_user_content_parts
        and not self.image_urls
        and not self.audio_urls
    ):
        return {"role": "user", "content": content_blocks[0]["text"]}

    return {"role": "user", "content": content_blocks}

async def _append_image_blocks(self, content_blocks: list[dict]) -> None:
    if not self.image_urls:
        return
    for image_url in self.image_urls:
        # 保持现有逻辑不变
        if image_url.startswith("http"):
            image_path = await download_image_by_url(image_url)
            image_data = await self._encode_image_bs64(image_path)
        else:
            image_data = await self._encode_image_bs64(image_url)

        if not image_data:
            logger.warning(f"图片 {image_url} 得到的结果为空,将忽略。")
            continue

        content_blocks.append({"type": "image_url", "image_url": {"url": image_data}})

async def _append_audio_blocks(self, content_blocks: list[dict]) -> None:
    if not self.audio_urls:
        return
    for audio_url in self.audio_urls:
        audio_data = await self._load_and_encode_audio(audio_url)
        if not audio_data:
            logger.warning(f"音频 {audio_url} 得到的结果为空,将忽略。")
            continue
        content_blocks.append({"type": "audio_url", "audio_url": {"url": audio_data}})

2. Centralize audio path/download handling into a small helper

This avoids repeating the http / file:/// / local path branching inside assemble_context and makes it more testable.

async def _load_and_encode_audio(self, audio_url: str) -> str:
    if audio_url.startswith("http"):
        parsed_url = urlparse(audio_url)
        suffix = Path(parsed_url.path).suffix
        temp_audio_path = (
            Path(get_astrbot_temp_path())
            / f"provider_request_audio_{uuid.uuid4().hex}{suffix}"
        )
        await download_file(audio_url, str(temp_audio_path))
        return await self._encode_audio_bs64(str(temp_audio_path), source_ref=audio_url)

    if audio_url.startswith("file:///"):
        audio_path = audio_url.replace("file:///", "")
        return await self._encode_audio_bs64(audio_path, source_ref=audio_url)

    # 默认视为本地路径或 base64://
    return await self._encode_audio_bs64(audio_url, source_ref=audio_url)

This keeps all existing behavior and media support, but makes ProviderRequest easier to scan and reduces the multimodal branching in assemble_context to a couple of clearly‑named calls.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements comprehensive audio modality support across the core agent, LLM providers (OpenAI, Gemini, Anthropic), and the dashboard UI. It introduces audio URL handling in ProviderRequest, adds utilities for audio format detection and WAV conversion, and updates the message pipeline to handle record components. Feedback primarily addresses resource management, noting that temporary audio files are not deleted after processing, which could lead to disk exhaustion. Other points include correcting MIDI magic byte detection mislabeled as MP4, improving MIME type handling during audio encoding, and suggesting that temporary files be tracked within the event lifecycle for automatic cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. feature:chatui The bug / feature is about astrbot's chatui, webchat size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant