Skip to content

refactor: remove default web search#7416

Merged
Soulter merged 2 commits intomasterfrom
refactor/websearch
Apr 8, 2026
Merged

refactor: remove default web search#7416
Soulter merged 2 commits intomasterfrom
refactor/websearch

Conversation

@Soulter
Copy link
Copy Markdown
Member

@Soulter Soulter commented Apr 8, 2026

Modifications / 改动点

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果


Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。

  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”

  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txtpyproject.toml 文件相应位置。

  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Refactor web search integration to use a new built-in tool set and drop the legacy default web search provider.

New Features:

  • Add first-class web search tools for Tavily (including page extraction), BoCha, Brave, and Baidu AI Search that integrate with the main agent tool system.

Enhancements:

  • Automatically apply web search tools in the main agent based on provider configuration, including normalization of legacy web search settings and key formats.
  • Extend chat dashboards and agent hooks to recognize Baidu web search results for citation rendering and reference extraction.
  • Update default configuration to use Tavily as the web search provider and simplify provider options by removing the deprecated default provider.
  • Revise English and Chinese user and developer documentation to describe the new set of supported web search providers and configuration options.

Chores:

  • Remove the legacy built-in web_searcher plugin and its engine implementations for the deprecated default web search.

@auto-assign auto-assign bot requested review from Fridemn and anka-afk April 8, 2026 05:21
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. area:core The bug / feature is about astrbot's core, backend area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. labels Apr 8, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In _cache_favicon, accessing sp.temporary_cache["_ws_favicon"] assumes the sub-dict already exists; consider using setdefault or an existence check to avoid a KeyError on first use.
  • The API key presence checks in the tool call methods (e.g., Tavily/BoCha/Brave) are duplicated by the _KeyRotator.get validation; you could simplify by centralizing the validation in _KeyRotator.get and removing the redundant checks in each tool.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_cache_favicon`, accessing `sp.temporary_cache["_ws_favicon"]` assumes the sub-dict already exists; consider using `setdefault` or an existence check to avoid a `KeyError` on first use.
- The API key presence checks in the tool `call` methods (e.g., Tavily/BoCha/Brave) are duplicated by the `_KeyRotator.get` validation; you could simplify by centralizing the validation in `_KeyRotator.get` and removing the redundant checks in each tool.

## Individual Comments

### Comment 1
<location path="astrbot/core/tools/web_search_tools.py" line_range="149-135" />
<code_context>
+            ]
+
+
+async def _tavily_extract(provider_settings: dict, payload: dict) -> list[dict]:
+    tavily_key = await _TAVILY_KEY_ROTATOR.get(provider_settings)
+    header = {
+        "Authorization": f"Bearer {tavily_key}",
+        "Content-Type": "application/json",
+    }
+    async with aiohttp.ClientSession(trust_env=True) as session:
+        async with session.post(
+            "https://api.tavily.com/extract",
+            json=payload,
+            headers=header,
+        ) as response:
+            if response.status != 200:
+                reason = await response.text()
+                raise Exception(
+                    f"Tavily web search failed: {reason}, status: {response.status}",
+                )
+            data = await response.json()
</code_context>
<issue_to_address>
**nitpick:** Error message for Tavily extract endpoint still refers to "web search" which can be misleading.

Since this helper targets the Tavily extract endpoint, using "Tavily web search failed" in the exception is misleading and makes it harder to distinguish extract vs search issues in logs. Please update the message to mention extract (or the specific endpoint) so failures are clearly identifiable.
</issue_to_address>

### Comment 2
<location path="astrbot/core/tools/web_search_tools.py" line_range="117" />
<code_context>
+    return json.dumps({"results": ret_ls}, ensure_ascii=False)
+
+
+async def _tavily_search(
+    provider_settings: dict,
+    payload: dict,
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared HTTP/JSON handling and common provider key/result checks into small helpers to reduce duplication and make each web search function and tool call easier to read.

You can keep all functionality as-is and still trim a lot of complexity by extracting a couple of tight internal helpers. Two places where the code is currently quite duplicated:

---

### 1. Centralize HTTP request/JSON handling

The 4 search functions (`_tavily_search`, `_tavily_extract`, `_bocha_search`, `_brave_search`, `_baidu_search`) all duplicate `ClientSession`/status checking/JSON parsing.

You can extract a small internal helper and keep provider-specific mapping logic local:

```python
async def _request_json(
    method: str,
    url: str,
    *,
    headers: dict,
    json_payload: dict | None = None,
    params: dict | None = None,
    error_prefix: str,
) -> dict:
    async with aiohttp.ClientSession(trust_env=True) as session:
        request = session.post if method.lower() == "post" else session.get
        async with request(url, json=json_payload, params=params, headers=headers) as resp:
            if resp.status != 200:
                reason = await resp.text()
                raise Exception(f"{error_prefix} failed: {reason}, status: {resp.status}")
            return await resp.json()
```

Then each provider becomes much smaller, e.g.:

```python
async def _tavily_search(provider_settings: dict, payload: dict) -> list[SearchResult]:
    tavily_key = await _TAVILY_KEY_ROTATOR.get(provider_settings)
    headers = {
        "Authorization": f"Bearer {tavily_key}",
        "Content-Type": "application/json",
    }
    data = await _request_json(
        "post",
        "https://api.tavily.com/search",
        headers=headers,
        json_payload=payload,
        error_prefix="Tavily web search",
    )
    return [
        SearchResult(
            title=item.get("title"),
            url=item.get("url"),
            snippet=item.get("content"),
            favicon=item.get("favicon"),
        )
        for item in data.get("results", [])
    ]
```

And similarly for `_tavily_extract`, `_bocha_search`, `_brave_search`, `_baidu_search`. This pulls out the noisy, repeated error/HTTP boilerplate and lets you focus on payload building + result mapping.

---

### 2. Normalize “provider key missing” & empty-result handling

Each tool’s `call` repeats:

- Get `provider_settings`.
- Check for presence of API key and return similar error strings.
- After calling the search function, check `if not results: return "Error: … does not return any results."`.

You can hide this pattern behind very small helpers per provider without changing behavior:

```python
def _ensure_provider_key(provider_settings: dict, key_name: str, label: str) -> str | None:
    value = provider_settings.get(key_name)
    if not value:
        # Preserve exact error wording
        return f"Error: {label} API key is not configured in AstrBot."
    return None

def _ensure_results(results: list[SearchResult], label: str) -> str | None:
    if not results:
        return f"Error: {label} web searcher does not return any results."
    return None
```

Usage example in `BochaWebSearchTool.call`:

```python
async def call(self, context, **kwargs) -> ToolExecResult:
    _, provider_settings, _ = _get_runtime(context)

    err = _ensure_provider_key(provider_settings, "websearch_bocha_key", "BoCha")
    if err:
        return err

    payload = {
        "query": kwargs["query"],
        "count": kwargs.get("count", 10),
        "summary": bool(kwargs.get("summary", False)),
    }
    if kwargs.get("freshness"):
        payload["freshness"] = kwargs["freshness"]
    if kwargs.get("include"):
        payload["include"] = kwargs["include"]
    if kwargs.get("exclude"):
        payload["exclude"] = kwargs["exclude"]

    results = await _bocha_search(provider_settings, payload)

    err = _ensure_results(results, "BoCha")
    if err:
        return err

    return _search_result_payload(results)
```

You can do the same for Tavily/Brave/Baidu, keeping provider-specific messages intact while cutting repeated scaffolding and making the happy-path logic easier to scan.

---

These two small extractions keep everything in one module, don’t change observable behavior, and remove a good chunk of duplication that contributes to the “600+ line blob” feel.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/tools/web_search_tools.py
Comment thread astrbot/core/tools/web_search_tools.py
@Soulter Soulter merged commit 94a529d into master Apr 8, 2026
6 checks passed
@Soulter Soulter deleted the refactor/websearch branch April 8, 2026 05:26
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the web search functionality by removing the legacy "default" search engine scrapers and introducing a modular tool-based system for API providers including Tavily, BoCha, Brave, and Baidu AI Search. The changes include a new web_search_tools.py file, configuration migration logic, and updates to the dashboard and documentation to support these providers. Feedback focuses on optimizing the configuration migration path to avoid redundant disk writes during LLM requests, improving error handling for API responses in the BoCha tool, and removing unnecessary query truncation in the Baidu search implementation.

plugin_context: Context,
) -> None:
cfg = plugin_context.get_config(umo=event.unified_msg_origin)
normalize_legacy_web_search_config(cfg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

normalize_legacy_web_search_config(cfg) is called on every LLM request. This function performs configuration migration and potentially saves the config to disk. Executing this logic in the request hot path is inefficient. It should be moved to a startup or configuration loading phase.

f"BoCha web search failed: {reason}, status: {response.status}",
)
data = await response.json()
rows = data["data"]["webPages"]["value"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing data["data"]["webPages"]["value"] directly might raise a KeyError if the BoCha API returns an unexpected response structure or an error message in a successful HTTP response. Using .get() or verifying the structure before access would be safer.

Suggested change
rows = data["data"]["webPages"]["value"]
rows = data.get("data", {}).get("webPages", {}).get("value", [])

top_k = 50

payload = {
"messages": [{"role": "user", "content": str(kwargs["query"])[:72]}],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Baidu AI Search query is truncated to 72 characters. This limit seems quite low for modern LLM-driven search queries and might lead to loss of context. If there isn't a strict API requirement for this specific length, consider increasing it or removing the truncation.

Suggested change
"messages": [{"role": "user", "content": str(kwargs["query"])[:72]}],
"messages": [{"role": "user", "content": str(kwargs["query"])}],

NekyuuYa pushed a commit to NekyuuYa/AstrBot-R that referenced this pull request Apr 8, 2026
* refactor: remove default web search

* chore: ruff format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant