Skip to content

Enhance backend configurations with revision support and chat options#16

Merged
NullPointerDepressiveDisorder merged 10 commits into
mainfrom
bug/chat-mode
Apr 20, 2026
Merged

Enhance backend configurations with revision support and chat options#16
NullPointerDepressiveDisorder merged 10 commits into
mainfrom
bug/chat-mode

Conversation

@NullPointerDepressiveDisorder
Copy link
Copy Markdown
Owner

This pull request introduces significant improvements to backend configuration and prompt formatting, with a focus on consistent "thinking" mode control across all supported backends. The main changes add support for disabling "thinking" (reasoning) in model prompts, propagate this option through all backend adapters, and improve compatibility with Ollama and other OpenAI-compatible servers. Additionally, the transformers library is added as a dependency, and the documentation is updated to reflect correct server URLs.

Backend configuration and prompt formatting improvements:

  • Added the disable_thinking and hf_revision fields to BackendConfig, and ensured these options are propagated to all backend adapters (mlx-lm, llama-cpp, vllm-mlx, openai-compat). This enables consistent control over whether the model uses "thinking" (reasoning) mode. [1] [2] [3] [4]
  • Updated MLXBackend, LlamaCppBackend, and OpenAICompatBackend to accept and use disable_thinking and revision parameters, and to consistently apply prompt formatting via the new format_prompt utility. [1] [2] [3] [4] [5] [6] [7]
  • In OpenAICompatBackend, implemented logic to strip "thinking" trigger tokens from prompts when disable_thinking is set, and to add backend-specific hints/parameters to the payload for disabling reasoning/thinking mode. The backend automatically falls back if the server rejects these parameters.

Ollama and OpenAI-compatibility enhancements:

  • For Ollama, added a dedicated code path to use the native /api/chat endpoint with think: false for reliable disabling of "thinking" mode, and improved logprobs parsing for Ollama's native response format. [1] [2]
  • Updated documentation and error messages to reflect the correct default base URL for OpenAI-compatible and Ollama servers (removed /v1 suffix). [1] [2] [3]

Dependency and utility updates:

  • Added transformers>=5.3.0 to the required dependencies in pyproject.toml.

These changes make backend selection and prompt formatting more robust, configurable, and consistent across local and remote inference servers.

…ll backends

- Add `hf_revision`/`revision` parameter to model config, resolve, and backend constructors
- Parse and propagate revision from model spec (e.g. repo/model@main)
- Implement `format_prompt` utility to apply chat templates using tokenizer or HuggingFace model
- Use `format_prompt` in mlx-lm, llama-cpp, and openai-compat backends for consistent prompt formatting
- Add `transformers` as a required dependency
…ackends

- Introduce `disable_thinking` flag to all backend configs and CLI, defaulting to True for consistent output comparison.
- Implement cross-backend support for disabling reasoning/thinking mode (Qwen3, DeepSeek-R1, Ollama, vLLM, OpenAI/OpenRouter).
- Strip reasoning-trigger tokens from prompts when disabled.
- Route to Ollama's native /api/chat endpoint with `think: false` when appropriate.
- Add tests for prompt formatting, payload construction, and retry logic when disabling thinking.
…ackends

- Introduce `disable_thinking` flag to all backend configs and CLI, defaulting to True for consistent output comparison.
- Implement cross-backend support for disabling reasoning/thinking mode (Qwen3, DeepSeek-R1, Ollama, vLLM, OpenAI/OpenRouter).
- Strip reasoning-trigger tokens from prompts when disabled.
- Route to Ollama's native /api/chat endpoint with `think: false` when appropriate.
- Add tests for prompt formatting, payload construction, and retry logic when disabling thinking.
Copilot AI review requested due to automatic review settings April 19, 2026 20:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances infer-check backend configuration and prompt formatting to support consistent “thinking/reasoning” disabling across backends, adds HuggingFace revision support in model specs, improves Ollama/OpenAI-compat handling, and updates defaults/docs for base URLs.

Changes:

  • Introduces disable_thinking and hf_revision in backend configuration and threads them through model resolution and backend factories.
  • Adds shared prompt utilities (strip_thinking_tokens, format_prompt) and updates backends/tests to use them.
  • Improves OpenAI-compat behavior for Ollama (native /api/chat routing when thinking is disabled) and updates default base URLs/docs to remove the /v1 suffix.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/infer_check/utils.py Adds shared prompt formatting + thinking-token stripping utilities.
src/infer_check/resolve.py Updates default OpenAI-compat base URL; adds @revision parsing and propagation via ResolvedModel.
src/infer_check/backends/base.py Adds hf_revision + disable_thinking to BackendConfig; propagates options to backend constructors.
src/infer_check/backends/mlx_lm.py Uses shared format_prompt and stores revision/disable_thinking in the backend.
src/infer_check/backends/llama_cpp.py Formats prompts before sending to llama-server.
src/infer_check/backends/openai_compat.py Adds disable-thinking payload hints, think-token stripping, and Ollama /api/chat native path.
src/infer_check/backends/vllm_mlx.py Threads disable_thinking through vllm-mlx wrapper.
src/infer_check/cli.py Adds --disable-thinking/--enable-thinking option and propagates it into command configs.
tests/unit/test_utils.py Adds unit tests for prompt formatting and thinking-token stripping behavior.
tests/unit/test_openai_compat.py Adds tests for disable-thinking payload/routing behavior and retry-on-400 logic.
tests/unit/test_resolve.py Updates expected default base URL for Ollama/openai-compat resolution.
tests/unit/test_mlx_backend.py Updates test to call shared format_prompt.
tests/unit/test_llama_cpp_payload.py Patches format_prompt for llama-cpp tests to avoid tokenizer loading.
tests/unit/test_llama_cpp_fallback.py Same as above for fallback test.
docs/backends.md Updates documented default base URL for openai-compat (no /v1).
pyproject.toml Adds transformers dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyproject.toml
Comment thread src/infer_check/utils.py Outdated
Comment thread src/infer_check/utils.py
Comment thread src/infer_check/backends/llama_cpp.py
Comment thread src/infer_check/backends/openai_compat.py
Comment thread src/infer_check/cli.py Outdated
Comment thread src/infer_check/backends/base.py
Comment thread src/infer_check/backends/openai_compat.py Outdated
Comment thread src/infer_check/backends/vllm_mlx.py
Comment thread src/infer_check/cli.py
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/infer_check/utils.py
Comment thread src/infer_check/backends/openai_compat.py
Comment thread src/infer_check/resolve.py Outdated
Comment thread src/infer_check/backends/mlx_lm.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/infer_check/utils.py
Comment on lines +110 to +117
if tokenizer is None and model_id:
# Only attempt to load from HF if it looks like a HF repo (owner/repo)
# or an absolute/relative path. Ollama tags (name:tag) or local GGUF
# files should be skipped as they'll fail or hang from_pretrained.
is_hf_id = "/" in model_id
if is_hf_id:
with contextlib.suppress(Exception):
tokenizer = _get_tokenizer(model_id, revision)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format_prompt()'s HF tokenizer-load heuristic only checks for '/' in model_id, but local GGUF paths (e.g. /models/foo.gguf) also contain / and will trigger repeated AutoTokenizer.from_pretrained(...) attempts (then get swallowed). Consider explicitly excluding .gguf (and/or file paths) from the HF path so llama-cpp/OpenAI-compat with GGUF model_ids don't pay this overhead every prompt.

Copilot uses AI. Check for mistakes.
Comment thread src/infer_check/utils.py
Comment on lines +115 to +117
if is_hf_id:
with contextlib.suppress(Exception):
tokenizer = _get_tokenizer(model_id, revision)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format_prompt() suppresses all exceptions when trying to load a tokenizer (contextlib.suppress(Exception)), which can silently skip chat templating when (for example) transformers isn't installed or the local cache is missing. It would be easier to debug/operate if this narrowed the exception handling (e.g., handle ImportError with an actionable message about installing infer-check[http], and log/propagate unexpected errors).

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +179
# Retry shedding unsupported params only on 400/422.
if exc.status_code in (400, 422) and (self._chat_logprobs_supported or self._thinking_keys_supported):
if self._chat_logprobs_supported:
self._chat_logprobs_supported = False
payload.pop("logprobs", None)
payload.pop("top_logprobs", None)
if self._thinking_keys_supported:
self._thinking_keys_supported = False
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _generate_chat(), any 400/422 response causes both logprobs and the thinking-disable keys to be dropped if their respective “supported” flags are still true. This means a server that rejects only the thinking-related fields will unnecessarily lose chat logprobs for all subsequent requests. Consider retrying in two steps (drop thinking keys first; if it still fails, then drop logprobs) so partial support is preserved.

Copilot uses AI. Check for mistakes.
@NullPointerDepressiveDisorder NullPointerDepressiveDisorder merged commit bf8cd7d into main Apr 20, 2026
9 checks passed
@NullPointerDepressiveDisorder NullPointerDepressiveDisorder deleted the bug/chat-mode branch April 20, 2026 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants