Enhance backend configurations with revision support and chat options#16
Conversation
…hat mode default to True
…ll backends - Add `hf_revision`/`revision` parameter to model config, resolve, and backend constructors - Parse and propagate revision from model spec (e.g. repo/model@main) - Implement `format_prompt` utility to apply chat templates using tokenizer or HuggingFace model - Use `format_prompt` in mlx-lm, llama-cpp, and openai-compat backends for consistent prompt formatting - Add `transformers` as a required dependency
…ackends - Introduce `disable_thinking` flag to all backend configs and CLI, defaulting to True for consistent output comparison. - Implement cross-backend support for disabling reasoning/thinking mode (Qwen3, DeepSeek-R1, Ollama, vLLM, OpenAI/OpenRouter). - Strip reasoning-trigger tokens from prompts when disabled. - Route to Ollama's native /api/chat endpoint with `think: false` when appropriate. - Add tests for prompt formatting, payload construction, and retry logic when disabling thinking.
…ackends - Introduce `disable_thinking` flag to all backend configs and CLI, defaulting to True for consistent output comparison. - Implement cross-backend support for disabling reasoning/thinking mode (Qwen3, DeepSeek-R1, Ollama, vLLM, OpenAI/OpenRouter). - Strip reasoning-trigger tokens from prompts when disabled. - Route to Ollama's native /api/chat endpoint with `think: false` when appropriate. - Add tests for prompt formatting, payload construction, and retry logic when disabling thinking.
There was a problem hiding this comment.
Pull request overview
This PR enhances infer-check backend configuration and prompt formatting to support consistent “thinking/reasoning” disabling across backends, adds HuggingFace revision support in model specs, improves Ollama/OpenAI-compat handling, and updates defaults/docs for base URLs.
Changes:
- Introduces
disable_thinkingandhf_revisionin backend configuration and threads them through model resolution and backend factories. - Adds shared prompt utilities (
strip_thinking_tokens,format_prompt) and updates backends/tests to use them. - Improves OpenAI-compat behavior for Ollama (native
/api/chatrouting when thinking is disabled) and updates default base URLs/docs to remove the/v1suffix.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
src/infer_check/utils.py |
Adds shared prompt formatting + thinking-token stripping utilities. |
src/infer_check/resolve.py |
Updates default OpenAI-compat base URL; adds @revision parsing and propagation via ResolvedModel. |
src/infer_check/backends/base.py |
Adds hf_revision + disable_thinking to BackendConfig; propagates options to backend constructors. |
src/infer_check/backends/mlx_lm.py |
Uses shared format_prompt and stores revision/disable_thinking in the backend. |
src/infer_check/backends/llama_cpp.py |
Formats prompts before sending to llama-server. |
src/infer_check/backends/openai_compat.py |
Adds disable-thinking payload hints, think-token stripping, and Ollama /api/chat native path. |
src/infer_check/backends/vllm_mlx.py |
Threads disable_thinking through vllm-mlx wrapper. |
src/infer_check/cli.py |
Adds --disable-thinking/--enable-thinking option and propagates it into command configs. |
tests/unit/test_utils.py |
Adds unit tests for prompt formatting and thinking-token stripping behavior. |
tests/unit/test_openai_compat.py |
Adds tests for disable-thinking payload/routing behavior and retry-on-400 logic. |
tests/unit/test_resolve.py |
Updates expected default base URL for Ollama/openai-compat resolution. |
tests/unit/test_mlx_backend.py |
Updates test to call shared format_prompt. |
tests/unit/test_llama_cpp_payload.py |
Patches format_prompt for llama-cpp tests to avoid tokenizer loading. |
tests/unit/test_llama_cpp_fallback.py |
Same as above for fallback test. |
docs/backends.md |
Updates documented default base URL for openai-compat (no /v1). |
pyproject.toml |
Adds transformers dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…he tokenizer loading
…d revision in diff command
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if tokenizer is None and model_id: | ||
| # Only attempt to load from HF if it looks like a HF repo (owner/repo) | ||
| # or an absolute/relative path. Ollama tags (name:tag) or local GGUF | ||
| # files should be skipped as they'll fail or hang from_pretrained. | ||
| is_hf_id = "/" in model_id | ||
| if is_hf_id: | ||
| with contextlib.suppress(Exception): | ||
| tokenizer = _get_tokenizer(model_id, revision) |
There was a problem hiding this comment.
format_prompt()'s HF tokenizer-load heuristic only checks for '/' in model_id, but local GGUF paths (e.g. /models/foo.gguf) also contain / and will trigger repeated AutoTokenizer.from_pretrained(...) attempts (then get swallowed). Consider explicitly excluding .gguf (and/or file paths) from the HF path so llama-cpp/OpenAI-compat with GGUF model_ids don't pay this overhead every prompt.
| if is_hf_id: | ||
| with contextlib.suppress(Exception): | ||
| tokenizer = _get_tokenizer(model_id, revision) |
There was a problem hiding this comment.
format_prompt() suppresses all exceptions when trying to load a tokenizer (contextlib.suppress(Exception)), which can silently skip chat templating when (for example) transformers isn't installed or the local cache is missing. It would be easier to debug/operate if this narrowed the exception handling (e.g., handle ImportError with an actionable message about installing infer-check[http], and log/propagate unexpected errors).
| # Retry shedding unsupported params only on 400/422. | ||
| if exc.status_code in (400, 422) and (self._chat_logprobs_supported or self._thinking_keys_supported): | ||
| if self._chat_logprobs_supported: | ||
| self._chat_logprobs_supported = False | ||
| payload.pop("logprobs", None) | ||
| payload.pop("top_logprobs", None) | ||
| if self._thinking_keys_supported: | ||
| self._thinking_keys_supported = False |
There was a problem hiding this comment.
In _generate_chat(), any 400/422 response causes both logprobs and the thinking-disable keys to be dropped if their respective “supported” flags are still true. This means a server that rejects only the thinking-related fields will unnecessarily lose chat logprobs for all subsequent requests. Consider retrying in two steps (drop thinking keys first; if it still fails, then drop logprobs) so partial support is preserved.
This pull request introduces significant improvements to backend configuration and prompt formatting, with a focus on consistent "thinking" mode control across all supported backends. The main changes add support for disabling "thinking" (reasoning) in model prompts, propagate this option through all backend adapters, and improve compatibility with Ollama and other OpenAI-compatible servers. Additionally, the
transformerslibrary is added as a dependency, and the documentation is updated to reflect correct server URLs.Backend configuration and prompt formatting improvements:
disable_thinkingandhf_revisionfields toBackendConfig, and ensured these options are propagated to all backend adapters (mlx-lm,llama-cpp,vllm-mlx,openai-compat). This enables consistent control over whether the model uses "thinking" (reasoning) mode. [1] [2] [3] [4]MLXBackend,LlamaCppBackend, andOpenAICompatBackendto accept and usedisable_thinkingandrevisionparameters, and to consistently apply prompt formatting via the newformat_promptutility. [1] [2] [3] [4] [5] [6] [7]OpenAICompatBackend, implemented logic to strip "thinking" trigger tokens from prompts whendisable_thinkingis set, and to add backend-specific hints/parameters to the payload for disabling reasoning/thinking mode. The backend automatically falls back if the server rejects these parameters.Ollama and OpenAI-compatibility enhancements:
/api/chatendpoint withthink: falsefor reliable disabling of "thinking" mode, and improved logprobs parsing for Ollama's native response format. [1] [2]/v1suffix). [1] [2] [3]Dependency and utility updates:
transformers>=5.3.0to the required dependencies inpyproject.toml.These changes make backend selection and prompt formatting more robust, configurable, and consistent across local and remote inference servers.