Enhance backend configurations with revision support and chat options by NullPointerDepressiveDisorder · Pull Request #16 · NullPointerDepressiveDisorder/infer-check

NullPointerDepressiveDisorder · 2026-04-19T20:29:44Z

This pull request introduces significant improvements to backend configuration and prompt formatting, with a focus on consistent "thinking" mode control across all supported backends. The main changes add support for disabling "thinking" (reasoning) in model prompts, propagate this option through all backend adapters, and improve compatibility with Ollama and other OpenAI-compatible servers. Additionally, the transformers library is added as a dependency, and the documentation is updated to reflect correct server URLs.

Backend configuration and prompt formatting improvements:

Added the disable_thinking and hf_revision fields to BackendConfig, and ensured these options are propagated to all backend adapters (mlx-lm, llama-cpp, vllm-mlx, openai-compat). This enables consistent control over whether the model uses "thinking" (reasoning) mode. [1] [2] [3] [4]
Updated MLXBackend, LlamaCppBackend, and OpenAICompatBackend to accept and use disable_thinking and revision parameters, and to consistently apply prompt formatting via the new format_prompt utility. [1] [2] [3] [4] [5] [6] [7]
In OpenAICompatBackend, implemented logic to strip "thinking" trigger tokens from prompts when disable_thinking is set, and to add backend-specific hints/parameters to the payload for disabling reasoning/thinking mode. The backend automatically falls back if the server rejects these parameters.

Ollama and OpenAI-compatibility enhancements:

For Ollama, added a dedicated code path to use the native /api/chat endpoint with think: false for reliable disabling of "thinking" mode, and improved logprobs parsing for Ollama's native response format. [1] [2]
Updated documentation and error messages to reflect the correct default base URL for OpenAI-compatible and Ollama servers (removed /v1 suffix). [1] [2] [3]

Dependency and utility updates:

Added transformers>=5.3.0 to the required dependencies in pyproject.toml.

These changes make backend selection and prompt formatting more robust, configurable, and consistent across local and remote inference servers.

…hat mode default to True

…ll backends - Add `hf_revision`/`revision` parameter to model config, resolve, and backend constructors - Parse and propagate revision from model spec (e.g. repo/model@main) - Implement `format_prompt` utility to apply chat templates using tokenizer or HuggingFace model - Use `format_prompt` in mlx-lm, llama-cpp, and openai-compat backends for consistent prompt formatting - Add `transformers` as a required dependency

…ackends - Introduce `disable_thinking` flag to all backend configs and CLI, defaulting to True for consistent output comparison. - Implement cross-backend support for disabling reasoning/thinking mode (Qwen3, DeepSeek-R1, Ollama, vLLM, OpenAI/OpenRouter). - Strip reasoning-trigger tokens from prompts when disabled. - Route to Ollama's native /api/chat endpoint with `think: false` when appropriate. - Add tests for prompt formatting, payload construction, and retry logic when disabling thinking.

Copilot

Pull request overview

This PR enhances infer-check backend configuration and prompt formatting to support consistent “thinking/reasoning” disabling across backends, adds HuggingFace revision support in model specs, improves Ollama/OpenAI-compat handling, and updates defaults/docs for base URLs.

Changes:

Introduces disable_thinking and hf_revision in backend configuration and threads them through model resolution and backend factories.
Adds shared prompt utilities (strip_thinking_tokens, format_prompt) and updates backends/tests to use them.
Improves OpenAI-compat behavior for Ollama (native /api/chat routing when thinking is disabled) and updates default base URLs/docs to remove the /v1 suffix.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
`src/infer_check/utils.py`	Adds shared prompt formatting + thinking-token stripping utilities.
`src/infer_check/resolve.py`	Updates default OpenAI-compat base URL; adds `@revision` parsing and propagation via `ResolvedModel`.
`src/infer_check/backends/base.py`	Adds `hf_revision` + `disable_thinking` to `BackendConfig`; propagates options to backend constructors.
`src/infer_check/backends/mlx_lm.py`	Uses shared `format_prompt` and stores revision/disable_thinking in the backend.
`src/infer_check/backends/llama_cpp.py`	Formats prompts before sending to llama-server.
`src/infer_check/backends/openai_compat.py`	Adds disable-thinking payload hints, think-token stripping, and Ollama `/api/chat` native path.
`src/infer_check/backends/vllm_mlx.py`	Threads `disable_thinking` through vllm-mlx wrapper.
`src/infer_check/cli.py`	Adds `--disable-thinking/--enable-thinking` option and propagates it into command configs.
`tests/unit/test_utils.py`	Adds unit tests for prompt formatting and thinking-token stripping behavior.
`tests/unit/test_openai_compat.py`	Adds tests for disable-thinking payload/routing behavior and retry-on-400 logic.
`tests/unit/test_resolve.py`	Updates expected default base URL for Ollama/openai-compat resolution.
`tests/unit/test_mlx_backend.py`	Updates test to call shared `format_prompt`.
`tests/unit/test_llama_cpp_payload.py`	Patches `format_prompt` for llama-cpp tests to avoid tokenizer loading.
`tests/unit/test_llama_cpp_fallback.py`	Same as above for fallback test.
`docs/backends.md`	Updates documented default base URL for openai-compat (no `/v1`).
`pyproject.toml`	Adds `transformers` dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…he tokenizer loading

…d revision in diff command

…andling comments

…figuration

codecov · 2026-04-20T01:58:03Z

Codecov Report

❌ Patch coverage is 94.00922% with 26 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/infer_check/backends/openai_compat.py	87.90%	14 Missing and 1 partial ⚠️
src/infer_check/utils.py	91.66%	3 Missing and 1 partial ⚠️
src/infer_check/cli.py	90.90%	3 Missing ⚠️
src/infer_check/backends/base.py	90.47%	2 Missing ⚠️
src/infer_check/backends/mlx_lm.py	88.88%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… for tokenizer loading

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-20T02:22:40Z

+    if tokenizer is None and model_id:
+        # Only attempt to load from HF if it looks like a HF repo (owner/repo)
+        # or an absolute/relative path. Ollama tags (name:tag) or local GGUF
+        # files should be skipped as they'll fail or hang from_pretrained.
+        is_hf_id = "/" in model_id
+        if is_hf_id:
+            with contextlib.suppress(Exception):
+                tokenizer = _get_tokenizer(model_id, revision)


format_prompt()'s HF tokenizer-load heuristic only checks for '/' in model_id, but local GGUF paths (e.g. /models/foo.gguf) also contain / and will trigger repeated AutoTokenizer.from_pretrained(...) attempts (then get swallowed). Consider explicitly excluding .gguf (and/or file paths) from the HF path so llama-cpp/OpenAI-compat with GGUF model_ids don't pay this overhead every prompt.

Copilot · 2026-04-20T02:22:40Z

+        if is_hf_id:
+            with contextlib.suppress(Exception):
+                tokenizer = _get_tokenizer(model_id, revision)


format_prompt() suppresses all exceptions when trying to load a tokenizer (contextlib.suppress(Exception)), which can silently skip chat templating when (for example) transformers isn't installed or the local cache is missing. It would be easier to debug/operate if this narrowed the exception handling (e.g., handle ImportError with an actionable message about installing infer-check[http], and log/propagate unexpected errors).

Copilot · 2026-04-20T02:22:40Z

+            # Retry shedding unsupported params only on 400/422.
+            if exc.status_code in (400, 422) and (self._chat_logprobs_supported or self._thinking_keys_supported):
+                if self._chat_logprobs_supported:
+                    self._chat_logprobs_supported = False
+                    payload.pop("logprobs", None)
+                    payload.pop("top_logprobs", None)
+                if self._thinking_keys_supported:
+                    self._thinking_keys_supported = False


In _generate_chat(), any 400/422 response causes both logprobs and the thinking-disable keys to be dropped if their respective “supported” flags are still true. This means a server that rejects only the thinking-related fields will unnecessarily lose chat logprobs for all subsequent requests. Consider retrying in two steps (drop thinking keys first; if it still fails, then drop logprobs) so partial support is preserved.

NullPointerDepressiveDisorder added 4 commits April 17, 2026 07:00

fix: update Ollama backend default URL to remove /v1 suffix and set c…

9ee796b

…hat mode default to True

Copilot AI review requested due to automatic review settings April 19, 2026 20:29

Copilot started reviewing on behalf of NullPointerDepressiveDisorder April 19, 2026 20:30 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

NullPointerDepressiveDisorder added 4 commits April 19, 2026 16:12

refactor: move transformers dependency to optional http extra and cac…

c46e5a4

…he tokenizer loading

feat: add revision support for vllm-mlx backend and propagate resolve…

970a8d5

…d revision in diff command

feat: propagate hf_revision to diff command and clarify Ollama chat h…

10b16b2

…andling comments

feat: add --chat option to CLI and propagate chat mode to backend con…

02654cd

…figuration

NullPointerDepressiveDisorder requested a review from Copilot April 20, 2026 01:59

Copilot started reviewing on behalf of NullPointerDepressiveDisorder April 20, 2026 01:59 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Comment thread src/infer_check/utils.py

Comment thread src/infer_check/backends/openai_compat.py

Comment thread src/infer_check/resolve.py Outdated

Comment thread src/infer_check/backends/mlx_lm.py

NullPointerDepressiveDisorder added 2 commits April 19, 2026 19:10

fix: refine model spec revision parsing and tighten HF repo detection…

d7a6c20

… for tokenizer loading

fix: pass revision to model loader in mlx backend

30a88bd

NullPointerDepressiveDisorder requested a review from Copilot April 20, 2026 02:18

Copilot started reviewing on behalf of NullPointerDepressiveDisorder April 20, 2026 02:18 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

NullPointerDepressiveDisorder merged commit bf8cd7d into main Apr 20, 2026
9 checks passed

NullPointerDepressiveDisorder deleted the bug/chat-mode branch April 20, 2026 02:24

Conversation

NullPointerDepressiveDisorder commented Apr 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 20, 2026 •

edited

Loading