Implementation status (this repo). The XS/S/M recommendations below are implemented on
net.ladenthin.llama.server.OpenAiCompatServer:POST /infill(FIM autocomplete),POST /v1/rerank(RAG),stream_options.include_usagepassthrough + acached_tokenssafety net,response_format(structured outputs), CORS/OPTIONSpreflight, bare-path (/v1-less) aliases, acache_prompt=truedefault, and--mmproj/--embedding/--rerankingCLI flags. Agentic tool-calling is the primary target and is verified wire-correct by a C++ guard pinningtool_calls.function.argumentsas a JSON string (llama.cpp #20198). Open items that need a product decision (Ollama native-API emulation, AnthropicPOST /v1/messages+ OpenAIPOST /v1/responsesshims, Continue's native/completion, a per-model FIM template registry,/propscapability reporting) are tracked in../TODO.md. The verbatim deep-research report follows.
- The single highest-leverage change is to add a llama.cpp-native
POST /infillendpoint (fieldsinput_prefix,input_suffix,input_extra,prompt,n_predict), because every high-quality local ghost-text client (llama.vscode, llama.vim, Tabby, Twinny, and Continue'sllama.cppprovider) drives FIM through/infillor a raw/v1/completionssuffixtemplate — NOT through/v1/chat/completions. A chat-only server unlocks chat/agent but currently unlocks zero first-class autocomplete. - For chat + agent (Copilot BYOK, Cline, Roo Code, Continue, ProxyAI, Zed, Aider), your existing
/v1/chat/completionsis already the right surface — but the make-or-break details are: streameddelta.tool_callswith correctindex/id/function.name/function.argumentsfragments,finish_reason:"tool_calls"on the terminating chunk, astream_options.include_usagefinal usage chunk with an emptychoicesarray, and never emittingtool_calls.function.argumentsas a JSON object (it must be a JSON-encoded string). Copilot's VS Code custom-endpoint feature also readsusage.prompt_tokens_details.cached_tokensand crashes if it is absent. - As of VS Code 1.122 (released May 28, 2026), the generic OpenAI-compatible "Custom Endpoint" provider (apiTypes
chat-completions/responses/messages) is now in VS Code Stable — so a plain OpenAI-compatible server is now a first-class Copilot chat/agent backend without Insiders or Ollama emulation. Copilot inline completions, however, remain closed to all local endpoints ("Inline suggestions and next edit suggestions still require a GitHub sign-in. BYOK powers chat, tools, and MCP servers only").
-
Two protocol families, not one. Autocomplete/FIM and chat/agent are almost entirely disjoint wire contracts. Chat/agent is OpenAI
/v1/chat/completions(or Anthropic/v1/messages, or OpenAI/v1/responses). Autocomplete is either llama.cpp/infill, Ollama/api/generate, or raw/v1/completionswith asuffixfield and a model-specific FIM template. Your server implements the chat side well and the FIM side not at all. -
Copilot inline completion is closed to local models. Per the VS Code 1.122 release notes, "Inline suggestions and next edit suggestions (NES) still require a GitHub sign-in. BYOK powers chat, tools, and MCP servers only." VS Code's language-models docs add: "Currently, you cannot connect to a local model for inline suggestions. VS Code provides an extension API
InlineCompletionItemProviderthat enables extensions to contribute a custom completion provider." So no llama.cpp server can power Copilot ghost text — you can only target Copilot's chat + agent surfaces (or ship your own inline-completion VS Code extension). -
Copilot's OpenAI-compatible path went Stable in May 2026. VS Code 1.122 (May 28, 2026) notes: "The Custom Endpoint provider lets you connect models that implement Chat Completions, Responses, or Messages APIs… The Custom Endpoint provider is now available in VS Code Stable." This supersedes the earlier Insiders-only status and reduces the urgency of emulating Ollama's native API. The built-in Ollama provider (native
/api/version,/api/tags,/api/show) and the deprecatedgithub.copilot.chat.customOAIModelssettings object remain as alternative paths. BYOK "now works without GitHub sign-in… in air-gapped or restricted environments" (GitHub Changelog, Apr 22, 2026), though model selection in the UI generally still prompts a GitHub login. -
Cline and Roo Code diverge on tool-calling. Roo Code forces native OpenAI tool calling: per the Roo Code blog ("Sorry we didn't listen sooner: Native Tool Calling"), "In 3.36.0 we introduced native tool calling… In 3.37.0 we made native tool calling the default and removed XML tool calling entirely." If your endpoint doesn't fully implement
tools/tool_calls, Roo (≥3.37) cannot be used unless the user rolls back to 3.36.16 and selects XML in advanced settings. Cline historically inlines XML-style tool instructions into the prompt and parses tool calls out of plain text, so it is tolerant of weak native tool support. This is a critical compatibility fork. -
Real-world SSE bugs cluster around three things: the trailing usage chunk (
stream_options.include_usage), thefinish_reasonafter streamed tool calls (must be"tool_calls", not"stop"), and Copilot's hard dependency onusage.prompt_tokens_details.cached_tokens. -
KV-cache prefix reuse is a latency feature clients actively rely on. llama.vscode warms the server with a fire-and-forget
/infilln_predict:0request DeepWiki and setscache_prompt:true; GitHub--cache-reuse 256is a standard launch flag. For acceptable repeated-prefix latency you must supportcache_prompt/prompt-prefix reuse.
| Client | IDEs | License | Local-endpoint mechanism |
|---|---|---|---|
| GitHub Copilot (VS Code) | VS Code | Proprietary (Copilot sub; BYOK works on Free) | Chat/agent only. Generic Custom Endpoint (chat-completions/responses/messages) — Stable since 1.122 (May 28 2026); built-in Ollama provider (native /api/*); legacy github.copilot.chat.customOAIModels (OpenAI base URL). Inline completion NOT available locally. |
| GitHub Copilot (Visual Studio / JetBrains) | VS, JetBrains | Proprietary | Model picker; local/BYOK parity with VS Code is unverified from a primary source (treat as lagging). |
| GitHub Copilot CLI | Terminal | Proprietary | COPILOT_PROVIDER_BASE_URL Ofox + COPILOT_MODEL (+COPILOT_PROVIDER_TYPE=azure/anthropic); any OpenAI-compatible endpoint; requires tool calling + streaming; "for best results, use a model with a context window of at least 128k tokens." |
| Continue.dev | VS Code, JetBrains | Apache-2.0 | provider: openai + apiBase; Continue native provider: llama.cpp; provider: ollama; roles chat/edit/apply/autocomplete/embed/rerank. |
| Cline | VS Code | Apache-2.0 | "OpenAI Compatible" base URL + key + model ID; Cline tolerant XML-ish tool parsing. |
| Roo Code | VS Code | Apache-2.0 | "OpenAI Compatible" base URL; native tool calling only (≥3.37). |
| Kilo Code | VS Code | Apache-2.0 | OpenAI-compatible; XML tool-call option still present in advanced settings (later versions). |
| Twinny | VS Code, VSCodium | MIT | OpenAI-compatible chat; /infill or FIM template for completion; Open VSX Registry llama.cpp/Ollama/LM Studio/Oobabooga presets. |
| llama.vscode / llama.vim | VS Code, Vim | MIT | llama.cpp /infill for FIM (required); /v1/chat/completions for chat/agent; /v1/embeddings. |
| ProxyAI (formerly CodeGPT) | JetBrains | Apache-2.0 | "Custom OpenAI" provider; FIM template for code completion; dedicated LLaMA C/C++ offline provider. |
| Cursor | Cursor (own) | Proprietary | "Override OpenAI Base URL" (+/v1); chat/agent; not local-friendly without a public/tunnel URL. |
| Zed | Zed (own) | GPL/Apache | language_models.openai_compatible with api_url, available_models[].capabilities.{tools,images}. |
| Aider | Terminal | Apache-2.0 | OPENAI_API_BASE + OPENAI_API_KEY, --model openai/<name>. Aider |
| Void | Void (own) | Apache-2.0 | OpenAI-compatible base URL (detailed behavior unverified). |
| Tabby | VS Code, JetBrains | Apache-2.0 (core) | config.toml: kind="llama.cpp/completion" (FIM via prompt_template), kind="openai/chat", kind="llama.cpp/before_b4356_embedding". Tabby |
| Tabnine, Qodo/Codium, Windsurf, Augment, Sourcegraph Cody, Pieces, Refact, Goose, OpenHands | various | mixed | Most accept an OpenAI-compatible base URL for chat; FIM/autocomplete typically proprietary or model-specific (verify per-tool). Goose & OpenHands are agent frameworks consuming /v1/chat/completions + tools. |
GitHub Copilot (VS Code). Three configuration paths today:
- Custom Endpoint provider (Stable since 1.122): added via Chat: Manage Language Models → Add Models → Custom Endpoint. Supports per-model
apiType∈chat-completions|responses|messages. The Insiders-erachatLanguageModels.jsonfile usedvendor: "customendpoint"with the sameapiTypeselector. - Legacy
github.copilot.chat.customOAIModels(still works in stable), object keyed by model id:
"github.copilot.chat.customOAIModels": {
"my-model": {
"name": "My Model",
"url": "http://127.0.0.1:8080/v1/chat/completions",
"toolCalling": true, "vision": false, "thinking": false,
"maxInputTokens": 128000, "maxOutputTokens": 16000,
"requiresAPIKey": false
}
}- Built-in Ollama provider: requires native endpoints
GET /api/version,GET /api/tags(model list),POST /api/show(capabilities incl. context length,tools/vision). LM Studio issue #526 documents that emulating these is what unlocks Copilot's "Ollama" provider for non-Ollama servers. GitHub
Copilot reads capability flags toolCalling, vision, thinking, maxInputTokens/maxOutputTokens. It sends standard OpenAI chat bodies with messages, tools, tool_choice, stream:true. A documented crash — microsoft/vscode issue #273482 ("OpenAI Compatible models return TypeError: Cannot read properties of undefined (reading 'cached_tokens')"), shows TypeError: Cannot read properties of undefined (reading 'cached_tokens') at SX.push (…github.copilot-chat-0.33.2025102701…) reproduced with LM Studio models in agent and ask mode — occurs when the streamed usage lacks prompt_tokens_details.cached_tokens. GitHub
Continue.dev (config.yaml, schema v1): provider: openai + apiBase: http://127.0.0.1:8080/v1 + model + roles: [chat, edit, apply]. For OpenAI-compatible non-chat completion: useLegacyCompletionsEndpoint: true forces /v1/completions. Continue Continue's native llama.cpp provider posts to /completion (singular), not /completions (issue #4991). requestOptions.headers carries auth; capabilities: [tool_use, image_input] can be declared.
Cline / Roo Code: Settings → "OpenAI Compatible" → Base URL (must include /v1), API key, model ID. Roo internally uses Anthropic message format then transforms to OpenAI ChatCompletionTool; it accumulates streamed fragments by index; finalizes on finish_reason:"tool_calls". parallelToolCalls:true is the default.
llama.cpp /infill contract (the target to implement):
POST /infill, fields:input_prefix(string, code before cursor),input_suffix(string, code after cursor), GitLabinput_extra(array of context chunks, prepended toward prompt start),prompt(optional raw text appended after the FIM middle marker), plus all/completionoptions; common paramsn_predict,temperature,top_p,top_k,stop,samplers(e.g.["top_k","top_p","infill"]),cache_prompt:true.- Response: JSON with
content(the completion — the only field clients require), plusstop,tokens_predicted,timings, etc. Streaming supported. - The model's own FIM tokens are applied server-side from GGUF metadata, so clients send raw prefix/suffix.
--spm-infilltoggles SPM vs PSM ordering. Debian Manpages
FIM control tokens by model family (verbatim — character precision matters):
| Model | Tokens (verbatim) | Char notes | Order |
|---|---|---|---|
| Qwen2.5-Coder | <|fim_prefix|> <|fim_suffix|> <|fim_middle|> <|fim_pad|> <|repo_name|> <|file_sep|> |
ASCII pipes (ids 151659–151664) | PSM: prefix·suffix·middle |
| Code Llama | ▁<PRE> ▁<SUF> ▁<MID> ▁<EOT> |
▁ = U+2581 (not ASCII underscore); ids 32007–32010 |
PSM: <PRE>pre<SUF>suf<MID> (paper recommends PSM over SPM) |
| DeepSeek-Coder (v1, 6.7b) | <|fim▁begin|> <|fim▁hole|> <|fim▁end|> |
|=U+FF5C full-width pipe; ▁=U+2581 |
PSM: begin·pre·hole·suf·end |
| DeepSeek-Coder-V2 | <|fim_begin|> <|fim_hole|> <|fim_end|> |
ASCII pipe + ASCII underscore — NOT byte-compatible with v1 | PSM |
| StarCoder2 | <fim_prefix> <fim_suffix> <fim_middle> <fim_pad> <file_sep> <repo_name> |
ASCII <> + underscore |
PSM: prefix<fim_suffix>suffix<fim_middle> |
| Codestral | [PREFIX] [SUFFIX] ([MIDDLE]) |
ASCII brackets; build via mistral_common.encode_fim, not by hand |
SPM internal: [SUFFIX]suf[PREFIX]pre; API uses prompt+suffix |
Character-precision warnings: Code Llama and DeepSeek-Coder-v1 use the SentencePiece ▁ (U+2581) glyph, not an ASCII underscore; DeepSeek-Coder-v1 uses the full-width pipe | (U+FF5C) while DeepSeek-Coder-V2 uses ASCII | + ASCII _ (the two are not interchangeable — match the exact checkpoint); Codestral uses square-bracket [PREFIX]/[SUFFIX] (the widely-circulated <PREFIX>/<SUFFIX> angle-bracket claim is incorrect) and its FIM API is POST /v1/fim/completions with prompt+suffix.
Per-client FIM behavior:
- llama.vscode / llama.vim:
POST /infill, readscontent; defaultscache_prompt:true,samplers:["top_k","top_p","infill"],top_k:40,top_p:0.99,stream:false; DeepWiki warms cache with a fire-and-forgetn_predict:0/infill. Requires llama.cpp (only server with/infill). Recommended launch:llama-server -hf ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF --port 8012 -ub 1024 -b 1024 --ctx-size 0 --cache-reuse 256. - Twinny: OpenAI-compatible; per-model FIM template; CodeLlama uses
<PRE>{prefix}<SUF>{suffix}<MID>, DeepSeek uses its FIM template; base (not instruct) models for FIM. "Twinny supports the OpenAI API specification so in theory any API should work." - Tabby:
kind="llama.cpp/completion",prompt_template="<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"(Qwen2.5) Tabby or"<PRE> {prefix} <SUF>{suffix} <MID>"(CodeLlama); endpoint must NOT include the/v1suffix. - Continue.dev autocomplete:
roles:[autocomplete];provider: llama.cppdrives FIM; orprovider: openaiwith atemplateMustache string ({{{prefix}}},{{{suffix}}},{{{filename}}},{{{reponame}}},{{{language}}}).autocompleteOptions:debounceDelay:250,maxPromptTokens:1024, ContinuemodelTimeout,maxSuffixPercentage:0.2,prefixPercentage:0.3,onlyMyCode:true. - ProxyAI: Custom OpenAI → Code Completions → "FIM Template (OpenAI)" + URL; Medium uses
/v1/completionsor/v1/chat/completions. - Cline/Roo: no ghost-text autocomplete; chat/agent only.
OpenAI shape required: tools:[{type:"function",function:{name,description,parameters}}], GitHub tool_choice ∈ auto|none|required|{type:"function",function:{name}}. Streaming delta.tool_calls[] carry index, id, function.name, and incremental function.arguments string fragments; finish_reason:"tool_calls" terminates. OpenAI API Reference function.arguments MUST be a JSON-encoded string, not an object. ggml-org/llama.cpp issue #20198 ("llama-server tool_calls returns arguments as JSON object instead of string, breaking OpenAI compatibility") documents that after the Autoparser refactoring (PR #18675), llama-server returned arguments as a parsed object (root cause in common/chat.cpp ~line 132: {"arguments", json::parse(tool_call.arguments)}), which crashes the official OpenAI Python SDK (Pydantic) with a TypeError. Your server must serialize arguments as a string.
- Roo Code: native only (≥3.37); transforms to OpenAI
ChatCompletionTool;parallelToolCalls:truedefault; finalizes onfinish_reason:"tool_calls"or stream end. Removal of XML tool calling broke some local stacks (issue #10319, SGLang gpt-oss 500 errors); rollback to 3.36.16 restores the XML selector. - Cline / Kilo Code: historically XML-in-prompt tool calling parsed from text; tolerant of weak native support. The
native_tool_call_adapterproxy exists specifically to translate Cline/Roo XML into OpenAItool_calls. - Copilot agent: native OpenAI tools via chat-completions; needs
toolCalling:trueon the model entry (a model that appears in chat but not agent mode usually hastoolCallingmissing/false). - llama.cpp tool support requires
--jinja(and often--chat-template-filefor a tool-capable template; worst case--chat-template chatml).chat_template_kwargs(e.g.{"enable_thinking":false}),parallel_tool_calls, andreasoning_format(deepseek →message.reasoning_content) Debian Manpages are supported. Fossies No client in scope strictly requires/v1/responses; Copilot's custom-endpoint can useresponsesor Anthropicmessagesbutchat-completionssuffices. Structured outputs (response_format:{type:"json_schema"}) are supported by llama.cpp via grammar but are not universally required by these clients.
GET /v1/modelsclients readid,object,owned_by. Continue can use the specialAUTODETECTmodel name. Roo/Cline mostly take an explicit model ID.- Copilot's Ollama path reads context length and
tools/visionfromPOST /api/show(microsoft/vscode issue #295659 shows Copilot's Manage Models UI expecting capability + context fields there, e.g.262144context,Tools/Vision). The OpenAI custom path takesmaxInputTokens/maxOutputTokens/toolCalling/visionfrom settings, not from/v1/models. - Zed reads
max_tokens,max_output_tokens,capabilities.{tools,images}Ofox from its own settings, not the server. - A single advertised model is fine for most clients; multi-model is optional. Non-standard capability fields on
/v1/modelsare largely ignored — capabilities are configured client-side.
- Trailing usage chunk: when
stream_options:{include_usage:true}, emit a final chunk withchoices:[]and a populatedusage, LiteLLM thendata: [DONE]. All non-final chunks should carryusage:null(per OpenAI's documented streaming shape and LiteLLM docs). - Copilot
cached_tokens: includeusage.prompt_tokens_details.cached_tokensor Copilot's custom-OAI path throwsCannot read properties of undefined (reading 'cached_tokens')(vscode #273482). - finish_reason after tool calls: must be
"tool_calls"on the terminating chunk in streaming, else agent loops terminate early GitHub (the open-webui #21768 pattern: "finish_reason incorrectly returned as 'stop' after streaming tool_calls"). - First delta with role: emit an initial
delta:{role:"assistant",content:""}chunk (matches OpenAI's documented first streamed event). data: [DONE]terminator is expected by OpenAI-style consumers; always send it last. LiteLLM #25389 shows consumers that stop atfinish_reasonlose the trailing usage chunk — keep the stream open until[DONE].- CORS / preflight: browser/webview clients send
OPTIONSpreflights and anAuthorizationheader; respond toOPTIONSwithAccess-Control-Allow-Origin,Access-Control-Allow-Methods: GET,POST,OPTIONS,Access-Control-Allow-Headers: Content-Type, Authorization. Ollama's default of restricting origins/headers is a documented friction source ("Request header field 'authorization' is not allowed by Access-Control-Allow-Headers in preflight response"). - Path /
/v1differences: some clients append/v1(Continueopenai, Cline, Zed), some must NOT (Tabbyllama.cpp/completion, llama.vscodeendpoint_chatexcludesv1). Continue'sllama.cppprovider uses/completionsingular. Support both trailing-slash and non-slash forms. - Keep-alive / timeouts: long prefill needs SSE heartbeats (you already emit these) and generous read timeouts (llama.cpp server default is 600s; Continue defaults
requestOptions.timeoutto tens of seconds — local guides raise it to 60000 ms for CPU). - gzip: accept but don't require; some clients send
Accept-Encoding: gzip. - arguments-as-string (Section D) is the single most damaging non-spec deviation.
- Embeddings:
POST /v1/embeddings(inputstring or array,model,encoding_format). Used by Continue (roles:[embed]), Twinny (workspace embeddings, defaultall-minilm:latest), llama.vscode (semantic re-rank). Response must be OpenAIdata:[{embedding,...}]shaped; llama.cpp's native/embeddingis non-OAI GitHub , so clients want/v1/embeddings. - Reranking: llama.cpp exposes
POST /v1/rerankGitHub (also/rerank,/reranking) requiring--reranking/--pooling rank; request{model,query,documents,top_n}, response{results:[{index,relevance_score}]}. Continue expects adataarray and errors on llama.cpp'sresultsshape (continue #6478: "Expected 'data' array but got: ['id','model','usage','results']") — a real interop gap to document or shim. - Prompt / KV cache:
cache_prompt:trueand--cache-reuse Nreuse common prefixes — essential for repeated-prefix latency (chat turns, FIM).--cache-reusehas regressed before (llama.cpp #15082) — pin a known-good build. - Vision: chat
contentparts withimage_url/base64; llama.cpp multimodal needs--mmproj; GitHub clients gate onvision:true/capabilities.images. Used by Cline/Roo screenshots and ProxyAI image chat. Per the llama.cpp server README, "A client must not specify [media] unless the server has the multimodal capability. Clients should check /models or /v1/models for the multimodal capability before a multimodal request."
Must-have for broad compatibility (do these first):
- Implement
POST /infillwithinput_prefix/input_suffix/input_extra/prompt/n_predict/cache_prompt, returningcontent. Unlocks llama.vscode, llama.vim, Twinny (llama.cpp mode), Tabby, Continuellama.cppautocomplete. This is your biggest gap. - Serialize
tool_calls.function.argumentsas a JSON string (never an object). Unlocks Roo Code, Copilot agent, any OpenAI-SDK consumer. Acceptance benchmark: the OpenAI Python SDK must parse your tool response withoutTypeError. - Streaming tool-call correctness:
delta.tool_callswithindex/id/function.name/incrementalarguments, terminatingfinish_reason:"tool_calls". Unlocks all agent modes. stream_options.include_usagetrailing chunk with emptychoices+ populatedusage, always ending withdata: [DONE], and includeusage.prompt_tokens_details.cached_tokens. Unlocks Copilot custom-endpoint without crashes.- CORS/OPTIONS handling allowing
Authorization+Content-Type. Unlocks webview/browser clients. - Tolerant routing: accept both
/v1/...and bare paths, with and without trailing slash; accept/completionand/completions.
High-value:
7. Emulate the Ollama native API (GET /api/version, GET /api/tags, POST /api/show with capabilities incl. tools/vision + context length, and POST /api/chat//api/generate). Downgraded from earlier priority: because the OpenAI Custom Endpoint provider reached VS Code Stable in 1.122 (May 28, 2026), a clean OpenAI-compatible surface now covers Copilot chat/agent. Ollama emulation is still worthwhile to support older VS Code versions and tools hard-coded to Ollama's native endpoints, but it is no longer on the critical path for current Copilot.
8. POST /v1/rerank ({query,documents,top_n} → {results:[{index,relevance_score}]}) and consider also returning a data array alias for Continue. Unlocks RAG / "chat with codebase."
9. cache_prompt + prefix reuse and SSE heartbeats during prefill (you have heartbeats; add prefix reuse).
10. Advertise capabilities in /v1/models and (if Ollama-emulating) /api/show so agent modes light up.
Nice-to-have:
11. Vision via image_url content parts (needs mmproj).
12. Anthropic /v1/messages and OpenAI /v1/responses shims for Copilot's other apiTypes and for Claude-shaped clients (Claude Code, etc.).
13. Per-model FIM template registry (Qwen / CodeLlama / DeepSeek v1 & V2 / StarCoder2 / Codestral) if you also expose /v1/completions-with-suffix for clients that don't use /infill.
14. /props / /v1/models context-length reporting so clients auto-size prompts.
Staged rollout: Ship (1)–(6) → validate against Continue (autocomplete + chat + agent), Twinny (FIM), Roo Code (native tools), Copilot Custom Endpoint (chat/agent). Then (7)–(10) → validate Copilot Ollama provider + RAG. Then (11)–(14).
GitHub Copilot (VS Code ≥1.122) — Custom Endpoint (preferred): Chat: Manage Language Models → Add Models → Custom Endpoint → display name, API key (any string if none), Base URL http://127.0.0.1:8080/v1, model ID, apiType chat-completions. For older stable builds, use the legacy object in settings.json:
"github.copilot.chat.customOAIModels": {
"local-qwen": {
"name": "Local Qwen (llama.cpp)",
"url": "http://127.0.0.1:8080/v1/chat/completions",
"toolCalling": true,
"vision": false,
"thinking": false,
"maxInputTokens": 32768,
"maxOutputTokens": 8192,
"requiresAPIKey": false
}
}(For Copilot's built-in Ollama provider, run an Ollama-emulation layer on :11434 and use Chat: Manage Language Models → Add → Ollama. Inline completions cannot be served locally in any case.)
Continue.dev — ~/.continue/config.yaml:
name: Local llama.cpp
version: 1.0.0
schema: v1
models:
- name: local-fim
provider: llama.cpp
model: your-model.gguf
apiBase: http://127.0.0.1:8080
roles: [autocomplete]
autocompleteOptions:
debounceDelay: 250
maxPromptTokens: 1024
- name: local-chat
provider: openai
model: your-model
apiBase: http://127.0.0.1:8080/v1
apiKey: sk-local
roles: [chat, edit, apply]Cline / Roo Code (VS Code settings UI):
- API Provider: OpenAI Compatible
- Base URL:
http://127.0.0.1:8080/v1 - API Key:
sk-local(any string if no auth) - Model ID:
your-model - (Roo Code ≥3.37: model must support native tool calling, or roll back to 3.36.16 for the XML option.)
Twinny (Providers → Code/FIM provider):
- Provider:
llamacpp - Hostname/port:
127.0.0.1/8080 - FIM endpoint path:
/infill - FIM Template: match the model (e.g. Qwen2.5-Coder / DeepSeek / CodeLlama)
Tabby — ~/.tabby/config.toml:
[model.completion.http]
kind = "llama.cpp/completion"
api_endpoint = "http://127.0.0.1:8080" # no /v1
prompt_template = "<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"
[model.chat.http]
kind = "openai/chat"
api_endpoint = "http://127.0.0.1:8080/v1"- Copilot Custom Endpoint (OpenAI) reached Stable in VS Code 1.122 (May 28, 2026). This is recent; behavior on older VS Code (and the exact apiType handling) may differ. The legacy
github.copilot.chat.customOAIModelsobject is slated to change to an array form (microsoft/vscode issue #277102) — track this. - Copilot inline completion remains closed to local models — a stable, documented limitation as of 2026 ("Inline suggestions and next edit suggestions still require a GitHub sign-in"). The only local autocomplete in VS Code is via third-party
InlineCompletionItemProviderextensions (Continue, Twinny, llama.vscode). - Roo Code removed the XML tool-calling selector in 3.37 and forces native; this broke some local stacks (issue #10319). A fallback may return in future versions — verify per release.
- llama.cpp build-dependent behavior: the
tool_callsarguments-as-object regression (#20198, from PR #18675) and--cache-reuseregressions (#15082) mean you should pin a known-good commit and add regression tests for both. - llama.cpp rerank path aliases (
/rerankvs/v1/rerankvs/reranking) have shifted across releases; Continue expects adataarray, notresultsGitHub (#6478). Reranker score quality also varies with GGUF conversion/quantization (#16407). - Visual Studio / JetBrains Copilot local-model parity with VS Code is unverified from a primary source — treat as "lagging/uncertain" and test directly before claiming support.
- Copilot CLI BYOK (GitHub Changelog, Apr 7 2026) requires tool calling + streaming and recommends ≥128k context;
COPILOT_OFFLINE=trueenables air-gapped use with a local provider.
Primary sources cited inline include: VS Code docs (code.visualstudio.com/docs/agent-customization/language-models) and the v1.122 release notes; GitHub Changelog (Apr 7 & Apr 22, 2026) and GitHub Docs BYOK pages; ggml-org/llama.cpp tools/server/README.md, docs/function-calling.md, and issues #20198, #15082, #16407, #16498, #21415; ggml-org/llama.vscode repo/wiki and DeepWiki; Continue.dev docs (autocomplete, yaml-reference, openai provider) and issues #4991, #2330, #6478; Roo Code docs and issues #4047, #10319; Cline docs; Tabby docs (llama.cpp/llamafile/model config); Twinny repo/docs; ProxyAI repo/docs; microsoft/vscode issues #273482, #277102, #295659; Hugging Face model cards/papers for Qwen2.5-Coder, Code Llama, DeepSeek-Coder, StarCoder2, and Codestral. Claims about Visual Studio/JetBrains Copilot parity and the Void editor are flagged as unverified.