Describe the bug
When using either the completions or chat completions endpoints and the openai server configured is not a vLLM instance (e.g. Ollama), a request with input detections returns 404:
Request
curl --location 'http://localhost:8033/api/v2/text/completions-detection' \
--header 'Content-Type: application/json' \
--data '{
"model": "qwen3:0.6b",
"prompt": "I hate aliens",
"detectors": {
"input": {
"hap": {}
}
}
}'
Response
{"code":404,"details":"tokenize request failed for `qwen3:0.6b`: unknown error occurred"}
This is caused because the /tokenize is invoked when there are input detections to gather usage data (.e.g here). However, this endpoint is not part of the openai API, it's strict to vLLM.
Discussion
I'm wondering what would be the best approach in this scenario. I've thought of two ideas, but both have limitations:
- When invoking the
/tokenize endpoint returns 404, return usage as empty. - the limitation with this idea is that it might not be obvious if no warning about this is provided (and I think there was an ongoing discussion to deprecate the warnings field in the orchestrator response.
- Create an additional config to accept
404 responses from /tokenize - this would have the same behavior as the previous option, except that it would only happen if this additional config parameter would be set to allow such responses. The drawback here is that there would be an extra config for the orchestrator.
Describe the bug
When using either the completions or chat completions endpoints and the
openaiserver configured is not a vLLM instance (e.g. Ollama), a request with input detections returns 404:Request
Response
This is caused because the
/tokenizeis invoked when there are input detections to gatherusagedata (.e.g here). However, this endpoint is not part of the openai API, it's strict to vLLM.Discussion
I'm wondering what would be the best approach in this scenario. I've thought of two ideas, but both have limitations:
/tokenizeendpoint returns 404, returnusageas empty. - the limitation with this idea is that it might not be obvious if no warning about this is provided (and I think there was an ongoing discussion to deprecate thewarningsfield in the orchestrator response.404responses from/tokenize- this would have the same behavior as the previous option, except that it would only happen if this additional config parameter would be set to allow such responses. The drawback here is that there would be an extra config for the orchestrator.