Skip to content

Add support for Chat response parsing#1639

Open
xenova wants to merge 7 commits intomainfrom
chat-response-parsing
Open

Add support for Chat response parsing#1639
xenova wants to merge 7 commits intomainfrom
chat-response-parsing

Conversation

@xenova
Copy link
Copy Markdown
Collaborator

@xenova xenova commented Apr 10, 2026

more info: https://huggingface.co/docs/transformers/en/chat_response_parsing

transformers reference PR: huggingface/transformers#44674

Example usage:

import {
  AutoProcessor,
  Gemma4ForConditionalGeneration,
  TextStreamer,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/gemma-4-E2B-it-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Gemma4ForConditionalGeneration.from_pretrained(model_id, {
  dtype: {
    audio_encoder: "q4f16",
    vision_encoder: "q4f16",
    embed_tokens: "q4f16",
    decoder_model_merged: "q4f16",
  },
  device: "webgpu",
});

// Define tools

const tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get current weather information for a location",
      parameters: {
        type: "object",
        properties: {
          location: {
            type: "string",
            description: "The city and state, e.g. San Francisco, CA",
          },
          unit: {
            type: "string",
            enum: ["celsius", "fahrenheit"],
            description: "The unit of temperature to use",
          },
        },
        required: ["location"],
      },
    },
  },
];

// Prepare messages
const messages = [
  {
    role: "user",
    content: "What is the weather like in New York?",
  },
];
const prompt = processor.apply_chat_template(messages, {
  add_generation_prompt: true,
  tools,
});

// Prepare inputs
const inputs = await processor(prompt, null, null, {
  add_special_tokens: false,
});

// Helper to simulate tool execution
function executeTool(name, _args) {
  if (name === "get_weather") {
    // Simulate a weather API response
    return {
      temperature: 25,
      unit: "celsius",
      description: "Sunny with a few clouds",
    };
  }
  return { error: `Unknown tool: ${name}` };
}

// First generation: model should produce a tool call
console.log("=== First generation (expecting tool call) ===");
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 512,
  do_sample: false,
  streamer: new TextStreamer(processor.tokenizer, {
    skip_prompt: true,
    skip_special_tokens: false,
  }),
});

// Decode the first output to extract tool calls
const firstOutput = processor.batch_decode(
  outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: false },
)[0];

const parsed = processor.parse_response(firstOutput);
console.log("\nParsed response:", JSON.stringify(parsed, null, 2));
const toolCalls = parsed.tool_calls ?? [];

if (toolCalls.length > 0) {
  // Execute tools and collect responses
  const toolResponses = toolCalls.map((tc) => ({
    name: tc.function.name,
    response: executeTool(tc.function.name, tc.function.arguments),
  }));
  console.log("Tool responses:", JSON.stringify(toolResponses, null, 2));

  // Build the full conversation with tool call + tool response
  messages.push({
    role: "assistant",
    tool_calls: toolCalls,
  });
  messages.push({
    role: "user",
    tool_responses: toolResponses,
  });

  // Re-apply chat template with the full conversation
  const prompt2 = processor.apply_chat_template(messages, {
    add_generation_prompt: true,
    tools,
  });

  const inputs2 = await processor(prompt2, null, null, {
    add_special_tokens: false,
  });

  // Second generation: model should produce a final answer
  console.log("\n=== Second generation (expecting final answer) ===");
  const outputs2 = await model.generate({
    ...inputs2,
    max_new_tokens: 512,
    do_sample: false,
    streamer: new TextStreamer(processor.tokenizer, {
      skip_prompt: true,
      skip_special_tokens: false,
    }),
  });

  const finalOutput = processor.batch_decode(
    outputs2.slice(null, [inputs2.input_ids.dims.at(-1), null]),
    { skip_special_tokens: true },
  )[0];
  console.log("\nFinal answer:", finalOutput);
}

@Rocketknight1 @nico-martin

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova
Copy link
Copy Markdown
Collaborator Author

xenova commented Apr 12, 2026

I only implemented a minimal set of required jmespath functionality to not bloat the library... @Rocketknight1 lmk what kind of functionality you think it needed in addition to the current set (which basically only implements those outlined in your original tests).

@Rocketknight1
Copy link
Copy Markdown
Member

API looks good! One thing I'll say is that we probably won't have a perfectly clean implementation like we do with jinja, where we almost never need to extend the spec for new models. The "cascading regex plus some predefined parsers" approach works in most cases, but it's likely that future models will occasionally require us to add an extra custom parser because they have a very weird tool call format. In that case I'll try to remember to ping you on the Python PR so you can implement it here, but they shouldn't be long (Gemma4JsontoJson is an example of exactly this)

@xenova
Copy link
Copy Markdown
Collaborator Author

xenova commented Apr 13, 2026

Thanks @Rocketknight1! Yeah that sounds good.

@sroussey
Copy link
Copy Markdown
Contributor

This would be great. I wrote a bunch of trash just trying to figure out the different ways different models do things (deepseek 2 vs 3.1, vs llama's three ways, vs hermes/qwen, etc --- oh don't forget functiongema).

https://github.com/workglow-dev/workglow/blob/main/packages/ai-provider/src/provider-hf-transformers/common/HFT_ToolParser.ts

@xenova
Copy link
Copy Markdown
Collaborator Author

xenova commented Apr 14, 2026

different models do things (deepseek 2 vs 3.1, vs llama's three ways, vs hermes/qwen, etc --- oh don't forget functiongema).

image

@nico-martin
Copy link
Copy Markdown
Collaborator

Gave it a little deep dive and I like the approach. Could we also add this to the TextGenerationPipeline so it would return a clean object?
Also I think it would make sense if we could parse streamed chunks. So if a request returns multiple toolcalls or toolcalls and extra text applications could already start to execute the tool while its still finishing the response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants