Skip to content

Retrieve thinking msg from the metadata reasoning_content field#532

Open
Chibukach wants to merge 3 commits into
vllm-project:mainfrom
neuralmagic:add_reasoning_content_support
Open

Retrieve thinking msg from the metadata reasoning_content field#532
Chibukach wants to merge 3 commits into
vllm-project:mainfrom
neuralmagic:add_reasoning_content_support

Conversation

@Chibukach
Copy link
Copy Markdown

Purpose

Recent VLLM output separate thinking content in the metadata instead of tags

Description

This PR adds support for retrieving thinking content from the "reasoning_content" field inside the metadata during a model response generation

Tests

Tested locally on Qwen/Qwen3-8B and the output prints thinking_msg

2026-05-19 11:29:50,075 - speculators.data_generation.preprocessing - INFO -   HIGHLIGHTED TEXT (BLUE = trainable, GREY = masked)
2026-05-19 11:29:50,089 - speculators.data_generation.preprocessing - INFO -   <|im_start|>user
Here is the problem:
Suppose that Alice and Bob are participating in an online auction.  They can either bid or pass.  Assuming that each person acts independently and that the probability of bidding is 1/2 for each participant,  then the probability that both Alice and Bob bid is $\frac{1}{4}$.  Can you explain this answer.

The answer is indeed $\frac{1}{4}$.  But, I don't understand what is meant by independent events.  Does this mean that their actions (bidding or passing) are independent of each other.
Yes, exactly.  In this case, "independent events" means that Alice's decision to bid or pass does not affect Bob's decision to bid or pass, and vice versa.

Thank you.  So, the probability that both Alice and Bob bid is calculated by multiplying their individual probabilities of bidding, which is 1/2, so 1/2 * 1/2 = 1/4, which I can understand.
That's correct.  The probability of independent events occurring together is the product of the individual probabilities.  This is a basic rule of probability, and it applies to any situation where the events are independent.

Okay, I think I have a good understanding of this problem now.  Is there anything else I should know about independent events?<|im_end|>
<|im_start|>assistant
<think>
Here's a thinking process that leads to the suggested response:

1.  **Analyze the User's Request:**
    *   **Context:** The user is asking about a probability problem involving Alice and Bob in an auction.
    *   **Current Understanding:** They understand *why* the answer is 1/4 (multiplying probabilities) and *what* independence means in this context (one person's choice doesn't affect the other's).
    *   **Current Question:** "Is there anything else I should know about independent events?"
    *   **Goal:** Provide a helpful, educational, and concise expansion on the concept of independent events without overwhelming the user. I need to reinforce their correct understanding while adding valuable nuance or common pitfalls.

2.  **Determine Key Concepts to Cover:**
    *   **Definition Reinforcement:** Briefly reiterate the formal definition ($P(A \cap B) = P(A) \cdot P(B)$).
    *   **Distinction from Mutually Exclusive:** This is the most common confusion point. Students often mix up "independent" (doesn't affect probability) with "mutually exclusive" (can't happen together).
    *   **Conditional Probability:** Briefly mention how independence relates to conditional probability ($P(A|B) = P(A)$).
    *   **Real-world Nuance:** Mention that true independence is often an assumption in models (like the auction example) but might not hold in reality (e.g., if they talk to each other).
    *   **More than two events:** Briefly note that this extends to 3+ events.

3.  **Drafting the Response (Iterative Process):**

    *   *Draft 1 (Internal Monologue):* Okay, they get the basics. I should tell them about mutually exclusive events because people always confuse that. If events are mutually exclusive, they can't both happen, so probability is 0. But independent events *can* both happen. Also, maybe mention conditional probability. If A and B are independent, knowing B happened doesn't change the chance of A. Oh, and maybe warn them that in real life, independence is an assumption.

    *   *Draft 2 (Closer to final):* You have a great grasp. Here are a few extra things. First, don't confuse independent with mutually exclusive. Mutually exclusive means they can't both happen. Independent means one doesn't change the other's odds. Second, think about conditional probability. If they are independent, P(Alice bids | Bob bids) is still 1/2. Third, remember this is a model. In real auctions, maybe they know each other. Finally, this works for more than two people too.

    *   *Refining for Tone and Clarity:* The tone should be encouraging and instructional. I should use formatting (bullet points) to make it readable. I need to make sure the distinction between "Independent" and "Mutually Exclusive" is very clear, as that's the biggest learning opportunity here.

4.  **Finalizing the Content (incorporating into the final output):**
    *   **Acknowledge and Validate:** Start by confirming their understanding is solid.
    *   **Key Point 1: Independent vs. Mutually Exclusive:** This is crucial. Explain the difference clearly.
        *   Independent: Can both happen, doesn't affect odds.
        *   Mutually Exclusive: Cannot both happen.
    *   **Key Point 2: Conditional Probability:** Explain $P(A|B) = P(A)$. This is the mathematical definition of independence.
    *   **Key Point 3: Assumption vs. Reality:** Contextualize the auction example. In game theory, decisions might *not* be independent (strategic interaction), but in this probability problem, we assume they are.
    *   **Key Point 4: Extension:** Mention it works for 3+ events.

5.  **Review against Constraints:**
    *   The user is asking a follow-up question in a dialogue format.
    *   Keep it conversational but informative.
    *   Ensure mathematical accuracy.

6.  **Final Polish:** (Self-Correction during drafting) *Wait, I should be careful not to introduce Game Theory too deeply, or it might confuse the probability concept. Just mention that in real auctions, independence might not hold, but for this math problem, it does.*

    *Let's structure the answer:*
    1. Validation.
    2. The "Mutually Exclusive" trap.
    3. The Conditional Probability perspective.
    4. Real-world context (Assumptions).
    5. Closing encouragement.

7.  **Final Output Generation** (similar to the provided good response).
</think>

You have a solid grasp of the basics! Since you are interested in deepening your understanding, here are a few key concepts and common pitfalls regarding independent events that are worth knowing:

**1. Independent vs. Mutually Exclusive**
This is the most common point of confusion in probability.
*   **Independent Events:** The occurrence of one does *not* change the probability of the other. They **can** happen at the same time (like Alice and Bob both bidding).
*   **Mutually Exclusive Events:** The two events **cannot** happen at the same time. (For example, if a coin flip results in "Heads," it cannot also be "Tails").
*   **Why it matters:** If events are mutually exclusive, the probability of both happening is **0**. If they are independent, you **multiply** their probabilities.

**2. The Conditional Probability Perspective**
Another way to define independence is using conditional probability (the probability of A happening *given* that B has already happened).
*   If Alice and Bob are independent, knowing that Bob bid does not change Alice's odds.
*   Mathematically: $P(\text{Alice bids} \mid \text{Bob bids}) = P(\text{Alice bids})$.
*   If knowing Bob bid made Alice *more* likely to bid (perhaps they are competing rivals), they would **not** be independent.

**3. Independence is Often an Assumption**
In textbook problems like this auction scenario, independence is usually given as a rule to make the math solvable. However, in real-world scenarios (especially in economics or game theory), actions are rarely truly independent.
*   **Real World:** If Alice sees Bob bid, she might be more likely to bid to win the item, or less likely to bid to save money.
*   **Math Problem:** We assume "independence" to isolate the probability calculation from psychological or strategic factors.

**4. It Extends to More Than Two Events**
The rule you used ($P(A) \times P(B)$) works for any number of independent events.
*   If Charlie also joins the auction with a 1/2 probability of bidding, the chance that **all three** bid is $\frac{1}{2} \times \frac{1}{2} \times \frac{1}{2} = \frac{1}{8}$.

You are asking the right questions to build a strong foundation in probability. Keep distinguishing between *what* happens (the events) and *how* they influence each other (the dependence)!<|im_end|>

I have filled in:

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan/results, such as providing test command and pasting the results.
  • (Optional) The necessary documentation update.
  • I (a human) have written or reviewed the code in this pr to the best of my ability.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a69d1c0b-1d66-4d55-9f4d-8787ba09a3f5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

The preprocessing pipeline is extended to propagate metadata through conversation normalization. The _normalize_conversation function now accepts a metadata parameter and copies reasoning_content from metadata into each normalized turn when available. The _preprocess_batch function extracts the per-example metadata field from input batches, validates both conversations and metadata, and passes metadata to normalization.

Changes

Metadata reasoning_content propagation

Layer / File(s) Summary
Normalize conversation with metadata reasoning_content
src/speculators/data_generation/preprocessing.py
_normalize_conversation accepts metadata: dict parameter and injects metadata["reasoning_content"] into each normalized turn. _preprocess_batch extracts metadata from batch input, zips with conversations, validates both, and passes to normalization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

data-generation, enhancement

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding support for retrieving thinking content from the metadata reasoning_content field during preprocessing.
Description check ✅ Passed The description is related to the changeset, explaining the purpose (VLLM now separates thinking content into metadata), what was implemented (retrieving reasoning_content from metadata), and providing test results with example output.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added enhancement New feature or request data-generation labels May 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/speculators/data_generation/preprocessing.py`:
- Around line 101-103: The code currently copies metadata["reasoning_content"]
into normalized_turn for all roles; change this so reasoning_content is only
attached when the turn is an assistant turn by checking
normalized_turn.get("role") == "assistant" (or equivalent role field) before
assigning normalized_turn["reasoning_content"] = metadata["reasoning_content"];
keep the existing truthy check on metadata["reasoning_content"] and update the
block around normalized_turn and metadata to guard the assignment with that role
check.
- Around line 274-287: The batch drops rows because metadatas =
examples.get("metadata", []) produces an empty list and zip(conversations,
metadatas) truncates pairs and because the code rejects empty dicts with if not
metadata or not isinstance(metadata, dict): continue; fix by: when building
metadatas, if examples lacks "metadata" or its list is shorter than
conversations, create/pad a metadatas list of the same length filled with {} (so
missing/empty metadata is treated as {}), remove the truthiness check and only
validate metadata with isinstance(metadata, dict) before calling
_normalize_conversation, and add a warning log when metadata length mismatches
conversations to surface potential data issues (keep using
_normalize_conversation for normalization).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f3c22b20-c51b-4804-9119-ef0061526fb4

📥 Commits

Reviewing files that changed from the base of the PR and between 4f80f5d and c897305.

📒 Files selected for processing (1)
  • src/speculators/data_generation/preprocessing.py

Comment on lines +101 to +103
if "reasoning_content" in metadata and metadata["reasoning_content"]:
normalized_turn["reasoning_content"] = metadata["reasoning_content"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Attach reasoning_content only to assistant turns.

This currently copies the same reasoning_content into user/system turns too, which can pollute role-specific templating and semantics.

💡 Suggested fix
-        if "reasoning_content" in metadata and metadata["reasoning_content"]:
+        if (
+            role == "assistant"
+            and "reasoning_content" in metadata
+            and metadata["reasoning_content"]
+        ):
             normalized_turn["reasoning_content"] = metadata["reasoning_content"]
🧰 Tools
🪛 Ruff (0.15.13)

[warning] 101-101: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/speculators/data_generation/preprocessing.py` around lines 101 - 103, The
code currently copies metadata["reasoning_content"] into normalized_turn for all
roles; change this so reasoning_content is only attached when the turn is an
assistant turn by checking normalized_turn.get("role") == "assistant" (or
equivalent role field) before assigning normalized_turn["reasoning_content"] =
metadata["reasoning_content"]; keep the existing truthy check on
metadata["reasoning_content"] and update the block around normalized_turn and
metadata to guard the assignment with that role check.

Comment on lines +274 to +287
metadatas= examples.get("metadata", [])

if not conversations:
log.warning(f"No conversations key found. Keys: {list(examples.keys())}")
return results

for idx, conv in enumerate(conversations):
for idx, (conv,metadata) in enumerate(zip(conversations,metadatas)):
if not conv or not isinstance(conv, list):
continue
if not metadata or not isinstance(metadata, dict):
continue

# Normalize to standard format with optional turn dropout
normalized_conv = _normalize_conversation(conv, turn_dropout)
normalized_conv = _normalize_conversation(conv, metadata, turn_dropout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's see the file structure
find . -name "preprocessing.py" | head -20

Repository: vllm-project/speculators

Length of output: 119


🏁 Script executed:

#!/bin/bash
# Read the preprocessing.py file around the target lines (274-287)
cat -n src/speculators/data_generation/preprocessing.py | sed -n '260,300p'

Repository: vllm-project/speculators

Length of output: 1821


🏁 Script executed:

#!/bin/bash
# Get more context on the _normalize_conversation function signature and usage
cat -n src/speculators/data_generation/preprocessing.py | sed -n '1,150p' | tail -100

Repository: vllm-project/speculators

Length of output: 4078


🏁 Script executed:

#!/bin/bash
# Look at the _normalize_conversation function definition to understand metadata usage
grep -n "def _normalize_conversation" src/speculators/data_generation/preprocessing.py

Repository: vllm-project/speculators

Length of output: 100


🏁 Script executed:

#!/bin/bash
# Read the _normalize_conversation function implementation
cat -n src/speculators/data_generation/preprocessing.py | sed -n '55,120p'

Repository: vllm-project/speculators

Length of output: 2706


🏁 Script executed:

#!/bin/bash
# Check for tests or examples showing how this preprocessing is used
find . -name "*.py" -type f | xargs grep -l "_preprocess_batch\|preprocessing" | head -10

Repository: vllm-project/speculators

Length of output: 330


🏁 Script executed:

#!/bin/bash
# Check if metadata is always present in data samples
grep -r "metadata" --include="*.py" src/speculators/data_generation/ | grep -E "(\.json|from|import|examples\[|get\()" | head -20

Repository: vllm-project/speculators

Length of output: 161


🏁 Script executed:

#!/bin/bash
# Look at how this is called - check main processing flow
grep -B 5 -A 5 "_preprocess_batch" src/speculators/data_generation/preprocessing.py | tail -50

Repository: vllm-project/speculators

Length of output: 1131


🏁 Script executed:

#!/bin/bash
# Check if metadata is required or optional in dataset loading
grep -n "metadata" src/speculators/data_generation/preprocessing.py | head -30

Repository: vllm-project/speculators

Length of output: 530


🏁 Script executed:

#!/bin/bash
# Check the test file to see how preprocessing is tested
cat -n tests/integration/datagen/test_preprocessing.py | head -100

Repository: vllm-project/speculators

Length of output: 3861


🏁 Script executed:

#!/bin/bash
# Look at how metadata is provided in tests
grep -A 10 -B 5 "metadata" tests/integration/datagen/test_preprocessing.py | head -60

Repository: vllm-project/speculators

Length of output: 50


🏁 Script executed:

#!/antml:bash
# Check how the preprocessing is called in the data loading pipeline
cat -n src/speculators/train/data.py | head -100

Repository: vllm-project/speculators

Length of output: 3552


🏁 Script executed:

#!/bin/bash
# Look for how examples dict is constructed when calling _preprocess_batch
grep -B 20 "_preprocess_batch" src/speculators/train/data.py | head -50

Repository: vllm-project/speculators

Length of output: 50


🏁 Script executed:

#!/bin/bash
# Check if metadata is always provided or if it's optional
grep -r "conversations" --include="*.py" src/speculators/ | grep -v "normalized_conv\|assistant_pattern" | head -20

Repository: vllm-project/speculators

Length of output: 1024


🏁 Script executed:

#!/bin/bash
# Check if the test file passes metadata parameter to _normalize_conversation
grep -A 15 "test_normalize_conversation" tests/integration/datagen/test_preprocessing.py | head -40

Repository: vllm-project/speculators

Length of output: 1445


🏁 Script executed:

#!/bin/bash
# Look for how _normalize_conversation is called in tests - check the actual call signatures
grep "_normalize_conversation" tests/integration/datagen/test_preprocessing.py

Repository: vllm-project/speculators

Length of output: 408


🏁 Script executed:

#!/bin/bash
# Check the actual test calls to see if metadata is provided
cat -n tests/integration/datagen/test_preprocessing.py | sed -n '27,80p'

Repository: vllm-project/speculators

Length of output: 2136


🏁 Script executed:

#!/bin/bash
# Look at the actual Python to understand the behavior of zip with mismatched lengths
python3 << 'EOF'
# Test what happens with zip when lists have different lengths
conversations = ["conv1", "conv2", "conv3"]
metadatas = []

result = list(zip(conversations, metadatas))
print(f"Result of zip with empty metadata list: {result}")
print(f"Number of pairs: {len(result)}")

# Test with empty dict
metadatas_with_empty = [{}]
result2 = list(zip(conversations, metadatas_with_empty))
print(f"\nWith one empty dict metadata: {result2}")
print(f"Number of pairs: {len(result2)}")

# Test 'not' operator on empty dict
empty_dict = {}
print(f"\nTruth value of empty dict {{}}: {bool(empty_dict)}")
print(f"'not {{}}' evaluates to: {not empty_dict}")
EOF

Repository: vllm-project/speculators

Length of output: 261


🏁 Script executed:

#!/bin/bash
# Check if the tests actually pass by running them
cd /tmp && python3 -m pytest /dev/stdin 2>&1 << 'EOF'
# Quick test to see if _normalize_conversation can be called without metadata
import sys
sys.path.insert(0, '/root/vllm-project/speculators/src')

from speculators.data_generation.preprocessing import _normalize_conversation

conv = [
    {"from": "human", "value": "Hello"},
    {"from": "gpt", "value": "Hi!"},
]

# Try calling without metadata parameter
try:
    result = _normalize_conversation(conv)
    print("SUCCESS: Can call _normalize_conversation with just conv")
except TypeError as e:
    print(f"ERROR: {e}")
EOF

Repository: vllm-project/speculators

Length of output: 109


🏁 Script executed:

#!/bin/bash
# Check if there's a recent change that broke signature compatibility
# Look at the function definition more carefully
cat -n src/speculators/data_generation/preprocessing.py | sed -n '61,75p'

Repository: vllm-project/speculators

Length of output: 662


🏁 Script executed:

#!/bin/bash
# Check if metadata has a default value that I missed
grep -A 10 "def _normalize_conversation" src/speculators/data_generation/preprocessing.py

Repository: vllm-project/speculators

Length of output: 399


🏁 Script executed:

#!/bin/bash
# Look for all calls to _normalize_conversation to see how it's called
grep -n "_normalize_conversation" src/speculators/data_generation/preprocessing.py

Repository: vllm-project/speculators

Length of output: 184


🏁 Script executed:

#!/bin/bash
# Check git to see if there were recent changes to the test file
git log --oneline -20 tests/integration/datagen/test_preprocessing.py 2>/dev/null || echo "No git history available"

Repository: vllm-project/speculators

Length of output: 112


🏁 Script executed:

#!/antml:bash
# Check if metadata parameter has a default value I might have missed
python3 << 'EOF'
import inspect
import sys
sys.path.insert(0, 'src')

try:
    from speculators.data_generation.preprocessing import _normalize_conversation
    sig = inspect.signature(_normalize_conversation)
    print("Function signature:")
    print(sig)
    print("\nParameters:")
    for param_name, param in sig.parameters.items():
        print(f"  {param_name}: default={param.default}")
except Exception as e:
    print(f"Error: {e}")
EOF

Repository: vllm-project/speculators

Length of output: 99


🏁 Script executed:

#!/bin/bash
# Check the actual file to see what the data looks like - look for example datasets
find . -name "*.json" -o -name "*.jsonl" | head -5

Repository: vllm-project/speculators

Length of output: 50


🏁 Script executed:

#!/bin/bash
# Check the configs to understand expected data structure
cat -n src/speculators/data_generation/configs.py | head -100

Repository: vllm-project/speculators

Length of output: 1800


🏁 Script executed:

#!/bin/bash
# Check the git diff for the recent commit that added metadata support
git log -1 --oneline --name-only
git show c897305 --stat

Repository: vllm-project/speculators

Length of output: 23396


🏁 Script executed:

#!/bin/bash
# Check if metadata parameter had a default value that was removed or changed
git show c897305:src/speculators/data_generation/preprocessing.py | grep -A 10 "def _normalize_conversation"

Repository: vllm-project/speculators

Length of output: 399


🏁 Script executed:

#!/antml:bash
# See what the old signature was
git show c897305~1:src/speculators/data_generation/preprocessing.py 2>/dev/null | grep -A 10 "def _normalize_conversation" || echo "Previous version not found"

Repository: vllm-project/speculators

Length of output: 95


🏁 Script executed:

#!/bin/bash
# Check if tests were updated in that commit
git show c897305 tests/integration/datagen/test_preprocessing.py | head -100

Repository: vllm-project/speculators

Length of output: 3267


🏁 Script executed:

#!/bin/bash
# Check the actual test assertions to see if they expect metadata handling
cat -n tests/integration/datagen/test_preprocessing.py | sed -n '27,80p'

Repository: vllm-project/speculators

Length of output: 2136


🏁 Script executed:

#!/bin/bash
# Search for test_preprocess_batch to see how the batch function is tested
grep -A 30 "def test_preprocess_batch" tests/integration/datagen/test_preprocessing.py | head -50

Repository: vllm-project/speculators

Length of output: 1839


🏁 Script executed:

#!/antml:bash
# Check if metadata parameter has a default in the actual function definition
python3 << 'EOF'
with open('src/speculators/data_generation/preprocessing.py', 'r') as f:
    lines = f.readlines()
    for i, line in enumerate(lines[60:75], start=61):
        print(f"{i:3d}: {line}", end='')
EOF

Repository: vllm-project/speculators

Length of output: 632


🏁 Script executed:

#!/bin/bash
# Look at how tests call _normalize_conversation - find all test calls
grep -n "_normalize_conversation(" tests/integration/datagen/test_preprocessing.py | head -20

Repository: vllm-project/speculators

Length of output: 206


🏁 Script executed:

#!/bin/bash
# This is critical - test calls _normalize_conversation with only 1 arg but function requires 2 (metadata is mandatory)
# Let me verify the actual test calls to be sure
cat -n tests/integration/datagen/test_preprocessing.py | sed -n '29,45p'

Repository: vllm-project/speculators

Length of output: 772


🏁 Script executed:

#!/antml:bash
# Check if metadata has a default value that wasn't shown
grep -A 5 "def _normalize_conversation" src/speculators/data_generation/preprocessing.py | cat -n

Repository: vllm-project/speculators

Length of output: 303


🏁 Script executed:

#!/bin/bash
# Let's look at the complete function signature definition to see if metadata has a default
sed -n '61,66p' src/speculators/data_generation/preprocessing.py | cat -n

Repository: vllm-project/speculators

Length of output: 303


🏁 Script executed:

#!/antml:bash
# Check if there are TWO _normalize_conversation functions or if metadata has a default we missed
rg "def _normalize_conversation" -A 8 src/speculators/data_generation/preprocessing.py

Repository: vllm-project/speculators

Length of output: 309


🏁 Script executed:

#!/antml:bash
# The test calls must be broken. Let me check if tests actually import and use this
cat -n tests/integration/datagen/test_preprocessing.py | sed -n '1,50p'

Repository: vllm-project/speculators

Length of output: 1822


🏁 Script executed:

#!/bash
# The tests call _normalize_conversation(conv) but the signature requires metadata
# Let me check if metadata has a default value (=None or ={})
python3 << 'EOF'
with open('src/speculators/data_generation/preprocessing.py', 'r') as f:
    content = f.read()
    
# Find the _normalize_conversation definition
import re
match = re.search(r'def _normalize_conversation\([^)]+\)', content, re.DOTALL)
if match:
    print("Found function signature:")
    print(match.group(0))
else:
    print("Not found")
EOF

Repository: vllm-project/speculators

Length of output: 199


🏁 Script executed:

#!/antml:bash
# Check if metadata has a default value = None or = {}
grep -A 10 "def _normalize_conversation" src/speculators/data_generation/preprocessing.py | grep -E "metadata|="

Repository: vllm-project/speculators

Length of output: 120


🏁 Script executed:

#!/bash
# Let me check the exact bytes to ensure no default is hidden
cat src/speculators/data_generation/preprocessing.py | sed -n '61,66p' | od -c | head -60

Repository: vllm-project/speculators

Length of output: 109


🏁 Script executed:

#!/antml:bash
# Check if maybe metadata has a default value that's on the next line or if the function was redefined
cat -n src/speculators/data_generation/preprocessing.py | sed -n '61,70p'

Repository: vllm-project/speculators

Length of output: 468


🏁 Script executed:

#!/bash
# The most direct approach - just check what the actual full definition is with no truncation
head -n 70 src/speculators/data_generation/preprocessing.py | tail -n 20

Repository: vllm-project/speculators

Length of output: 551


🏁 Script executed:

#!/bash
# This reveals a critical issue: the tests call _normalize_conversation(conv) 
# but the function signature requires metadata as a MANDATORY parameter with NO DEFAULT
# This means the tests are BROKEN - they would fail at runtime
# Let's verify this by checking if maybe the test assertions should also fail
grep -A 20 "test_normalize_conversation_sharegpt_format" tests/integration/datagen/test_preprocessing.py | head -25

Repository: vllm-project/speculators

Length of output: 722


🏁 Script executed:

#!/bash
# Check if there's perhaps an overloaded version or if metadata has a default that's set differently
grep -B 5 "metadata: dict" src/speculators/data_generation/preprocessing.py | head -20

Repository: vllm-project/speculators

Length of output: 167


🏁 Script executed:

#!/bash
# Let me check if maybe the parameter was supposed to have a default value
# by looking at line 63 specifically for any = sign
sed -n '63p' src/speculators/data_generation/preprocessing.py | cat -A

Repository: vllm-project/speculators

Length of output: 89


🏁 Script executed:

#!/bash
# Verify the linked repository findings about existing usages in vllm
# Check if reasoning_content is optional in vllm
grep -r "reasoning_content" /root/vllm-project/vllm/vllm/entrypoints/chat_utils.py 2>/dev/null | head -10 || echo "File not found"

Repository: vllm-project/speculators

Length of output: 50


🏁 Script executed:

#!/bash
# Verify the zip behavior with mismatched lengths
python3 << 'EOF'
# Demonstrate the issue
conversations = ["conv1", "conv2", "conv3"]
metadatas = []  # This is what happens when metadata key is absent

result = list(zip(conversations, metadatas))
print(f"conversations: {len(conversations)} items")
print(f"metadatas: {len(metadatas)} items")
print(f"zip result: {len(result)} pairs")
print(f"Rows processed: {len(result)}, Rows lost: {len(conversations) - len(result)}")

# Demonstrate empty dict behavior
print(f"\nTruth value of empty dict: bool({{}}) = {bool({})}")
print(f"'not {{}}' = {not {}}")
EOF

Repository: vllm-project/speculators

Length of output: 223


🏁 Script executed:

#!/bash
# Check if there are any tests that test the batch function WITH metadata
grep -A 50 "test_preprocess_batch" tests/integration/datagen/test_preprocessing.py | grep -E "metadata|examples\s*=" | head -20

Repository: vllm-project/speculators

Length of output: 204


🏁 Script executed:

#!/bash
# Check if the batched tests provide metadata in the examples dict
cat -n tests/integration/datagen/test_preprocessing.py | sed -n '123,180p'

Repository: vllm-project/speculators

Length of output: 2665


🏁 Script executed:

#!/bash
# Look for any test that actually includes metadata in the examples dict
grep -B 5 -A 30 "def test_preprocess_batch_basic" tests/integration/datagen/test_preprocessing.py

Repository: vllm-project/speculators

Length of output: 1205


🏁 Script executed:

#!/bash
# Search for any test that uses metadata key
grep -n "metadata" tests/integration/datagen/test_preprocessing.py | head -20

Repository: vllm-project/speculators

Length of output: 50


🏁 Script executed:

#!/bash
# Final verification - check if metadata parameter was supposed to have a default
# Let me see the complete function one more time to be absolutely certain
grep -A 50 "def _normalize_conversation" src/speculators/data_generation/preprocessing.py | head -60

Repository: vllm-project/speculators

Length of output: 1863


Metadata handling causes silent data loss in the batch preprocessing pipeline.

When the "metadata" key is missing from examples, metadatas = examples.get("metadata", []) returns an empty list, and zip(conversations, metadatas) produces zero pairs, dropping all rows without warning. Additionally, if not metadata or not isinstance(metadata, dict): continue skips even valid empty metadata {} (since not {} = True), and length mismatches between conversations and metadatas are silently truncated by zip.

Since reasoning_content is optional and conditionally accessed (line 101-102 checks if "reasoning_content" in metadata), samples without this field should still process normally with empty or missing metadata rather than being dropped entirely.

💡 Suggested fix
-    metadatas= examples.get("metadata", [])
+    metadatas = examples.get("metadata")
+    if metadatas is None:
+        metadatas = [{} for _ in conversations]
+    elif not isinstance(metadatas, list):
+        log.warning("Invalid metadata column type; falling back to empty metadata")
+        metadatas = [{} for _ in conversations]

-    for idx, (conv,metadata) in enumerate(zip(conversations,metadatas)):
+    if len(metadatas) != len(conversations):
+        log.warning(
+            f"Metadata/conversation length mismatch: "
+            f"{len(metadatas)} vs {len(conversations)}; padding metadata"
+        )
+        if len(metadatas) < len(conversations):
+            metadatas = metadatas + [{} for _ in range(len(conversations) - len(metadatas))]
+        else:
+            metadatas = metadatas[: len(conversations)]
+
+    for idx, (conv, metadata) in enumerate(zip(conversations, metadatas, strict=True)):
         if not conv or not isinstance(conv, list):
             continue
-        if not metadata or not isinstance(metadata, dict):
+        if not isinstance(metadata, dict):
             continue
🧰 Tools
🪛 Ruff (0.15.13)

[warning] 280-280: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/speculators/data_generation/preprocessing.py` around lines 274 - 287, The
batch drops rows because metadatas = examples.get("metadata", []) produces an
empty list and zip(conversations, metadatas) truncates pairs and because the
code rejects empty dicts with if not metadata or not isinstance(metadata, dict):
continue; fix by: when building metadatas, if examples lacks "metadata" or its
list is shorter than conversations, create/pad a metadatas list of the same
length filled with {} (so missing/empty metadata is treated as {}), remove the
truthiness check and only validate metadata with isinstance(metadata, dict)
before calling _normalize_conversation, and add a warning log when metadata
length mismatches conversations to surface potential data issues (keep using
_normalize_conversation for normalization).

@mergify
Copy link
Copy Markdown

mergify Bot commented May 20, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Copy link
Copy Markdown
Collaborator

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Chibukach, my main issue with this is I don't fully understand the structure of the reasoning content you're trying to support.

I think I saw an example from Alex where we had a single reasoning_content value for a full conversation, is this the same as that?

The code seems to be assuming we have:

{
  "conversations": list of dicts,
  "metadata": list of dicts of same length
}

Is that correct?

Comment on lines +288 to +291
metadatas = metadatas + [{} for _ in range(len(conversations) - len(metadatas))]
else:
metadatas = metadatas[: len(conversations)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels pretty unsafe. We're just assuming that if we have less metadata that the metadata lines up with first part of the conversation?

The conversation contains system, user, and assistant messages. Presumably we only have reasoning content for assistant responses.

Comment on lines +293 to +296
if not conv or not isinstance(conv, list):
continue
if not isinstance(metadata, dict):
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're constructing things this way we don't need to check for it. Also if the sample structure doesn't meet our expectations it would be better to fail with an error than just silently skip it.

@shanjiaz
Copy link
Copy Markdown
Collaborator

@Chibukach Thanks so much for putting up this change! Just trying to understand this issue better. Could you maybe provide an example json file for what we actually get when we run response regeneration? I tested on vllm 0.21.0 and this is what I got:

{
    "id": "cf428475-a0ef-5fd8-a35e-25abbf57e895",
    "conversations": [
        {
            "from": "human",
            "value": "Has the SoundCloud Generation been broke down into further age-defined groups any?"
        },
        {
            "from": "gpt",
            "value": "<think>\nOkay, the user is asking if the SoundCloud Generation has been broken down into more specific age groups. \n</think>\n\nThe term \"SoundCloud Generation\" ... strict demographic segmentation."
        }
    ],
    "metadata": {
        "idx": 2,
        "finish_reason": "stop",
        "latency_s": 6.904,
        "usage": {
            "prompt_tokens": 23,
            "total_tokens": 973,
            "completion_tokens": 950,
            "prompt_tokens_details": null
        },
        "endpoint": "http://127.0.0.1:8000/v1/chat/completions"
    }
}

This looks like what we had before. Perhaps you could provide more information on which model & which version of vllm that requires this change? Thank you thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants