Retrieve thinking msg from the metadata reasoning_content field by Chibukach · Pull Request #532 · vllm-project/speculators

Chibukach · 2026-05-19T13:54:47Z

Purpose

Recent VLLM output separate thinking content in the metadata instead of tags

Description

This PR adds support for retrieving thinking content from the "reasoning_content" field inside the metadata during a model response generation

Tests

Tested locally on Qwen/Qwen3-8B and the output prints thinking_msg

2026-05-19 11:29:50,075 - speculators.data_generation.preprocessing - INFO -   HIGHLIGHTED TEXT (BLUE = trainable, GREY = masked)
2026-05-19 11:29:50,089 - speculators.data_generation.preprocessing - INFO -   <|im_start|>user
Here is the problem:
Suppose that Alice and Bob are participating in an online auction.  They can either bid or pass.  Assuming that each person acts independently and that the probability of bidding is 1/2 for each participant,  then the probability that both Alice and Bob bid is $\frac{1}{4}$.  Can you explain this answer.

The answer is indeed $\frac{1}{4}$.  But, I don't understand what is meant by independent events.  Does this mean that their actions (bidding or passing) are independent of each other.
Yes, exactly.  In this case, "independent events" means that Alice's decision to bid or pass does not affect Bob's decision to bid or pass, and vice versa.

Thank you.  So, the probability that both Alice and Bob bid is calculated by multiplying their individual probabilities of bidding, which is 1/2, so 1/2 * 1/2 = 1/4, which I can understand.
That's correct.  The probability of independent events occurring together is the product of the individual probabilities.  This is a basic rule of probability, and it applies to any situation where the events are independent.

Okay, I think I have a good understanding of this problem now.  Is there anything else I should know about independent events?<|im_end|>
<|im_start|>assistant
<think>
Here's a thinking process that leads to the suggested response:

1.  **Analyze the User's Request:**
    *   **Context:** The user is asking about a probability problem involving Alice and Bob in an auction.
    *   **Current Understanding:** They understand *why* the answer is 1/4 (multiplying probabilities) and *what* independence means in this context (one person's choice doesn't affect the other's).
    *   **Current Question:** "Is there anything else I should know about independent events?"
    *   **Goal:** Provide a helpful, educational, and concise expansion on the concept of independent events without overwhelming the user. I need to reinforce their correct understanding while adding valuable nuance or common pitfalls.

2.  **Determine Key Concepts to Cover:**
    *   **Definition Reinforcement:** Briefly reiterate the formal definition ($P(A \cap B) = P(A) \cdot P(B)$).
    *   **Distinction from Mutually Exclusive:** This is the most common confusion point. Students often mix up "independent" (doesn't affect probability) with "mutually exclusive" (can't happen together).
    *   **Conditional Probability:** Briefly mention how independence relates to conditional probability ($P(A|B) = P(A)$).
    *   **Real-world Nuance:** Mention that true independence is often an assumption in models (like the auction example) but might not hold in reality (e.g., if they talk to each other).
    *   **More than two events:** Briefly note that this extends to 3+ events.

3.  **Drafting the Response (Iterative Process):**

    *   *Draft 1 (Internal Monologue):* Okay, they get the basics. I should tell them about mutually exclusive events because people always confuse that. If events are mutually exclusive, they can't both happen, so probability is 0. But independent events *can* both happen. Also, maybe mention conditional probability. If A and B are independent, knowing B happened doesn't change the chance of A. Oh, and maybe warn them that in real life, independence is an assumption.

    *   *Draft 2 (Closer to final):* You have a great grasp. Here are a few extra things. First, don't confuse independent with mutually exclusive. Mutually exclusive means they can't both happen. Independent means one doesn't change the other's odds. Second, think about conditional probability. If they are independent, P(Alice bids | Bob bids) is still 1/2. Third, remember this is a model. In real auctions, maybe they know each other. Finally, this works for more than two people too.

    *   *Refining for Tone and Clarity:* The tone should be encouraging and instructional. I should use formatting (bullet points) to make it readable. I need to make sure the distinction between "Independent" and "Mutually Exclusive" is very clear, as that's the biggest learning opportunity here.

4.  **Finalizing the Content (incorporating into the final output):**
    *   **Acknowledge and Validate:** Start by confirming their understanding is solid.
    *   **Key Point 1: Independent vs. Mutually Exclusive:** This is crucial. Explain the difference clearly.
        *   Independent: Can both happen, doesn't affect odds.
        *   Mutually Exclusive: Cannot both happen.
    *   **Key Point 2: Conditional Probability:** Explain $P(A|B) = P(A)$. This is the mathematical definition of independence.
    *   **Key Point 3: Assumption vs. Reality:** Contextualize the auction example. In game theory, decisions might *not* be independent (strategic interaction), but in this probability problem, we assume they are.
    *   **Key Point 4: Extension:** Mention it works for 3+ events.

5.  **Review against Constraints:**
    *   The user is asking a follow-up question in a dialogue format.
    *   Keep it conversational but informative.
    *   Ensure mathematical accuracy.

6.  **Final Polish:** (Self-Correction during drafting) *Wait, I should be careful not to introduce Game Theory too deeply, or it might confuse the probability concept. Just mention that in real auctions, independence might not hold, but for this math problem, it does.*

    *Let's structure the answer:*
    1. Validation.
    2. The "Mutually Exclusive" trap.
    3. The Conditional Probability perspective.
    4. Real-world context (Assumptions).
    5. Closing encouragement.

7.  **Final Output Generation** (similar to the provided good response).
</think>

You have a solid grasp of the basics! Since you are interested in deepening your understanding, here are a few key concepts and common pitfalls regarding independent events that are worth knowing:

**1. Independent vs. Mutually Exclusive**
This is the most common point of confusion in probability.
*   **Independent Events:** The occurrence of one does *not* change the probability of the other. They **can** happen at the same time (like Alice and Bob both bidding).
*   **Mutually Exclusive Events:** The two events **cannot** happen at the same time. (For example, if a coin flip results in "Heads," it cannot also be "Tails").
*   **Why it matters:** If events are mutually exclusive, the probability of both happening is **0**. If they are independent, you **multiply** their probabilities.

**2. The Conditional Probability Perspective**
Another way to define independence is using conditional probability (the probability of A happening *given* that B has already happened).
*   If Alice and Bob are independent, knowing that Bob bid does not change Alice's odds.
*   Mathematically: $P(\text{Alice bids} \mid \text{Bob bids}) = P(\text{Alice bids})$.
*   If knowing Bob bid made Alice *more* likely to bid (perhaps they are competing rivals), they would **not** be independent.

**3. Independence is Often an Assumption**
In textbook problems like this auction scenario, independence is usually given as a rule to make the math solvable. However, in real-world scenarios (especially in economics or game theory), actions are rarely truly independent.
*   **Real World:** If Alice sees Bob bid, she might be more likely to bid to win the item, or less likely to bid to save money.
*   **Math Problem:** We assume "independence" to isolate the probability calculation from psychological or strategic factors.

**4. It Extends to More Than Two Events**
The rule you used ($P(A) \times P(B)$) works for any number of independent events.
*   If Charlie also joins the auction with a 1/2 probability of bidding, the chance that **all three** bid is $\frac{1}{2} \times \frac{1}{2} \times \frac{1}{2} = \frac{1}{8}$.

You are asking the right questions to build a strong foundation in probability. Keep distinguishing between *what* happens (the events) and *how* they influence each other (the dependence)!<|im_end|>

I have filled in:

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan/results, such as providing test command and pasting the results.
(Optional) The necessary documentation update.
I (a human) have written or reviewed the code in this pr to the best of my ability.

coderabbitai · 2026-05-19T13:55:07Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a69d1c0b-1d66-4d55-9f4d-8787ba09a3f5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

The preprocessing pipeline is extended to propagate metadata through conversation normalization. The _normalize_conversation function now accepts a metadata parameter and copies reasoning_content from metadata into each normalized turn when available. The _preprocess_batch function extracts the per-example metadata field from input batches, validates both conversations and metadata, and passes metadata to normalization.

Changes

Metadata reasoning_content propagation

Layer / File(s)	Summary
Normalize conversation with metadata reasoning_content `src/speculators/data_generation/preprocessing.py`	`_normalize_conversation` accepts `metadata: dict` parameter and injects `metadata["reasoning_content"]` into each normalized turn. `_preprocess_batch` extracts metadata from batch input, zips with conversations, validates both, and passes to normalization.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

data-generation, enhancement

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding support for retrieving thinking content from the metadata reasoning_content field during preprocessing.
Description check	✅ Passed	The description is related to the changeset, explaining the purpose (VLLM now separates thinking content into metadata), what was implemented (retrieving reasoning_content from metadata), and providing test results with example output.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/speculators/data_generation/preprocessing.py`:
- Around line 101-103: The code currently copies metadata["reasoning_content"]
into normalized_turn for all roles; change this so reasoning_content is only
attached when the turn is an assistant turn by checking
normalized_turn.get("role") == "assistant" (or equivalent role field) before
assigning normalized_turn["reasoning_content"] = metadata["reasoning_content"];
keep the existing truthy check on metadata["reasoning_content"] and update the
block around normalized_turn and metadata to guard the assignment with that role
check.
- Around line 274-287: The batch drops rows because metadatas =
examples.get("metadata", []) produces an empty list and zip(conversations,
metadatas) truncates pairs and because the code rejects empty dicts with if not
metadata or not isinstance(metadata, dict): continue; fix by: when building
metadatas, if examples lacks "metadata" or its list is shorter than
conversations, create/pad a metadatas list of the same length filled with {} (so
missing/empty metadata is treated as {}), remove the truthiness check and only
validate metadata with isinstance(metadata, dict) before calling
_normalize_conversation, and add a warning log when metadata length mismatches
conversations to surface potential data issues (keep using
_normalize_conversation for normalization).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f3c22b20-c51b-4804-9119-ef0061526fb4

📥 Commits

Reviewing files that changed from the base of the PR and between 4f80f5d and c897305.

📒 Files selected for processing (1)

src/speculators/data_generation/preprocessing.py

coderabbitai · 2026-05-19T13:59:39Z

+        if "reasoning_content" in metadata and metadata["reasoning_content"]:
+            normalized_turn["reasoning_content"] = metadata["reasoning_content"]
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Attach reasoning_content only to assistant turns.

This currently copies the same reasoning_content into user/system turns too, which can pollute role-specific templating and semantics.

💡 Suggested fix

- if "reasoning_content" in metadata and metadata["reasoning_content"]: + if ( + role == "assistant" + and "reasoning_content" in metadata + and metadata["reasoning_content"] + ): normalized_turn["reasoning_content"] = metadata["reasoning_content"]

🧰 Tools

🪛 Ruff (0.15.13)

[warning] 101-101: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/speculators/data_generation/preprocessing.py` around lines 101 - 103, The code currently copies metadata["reasoning_content"] into normalized_turn for all roles; change this so reasoning_content is only attached when the turn is an assistant turn by checking normalized_turn.get("role") == "assistant" (or equivalent role field) before assigning normalized_turn["reasoning_content"] = metadata["reasoning_content"]; keep the existing truthy check on metadata["reasoning_content"] and update the block around normalized_turn and metadata to guard the assignment with that role check.

coderabbitai · 2026-05-19T13:59:39Z

+    metadatas= examples.get("metadata", [])

    if not conversations:
        log.warning(f"No conversations key found. Keys: {list(examples.keys())}")
        return results

-    for idx, conv in enumerate(conversations):
+    for idx, (conv,metadata) in enumerate(zip(conversations,metadatas)):
        if not conv or not isinstance(conv, list):
            continue
+        if not metadata or not isinstance(metadata, dict):
+            continue

        # Normalize to standard format with optional turn dropout
-        normalized_conv = _normalize_conversation(conv, turn_dropout)
+        normalized_conv = _normalize_conversation(conv, metadata, turn_dropout)


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's see the file structure find . -name "preprocessing.py" | head -20

Repository: vllm-project/speculators

Length of output: 119

🏁 Script executed:

#!/bin/bash # Read the preprocessing.py file around the target lines (274-287) cat -n src/speculators/data_generation/preprocessing.py | sed -n '260,300p'

Repository: vllm-project/speculators

Length of output: 1821

🏁 Script executed:

#!/bin/bash # Get more context on the _normalize_conversation function signature and usage cat -n src/speculators/data_generation/preprocessing.py | sed -n '1,150p' | tail -100

Repository: vllm-project/speculators

Length of output: 4078

🏁 Script executed:

#!/bin/bash # Look at the _normalize_conversation function definition to understand metadata usage grep -n "def _normalize_conversation" src/speculators/data_generation/preprocessing.py

Repository: vllm-project/speculators

Length of output: 100

🏁 Script executed:

#!/bin/bash # Read the _normalize_conversation function implementation cat -n src/speculators/data_generation/preprocessing.py | sed -n '55,120p'

Repository: vllm-project/speculators

Length of output: 2706

🏁 Script executed:

#!/bin/bash # Check for tests or examples showing how this preprocessing is used find . -name "*.py" -type f | xargs grep -l "_preprocess_batch\|preprocessing" | head -10

Repository: vllm-project/speculators

Length of output: 330

🏁 Script executed:

#!/bin/bash # Check if metadata is always present in data samples grep -r "metadata" --include="*.py" src/speculators/data_generation/ | grep -E "(\.json|from|import|examples\[|get\()" | head -20

Repository: vllm-project/speculators

Length of output: 161

🏁 Script executed:

#!/bin/bash # Look at how this is called - check main processing flow grep -B 5 -A 5 "_preprocess_batch" src/speculators/data_generation/preprocessing.py | tail -50

Repository: vllm-project/speculators

Length of output: 1131

🏁 Script executed:

#!/bin/bash # Check if metadata is required or optional in dataset loading grep -n "metadata" src/speculators/data_generation/preprocessing.py | head -30

Repository: vllm-project/speculators

Length of output: 530

🏁 Script executed:

#!/bin/bash # Check the test file to see how preprocessing is tested cat -n tests/integration/datagen/test_preprocessing.py | head -100

Repository: vllm-project/speculators

Length of output: 3861

🏁 Script executed:

#!/bin/bash # Look at how metadata is provided in tests grep -A 10 -B 5 "metadata" tests/integration/datagen/test_preprocessing.py | head -60

Repository: vllm-project/speculators

Length of output: 50

🏁 Script executed:

#!/antml:bash # Check how the preprocessing is called in the data loading pipeline cat -n src/speculators/train/data.py | head -100

Repository: vllm-project/speculators

Length of output: 3552

🏁 Script executed:

#!/bin/bash # Look for how examples dict is constructed when calling _preprocess_batch grep -B 20 "_preprocess_batch" src/speculators/train/data.py | head -50

Repository: vllm-project/speculators

Length of output: 50

🏁 Script executed:

#!/bin/bash # Check if metadata is always provided or if it's optional grep -r "conversations" --include="*.py" src/speculators/ | grep -v "normalized_conv\|assistant_pattern" | head -20

Repository: vllm-project/speculators

Length of output: 1024

🏁 Script executed:

#!/bin/bash # Check if the test file passes metadata parameter to _normalize_conversation grep -A 15 "test_normalize_conversation" tests/integration/datagen/test_preprocessing.py | head -40

Repository: vllm-project/speculators

Length of output: 1445

🏁 Script executed:

#!/bin/bash # Look for how _normalize_conversation is called in tests - check the actual call signatures grep "_normalize_conversation" tests/integration/datagen/test_preprocessing.py

Repository: vllm-project/speculators

Length of output: 408

🏁 Script executed:

#!/bin/bash # Check the actual test calls to see if metadata is provided cat -n tests/integration/datagen/test_preprocessing.py | sed -n '27,80p'

Repository: vllm-project/speculators

Length of output: 2136

🏁 Script executed:

#!/bin/bash # Look at the actual Python to understand the behavior of zip with mismatched lengths python3 << 'EOF' # Test what happens with zip when lists have different lengths conversations = ["conv1", "conv2", "conv3"] metadatas = [] result = list(zip(conversations, metadatas)) print(f"Result of zip with empty metadata list: {result}") print(f"Number of pairs: {len(result)}") # Test with empty dict metadatas_with_empty = [{}] result2 = list(zip(conversations, metadatas_with_empty)) print(f"\nWith one empty dict metadata: {result2}") print(f"Number of pairs: {len(result2)}") # Test 'not' operator on empty dict empty_dict = {} print(f"\nTruth value of empty dict {{}}: {bool(empty_dict)}") print(f"'not {{}}' evaluates to: {not empty_dict}") EOF

Repository: vllm-project/speculators

Length of output: 261

🏁 Script executed:

#!/bin/bash # Check if the tests actually pass by running them cd /tmp && python3 -m pytest /dev/stdin 2>&1 << 'EOF' # Quick test to see if _normalize_conversation can be called without metadata import sys sys.path.insert(0, '/root/vllm-project/speculators/src') from speculators.data_generation.preprocessing import _normalize_conversation conv = [ {"from": "human", "value": "Hello"}, {"from": "gpt", "value": "Hi!"}, ] # Try calling without metadata parameter try: result = _normalize_conversation(conv) print("SUCCESS: Can call _normalize_conversation with just conv") except TypeError as e: print(f"ERROR: {e}") EOF

Repository: vllm-project/speculators

Length of output: 109

🏁 Script executed:

#!/bin/bash # Check if there's a recent change that broke signature compatibility # Look at the function definition more carefully cat -n src/speculators/data_generation/preprocessing.py | sed -n '61,75p'

Repository: vllm-project/speculators

Length of output: 662

🏁 Script executed:

#!/bin/bash # Check if metadata has a default value that I missed grep -A 10 "def _normalize_conversation" src/speculators/data_generation/preprocessing.py

Repository: vllm-project/speculators

Length of output: 399

🏁 Script executed:

#!/bin/bash # Look for all calls to _normalize_conversation to see how it's called grep -n "_normalize_conversation" src/speculators/data_generation/preprocessing.py

Repository: vllm-project/speculators

Length of output: 184

🏁 Script executed:

#!/bin/bash # Check git to see if there were recent changes to the test file git log --oneline -20 tests/integration/datagen/test_preprocessing.py 2>/dev/null || echo "No git history available"

Repository: vllm-project/speculators

Length of output: 112

🏁 Script executed:

#!/antml:bash # Check if metadata parameter has a default value I might have missed python3 << 'EOF' import inspect import sys sys.path.insert(0, 'src') try: from speculators.data_generation.preprocessing import _normalize_conversation sig = inspect.signature(_normalize_conversation) print("Function signature:") print(sig) print("\nParameters:") for param_name, param in sig.parameters.items(): print(f" {param_name}: default={param.default}") except Exception as e: print(f"Error: {e}") EOF

Repository: vllm-project/speculators

Length of output: 99

🏁 Script executed:

#!/bin/bash # Check the actual file to see what the data looks like - look for example datasets find . -name "*.json" -o -name "*.jsonl" | head -5

Repository: vllm-project/speculators

Length of output: 50

🏁 Script executed:

#!/bin/bash # Check the configs to understand expected data structure cat -n src/speculators/data_generation/configs.py | head -100

Repository: vllm-project/speculators

Length of output: 1800

🏁 Script executed:

#!/bin/bash # Check the git diff for the recent commit that added metadata support git log -1 --oneline --name-only git show c897305 --stat

Repository: vllm-project/speculators

Length of output: 23396

🏁 Script executed:

#!/bin/bash # Check if metadata parameter had a default value that was removed or changed git show c897305:src/speculators/data_generation/preprocessing.py | grep -A 10 "def _normalize_conversation"

Repository: vllm-project/speculators

Length of output: 399

🏁 Script executed:

#!/antml:bash # See what the old signature was git show c897305~1:src/speculators/data_generation/preprocessing.py 2>/dev/null | grep -A 10 "def _normalize_conversation" || echo "Previous version not found"

Repository: vllm-project/speculators

Length of output: 95

🏁 Script executed:

#!/bin/bash # Check if tests were updated in that commit git show c897305 tests/integration/datagen/test_preprocessing.py | head -100

Repository: vllm-project/speculators

Length of output: 3267

🏁 Script executed:

#!/bin/bash # Check the actual test assertions to see if they expect metadata handling cat -n tests/integration/datagen/test_preprocessing.py | sed -n '27,80p'

Repository: vllm-project/speculators

Length of output: 2136

🏁 Script executed:

#!/bin/bash # Search for test_preprocess_batch to see how the batch function is tested grep -A 30 "def test_preprocess_batch" tests/integration/datagen/test_preprocessing.py | head -50

Repository: vllm-project/speculators

Length of output: 1839

🏁 Script executed:

#!/antml:bash # Check if metadata parameter has a default in the actual function definition python3 << 'EOF' with open('src/speculators/data_generation/preprocessing.py', 'r') as f: lines = f.readlines() for i, line in enumerate(lines[60:75], start=61): print(f"{i:3d}: {line}", end='') EOF

Repository: vllm-project/speculators

Length of output: 632

🏁 Script executed:

#!/bin/bash # Look at how tests call _normalize_conversation - find all test calls grep -n "_normalize_conversation(" tests/integration/datagen/test_preprocessing.py | head -20

Repository: vllm-project/speculators

Length of output: 206

🏁 Script executed:

#!/bin/bash # This is critical - test calls _normalize_conversation with only 1 arg but function requires 2 (metadata is mandatory) # Let me verify the actual test calls to be sure cat -n tests/integration/datagen/test_preprocessing.py | sed -n '29,45p'

Repository: vllm-project/speculators

Length of output: 772

🏁 Script executed:

#!/antml:bash # Check if metadata has a default value that wasn't shown grep -A 5 "def _normalize_conversation" src/speculators/data_generation/preprocessing.py | cat -n

Repository: vllm-project/speculators

Length of output: 303

🏁 Script executed:

#!/bin/bash # Let's look at the complete function signature definition to see if metadata has a default sed -n '61,66p' src/speculators/data_generation/preprocessing.py | cat -n

Repository: vllm-project/speculators

Length of output: 303

🏁 Script executed:

#!/antml:bash # Check if there are TWO _normalize_conversation functions or if metadata has a default we missed rg "def _normalize_conversation" -A 8 src/speculators/data_generation/preprocessing.py

Repository: vllm-project/speculators

Length of output: 309

🏁 Script executed:

#!/antml:bash # The test calls must be broken. Let me check if tests actually import and use this cat -n tests/integration/datagen/test_preprocessing.py | sed -n '1,50p'

Repository: vllm-project/speculators

Length of output: 1822

🏁 Script executed:

#!/bash # The tests call _normalize_conversation(conv) but the signature requires metadata # Let me check if metadata has a default value (=None or ={}) python3 << 'EOF' with open('src/speculators/data_generation/preprocessing.py', 'r') as f: content = f.read() # Find the _normalize_conversation definition import re match = re.search(r'def _normalize_conversation\([^)]+\)', content, re.DOTALL) if match: print("Found function signature:") print(match.group(0)) else: print("Not found") EOF

Repository: vllm-project/speculators

Length of output: 199

🏁 Script executed:

#!/antml:bash # Check if metadata has a default value = None or = {} grep -A 10 "def _normalize_conversation" src/speculators/data_generation/preprocessing.py | grep -E "metadata|="

Repository: vllm-project/speculators

Length of output: 120

🏁 Script executed:

#!/bash # Let me check the exact bytes to ensure no default is hidden cat src/speculators/data_generation/preprocessing.py | sed -n '61,66p' | od -c | head -60

Repository: vllm-project/speculators

Length of output: 109

🏁 Script executed:

#!/antml:bash # Check if maybe metadata has a default value that's on the next line or if the function was redefined cat -n src/speculators/data_generation/preprocessing.py | sed -n '61,70p'

Repository: vllm-project/speculators

Length of output: 468

🏁 Script executed:

#!/bash # The most direct approach - just check what the actual full definition is with no truncation head -n 70 src/speculators/data_generation/preprocessing.py | tail -n 20

Repository: vllm-project/speculators

Length of output: 551

🏁 Script executed:

#!/bash # This reveals a critical issue: the tests call _normalize_conversation(conv) # but the function signature requires metadata as a MANDATORY parameter with NO DEFAULT # This means the tests are BROKEN - they would fail at runtime # Let's verify this by checking if maybe the test assertions should also fail grep -A 20 "test_normalize_conversation_sharegpt_format" tests/integration/datagen/test_preprocessing.py | head -25

Repository: vllm-project/speculators

Length of output: 722

🏁 Script executed:

#!/bash # Check if there's perhaps an overloaded version or if metadata has a default that's set differently grep -B 5 "metadata: dict" src/speculators/data_generation/preprocessing.py | head -20

Repository: vllm-project/speculators

Length of output: 167

🏁 Script executed:

#!/bash # Let me check if maybe the parameter was supposed to have a default value # by looking at line 63 specifically for any = sign sed -n '63p' src/speculators/data_generation/preprocessing.py | cat -A

Repository: vllm-project/speculators

Length of output: 89

🏁 Script executed:

#!/bash # Verify the linked repository findings about existing usages in vllm # Check if reasoning_content is optional in vllm grep -r "reasoning_content" /root/vllm-project/vllm/vllm/entrypoints/chat_utils.py 2>/dev/null | head -10 || echo "File not found"

Repository: vllm-project/speculators

Length of output: 50

🏁 Script executed:

#!/bash # Verify the zip behavior with mismatched lengths python3 << 'EOF' # Demonstrate the issue conversations = ["conv1", "conv2", "conv3"] metadatas = [] # This is what happens when metadata key is absent result = list(zip(conversations, metadatas)) print(f"conversations: {len(conversations)} items") print(f"metadatas: {len(metadatas)} items") print(f"zip result: {len(result)} pairs") print(f"Rows processed: {len(result)}, Rows lost: {len(conversations) - len(result)}") # Demonstrate empty dict behavior print(f"\nTruth value of empty dict: bool({{}}) = {bool({})}") print(f"'not {{}}' = {not {}}") EOF

Repository: vllm-project/speculators

Length of output: 223

🏁 Script executed:

#!/bash # Check if there are any tests that test the batch function WITH metadata grep -A 50 "test_preprocess_batch" tests/integration/datagen/test_preprocessing.py | grep -E "metadata|examples\s*=" | head -20

Repository: vllm-project/speculators

Length of output: 204

🏁 Script executed:

#!/bash # Check if the batched tests provide metadata in the examples dict cat -n tests/integration/datagen/test_preprocessing.py | sed -n '123,180p'

Repository: vllm-project/speculators

Length of output: 2665

🏁 Script executed:

#!/bash # Look for any test that actually includes metadata in the examples dict grep -B 5 -A 30 "def test_preprocess_batch_basic" tests/integration/datagen/test_preprocessing.py

Repository: vllm-project/speculators

Length of output: 1205

🏁 Script executed:

#!/bash # Search for any test that uses metadata key grep -n "metadata" tests/integration/datagen/test_preprocessing.py | head -20

Repository: vllm-project/speculators

Length of output: 50

🏁 Script executed:

#!/bash # Final verification - check if metadata parameter was supposed to have a default # Let me see the complete function one more time to be absolutely certain grep -A 50 "def _normalize_conversation" src/speculators/data_generation/preprocessing.py | head -60

Repository: vllm-project/speculators

Length of output: 1863

Metadata handling causes silent data loss in the batch preprocessing pipeline.

When the "metadata" key is missing from examples, metadatas = examples.get("metadata", []) returns an empty list, and zip(conversations, metadatas) produces zero pairs, dropping all rows without warning. Additionally, if not metadata or not isinstance(metadata, dict): continue skips even valid empty metadata {} (since not {} = True), and length mismatches between conversations and metadatas are silently truncated by zip.

Since reasoning_content is optional and conditionally accessed (line 101-102 checks if "reasoning_content" in metadata), samples without this field should still process normally with empty or missing metadata rather than being dropped entirely.

💡 Suggested fix

- metadatas= examples.get("metadata", []) + metadatas = examples.get("metadata") + if metadatas is None: + metadatas = [{} for _ in conversations] + elif not isinstance(metadatas, list): + log.warning("Invalid metadata column type; falling back to empty metadata") + metadatas = [{} for _ in conversations] - for idx, (conv,metadata) in enumerate(zip(conversations,metadatas)): + if len(metadatas) != len(conversations): + log.warning( + f"Metadata/conversation length mismatch: " + f"{len(metadatas)} vs {len(conversations)}; padding metadata" + ) + if len(metadatas) < len(conversations): + metadatas = metadatas + [{} for _ in range(len(conversations) - len(metadatas))] + else: + metadatas = metadatas[: len(conversations)] + + for idx, (conv, metadata) in enumerate(zip(conversations, metadatas, strict=True)): if not conv or not isinstance(conv, list): continue - if not metadata or not isinstance(metadata, dict): + if not isinstance(metadata, dict): continue

🧰 Tools

🪛 Ruff (0.15.13)

[warning] 280-280: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/speculators/data_generation/preprocessing.py` around lines 274 - 287, The batch drops rows because metadatas = examples.get("metadata", []) produces an empty list and zip(conversations, metadatas) truncates pairs and because the code rejects empty dicts with if not metadata or not isinstance(metadata, dict): continue; fix by: when building metadatas, if examples lacks "metadata" or its list is shorter than conversations, create/pad a metadatas list of the same length filled with {} (so missing/empty metadata is treated as {}), remove the truthiness check and only validate metadata with isinstance(metadata, dict) before calling _normalize_conversation, and add a warning log when metadata length mismatches conversations to surface potential data issues (keep using _normalize_conversation for normalization).

mergify · 2026-05-20T17:03:13Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

fynnsu

Hi @Chibukach, my main issue with this is I don't fully understand the structure of the reasoning content you're trying to support.

I think I saw an example from Alex where we had a single reasoning_content value for a full conversation, is this the same as that?

The code seems to be assuming we have:

{
  "conversations": list of dicts,
  "metadata": list of dicts of same length
}

Is that correct?

fynnsu · 2026-05-20T17:20:28Z

+            metadatas = metadatas + [{} for _ in range(len(conversations) - len(metadatas))]
+        else:
+            metadatas = metadatas[: len(conversations)]
+


This feels pretty unsafe. We're just assuming that if we have less metadata that the metadata lines up with first part of the conversation?

The conversation contains system, user, and assistant messages. Presumably we only have reasoning content for assistant responses.

fynnsu · 2026-05-20T17:21:33Z

+        if not conv or not isinstance(conv, list):
+            continue
+        if not isinstance(metadata, dict):
+            continue


If we're constructing things this way we don't need to check for it. Also if the sample structure doesn't meet our expectations it would be better to fail with an error than just silently skip it.

shanjiaz · 2026-05-20T20:31:58Z

@Chibukach Thanks so much for putting up this change! Just trying to understand this issue better. Could you maybe provide an example json file for what we actually get when we run response regeneration? I tested on vllm 0.21.0 and this is what I got:

{
    "id": "cf428475-a0ef-5fd8-a35e-25abbf57e895",
    "conversations": [
        {
            "from": "human",
            "value": "Has the SoundCloud Generation been broke down into further age-defined groups any?"
        },
        {
            "from": "gpt",
            "value": "<think>\nOkay, the user is asking if the SoundCloud Generation has been broken down into more specific age groups. \n</think>\n\nThe term \"SoundCloud Generation\" ... strict demographic segmentation."
        }
    ],
    "metadata": {
        "idx": 2,
        "finish_reason": "stop",
        "latency_s": 6.904,
        "usage": {
            "prompt_tokens": 23,
            "total_tokens": 973,
            "completion_tokens": 950,
            "prompt_tokens_details": null
        },
        "endpoint": "http://127.0.0.1:8000/v1/chat/completions"
    }
}

This looks like what we had before. Perhaps you could provide more information on which model & which version of vllm that requires this change? Thank you thank you!

add reasoning content field support

c897305

coderabbitai Bot added enhancement New feature or request data-generation labels May 19, 2026

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Chibu Ukachi and others added 2 commits May 19, 2026 17:54

add more checks

cd76f41

Merge branch 'main' into add_reasoning_content_support

43bec5e

mergify Bot added the quality-failed label May 20, 2026

fynnsu requested changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve thinking msg from the metadata reasoning_content field#532

Retrieve thinking msg from the metadata reasoning_content field#532
Chibukach wants to merge 3 commits into
vllm-project:mainfrom
neuralmagic:add_reasoning_content_support

Chibukach commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

mergify Bot commented May 20, 2026

Uh oh!

fynnsu left a comment

Uh oh!

fynnsu May 20, 2026

Uh oh!

fynnsu May 20, 2026

Uh oh!

shanjiaz commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if "reasoning_content" in metadata and metadata["reasoning_content"]:
		normalized_turn["reasoning_content"] = metadata["reasoning_content"]

Conversation

Chibukach commented May 19, 2026

Purpose

Description

Tests

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 20, 2026

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

fynnsu May 20, 2026

Choose a reason for hiding this comment

Uh oh!

fynnsu May 20, 2026

Choose a reason for hiding this comment

Uh oh!

shanjiaz commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented May 19, 2026 •

edited

Loading