Skip to content

add GRPO fine-tuning notebook for JSON invoice extraction using Fireworks Training API#243

Open
namanvirk18 wants to merge 2 commits into
patchy631:mainfrom
namanvirk18:add-fireworks-grpo-notebook
Open

add GRPO fine-tuning notebook for JSON invoice extraction using Fireworks Training API#243
namanvirk18 wants to merge 2 commits into
patchy631:mainfrom
namanvirk18:add-fireworks-grpo-notebook

Conversation

@namanvirk18
Copy link
Copy Markdown
Contributor

@namanvirk18 namanvirk18 commented Jun 1, 2026

Summary by CodeRabbit

  • New Features

    • End-to-end GRPO fine-tuning workflow for structured JSON invoice extraction, including training, evaluation, and demo/run scripts
    • Included training (200) and evaluation (50) example prompts plus sample invoices and CLI demo for batch inference
    • Automated reward validation utility and notebook with evaluation charts and baseline comparison
  • Documentation

    • Detailed README and skill documentation with setup, environment variables, usage steps, and contribution notes

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds an end-to-end GRPO fine-tuning workflow for JSON invoice extraction (notebook, training/eval datasets, reward scorer), plus agent-skill docs, a reward validator, orchestration scripts, a CLI demo, and sample invoice inputs.

Changes

GRPO Training Pipeline

Layer / File(s) Summary
Project setup & notebook init
grpo-finetuning-qwen3/.env.example, grpo-finetuning-qwen3/README.md, grpo-finetuning-qwen3/grpo_json_extraction.ipynb (Step 0)
Environment example entries, README with prerequisites and run instructions, and notebook initialization (dependency install and API-key loading/assertions).
Reward function (score) implementation
grpo-finetuning-qwen3/grpo_json_extraction.ipynb (Step 1)
JSON parsing + JSON-schema scoring function score() with 1.0/0.5/0.0 semantics and inline self-tests.
Training & evaluation datasets
grpo-finetuning-qwen3/train_prompts.jsonl, grpo-finetuning-qwen3/eval_prompts.jsonl
200-line JSONL training dataset and 50-line eval JSONL with extraction prompts and ground-truth vendor/date/amount/currency.
Dataset upload & baseline evaluation
grpo-finetuning-qwen3/grpo_json_extraction.ipynb (Steps 2–3)
Upload training dataset to Fireworks (poll until READY) and run baseline run_eval() over eval prompts using OpenRouter client and score().
GRPO training execution
grpo-finetuning-qwen3/grpo_json_extraction.ipynb (Step 4)
Wire and run Fireworks GRPO RL loop with invoice_reward, model/dataset/GRPO config, infra/deployment settings, and notebook-compatible asyncio patching.
Post-training evaluation & visualization
grpo-finetuning-qwen3/grpo_json_extraction.ipynb (Steps 5–7)
Evaluate deployed fine-tuned model, evaluate openai/gpt-4.1, compute accuracies, generate eval_chart.png, and run single inference example with schema validation.

Agent Skill & Pipeline

Layer / File(s) Summary
Skill documentation & demo
grpo-finetuning-qwen3/agent-skill/grpo-finetune/SKILL.md, grpo-finetuning-qwen3/agent-skill/grpo-finetune/agent_demo.py
Skill README with trigger/workflow/commands and a CLI demo that calls a deployment, sanitizes/parses JSON, validates required fields, and prints per-invoice results.
Reward validator
grpo-finetuning-qwen3/agent-skill/grpo-finetune/generate_reward.py
Validator that dynamically imports reward.py, ensures score(completion, row=None) exists, probes outputs are numeric ∈ [0,1], and optionally runs SELF_TESTS, returning process exit codes.
Run pipeline orchestration
grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py
CLI orchestration: validate reward, upload or reuse dataset, run GRPO training via RL-loop wiring, poll readiness, evaluate deployment with run_eval, and optionally execute local demo using invoices.txt.
Demo inputs
grpo-finetuning-qwen3/invoices.txt
Five sample invoice lines used for demo/inference runs.

Sequence Diagram

sequenceDiagram
  participant Notebook
  participant Fireworks as Fireworks (Training & Inference)
  participant OpenRouter as OpenRouter (Inference)
  participant Local as Local (Reward Scoring)
  
  Notebook->>Fireworks: Upload train_prompts.jsonl dataset
  Fireworks-->>Notebook: Dataset ready
  
  Notebook->>OpenRouter: Baseline eval on eval_prompts
  OpenRouter-->>Notebook: Base model outputs
  Notebook->>Local: Score outputs with reward function
  Local-->>Notebook: Schema-valid accuracy
  
  Notebook->>Fireworks: Execute GRPO training loop
  Fireworks->>Fireworks: Train with invoice_reward function
  Fireworks-->>Notebook: Deployment endpoint ready
  
  Notebook->>Fireworks: Eval fine-tuned model
  Fireworks-->>Notebook: Fine-tuned outputs
  Notebook->>Local: Score outputs
  Local-->>Notebook: Fine-tuned accuracy
  
  Notebook->>OpenRouter: Eval GPT-4.1 baseline
  OpenRouter-->>Notebook: GPT-4.1 outputs
  Notebook->>Local: Score outputs
  Local-->>Notebook: GPT-4.1 accuracy
  
  Notebook->>Notebook: Generate & save eval_chart.png
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through prompts and JSON lines,
Reward tests, uploads, and training times,
Baseline, tuned, and GPT side-by-side,
Charts saved, demos run, all verified,
A tiny rabbit cheers this pipeline's signs.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main addition: a GRPO fine-tuning notebook for JSON invoice extraction using the Fireworks Training API, which aligns with the primary file addition (grpo_json_extraction.ipynb) and core objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpo-finetuning-qwen3/eval_prompts.jsonl`:
- Around line 1-50: The eval_prompts.jsonl file contains corrupted ground_truth
labels that don't match the prompt text inside each "messages" array (e.g.,
amounts/currencies in the "user" content disagree with the "ground_truth"
object); update each record's ground_truth fields (vendor, date, amount,
currency) to exactly match the invoice text in the corresponding "messages" ->
user content (fix entries like the first record where "654 GBP" should be
amount: 654 and currency: "GBP", the sixth record that incorrectly adds "GBP",
and the 14th that lists 130 JPY but has 130.5), or remove/regenerate any rows
where you cannot confidently determine the correct label; ensure every JSON
object’s "ground_truth" matches the parsed values from its "messages" content
before running run_eval().

In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb`:
- Around line 564-567: The reward function invoice_reward currently ignores the
ground-truth row and only calls score(completion), so the policy can game output
shape; update invoice_reward to parse the completion JSON, extract fields
(vendor, date, amount, currency), compare them against the corresponding values
in row (normalizing formats for date/amount/currency), compute a numeric reward
that gives full credit for exact matches and partial credit for close/normalized
matches and penalties for missing/invalid fields, and return that float; replace
the current rl_loop.reward_fn binding with this new invoice_reward and ensure
parsing errors return a low reward rather than a crash.
- Around line 257-288: run_eval currently only measures schema validity (via
score/content) and never uses entry["ground_truth"]; update run_eval to compute
task accuracy against the true labels by parsing the model output JSON and
comparing it to entry["ground_truth"]. Specifically, inside run_eval (and where
you call score), first validate/parse content into a dict (handle the existing
</think> strip and JSON exceptions), then call a new or extended comparator
(e.g., score or compare_to_ground_truth) that accepts (parsed_output,
entry["ground_truth"]) and returns 1.0 for a correct match and 0.0 otherwise;
append that result to scores, keep schema-valid logging if desired, and change
the printed/returned accuracy to be based on matches to entry["ground_truth"]
rather than just schema validity. Ensure you reference run_eval, eval_data,
entry["ground_truth"], and score/compare_to_ground_truth when implementing.
- Around line 548-552: The notebook globally replaces asyncio.run with
_patched_run but never restores it if main(cfg) raises; wrap the patch and the
call to main(cfg) in a try/finally: save the original in _original_run, assign
asyncio.run = _patched_run, call main(cfg) inside try, and always restore
asyncio.run = _original_run in the finally block (referencing the existing
symbols _original_run, _patched_run, asyncio.run and the main(cfg) invocation)
so later cells see the original asyncio.run regardless of exceptions.
- Around line 84-88: The code currently assigns os.environ["FIREWORKS_API_KEY"]
= FIREWORKS_KEY before checking FIREWORKS_KEY is non-None, which can raise
TypeError; move or replace the assertion so that assert FIREWORKS_KEY, "Missing
FIREWORKS_API_KEY" (and similar asserts for OPENROUTER_KEY and ACCOUNT_ID) run
before any os.environ[...] = ... assignments, and ensure the checks verify the
values are non-empty strings (not None) before injecting them into environment
variables (refer to FIREWORKS_KEY, OPENROUTER_KEY, ACCOUNT_ID and the
os.environ[...] assignments).

In `@grpo-finetuning-qwen3/README.md`:
- Around line 19-40: Update the README to explicitly instruct users to start
Jupyter (or run the notebook) from the grpo-finetuning-qwen3 directory so the
notebook’s relative imports resolve; mention that grpo_json_extraction.ipynb
expects process-relative paths to ./cookbook/training, ./train_prompts.jsonl,
and ./eval_prompts.jsonl and provide the exact command to change into the
directory (e.g., cd grpo-finetuning-qwen3) or an alternative note about
launching Jupyter from the repo root after adjusting those paths in the
notebook.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ef7d615e-aabd-43f1-be50-ed875fc1a589

📥 Commits

Reviewing files that changed from the base of the PR and between 42eedf8 and 0d30d20.

⛔ Files ignored due to path filters (1)
  • grpo-finetuning-qwen3/eval_chart.png is excluded by !**/*.png
📒 Files selected for processing (5)
  • grpo-finetuning-qwen3/.env.example
  • grpo-finetuning-qwen3/README.md
  • grpo-finetuning-qwen3/eval_prompts.jsonl
  • grpo-finetuning-qwen3/grpo_json_extraction.ipynb
  • grpo-finetuning-qwen3/train_prompts.jsonl

Comment on lines +1 to +50
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Please find invoice from Acme Corp dated April 15, 2024. Grand total: 654 GBP.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Acme Corp", "date": "2024-04-15", "amount": 653.75, "currency": "GBP"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Outstanding payment to Crestline Distributors, due as of 05/02/2024, totalling \u20ac7,999.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Crestline Distributors", "date": "2024-05-02", "amount": 7999.0, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Redwood Staffing Inc on 09/17/2024. Amount due: 3,696 GBP.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Redwood Staffing Inc", "date": "2024-09-17", "amount": 3695.99, "currency": "GBP"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Nexus Solutions Pvt Ltd. Invoice date: Jan 10 2023. Total payable: $69,552.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2023-01-10", "amount": 69552.5, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Please find invoice from Global Logistics Ltd dated Apr 6 2023. Grand total: $1,043.75.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Global Logistics Ltd", "date": "2023-04-06", "amount": 1043.75, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Nexus Solutions Pvt Ltd, 16/10/2023, 16731.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2023-10-16", "amount": 16731.25, "currency": "GBP"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Bill from Sterling Parts & Co, dated 22/02/2023, total 83,860 GBP.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Sterling Parts & Co", "date": "2023-02-22", "amount": 83860.0, "currency": "GBP"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Ironclad Manufacturing on 18/12/2024. Amount due: 825.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2024-12-18", "amount": 825.99, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice\nFrom: Clearwater Utilities\nIssued: Feb 7 2023\nAmount owed: 9,094 EUR\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Clearwater Utilities", "date": "2023-02-07", "amount": 9094.5, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Sterling Parts & Co on Dec 7 2023. Amount due: $55,158.00 USD.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Sterling Parts & Co", "date": "2023-12-07", "amount": 55158.0, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Please find invoice from Pinnacle Hardware Co dated 04/23/2023. Grand total: 78,517 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Pinnacle Hardware Co", "date": "2023-04-23", "amount": 78517.0, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from Global Logistics Ltd. Date of service: 2023-02-05. Invoice total: 3,621.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Global Logistics Ltd", "date": "2023-02-05", "amount": 3621.5, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Outstanding payment to Apex Print Works, due as of 1 October 2024, totalling 69.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2024-10-01", "amount": 69.5, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Clearwater Utilities on 02/24/2024. Amount due: 130 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Clearwater Utilities", "date": "2024-02-24", "amount": 130.5, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Vendor: BlueSky Rentals\nDate: 10/06/2024\nTotal: \u00a3890.75\n\nExtract the fields and return valid JSON only."}], "ground_truth": {"vendor": "BlueSky Rentals", "date": "2024-06-10", "amount": 890.75, "currency": "GBP"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from Apex Print Works. Date of service: 12/08/2023. Invoice total: 392 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2023-08-12", "amount": 391.99, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Bright Media LLC, issue date 12/06/2023, for the amount of 580.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Bright Media LLC", "date": "2023-12-06", "amount": 580.0, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Nexus Solutions Pvt Ltd, 10/23/2024, \u20ac6,147.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2024-10-23", "amount": 6147.0, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Bill from Apex Print Works, dated 2024-06-25, total 7,211.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2024-06-25", "amount": 7211.0, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Vendor: Apex Print Works\nDate: 9 April 2023\nTotal: \u00a347,880.00\n\nExtract the fields and return valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2023-04-09", "amount": 47880.0, "currency": "GBP"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Bill from Atlas Freight Services, dated 2023-04-10, total 85,914 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Atlas Freight Services", "date": "2023-04-10", "amount": 85914.5, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Pinnacle Hardware Co on December 19, 2023. Amount due: EUR 8,744.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Pinnacle Hardware Co", "date": "2023-12-19", "amount": 8744.99, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Vendor: BlueSky Rentals\nDate: 09/13/2024\nTotal: 309.00 JPY\n\nExtract the fields and return valid JSON only."}], "ground_truth": {"vendor": "BlueSky Rentals", "date": "2024-09-13", "amount": 309.0, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice\nFrom: Nexus Solutions Pvt Ltd\nIssued: 09/17/2024\nAmount owed: 215.99 JPY\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2024-09-17", "amount": 215.99, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Please find invoice from Ironclad Manufacturing dated 2023-07-28. Grand total: $8,138.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2023-07-28", "amount": 8138.5, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Cascade Engineering. Invoice date: May 21, 2023. Total payable: 18,142 GBP.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Cascade Engineering", "date": "2023-05-21", "amount": 18142.0, "currency": "GBP"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Cascade Engineering, Jan 3 2023, $706.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Cascade Engineering", "date": "2023-01-03", "amount": 706.99, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Acme Corp, issue date July 2, 2024, for the amount of 6,927.75 EUR.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Acme Corp", "date": "2024-07-02", "amount": 6927.75, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from BlueSky Rentals. Date of service: Jan 10 2024. Invoice total: JPY 447.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "BlueSky Rentals", "date": "2024-01-10", "amount": 447.5, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Global Logistics Ltd, issue date 04/21/2024, for the amount of EUR 7,482.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Global Logistics Ltd", "date": "2024-04-21", "amount": 7482.0, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Ironclad Manufacturing. Invoice date: 4 April 2023. Total payable: 32,740.00 EUR.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2023-04-04", "amount": 32740.0, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Acme Corp, issue date 10/24/2023, for the amount of JPY 40,162.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Acme Corp", "date": "2023-10-24", "amount": 40162.99, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Horizon Digital Agency, 10/09/2024, 93,238 USD.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Horizon Digital Agency", "date": "2024-09-10", "amount": 93238.0, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Vendor: Meridian Office Supplies\nDate: 09/14/2023\nTotal: JPY 9,683.00\n\nExtract the fields and return valid JSON only."}], "ground_truth": {"vendor": "Meridian Office Supplies", "date": "2023-09-14", "amount": 9683.0, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Ironclad Manufacturing on 28 March 2023. Amount due: 629.75 EUR.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2023-03-28", "amount": 629.75, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice\nFrom: Cascade Engineering\nIssued: 21/12/2024\nAmount owed: 346.00\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Cascade Engineering", "date": "2024-12-21", "amount": 346.0, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Meridian Office Supplies. Invoice date: 12/10/2023. Total payable: $6,314.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Meridian Office Supplies", "date": "2023-12-10", "amount": 6314.0, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Global Logistics Ltd. Invoice date: February 17, 2024. Total payable: USD 727.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Global Logistics Ltd", "date": "2024-02-17", "amount": 727.5, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Vortex Software Ltd on Nov 4 2024. Amount due: JPY 779.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Vortex Software Ltd", "date": "2024-11-04", "amount": 779.0, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from Ironclad Manufacturing. Date of service: Dec 28 2023. Invoice total: USD 2,374.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2023-12-28", "amount": 2374.25, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from TechSupplies Inc. Date of service: 08/04/2024. Invoice total: 9807.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "TechSupplies Inc", "date": "2024-08-04", "amount": 9807.25, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Summit Consulting Group, issue date 2023-12-10, for the amount of GBP 67,422.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Summit Consulting Group", "date": "2023-12-10", "amount": 67422.25, "currency": "GBP"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Clearwater Utilities, 9 February 2023, 398 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Clearwater Utilities", "date": "2023-02-09", "amount": 398.0, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Cascade Engineering. Invoice date: September 24, 2023. Total payable: $985.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Cascade Engineering", "date": "2023-09-24", "amount": 985.0, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Bill from Orbit Telecom LLC, dated 11/25/2024, total \u20ac92,242.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Orbit Telecom LLC", "date": "2024-11-25", "amount": 92242.0, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Meridian Office Supplies, issue date May 15 2023, for the amount of \u20ac2,586.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Meridian Office Supplies", "date": "2023-05-15", "amount": 2586.25, "currency": "EUR"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Nexus Solutions Pvt Ltd. Invoice date: April 22, 2024. Total payable: 56357.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2024-04-22", "amount": 56357.99, "currency": "JPY"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Summit Consulting Group. Invoice date: 2024-01-05. Total payable: 7001.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Summit Consulting Group", "date": "2024-01-05", "amount": 7001.5, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Clearwater Utilities, issue date 2023-06-08, for the amount of USD 715.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Clearwater Utilities", "date": "2023-06-08", "amount": 715.99, "currency": "USD"}}
{"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Apex Print Works on 20 March 2024. Amount due: 61,645.50 EUR.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2024-03-20", "amount": 61645.5, "currency": "EUR"}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Several eval labels do not match the prompt text.

This held-out set has corrupted targets. For example, Line 1 says 654 GBP but ground_truth.amount is 653.75; Line 6 invents GBP even though the prompt gives no currency; Line 14 says 130 JPY but the label is 130.5. Once run_eval() starts comparing against ground_truth, these rows will penalize correct extraction.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/eval_prompts.jsonl` around lines 1 - 50, The
eval_prompts.jsonl file contains corrupted ground_truth labels that don't match
the prompt text inside each "messages" array (e.g., amounts/currencies in the
"user" content disagree with the "ground_truth" object); update each record's
ground_truth fields (vendor, date, amount, currency) to exactly match the
invoice text in the corresponding "messages" -> user content (fix entries like
the first record where "654 GBP" should be amount: 654 and currency: "GBP", the
sixth record that incorrectly adds "GBP", and the 14th that lists 130 JPY but
has 130.5), or remove/regenerate any rows where you cannot confidently determine
the correct label; ensure every JSON object’s "ground_truth" matches the parsed
values from its "messages" content before running run_eval().

Comment on lines +84 to +88
"os.environ[\"FIREWORKS_API_KEY\"] = FIREWORKS_KEY\n",
"\n",
"assert FIREWORKS_KEY, \"Missing FIREWORKS_API_KEY\"\n",
"assert OPENROUTER_KEY, \"Missing OPENROUTER_API_KEY\"\n",
"assert ACCOUNT_ID, \"Missing FIREWORKS_ACCOUNT_ID\"\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import os
try:
    os.environ["FIREWORKS_API_KEY"] = None
except Exception as exc:
    print(type(exc).__name__)
    print(exc)
PY

Repository: patchy631/ai-engineering-hub

Length of output: 109


Assert FIREWORKS_KEY is a non-None string before assigning os.environ["FIREWORKS_API_KEY"]

When FIREWORKS_API_KEY is unset, FIREWORKS_KEY becomes None; writing it into os.environ throws TypeError: str expected, not NoneType before the notebook reaches assert FIREWORKS_KEY, "Missing FIREWORKS_API_KEY". Move/replace the assert so it runs before the os.environ[...] = FIREWORKS_KEY assignment (and similarly ensure other injected env values are not None).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb` around lines 84 - 88, The
code currently assigns os.environ["FIREWORKS_API_KEY"] = FIREWORKS_KEY before
checking FIREWORKS_KEY is non-None, which can raise TypeError; move or replace
the assertion so that assert FIREWORKS_KEY, "Missing FIREWORKS_API_KEY" (and
similar asserts for OPENROUTER_KEY and ACCOUNT_ID) run before any
os.environ[...] = ... assignments, and ensure the checks verify the values are
non-empty strings (not None) before injecting them into environment variables
(refer to FIREWORKS_KEY, OPENROUTER_KEY, ACCOUNT_ID and the os.environ[...]
assignments).

Comment on lines +257 to +288
"def run_eval(client, model_id: str, label: str) -> float:\n",
" scores = []\n",
" for i, entry in enumerate(eval_data):\n",
" messages = [m for m in entry[\"messages\"] if m[\"role\"] != \"assistant\"]\n",
" # Suppress thinking mode for structured output\n",
" messages = [\n",
" {**m, \"content\": m[\"content\"] + \" /no-think\"}\n",
" if m[\"role\"] == \"system\" else m\n",
" for m in messages\n",
" ]\n",
" try:\n",
" response = client.chat.completions.create(\n",
" model=model_id,\n",
" messages=messages,\n",
" temperature=0.0,\n",
" max_tokens=512,\n",
" )\n",
" content = response.choices[0].message.content\n",
" if \"</think>\" in content:\n",
" content = content.split(\"</think>\")[-1].strip()\n",
" s = score(content)\n",
" except Exception as e:\n",
" print(f\" Error on prompt {i}: {e}\")\n",
" s = 0.0\n",
" scores.append(s)\n",
" if (i + 1) % 10 == 0:\n",
" acc = sum(1 for s in scores if s == 1.0) / len(scores)\n",
" print(f\" [{label}] {i+1}/{len(eval_data)} running: {acc:.1%}\")\n",
"\n",
" accuracy = sum(1 for s in scores if s == 1.0) / len(scores)\n",
" print(f\"\\n[{label}] Schema-valid: {accuracy:.1%} ({sum(1 for s in scores if s == 1.0)}/{len(scores)})\")\n",
" return accuracy\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use ground_truth for task accuracy, not just schema validity.

run_eval() never reads entry["ground_truth"], so the reported 62%/82% numbers only measure whether the output is valid JSON for the schema. A model can return the same well-formed placeholder object for every prompt and still score perfectly here.

📌 Suggested direction
 def run_eval(client, model_id: str, label: str) -> float:
-    scores = []
+    schema_scores = []
+    exact_matches = 0
     for i, entry in enumerate(eval_data):
         messages = [m for m in entry["messages"] if m["role"] != "assistant"]
         # Suppress thinking mode for structured output
         messages = [
             {**m, "content": m["content"] + " /no-think"}
             if m["role"] == "system" else m
             for m in messages
         ]
         try:
             response = client.chat.completions.create(
                 model=model_id,
                 messages=messages,
                 temperature=0.0,
                 max_tokens=512,
             )
             content = response.choices[0].message.content
             if "</think>" in content:
                 content = content.split("</think>")[-1].strip()
-            s = score(content)
+            s = score(content)
+            if s == 1.0 and json.loads(content) == entry["ground_truth"]:
+                exact_matches += 1
         except Exception as e:
             print(f"  Error on prompt {i}: {e}")
             s = 0.0
-        scores.append(s)
+        schema_scores.append(s)
         if (i + 1) % 10 == 0:
-            acc = sum(1 for s in scores if s == 1.0) / len(scores)
-            print(f"  [{label}] {i+1}/{len(eval_data)} running: {acc:.1%}")
+            acc = exact_matches / len(schema_scores)
+            print(f"  [{label}] {i+1}/{len(eval_data)} running exact-match: {acc:.1%}")
 
-    accuracy = sum(1 for s in scores if s == 1.0) / len(scores)
-    print(f"\n[{label}] Schema-valid: {accuracy:.1%} ({sum(1 for s in scores if s == 1.0)}/{len(scores)})")
-    return accuracy
+    schema_valid = sum(1 for s in schema_scores if s == 1.0) / len(schema_scores)
+    exact_accuracy = exact_matches / len(schema_scores)
+    print(f"\n[{label}] Schema-valid: {schema_valid:.1%}")
+    print(f"[{label}] Exact-match:  {exact_accuracy:.1%} ({exact_matches}/{len(schema_scores)})")
+    return exact_accuracy
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb` around lines 257 - 288,
run_eval currently only measures schema validity (via score/content) and never
uses entry["ground_truth"]; update run_eval to compute task accuracy against the
true labels by parsing the model output JSON and comparing it to
entry["ground_truth"]. Specifically, inside run_eval (and where you call score),
first validate/parse content into a dict (handle the existing </think> strip and
JSON exceptions), then call a new or extended comparator (e.g., score or
compare_to_ground_truth) that accepts (parsed_output, entry["ground_truth"]) and
returns 1.0 for a correct match and 0.0 otherwise; append that result to scores,
keep schema-valid logging if desired, and change the printed/returned accuracy
to be based on matches to entry["ground_truth"] rather than just schema
validity. Ensure you reference run_eval, eval_data, entry["ground_truth"], and
score/compare_to_ground_truth when implementing.

Comment on lines +548 to +552
"_original_run = asyncio.run\n",
"def _patched_run(coro, **kwargs):\n",
" loop = asyncio.get_event_loop()\n",
" return loop.run_until_complete(coro)\n",
"asyncio.run = _patched_run\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore asyncio.run in a finally block.

If main(cfg) throws, Line 601 never executes and the notebook leaves a globally patched asyncio.run behind for later cells. That makes subsequent failures hard to reason about.

#!/bin/bash
python - <<'PY'
import json
from pathlib import Path

nb = json.loads(Path("grpo-finetuning-qwen3/grpo_json_extraction.ipynb").read_text())
for cell in nb["cells"]:
    src = "".join(cell.get("source", []))
    if "_original_run = asyncio.run" in src:
        print(src)
        break
PY

Also applies to: 599-601

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb` around lines 548 - 552, The
notebook globally replaces asyncio.run with _patched_run but never restores it
if main(cfg) raises; wrap the patch and the call to main(cfg) in a try/finally:
save the original in _original_run, assign asyncio.run = _patched_run, call
main(cfg) inside try, and always restore asyncio.run = _original_run in the
finally block (referencing the existing symbols _original_run, _patched_run,
asyncio.run and the main(cfg) invocation) so later cells see the original
asyncio.run regardless of exceptions.

Comment on lines +564 to +567
"def invoice_reward(completion: str, row: dict) -> float:\n",
" return score(completion)\n",
"\n",
"rl_loop.reward_fn = invoice_reward\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The GRPO reward never checks extraction correctness.

invoice_reward() ignores row and only rewards schema-conforming JSON. Combined with the current training rows, that means the policy can optimize for output shape alone instead of learning to extract the vendor, date, amount, and currency from the invoice text.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb` around lines 564 - 567, The
reward function invoice_reward currently ignores the ground-truth row and only
calls score(completion), so the policy can game output shape; update
invoice_reward to parse the completion JSON, extract fields (vendor, date,
amount, currency), compare them against the corresponding values in row
(normalizing formats for date/amount/currency), compute a numeric reward that
gives full credit for exact matches and partial credit for close/normalized
matches and penalties for missing/invalid fields, and return that float; replace
the current rl_loop.reward_fn binding with this new invoice_reward and ensure
parsing errors return a low reward rather than a crash.

Comment on lines +19 to +40
**Clone the Fireworks cookbook**:
```bash
git clone https://github.com/fw-ai/cookbook.git
```

**Install Dependencies**:

Ensure you have Python 3.10 or later installed.

```bash
uv venv
source .venv/bin/activate
uv pip install python-dotenv jsonschema openai fireworks-ai matplotlib
uv pip install -e "cookbook/training[training]"
uv pip install eval-protocol nest_asyncio
```

Select the virtual environment as the kernel in the notebook.

**Run the notebook**:

Open and run `grpo_json_extraction.ipynb` end-to-end. The notebook covers:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Document the required notebook working directory.

These steps never say to start Jupyter from grpo-finetuning-qwen3, but the notebook resolves ./cookbook/training, ./train_prompts.jsonl, and ./eval_prompts.jsonl with process-relative paths. Launching the notebook from the repo root will make the setup and data-loading cells fail immediately.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/README.md` around lines 19 - 40, Update the README to
explicitly instruct users to start Jupyter (or run the notebook) from the
grpo-finetuning-qwen3 directory so the notebook’s relative imports resolve;
mention that grpo_json_extraction.ipynb expects process-relative paths to
./cookbook/training, ./train_prompts.jsonl, and ./eval_prompts.jsonl and provide
the exact command to change into the directory (e.g., cd grpo-finetuning-qwen3)
or an alternative note about launching Jupyter from the repo root after
adjusting those paths in the notebook.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
grpo-finetuning-qwen3/README.md (1)

19-22: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Specify the cookbook clone location relative to the working directory.

Line 21 clones the cookbook repository, but doesn't specify where to clone it. Line 32 then references cookbook/training[training], assuming the cookbook directory exists in the current working directory.

If users clone the cookbook while in the repository root but run the notebook from grpo-finetuning-qwen3/, the relative path ./cookbook/training in the notebook won't resolve correctly.

📝 Suggested fix

Either instruct users to clone into the grpo-finetuning-qwen3 directory, or adjust the install command to reference the correct relative path:

Option 1: Clone inside grpo-finetuning-qwen3

 **Clone the Fireworks cookbook**:
 ```bash
+cd grpo-finetuning-qwen3
 git clone https://github.com/fw-ai/cookbook.git

**Option 2: Adjust the pip install path**
```diff
 ```bash
 uv venv
 source .venv/bin/activate
 uv pip install python-dotenv jsonschema openai fireworks-ai matplotlib
-uv pip install -e "cookbook/training[training]"
+uv pip install -e "../cookbook/training[training]"
 uv pip install eval-protocol nest_asyncio
</details>


Also applies to: 32-32

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @grpo-finetuning-qwen3/README.md around lines 19 - 22, The README instructs
cloning the cookbook without a specific target, causing the later pip install -e
"cookbook/training[training]" to fail if the repo is cloned elsewhere; update
instructions to either (A) tell users to cd into grpo-finetuning-qwen3 before
cloning so the cookbook lands at ./cookbook (e.g., run cd grpo-finetuning-qwen3
then git clone https://github.com/fw-ai/cookbook.git), or (B) change the
editable install path in the setup commands to point to the parent folder where
users are likely to clone (replace pip install -e "cookbook/training[training]"
with pip install -e "../cookbook/training[training]"); ensure the README clearly
states which option you choose so the path used by the pip install matches where
the cookbook was cloned.


</details>

</blockquote></details>

</blockquote></details>
♻️ Duplicate comments (1)
grpo-finetuning-qwen3/README.md (1)

19-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clarify the working directory for setup and notebook execution.

The setup steps don't specify where to run these commands. The notebook resolves paths like ./train_prompts.jsonl, ./eval_prompts.jsonl, and ./cookbook/training relative to the process working directory, but the instructions never tell users to cd grpo-finetuning-qwen3 first.

Without this clarification, users starting from the repository root will encounter import failures and missing file errors when running the notebook.

📝 Suggested fix

Add an explicit working directory instruction before the setup steps:

 ## Setup and installations
 
+**Navigate to the project directory**:
+```bash
+cd grpo-finetuning-qwen3
+```
+
 **Get API Keys**:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/README.md` around lines 19 - 40, The README lacks an
explicit working-directory instruction so running the setup and the notebook
(which references files like ./train_prompts.jsonl, ./eval_prompts.jsonl and the
./cookbook/training package) from the repository root will fail; update
README.md to instruct users to first change into the project folder (e.g., cd
grpo-finetuning-qwen3) before running the listed setup commands and launching
grpo_json_extraction.ipynb so the relative paths and imports resolve correctly.
🧹 Nitpick comments (1)
grpo-finetuning-qwen3/README.md (1)

66-71: 💤 Low value

Consider providing a more concrete example for the --output-id parameter.

Line 70 uses <your-model-id> as a placeholder, which requires users to invent their own identifier. Providing a concrete example (e.g., invoice-extraction-v1) would make it easier for users to understand the expected format and get started quickly.

📝 Suggested enhancement
 python agent-skill/grpo-finetune/run_pipeline.py \
     --train ./train_prompts.jsonl \
     --eval  ./eval_prompts.jsonl \
     --task  invoice-extraction \
-    --output-id <your-model-id>
+    --output-id invoice-extraction-v1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/README.md` around lines 66 - 71, Replace the
placeholder "<your-model-id>" in the example command that runs
agent-skill/grpo-finetune/run_pipeline.py with a concrete example model id
(e.g., "invoice-extraction-v1") so users see the expected format for the
--output-id parameter; update the README example invocation to: python
agent-skill/grpo-finetune/run_pipeline.py --train ./train_prompts.jsonl --eval
./eval_prompts.jsonl --task invoice-extraction --output-id invoice-extraction-v1
and keep the placeholder mention elsewhere if you still want to show users they
can choose their own id.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/agent_demo.py`:
- Around line 97-132: The loop can divide by zero when raw_docs is empty (total
== 0); update the logic after reading raw_docs (where total and passed are set)
to check if total == 0 and return/exit early with a clear message instead of
proceeding to the loop and final percentage computation—e.g., detect total == 0
and print a user-friendly "no documents found" message (referencing raw_docs,
total, passed, and filepath) and skip the pct calculation and final summary so
pct = round(passed / total * 100) is never executed on zero.
- Around line 83-84: The current validate(result: dict) uses REQUIRED_FIELDS and
only checks for non-empty values, so replace its body to call the canonical
schema validator used by the training/eval pipeline (i.e., reuse the same schema
validation function/class used for training/eval rather than ad-hoc
REQUIRED_FIELDS checks), import that validator into this module, and ensure
validate(result) returns the validator's boolean result so numeric 0 is allowed,
types are enforced, and extra/missing fields are handled consistently with the
reward contract.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/generate_reward.py`:
- Around line 43-53: The SELF_TESTS loop can raise on malformed entries or when
mod.score(...) throws, so change the loop over getattr(mod, "SELF_TESTS", []) in
generate_reward.py to catch exceptions per-test: inside the for i, (completion,
row, expected) in enumerate(...) wrap the unpack and call to
mod.score(completion, row) in a try/except Exception as e, mark the test as
failed (increment failures), and print a clear failure line that includes the
exception message/type instead of letting it traceback; keep the existing
numeric tolerance check when score returns normally and ensure the existing
failures check (printing "FAIL ... self-test(s) failed" and returning 1) still
runs after the loop.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py`:
- Around line 55-61: The completion content handling currently only strips
"</think>" before scoring; update the logic in run_pipeline.py where the
variable content is prepared (after resp.choices[0].message.content and before
calling score_fn) to also remove Markdown fenced code blocks by trimming a
leading triple-backtick fence (optionally followed by "json") and a trailing
triple-backtick from content.strip(), then pass the cleaned string to score_fn
so fenced JSON is scored correctly; refer to the variables/content handling
around model_id, resp, content and the call to score_fn when making the change.
- Around line 181-184: The subprocess call uses a hard-coded relative path;
change it to resolve agent_demo.py relative to this module's file: import Path
from pathlib, compute script_path = Path(__file__).resolve().parent /
"agent_demo.py" (since run_pipeline.py and agent_demo.py are siblings), and pass
str(script_path) to subprocess.run instead of the literal
"agent-skill/grpo-finetune/agent_demo.py"; keep the existing argv list
(sys.executable, str(invoices_path), "--deployment", deployment).
- Around line 107-124: The try/except around fw.datasets.get(...) currently
catches all exceptions and wrongly treats any error as "dataset not found";
change the except Exception to catch only the specific fireworks.NotFoundError
(e.g., except fireworks.NotFoundError as e:) so you only fall back to
fw.datasets.create(...) and fw.datasets.upload(...) when the SDK reports a 404,
and for any other exception re-raise or log and propagate it instead of
proceeding with create/upload; update references around fw.datasets.get,
fw.datasets.create, fw.datasets.upload and ensure the NotFoundError import or
fully-qualified name is used.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/SKILL.md`:
- Around line 94-95: The statement "The pipeline automatically runs the agent
demo on sample invoices at the end." is inaccurate; update the doc to state the
demo is conditional: describe that run_pipeline.py calls agent_demo.py only if
the invoices file exists and otherwise prints that the demo is skipped.
Reference the filenames run_pipeline.py and agent_demo.py and mention the
invoices file existence check so readers understand the exact conditional
behavior.

---

Outside diff comments:
In `@grpo-finetuning-qwen3/README.md`:
- Around line 19-22: The README instructs cloning the cookbook without a
specific target, causing the later pip install -e "cookbook/training[training]"
to fail if the repo is cloned elsewhere; update instructions to either (A) tell
users to cd into grpo-finetuning-qwen3 before cloning so the cookbook lands at
./cookbook (e.g., run cd grpo-finetuning-qwen3 then git clone
https://github.com/fw-ai/cookbook.git), or (B) change the editable install path
in the setup commands to point to the parent folder where users are likely to
clone (replace pip install -e "cookbook/training[training]" with pip install -e
"../cookbook/training[training]"); ensure the README clearly states which option
you choose so the path used by the pip install matches where the cookbook was
cloned.

---

Duplicate comments:
In `@grpo-finetuning-qwen3/README.md`:
- Around line 19-40: The README lacks an explicit working-directory instruction
so running the setup and the notebook (which references files like
./train_prompts.jsonl, ./eval_prompts.jsonl and the ./cookbook/training package)
from the repository root will fail; update README.md to instruct users to first
change into the project folder (e.g., cd grpo-finetuning-qwen3) before running
the listed setup commands and launching grpo_json_extraction.ipynb so the
relative paths and imports resolve correctly.

---

Nitpick comments:
In `@grpo-finetuning-qwen3/README.md`:
- Around line 66-71: Replace the placeholder "<your-model-id>" in the example
command that runs agent-skill/grpo-finetune/run_pipeline.py with a concrete
example model id (e.g., "invoice-extraction-v1") so users see the expected
format for the --output-id parameter; update the README example invocation to:
python agent-skill/grpo-finetune/run_pipeline.py --train ./train_prompts.jsonl
--eval ./eval_prompts.jsonl --task invoice-extraction --output-id
invoice-extraction-v1 and keep the placeholder mention elsewhere if you still
want to show users they can choose their own id.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 86f36cd1-c636-4c99-9412-dfcc70a4cbbb

📥 Commits

Reviewing files that changed from the base of the PR and between 0d30d20 and 897d1b3.

📒 Files selected for processing (6)
  • grpo-finetuning-qwen3/README.md
  • grpo-finetuning-qwen3/agent-skill/grpo-finetune/SKILL.md
  • grpo-finetuning-qwen3/agent-skill/grpo-finetune/agent_demo.py
  • grpo-finetuning-qwen3/agent-skill/grpo-finetune/generate_reward.py
  • grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py
  • grpo-finetuning-qwen3/invoices.txt
✅ Files skipped from review due to trivial changes (1)
  • grpo-finetuning-qwen3/invoices.txt

Comment on lines +83 to +84
def validate(result: dict) -> bool:
return all(result.get(f) not in (None, "", 0) for f in REQUIRED_FIELDS)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make demo validation match the training/eval schema.

This only checks for non-empty keys, so it can mark wrong types or extra fields as Schema valid, and it rejects amount: 0 even though the notebook schema accepts any JSON number. Reuse the same schema validation here so the demo's pass/fail matches the reward contract.

Suggested fix
 import argparse
 import json
 import os
 import re
 import sys
 import time
 
 from dotenv import load_dotenv
+from jsonschema import ValidationError, validate as validate_json
 from openai import OpenAI
 
 REQUIRED_FIELDS = {"vendor", "date", "amount", "currency"}
+SCHEMA = {
+    "type": "object",
+    "required": ["vendor", "date", "amount", "currency"],
+    "properties": {
+        "vendor": {"type": "string"},
+        "date": {"type": "string"},
+        "amount": {"type": "number"},
+        "currency": {"type": "string"},
+    },
+    "additionalProperties": False,
+}
@@
 def validate(result: dict) -> bool:
-    return all(result.get(f) not in (None, "", 0) for f in REQUIRED_FIELDS)
+    try:
+        validate_json(instance=result, schema=SCHEMA)
+        return True
+    except ValidationError:
+        return False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/agent_demo.py` around lines
83 - 84, The current validate(result: dict) uses REQUIRED_FIELDS and only checks
for non-empty values, so replace its body to call the canonical schema validator
used by the training/eval pipeline (i.e., reuse the same schema validation
function/class used for training/eval rather than ad-hoc REQUIRED_FIELDS
checks), import that validator into this module, and ensure validate(result)
returns the validator's boolean result so numeric 0 is allowed, types are
enforced, and extra/missing fields are handled consistently with the reward
contract.

Comment on lines +97 to +132
with open(filepath) as f:
raw_docs = [line.strip() for line in f if line.strip()]

total = len(raw_docs)
passed = 0

for i, doc in enumerate(raw_docs, 1):
print(f"\n{GRAY}#{i} of {total}{RESET}")
print(f"{CYAN}{doc}{RESET}")
print(DIVIDER_MID)

t0 = time.time()
try:
result = extract(client, deployment, doc)
elapsed = round(time.time() - t0, 2)
valid = validate(result)
if valid:
passed += 1

for field in REQUIRED_FIELDS:
val = result.get(field, "—")
print(f" {GRAY}{field:<10}{RESET} {val}")

print()
if valid:
print(f" {GREEN}✓ Schema valid{RESET} {GRAY}{elapsed}s{RESET}")
else:
print(f" {RED}✗ Schema mismatch{RESET} {GRAY}{elapsed}s{RESET}")

except Exception as e:
print(f" {RED}✗ Error: {e}{RESET}")

print(DIVIDER)

pct = round(passed / total * 100)
print(f"\n {BOLD}Results{RESET} {GREEN}{passed}/{total} valid{RESET} Schema match: {GREEN}{pct}%{RESET}\n")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle an empty invoices.txt before computing percentages.

If the file is empty or only whitespace, total becomes 0 and Line 131 raises ZeroDivisionError. Return early with a clear message instead of crashing after the banner.

🧰 Tools
🪛 Ruff (0.15.15)

[warning] 126-126: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/agent_demo.py` around lines
97 - 132, The loop can divide by zero when raw_docs is empty (total == 0);
update the logic after reading raw_docs (where total and passed are set) to
check if total == 0 and return/exit early with a clear message instead of
proceeding to the loop and final percentage computation—e.g., detect total == 0
and print a user-friendly "no documents found" message (referencing raw_docs,
total, passed, and filepath) and skip the pct calculation and final summary so
pct = round(passed / total * 100) is never executed on zero.

Comment on lines +43 to +53
for i, (completion, row, expected) in enumerate(getattr(mod, "SELF_TESTS", [])):
got = mod.score(completion, row)
ok = abs(float(got) - float(expected)) < 1e-6
print(f" [{'ok ' if ok else 'FAIL'}] test {i}: got {got} expected {expected}")
if not ok:
failures += 1
if failures:
print(f"FAIL {failures} self-test(s) failed")
return 1
tests = len(getattr(mod, "SELF_TESTS", []))
print("PASS" + (f" {tests} self-tests green" if tests else " (add SELF_TESTS to pin the task)"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail self-tests cleanly instead of letting them traceback.

A malformed SELF_TESTS entry, or a score() exception on one of those cases, currently escapes this validator as an unhandled exception. That defeats the point of validating reward.py before GPU spend; this path should print FAIL and return 1 just like the probe checks above.

Suggested fix
     failures = 0
-    for i, (completion, row, expected) in enumerate(getattr(mod, "SELF_TESTS", [])):
-        got = mod.score(completion, row)
+    for i, test_case in enumerate(getattr(mod, "SELF_TESTS", [])):
+        try:
+            completion, row, expected = test_case
+            got = mod.score(completion, row)
+        except Exception as e:
+            print(f"FAIL  self-test {i} crashed: {e}")
+            return 1
         ok = abs(float(got) - float(expected)) < 1e-6
         print(f"  [{'ok  ' if ok else 'FAIL'}] test {i}: got {got}  expected {expected}")
         if not ok:
             failures += 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/generate_reward.py` around
lines 43 - 53, The SELF_TESTS loop can raise on malformed entries or when
mod.score(...) throws, so change the loop over getattr(mod, "SELF_TESTS", []) in
generate_reward.py to catch exceptions per-test: inside the for i, (completion,
row, expected) in enumerate(...) wrap the unpack and call to
mod.score(completion, row) in a try/except Exception as e, mark the test as
failed (increment failures), and print a clear failure line that includes the
exception message/type instead of letting it traceback; keep the existing
numeric tolerance check when score returns normally and ensure the existing
failures check (printing "FAIL ... self-test(s) failed" and returning 1) still
runs after the loop.

Comment on lines +55 to +61
try:
resp = client.chat.completions.create(model=model_id, messages=msgs,
temperature=0.0, max_tokens=512)
content = resp.choices[0].message.content
if "</think>" in content:
content = content.split("</think>")[-1].strip()
s = score_fn(content, row)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Strip fenced JSON in eval before scoring it.

agent_demo.py already removes both </think> and Markdown ```json fences, but run_eval() only removes ``. A completion wrapped in fences gets scored as `0.0` here even when the JSON body is otherwise valid, so the reported accuracy can be artificially low.

Suggested fix
 import argparse
 import json
 import os
+import re
 import subprocess
 import sys
 import time
@@
             content = resp.choices[0].message.content
             if "</think>" in content:
                 content = content.split("</think>")[-1].strip()
+            content = re.sub(r"^```(?:json)?\s*|\s*```$", "", content.strip())
             s = score_fn(content, row)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py` around lines
55 - 61, The completion content handling currently only strips "</think>" before
scoring; update the logic in run_pipeline.py where the variable content is
prepared (after resp.choices[0].message.content and before calling score_fn) to
also remove Markdown fenced code blocks by trimming a leading triple-backtick
fence (optionally followed by "json") and a trailing triple-backtick from
content.strip(), then pass the cleaned string to score_fn so fenced JSON is
scored correctly; refer to the variables/content handling around model_id, resp,
content and the call to score_fn when making the change.

Comment on lines +107 to +124
try:
ds = fw.datasets.get(dataset_id=dataset_id)
if ds.state == "READY":
checkpoint(f"Dataset ready · {row_count} prompts")
else:
print("Waiting for dataset", end="", flush=True)
while True:
ds = fw.datasets.get(dataset_id=dataset_id)
if ds.state == "READY":
print()
break
print(".", end="", flush=True)
time.sleep(3)
checkpoint(f"Dataset ready · {row_count} prompts")
except Exception:
print(f"Uploading {row_count} prompts...", flush=True)
fw.datasets.create(dataset_id=dataset_id, dataset={"exampleCount": str(row_count)})
fw.datasets.upload(dataset_id=dataset_id, file=train_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What exception does the Fireworks Python SDK raise when fw.datasets.get(dataset_id=...) is called for a nonexistent dataset, and how should callers distinguish that from auth or transport failures?

💡 Result:

When the Fireworks Python SDK encounters a nonexistent dataset during a call to fw.datasets.get(dataset_id=...), it raises a fireworks.NotFoundError exception [1][2]. To distinguish this from other failures, you should handle the hierarchical exception structure provided by the SDK [1][2]: 1. NotFoundError: This is a subclass of fireworks.APIStatusError that specifically indicates the requested resource was not found (HTTP 404) [1][2]. 2. APIStatusError: This class is raised for any non-success API response (4xx or 5xx) [1][2]. It contains status_code and response properties, which allow you to programmatically inspect the HTTP status code (e.g., 401 for authentication, 403 for permission denied) [1][2]. 3. APIConnectionError: This class is raised when the library cannot connect to the API, such as during network failures or timeouts [1][2]. All of these exceptions inherit from the base fireworks.APIError class, which can be used to catch any SDK-related error [1][2]. Example handling pattern: try: dataset = client.datasets.get(dataset_id=dataset_id) except fireworks.NotFoundError: print("The requested dataset does not exist.") except fireworks.APIStatusError as e: if e.status_code == 401: print("Authentication failure.") else: print(f"API request failed with status code {e.status_code}") except fireworks.APIConnectionError: print("Failed to connect to the Fireworks API.") except fireworks.APIError: print("An unexpected API error occurred.")

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the referenced file around the snippet and its imports/SDK usage
sed -n '1,220p' grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py | nl -ba

Repository: patchy631/ai-engineering-hub

Length of output: 115


🌐 Web query:

Fireworks Python SDK exception class for nonexistent dataset datasets.get NotFoundError APIStatusError APIConnectionError

💡 Result:

In the Fireworks Python SDK, when a requested resource—such as a dataset—does not exist, the API returns a 404 status code, which triggers a fireworks.NotFoundError exception [1][2][3]. The Fireworks Python SDK uses a structured hierarchy for error handling, where all exceptions inherit from the base fireworks.APIError class [1][2][3]. The specific exception classes relevant to your query are defined as follows: 1. NotFoundError: Raised when the API returns a 404 status code, indicating that the requested resource (e.g., a specific dataset) could not be found [1][2][4]. 2. APIStatusError: This is the base class for all non-success status codes (4xx or 5xx responses) [1][2][3]. It contains properties for status_code and response, allowing you to inspect the details of the error [1][2][3]. 3. APIConnectionError: This exception is raised when the library is unable to connect to the API, such as in cases of network connectivity issues or timeouts (where no HTTP status code is received) [1][2][4]. Other common exception classes include: - BadRequestError (400) [1][2] - AuthenticationError (401) [1][2] - PermissionDeniedError (403) [1][2] - RateLimitError (429) [1][2] - InternalServerError (>=500) [1][2] When writing code to handle these, you can catch them specifically to implement appropriate retry or logging logic [1][2]. For example: import fireworks try: dataset = client.datasets.get(dataset_id="nonexistent-id") except fireworks.NotFoundError: print("The requested dataset was not found.") except fireworks.APIConnectionError: print("Failed to connect to the Fireworks API.") except fireworks.APIStatusError as e: print(f"API returned status {e.status_code}: {e.response}")

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Print the relevant section with line numbers (cat -n works without nl)
cat -n grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py | sed -n '70,170p'

# Check how `fw` is instantiated / what SDK is imported
rg -n "import fireworks|fw\s*=|fireworks\." grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py || true

Repository: patchy631/ai-engineering-hub

Length of output: 4957


Limit the dataset-upload fallback to fireworks.NotFoundError (404)

In grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py (lines 107-124), the except Exception: around fw.datasets.get(dataset_id=...) will also swallow auth/permission failures and transport/transient API errors—then it incorrectly proceeds to fw.datasets.create(...) + fw.datasets.upload(...) as if the dataset were missing. The SDK raises fireworks.NotFoundError for a nonexistent dataset; narrow the fallback to that exception only and re-raise/log everything else.

🧰 Tools
🪛 Ruff (0.15.15)

[warning] 121-121: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py` around lines
107 - 124, The try/except around fw.datasets.get(...) currently catches all
exceptions and wrongly treats any error as "dataset not found"; change the
except Exception to catch only the specific fireworks.NotFoundError (e.g.,
except fireworks.NotFoundError as e:) so you only fall back to
fw.datasets.create(...) and fw.datasets.upload(...) when the SDK reports a 404,
and for any other exception re-raise or log and propagate it instead of
proceeding with create/upload; update references around fw.datasets.get,
fw.datasets.create, fw.datasets.upload and ensure the NotFoundError import or
fully-qualified name is used.

Comment on lines +181 to +184
subprocess.run([sys.executable,
"agent-skill/grpo-finetune/agent_demo.py",
str(invoices_path),
"--deployment", deployment])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resolve agent_demo.py from __file__, not the caller's working directory.

The subprocess target is hard-coded as agent-skill/grpo-finetune/agent_demo.py, which only works when the current working directory is grpo-finetuning-qwen3. Running this script from the repo root or from its own directory breaks the demo step.

Suggested fix
     if invoices_path.exists():
         banner("STEP 4 · sample inference")
         print("Running the fine-tuned model on sample invoices...\n", flush=True)
         time.sleep(1)
+        demo_script = Path(__file__).resolve().parent / "agent_demo.py"
         subprocess.run([sys.executable,
-                        "agent-skill/grpo-finetune/agent_demo.py",
+                        str(demo_script),
                         str(invoices_path),
                         "--deployment", deployment])
🧰 Tools
🪛 Ruff (0.15.15)

[error] 181-181: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py` around lines
181 - 184, The subprocess call uses a hard-coded relative path; change it to
resolve agent_demo.py relative to this module's file: import Path from pathlib,
compute script_path = Path(__file__).resolve().parent / "agent_demo.py" (since
run_pipeline.py and agent_demo.py are siblings), and pass str(script_path) to
subprocess.run instead of the literal "agent-skill/grpo-finetune/agent_demo.py";
keep the existing argv list (sys.executable, str(invoices_path), "--deployment",
deployment).

Comment on lines +94 to +95
The pipeline automatically runs the agent demo on sample invoices at the end.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the demo step as conditional.

run_pipeline.py only invokes agent_demo.py when the invoices file exists; otherwise it prints that the demo is being skipped. This line currently overpromises the behavior the user will see.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/SKILL.md` around lines 94 -
95, The statement "The pipeline automatically runs the agent demo on sample
invoices at the end." is inaccurate; update the doc to state the demo is
conditional: describe that run_pipeline.py calls agent_demo.py only if the
invoices file exists and otherwise prints that the demo is skipped. Reference
the filenames run_pipeline.py and agent_demo.py and mention the invoices file
existence check so readers understand the exact conditional behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant