add GRPO fine-tuning notebook for JSON invoice extraction using Fireworks Training API#243
add GRPO fine-tuning notebook for JSON invoice extraction using Fireworks Training API#243namanvirk18 wants to merge 2 commits into
Conversation
… Fireworks Training API
📝 WalkthroughWalkthroughThis PR adds an end-to-end GRPO fine-tuning workflow for JSON invoice extraction (notebook, training/eval datasets, reward scorer), plus agent-skill docs, a reward validator, orchestration scripts, a CLI demo, and sample invoice inputs. ChangesGRPO Training Pipeline
Agent Skill & Pipeline
Sequence DiagramsequenceDiagram
participant Notebook
participant Fireworks as Fireworks (Training & Inference)
participant OpenRouter as OpenRouter (Inference)
participant Local as Local (Reward Scoring)
Notebook->>Fireworks: Upload train_prompts.jsonl dataset
Fireworks-->>Notebook: Dataset ready
Notebook->>OpenRouter: Baseline eval on eval_prompts
OpenRouter-->>Notebook: Base model outputs
Notebook->>Local: Score outputs with reward function
Local-->>Notebook: Schema-valid accuracy
Notebook->>Fireworks: Execute GRPO training loop
Fireworks->>Fireworks: Train with invoice_reward function
Fireworks-->>Notebook: Deployment endpoint ready
Notebook->>Fireworks: Eval fine-tuned model
Fireworks-->>Notebook: Fine-tuned outputs
Notebook->>Local: Score outputs
Local-->>Notebook: Fine-tuned accuracy
Notebook->>OpenRouter: Eval GPT-4.1 baseline
OpenRouter-->>Notebook: GPT-4.1 outputs
Notebook->>Local: Score outputs
Local-->>Notebook: GPT-4.1 accuracy
Notebook->>Notebook: Generate & save eval_chart.png
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@grpo-finetuning-qwen3/eval_prompts.jsonl`:
- Around line 1-50: The eval_prompts.jsonl file contains corrupted ground_truth
labels that don't match the prompt text inside each "messages" array (e.g.,
amounts/currencies in the "user" content disagree with the "ground_truth"
object); update each record's ground_truth fields (vendor, date, amount,
currency) to exactly match the invoice text in the corresponding "messages" ->
user content (fix entries like the first record where "654 GBP" should be
amount: 654 and currency: "GBP", the sixth record that incorrectly adds "GBP",
and the 14th that lists 130 JPY but has 130.5), or remove/regenerate any rows
where you cannot confidently determine the correct label; ensure every JSON
object’s "ground_truth" matches the parsed values from its "messages" content
before running run_eval().
In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb`:
- Around line 564-567: The reward function invoice_reward currently ignores the
ground-truth row and only calls score(completion), so the policy can game output
shape; update invoice_reward to parse the completion JSON, extract fields
(vendor, date, amount, currency), compare them against the corresponding values
in row (normalizing formats for date/amount/currency), compute a numeric reward
that gives full credit for exact matches and partial credit for close/normalized
matches and penalties for missing/invalid fields, and return that float; replace
the current rl_loop.reward_fn binding with this new invoice_reward and ensure
parsing errors return a low reward rather than a crash.
- Around line 257-288: run_eval currently only measures schema validity (via
score/content) and never uses entry["ground_truth"]; update run_eval to compute
task accuracy against the true labels by parsing the model output JSON and
comparing it to entry["ground_truth"]. Specifically, inside run_eval (and where
you call score), first validate/parse content into a dict (handle the existing
</think> strip and JSON exceptions), then call a new or extended comparator
(e.g., score or compare_to_ground_truth) that accepts (parsed_output,
entry["ground_truth"]) and returns 1.0 for a correct match and 0.0 otherwise;
append that result to scores, keep schema-valid logging if desired, and change
the printed/returned accuracy to be based on matches to entry["ground_truth"]
rather than just schema validity. Ensure you reference run_eval, eval_data,
entry["ground_truth"], and score/compare_to_ground_truth when implementing.
- Around line 548-552: The notebook globally replaces asyncio.run with
_patched_run but never restores it if main(cfg) raises; wrap the patch and the
call to main(cfg) in a try/finally: save the original in _original_run, assign
asyncio.run = _patched_run, call main(cfg) inside try, and always restore
asyncio.run = _original_run in the finally block (referencing the existing
symbols _original_run, _patched_run, asyncio.run and the main(cfg) invocation)
so later cells see the original asyncio.run regardless of exceptions.
- Around line 84-88: The code currently assigns os.environ["FIREWORKS_API_KEY"]
= FIREWORKS_KEY before checking FIREWORKS_KEY is non-None, which can raise
TypeError; move or replace the assertion so that assert FIREWORKS_KEY, "Missing
FIREWORKS_API_KEY" (and similar asserts for OPENROUTER_KEY and ACCOUNT_ID) run
before any os.environ[...] = ... assignments, and ensure the checks verify the
values are non-empty strings (not None) before injecting them into environment
variables (refer to FIREWORKS_KEY, OPENROUTER_KEY, ACCOUNT_ID and the
os.environ[...] assignments).
In `@grpo-finetuning-qwen3/README.md`:
- Around line 19-40: Update the README to explicitly instruct users to start
Jupyter (or run the notebook) from the grpo-finetuning-qwen3 directory so the
notebook’s relative imports resolve; mention that grpo_json_extraction.ipynb
expects process-relative paths to ./cookbook/training, ./train_prompts.jsonl,
and ./eval_prompts.jsonl and provide the exact command to change into the
directory (e.g., cd grpo-finetuning-qwen3) or an alternative note about
launching Jupyter from the repo root after adjusting those paths in the
notebook.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ef7d615e-aabd-43f1-be50-ed875fc1a589
⛔ Files ignored due to path filters (1)
grpo-finetuning-qwen3/eval_chart.pngis excluded by!**/*.png
📒 Files selected for processing (5)
grpo-finetuning-qwen3/.env.examplegrpo-finetuning-qwen3/README.mdgrpo-finetuning-qwen3/eval_prompts.jsonlgrpo-finetuning-qwen3/grpo_json_extraction.ipynbgrpo-finetuning-qwen3/train_prompts.jsonl
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Please find invoice from Acme Corp dated April 15, 2024. Grand total: 654 GBP.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Acme Corp", "date": "2024-04-15", "amount": 653.75, "currency": "GBP"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Outstanding payment to Crestline Distributors, due as of 05/02/2024, totalling \u20ac7,999.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Crestline Distributors", "date": "2024-05-02", "amount": 7999.0, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Redwood Staffing Inc on 09/17/2024. Amount due: 3,696 GBP.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Redwood Staffing Inc", "date": "2024-09-17", "amount": 3695.99, "currency": "GBP"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Nexus Solutions Pvt Ltd. Invoice date: Jan 10 2023. Total payable: $69,552.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2023-01-10", "amount": 69552.5, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Please find invoice from Global Logistics Ltd dated Apr 6 2023. Grand total: $1,043.75.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Global Logistics Ltd", "date": "2023-04-06", "amount": 1043.75, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Nexus Solutions Pvt Ltd, 16/10/2023, 16731.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2023-10-16", "amount": 16731.25, "currency": "GBP"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Bill from Sterling Parts & Co, dated 22/02/2023, total 83,860 GBP.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Sterling Parts & Co", "date": "2023-02-22", "amount": 83860.0, "currency": "GBP"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Ironclad Manufacturing on 18/12/2024. Amount due: 825.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2024-12-18", "amount": 825.99, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice\nFrom: Clearwater Utilities\nIssued: Feb 7 2023\nAmount owed: 9,094 EUR\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Clearwater Utilities", "date": "2023-02-07", "amount": 9094.5, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Sterling Parts & Co on Dec 7 2023. Amount due: $55,158.00 USD.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Sterling Parts & Co", "date": "2023-12-07", "amount": 55158.0, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Please find invoice from Pinnacle Hardware Co dated 04/23/2023. Grand total: 78,517 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Pinnacle Hardware Co", "date": "2023-04-23", "amount": 78517.0, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from Global Logistics Ltd. Date of service: 2023-02-05. Invoice total: 3,621.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Global Logistics Ltd", "date": "2023-02-05", "amount": 3621.5, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Outstanding payment to Apex Print Works, due as of 1 October 2024, totalling 69.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2024-10-01", "amount": 69.5, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Clearwater Utilities on 02/24/2024. Amount due: 130 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Clearwater Utilities", "date": "2024-02-24", "amount": 130.5, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Vendor: BlueSky Rentals\nDate: 10/06/2024\nTotal: \u00a3890.75\n\nExtract the fields and return valid JSON only."}], "ground_truth": {"vendor": "BlueSky Rentals", "date": "2024-06-10", "amount": 890.75, "currency": "GBP"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from Apex Print Works. Date of service: 12/08/2023. Invoice total: 392 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2023-08-12", "amount": 391.99, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Bright Media LLC, issue date 12/06/2023, for the amount of 580.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Bright Media LLC", "date": "2023-12-06", "amount": 580.0, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Nexus Solutions Pvt Ltd, 10/23/2024, \u20ac6,147.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2024-10-23", "amount": 6147.0, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Bill from Apex Print Works, dated 2024-06-25, total 7,211.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2024-06-25", "amount": 7211.0, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Vendor: Apex Print Works\nDate: 9 April 2023\nTotal: \u00a347,880.00\n\nExtract the fields and return valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2023-04-09", "amount": 47880.0, "currency": "GBP"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Bill from Atlas Freight Services, dated 2023-04-10, total 85,914 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Atlas Freight Services", "date": "2023-04-10", "amount": 85914.5, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Pinnacle Hardware Co on December 19, 2023. Amount due: EUR 8,744.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Pinnacle Hardware Co", "date": "2023-12-19", "amount": 8744.99, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Vendor: BlueSky Rentals\nDate: 09/13/2024\nTotal: 309.00 JPY\n\nExtract the fields and return valid JSON only."}], "ground_truth": {"vendor": "BlueSky Rentals", "date": "2024-09-13", "amount": 309.0, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice\nFrom: Nexus Solutions Pvt Ltd\nIssued: 09/17/2024\nAmount owed: 215.99 JPY\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2024-09-17", "amount": 215.99, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Please find invoice from Ironclad Manufacturing dated 2023-07-28. Grand total: $8,138.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2023-07-28", "amount": 8138.5, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Cascade Engineering. Invoice date: May 21, 2023. Total payable: 18,142 GBP.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Cascade Engineering", "date": "2023-05-21", "amount": 18142.0, "currency": "GBP"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Cascade Engineering, Jan 3 2023, $706.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Cascade Engineering", "date": "2023-01-03", "amount": 706.99, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Acme Corp, issue date July 2, 2024, for the amount of 6,927.75 EUR.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Acme Corp", "date": "2024-07-02", "amount": 6927.75, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from BlueSky Rentals. Date of service: Jan 10 2024. Invoice total: JPY 447.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "BlueSky Rentals", "date": "2024-01-10", "amount": 447.5, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Global Logistics Ltd, issue date 04/21/2024, for the amount of EUR 7,482.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Global Logistics Ltd", "date": "2024-04-21", "amount": 7482.0, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Ironclad Manufacturing. Invoice date: 4 April 2023. Total payable: 32,740.00 EUR.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2023-04-04", "amount": 32740.0, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Acme Corp, issue date 10/24/2023, for the amount of JPY 40,162.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Acme Corp", "date": "2023-10-24", "amount": 40162.99, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Horizon Digital Agency, 10/09/2024, 93,238 USD.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Horizon Digital Agency", "date": "2024-09-10", "amount": 93238.0, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Vendor: Meridian Office Supplies\nDate: 09/14/2023\nTotal: JPY 9,683.00\n\nExtract the fields and return valid JSON only."}], "ground_truth": {"vendor": "Meridian Office Supplies", "date": "2023-09-14", "amount": 9683.0, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Ironclad Manufacturing on 28 March 2023. Amount due: 629.75 EUR.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2023-03-28", "amount": 629.75, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice\nFrom: Cascade Engineering\nIssued: 21/12/2024\nAmount owed: 346.00\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Cascade Engineering", "date": "2024-12-21", "amount": 346.0, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Meridian Office Supplies. Invoice date: 12/10/2023. Total payable: $6,314.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Meridian Office Supplies", "date": "2023-12-10", "amount": 6314.0, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Global Logistics Ltd. Invoice date: February 17, 2024. Total payable: USD 727.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Global Logistics Ltd", "date": "2024-02-17", "amount": 727.5, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Vortex Software Ltd on Nov 4 2024. Amount due: JPY 779.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Vortex Software Ltd", "date": "2024-11-04", "amount": 779.0, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from Ironclad Manufacturing. Date of service: Dec 28 2023. Invoice total: USD 2,374.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Ironclad Manufacturing", "date": "2023-12-28", "amount": 2374.25, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "We have received a bill from TechSupplies Inc. Date of service: 08/04/2024. Invoice total: 9807.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "TechSupplies Inc", "date": "2024-08-04", "amount": 9807.25, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Summit Consulting Group, issue date 2023-12-10, for the amount of GBP 67,422.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Summit Consulting Group", "date": "2023-12-10", "amount": 67422.25, "currency": "GBP"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Accounts payable entry \u2014 Clearwater Utilities, 9 February 2023, 398 JPY.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Clearwater Utilities", "date": "2023-02-09", "amount": 398.0, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Cascade Engineering. Invoice date: September 24, 2023. Total payable: $985.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Cascade Engineering", "date": "2023-09-24", "amount": 985.0, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Bill from Orbit Telecom LLC, dated 11/25/2024, total \u20ac92,242.00.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Orbit Telecom LLC", "date": "2024-11-25", "amount": 92242.0, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Meridian Office Supplies, issue date May 15 2023, for the amount of \u20ac2,586.25.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Meridian Office Supplies", "date": "2023-05-15", "amount": 2586.25, "currency": "EUR"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Nexus Solutions Pvt Ltd. Invoice date: April 22, 2024. Total payable: 56357.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Nexus Solutions Pvt Ltd", "date": "2024-04-22", "amount": 56357.99, "currency": "JPY"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Invoice details: Supplier \u2014 Summit Consulting Group. Invoice date: 2024-01-05. Total payable: 7001.50.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Summit Consulting Group", "date": "2024-01-05", "amount": 7001.5, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Payment request from Clearwater Utilities, issue date 2023-06-08, for the amount of USD 715.99.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Clearwater Utilities", "date": "2023-06-08", "amount": 715.99, "currency": "USD"}} | ||
| {"messages": [{"role": "system", "content": "Extract the following fields from this invoice: vendor, date, amount, currency."}, {"role": "user", "content": "Received invoice from Apex Print Works on 20 March 2024. Amount due: 61,645.50 EUR.\n\nReturn valid JSON only."}], "ground_truth": {"vendor": "Apex Print Works", "date": "2024-03-20", "amount": 61645.5, "currency": "EUR"}} |
There was a problem hiding this comment.
Several eval labels do not match the prompt text.
This held-out set has corrupted targets. For example, Line 1 says 654 GBP but ground_truth.amount is 653.75; Line 6 invents GBP even though the prompt gives no currency; Line 14 says 130 JPY but the label is 130.5. Once run_eval() starts comparing against ground_truth, these rows will penalize correct extraction.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/eval_prompts.jsonl` around lines 1 - 50, The
eval_prompts.jsonl file contains corrupted ground_truth labels that don't match
the prompt text inside each "messages" array (e.g., amounts/currencies in the
"user" content disagree with the "ground_truth" object); update each record's
ground_truth fields (vendor, date, amount, currency) to exactly match the
invoice text in the corresponding "messages" -> user content (fix entries like
the first record where "654 GBP" should be amount: 654 and currency: "GBP", the
sixth record that incorrectly adds "GBP", and the 14th that lists 130 JPY but
has 130.5), or remove/regenerate any rows where you cannot confidently determine
the correct label; ensure every JSON object’s "ground_truth" matches the parsed
values from its "messages" content before running run_eval().
| "os.environ[\"FIREWORKS_API_KEY\"] = FIREWORKS_KEY\n", | ||
| "\n", | ||
| "assert FIREWORKS_KEY, \"Missing FIREWORKS_API_KEY\"\n", | ||
| "assert OPENROUTER_KEY, \"Missing OPENROUTER_API_KEY\"\n", | ||
| "assert ACCOUNT_ID, \"Missing FIREWORKS_ACCOUNT_ID\"\n", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import os
try:
os.environ["FIREWORKS_API_KEY"] = None
except Exception as exc:
print(type(exc).__name__)
print(exc)
PYRepository: patchy631/ai-engineering-hub
Length of output: 109
Assert FIREWORKS_KEY is a non-None string before assigning os.environ["FIREWORKS_API_KEY"]
When FIREWORKS_API_KEY is unset, FIREWORKS_KEY becomes None; writing it into os.environ throws TypeError: str expected, not NoneType before the notebook reaches assert FIREWORKS_KEY, "Missing FIREWORKS_API_KEY". Move/replace the assert so it runs before the os.environ[...] = FIREWORKS_KEY assignment (and similarly ensure other injected env values are not None).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb` around lines 84 - 88, The
code currently assigns os.environ["FIREWORKS_API_KEY"] = FIREWORKS_KEY before
checking FIREWORKS_KEY is non-None, which can raise TypeError; move or replace
the assertion so that assert FIREWORKS_KEY, "Missing FIREWORKS_API_KEY" (and
similar asserts for OPENROUTER_KEY and ACCOUNT_ID) run before any
os.environ[...] = ... assignments, and ensure the checks verify the values are
non-empty strings (not None) before injecting them into environment variables
(refer to FIREWORKS_KEY, OPENROUTER_KEY, ACCOUNT_ID and the os.environ[...]
assignments).
| "def run_eval(client, model_id: str, label: str) -> float:\n", | ||
| " scores = []\n", | ||
| " for i, entry in enumerate(eval_data):\n", | ||
| " messages = [m for m in entry[\"messages\"] if m[\"role\"] != \"assistant\"]\n", | ||
| " # Suppress thinking mode for structured output\n", | ||
| " messages = [\n", | ||
| " {**m, \"content\": m[\"content\"] + \" /no-think\"}\n", | ||
| " if m[\"role\"] == \"system\" else m\n", | ||
| " for m in messages\n", | ||
| " ]\n", | ||
| " try:\n", | ||
| " response = client.chat.completions.create(\n", | ||
| " model=model_id,\n", | ||
| " messages=messages,\n", | ||
| " temperature=0.0,\n", | ||
| " max_tokens=512,\n", | ||
| " )\n", | ||
| " content = response.choices[0].message.content\n", | ||
| " if \"</think>\" in content:\n", | ||
| " content = content.split(\"</think>\")[-1].strip()\n", | ||
| " s = score(content)\n", | ||
| " except Exception as e:\n", | ||
| " print(f\" Error on prompt {i}: {e}\")\n", | ||
| " s = 0.0\n", | ||
| " scores.append(s)\n", | ||
| " if (i + 1) % 10 == 0:\n", | ||
| " acc = sum(1 for s in scores if s == 1.0) / len(scores)\n", | ||
| " print(f\" [{label}] {i+1}/{len(eval_data)} running: {acc:.1%}\")\n", | ||
| "\n", | ||
| " accuracy = sum(1 for s in scores if s == 1.0) / len(scores)\n", | ||
| " print(f\"\\n[{label}] Schema-valid: {accuracy:.1%} ({sum(1 for s in scores if s == 1.0)}/{len(scores)})\")\n", | ||
| " return accuracy\n", |
There was a problem hiding this comment.
Use ground_truth for task accuracy, not just schema validity.
run_eval() never reads entry["ground_truth"], so the reported 62%/82% numbers only measure whether the output is valid JSON for the schema. A model can return the same well-formed placeholder object for every prompt and still score perfectly here.
📌 Suggested direction
def run_eval(client, model_id: str, label: str) -> float:
- scores = []
+ schema_scores = []
+ exact_matches = 0
for i, entry in enumerate(eval_data):
messages = [m for m in entry["messages"] if m["role"] != "assistant"]
# Suppress thinking mode for structured output
messages = [
{**m, "content": m["content"] + " /no-think"}
if m["role"] == "system" else m
for m in messages
]
try:
response = client.chat.completions.create(
model=model_id,
messages=messages,
temperature=0.0,
max_tokens=512,
)
content = response.choices[0].message.content
if "</think>" in content:
content = content.split("</think>")[-1].strip()
- s = score(content)
+ s = score(content)
+ if s == 1.0 and json.loads(content) == entry["ground_truth"]:
+ exact_matches += 1
except Exception as e:
print(f" Error on prompt {i}: {e}")
s = 0.0
- scores.append(s)
+ schema_scores.append(s)
if (i + 1) % 10 == 0:
- acc = sum(1 for s in scores if s == 1.0) / len(scores)
- print(f" [{label}] {i+1}/{len(eval_data)} running: {acc:.1%}")
+ acc = exact_matches / len(schema_scores)
+ print(f" [{label}] {i+1}/{len(eval_data)} running exact-match: {acc:.1%}")
- accuracy = sum(1 for s in scores if s == 1.0) / len(scores)
- print(f"\n[{label}] Schema-valid: {accuracy:.1%} ({sum(1 for s in scores if s == 1.0)}/{len(scores)})")
- return accuracy
+ schema_valid = sum(1 for s in schema_scores if s == 1.0) / len(schema_scores)
+ exact_accuracy = exact_matches / len(schema_scores)
+ print(f"\n[{label}] Schema-valid: {schema_valid:.1%}")
+ print(f"[{label}] Exact-match: {exact_accuracy:.1%} ({exact_matches}/{len(schema_scores)})")
+ return exact_accuracy🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb` around lines 257 - 288,
run_eval currently only measures schema validity (via score/content) and never
uses entry["ground_truth"]; update run_eval to compute task accuracy against the
true labels by parsing the model output JSON and comparing it to
entry["ground_truth"]. Specifically, inside run_eval (and where you call score),
first validate/parse content into a dict (handle the existing </think> strip and
JSON exceptions), then call a new or extended comparator (e.g., score or
compare_to_ground_truth) that accepts (parsed_output, entry["ground_truth"]) and
returns 1.0 for a correct match and 0.0 otherwise; append that result to scores,
keep schema-valid logging if desired, and change the printed/returned accuracy
to be based on matches to entry["ground_truth"] rather than just schema
validity. Ensure you reference run_eval, eval_data, entry["ground_truth"], and
score/compare_to_ground_truth when implementing.
| "_original_run = asyncio.run\n", | ||
| "def _patched_run(coro, **kwargs):\n", | ||
| " loop = asyncio.get_event_loop()\n", | ||
| " return loop.run_until_complete(coro)\n", | ||
| "asyncio.run = _patched_run\n", |
There was a problem hiding this comment.
Restore asyncio.run in a finally block.
If main(cfg) throws, Line 601 never executes and the notebook leaves a globally patched asyncio.run behind for later cells. That makes subsequent failures hard to reason about.
#!/bin/bash
python - <<'PY'
import json
from pathlib import Path
nb = json.loads(Path("grpo-finetuning-qwen3/grpo_json_extraction.ipynb").read_text())
for cell in nb["cells"]:
src = "".join(cell.get("source", []))
if "_original_run = asyncio.run" in src:
print(src)
break
PYAlso applies to: 599-601
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb` around lines 548 - 552, The
notebook globally replaces asyncio.run with _patched_run but never restores it
if main(cfg) raises; wrap the patch and the call to main(cfg) in a try/finally:
save the original in _original_run, assign asyncio.run = _patched_run, call
main(cfg) inside try, and always restore asyncio.run = _original_run in the
finally block (referencing the existing symbols _original_run, _patched_run,
asyncio.run and the main(cfg) invocation) so later cells see the original
asyncio.run regardless of exceptions.
| "def invoice_reward(completion: str, row: dict) -> float:\n", | ||
| " return score(completion)\n", | ||
| "\n", | ||
| "rl_loop.reward_fn = invoice_reward\n", |
There was a problem hiding this comment.
The GRPO reward never checks extraction correctness.
invoice_reward() ignores row and only rewards schema-conforming JSON. Combined with the current training rows, that means the policy can optimize for output shape alone instead of learning to extract the vendor, date, amount, and currency from the invoice text.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/grpo_json_extraction.ipynb` around lines 564 - 567, The
reward function invoice_reward currently ignores the ground-truth row and only
calls score(completion), so the policy can game output shape; update
invoice_reward to parse the completion JSON, extract fields (vendor, date,
amount, currency), compare them against the corresponding values in row
(normalizing formats for date/amount/currency), compute a numeric reward that
gives full credit for exact matches and partial credit for close/normalized
matches and penalties for missing/invalid fields, and return that float; replace
the current rl_loop.reward_fn binding with this new invoice_reward and ensure
parsing errors return a low reward rather than a crash.
| **Clone the Fireworks cookbook**: | ||
| ```bash | ||
| git clone https://github.com/fw-ai/cookbook.git | ||
| ``` | ||
|
|
||
| **Install Dependencies**: | ||
|
|
||
| Ensure you have Python 3.10 or later installed. | ||
|
|
||
| ```bash | ||
| uv venv | ||
| source .venv/bin/activate | ||
| uv pip install python-dotenv jsonschema openai fireworks-ai matplotlib | ||
| uv pip install -e "cookbook/training[training]" | ||
| uv pip install eval-protocol nest_asyncio | ||
| ``` | ||
|
|
||
| Select the virtual environment as the kernel in the notebook. | ||
|
|
||
| **Run the notebook**: | ||
|
|
||
| Open and run `grpo_json_extraction.ipynb` end-to-end. The notebook covers: |
There was a problem hiding this comment.
Document the required notebook working directory.
These steps never say to start Jupyter from grpo-finetuning-qwen3, but the notebook resolves ./cookbook/training, ./train_prompts.jsonl, and ./eval_prompts.jsonl with process-relative paths. Launching the notebook from the repo root will make the setup and data-loading cells fail immediately.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/README.md` around lines 19 - 40, Update the README to
explicitly instruct users to start Jupyter (or run the notebook) from the
grpo-finetuning-qwen3 directory so the notebook’s relative imports resolve;
mention that grpo_json_extraction.ipynb expects process-relative paths to
./cookbook/training, ./train_prompts.jsonl, and ./eval_prompts.jsonl and provide
the exact command to change into the directory (e.g., cd grpo-finetuning-qwen3)
or an alternative note about launching Jupyter from the repo root after
adjusting those paths in the notebook.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
grpo-finetuning-qwen3/README.md (1)
19-22:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSpecify the cookbook clone location relative to the working directory.
Line 21 clones the cookbook repository, but doesn't specify where to clone it. Line 32 then references
cookbook/training[training], assuming the cookbook directory exists in the current working directory.If users clone the cookbook while in the repository root but run the notebook from
grpo-finetuning-qwen3/, the relative path./cookbook/trainingin the notebook won't resolve correctly.📝 Suggested fix
Either instruct users to clone into the grpo-finetuning-qwen3 directory, or adjust the install command to reference the correct relative path:
Option 1: Clone inside grpo-finetuning-qwen3
**Clone the Fireworks cookbook**: ```bash +cd grpo-finetuning-qwen3 git clone https://github.com/fw-ai/cookbook.git**Option 2: Adjust the pip install path** ```diff ```bash uv venv source .venv/bin/activate uv pip install python-dotenv jsonschema openai fireworks-ai matplotlib -uv pip install -e "cookbook/training[training]" +uv pip install -e "../cookbook/training[training]" uv pip install eval-protocol nest_asyncio</details> Also applies to: 32-32 <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@grpo-finetuning-qwen3/README.mdaround lines 19 - 22, The README instructs
cloning the cookbook without a specific target, causing the later pip install -e
"cookbook/training[training]" to fail if the repo is cloned elsewhere; update
instructions to either (A) tell users to cd into grpo-finetuning-qwen3 before
cloning so the cookbook lands at ./cookbook (e.g., run cd grpo-finetuning-qwen3
then git clone https://github.com/fw-ai/cookbook.git), or (B) change the
editable install path in the setup commands to point to the parent folder where
users are likely to clone (replace pip install -e "cookbook/training[training]"
with pip install -e "../cookbook/training[training]"); ensure the README clearly
states which option you choose so the path used by the pip install matches where
the cookbook was cloned.</details> </blockquote></details> </blockquote></details>♻️ Duplicate comments (1)
grpo-finetuning-qwen3/README.md (1)
19-40:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClarify the working directory for setup and notebook execution.
The setup steps don't specify where to run these commands. The notebook resolves paths like
./train_prompts.jsonl,./eval_prompts.jsonl, and./cookbook/trainingrelative to the process working directory, but the instructions never tell users tocd grpo-finetuning-qwen3first.Without this clarification, users starting from the repository root will encounter import failures and missing file errors when running the notebook.
📝 Suggested fix
Add an explicit working directory instruction before the setup steps:
## Setup and installations +**Navigate to the project directory**: +```bash +cd grpo-finetuning-qwen3 +``` + **Get API Keys**:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@grpo-finetuning-qwen3/README.md` around lines 19 - 40, The README lacks an explicit working-directory instruction so running the setup and the notebook (which references files like ./train_prompts.jsonl, ./eval_prompts.jsonl and the ./cookbook/training package) from the repository root will fail; update README.md to instruct users to first change into the project folder (e.g., cd grpo-finetuning-qwen3) before running the listed setup commands and launching grpo_json_extraction.ipynb so the relative paths and imports resolve correctly.🧹 Nitpick comments (1)
grpo-finetuning-qwen3/README.md (1)
66-71: 💤 Low valueConsider providing a more concrete example for the
--output-idparameter.Line 70 uses
<your-model-id>as a placeholder, which requires users to invent their own identifier. Providing a concrete example (e.g.,invoice-extraction-v1) would make it easier for users to understand the expected format and get started quickly.📝 Suggested enhancement
python agent-skill/grpo-finetune/run_pipeline.py \ --train ./train_prompts.jsonl \ --eval ./eval_prompts.jsonl \ --task invoice-extraction \ - --output-id <your-model-id> + --output-id invoice-extraction-v1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@grpo-finetuning-qwen3/README.md` around lines 66 - 71, Replace the placeholder "<your-model-id>" in the example command that runs agent-skill/grpo-finetune/run_pipeline.py with a concrete example model id (e.g., "invoice-extraction-v1") so users see the expected format for the --output-id parameter; update the README example invocation to: python agent-skill/grpo-finetune/run_pipeline.py --train ./train_prompts.jsonl --eval ./eval_prompts.jsonl --task invoice-extraction --output-id invoice-extraction-v1 and keep the placeholder mention elsewhere if you still want to show users they can choose their own id.🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. Inline comments: In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/agent_demo.py`: - Around line 97-132: The loop can divide by zero when raw_docs is empty (total == 0); update the logic after reading raw_docs (where total and passed are set) to check if total == 0 and return/exit early with a clear message instead of proceeding to the loop and final percentage computation—e.g., detect total == 0 and print a user-friendly "no documents found" message (referencing raw_docs, total, passed, and filepath) and skip the pct calculation and final summary so pct = round(passed / total * 100) is never executed on zero. - Around line 83-84: The current validate(result: dict) uses REQUIRED_FIELDS and only checks for non-empty values, so replace its body to call the canonical schema validator used by the training/eval pipeline (i.e., reuse the same schema validation function/class used for training/eval rather than ad-hoc REQUIRED_FIELDS checks), import that validator into this module, and ensure validate(result) returns the validator's boolean result so numeric 0 is allowed, types are enforced, and extra/missing fields are handled consistently with the reward contract. In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/generate_reward.py`: - Around line 43-53: The SELF_TESTS loop can raise on malformed entries or when mod.score(...) throws, so change the loop over getattr(mod, "SELF_TESTS", []) in generate_reward.py to catch exceptions per-test: inside the for i, (completion, row, expected) in enumerate(...) wrap the unpack and call to mod.score(completion, row) in a try/except Exception as e, mark the test as failed (increment failures), and print a clear failure line that includes the exception message/type instead of letting it traceback; keep the existing numeric tolerance check when score returns normally and ensure the existing failures check (printing "FAIL ... self-test(s) failed" and returning 1) still runs after the loop. In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py`: - Around line 55-61: The completion content handling currently only strips "</think>" before scoring; update the logic in run_pipeline.py where the variable content is prepared (after resp.choices[0].message.content and before calling score_fn) to also remove Markdown fenced code blocks by trimming a leading triple-backtick fence (optionally followed by "json") and a trailing triple-backtick from content.strip(), then pass the cleaned string to score_fn so fenced JSON is scored correctly; refer to the variables/content handling around model_id, resp, content and the call to score_fn when making the change. - Around line 181-184: The subprocess call uses a hard-coded relative path; change it to resolve agent_demo.py relative to this module's file: import Path from pathlib, compute script_path = Path(__file__).resolve().parent / "agent_demo.py" (since run_pipeline.py and agent_demo.py are siblings), and pass str(script_path) to subprocess.run instead of the literal "agent-skill/grpo-finetune/agent_demo.py"; keep the existing argv list (sys.executable, str(invoices_path), "--deployment", deployment). - Around line 107-124: The try/except around fw.datasets.get(...) currently catches all exceptions and wrongly treats any error as "dataset not found"; change the except Exception to catch only the specific fireworks.NotFoundError (e.g., except fireworks.NotFoundError as e:) so you only fall back to fw.datasets.create(...) and fw.datasets.upload(...) when the SDK reports a 404, and for any other exception re-raise or log and propagate it instead of proceeding with create/upload; update references around fw.datasets.get, fw.datasets.create, fw.datasets.upload and ensure the NotFoundError import or fully-qualified name is used. In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/SKILL.md`: - Around line 94-95: The statement "The pipeline automatically runs the agent demo on sample invoices at the end." is inaccurate; update the doc to state the demo is conditional: describe that run_pipeline.py calls agent_demo.py only if the invoices file exists and otherwise prints that the demo is skipped. Reference the filenames run_pipeline.py and agent_demo.py and mention the invoices file existence check so readers understand the exact conditional behavior. --- Outside diff comments: In `@grpo-finetuning-qwen3/README.md`: - Around line 19-22: The README instructs cloning the cookbook without a specific target, causing the later pip install -e "cookbook/training[training]" to fail if the repo is cloned elsewhere; update instructions to either (A) tell users to cd into grpo-finetuning-qwen3 before cloning so the cookbook lands at ./cookbook (e.g., run cd grpo-finetuning-qwen3 then git clone https://github.com/fw-ai/cookbook.git), or (B) change the editable install path in the setup commands to point to the parent folder where users are likely to clone (replace pip install -e "cookbook/training[training]" with pip install -e "../cookbook/training[training]"); ensure the README clearly states which option you choose so the path used by the pip install matches where the cookbook was cloned. --- Duplicate comments: In `@grpo-finetuning-qwen3/README.md`: - Around line 19-40: The README lacks an explicit working-directory instruction so running the setup and the notebook (which references files like ./train_prompts.jsonl, ./eval_prompts.jsonl and the ./cookbook/training package) from the repository root will fail; update README.md to instruct users to first change into the project folder (e.g., cd grpo-finetuning-qwen3) before running the listed setup commands and launching grpo_json_extraction.ipynb so the relative paths and imports resolve correctly. --- Nitpick comments: In `@grpo-finetuning-qwen3/README.md`: - Around line 66-71: Replace the placeholder "<your-model-id>" in the example command that runs agent-skill/grpo-finetune/run_pipeline.py with a concrete example model id (e.g., "invoice-extraction-v1") so users see the expected format for the --output-id parameter; update the README example invocation to: python agent-skill/grpo-finetune/run_pipeline.py --train ./train_prompts.jsonl --eval ./eval_prompts.jsonl --task invoice-extraction --output-id invoice-extraction-v1 and keep the placeholder mention elsewhere if you still want to show users they can choose their own id.🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID:
86f36cd1-c636-4c99-9412-dfcc70a4cbbb📒 Files selected for processing (6)
grpo-finetuning-qwen3/README.mdgrpo-finetuning-qwen3/agent-skill/grpo-finetune/SKILL.mdgrpo-finetuning-qwen3/agent-skill/grpo-finetune/agent_demo.pygrpo-finetuning-qwen3/agent-skill/grpo-finetune/generate_reward.pygrpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.pygrpo-finetuning-qwen3/invoices.txt✅ Files skipped from review due to trivial changes (1)
- grpo-finetuning-qwen3/invoices.txt
| def validate(result: dict) -> bool: | ||
| return all(result.get(f) not in (None, "", 0) for f in REQUIRED_FIELDS) |
There was a problem hiding this comment.
Make demo validation match the training/eval schema.
This only checks for non-empty keys, so it can mark wrong types or extra fields as Schema valid, and it rejects amount: 0 even though the notebook schema accepts any JSON number. Reuse the same schema validation here so the demo's pass/fail matches the reward contract.
Suggested fix
import argparse
import json
import os
import re
import sys
import time
from dotenv import load_dotenv
+from jsonschema import ValidationError, validate as validate_json
from openai import OpenAI
REQUIRED_FIELDS = {"vendor", "date", "amount", "currency"}
+SCHEMA = {
+ "type": "object",
+ "required": ["vendor", "date", "amount", "currency"],
+ "properties": {
+ "vendor": {"type": "string"},
+ "date": {"type": "string"},
+ "amount": {"type": "number"},
+ "currency": {"type": "string"},
+ },
+ "additionalProperties": False,
+}
@@
def validate(result: dict) -> bool:
- return all(result.get(f) not in (None, "", 0) for f in REQUIRED_FIELDS)
+ try:
+ validate_json(instance=result, schema=SCHEMA)
+ return True
+ except ValidationError:
+ return False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/agent_demo.py` around lines
83 - 84, The current validate(result: dict) uses REQUIRED_FIELDS and only checks
for non-empty values, so replace its body to call the canonical schema validator
used by the training/eval pipeline (i.e., reuse the same schema validation
function/class used for training/eval rather than ad-hoc REQUIRED_FIELDS
checks), import that validator into this module, and ensure validate(result)
returns the validator's boolean result so numeric 0 is allowed, types are
enforced, and extra/missing fields are handled consistently with the reward
contract.
| with open(filepath) as f: | ||
| raw_docs = [line.strip() for line in f if line.strip()] | ||
|
|
||
| total = len(raw_docs) | ||
| passed = 0 | ||
|
|
||
| for i, doc in enumerate(raw_docs, 1): | ||
| print(f"\n{GRAY}#{i} of {total}{RESET}") | ||
| print(f"{CYAN}{doc}{RESET}") | ||
| print(DIVIDER_MID) | ||
|
|
||
| t0 = time.time() | ||
| try: | ||
| result = extract(client, deployment, doc) | ||
| elapsed = round(time.time() - t0, 2) | ||
| valid = validate(result) | ||
| if valid: | ||
| passed += 1 | ||
|
|
||
| for field in REQUIRED_FIELDS: | ||
| val = result.get(field, "—") | ||
| print(f" {GRAY}{field:<10}{RESET} {val}") | ||
|
|
||
| print() | ||
| if valid: | ||
| print(f" {GREEN}✓ Schema valid{RESET} {GRAY}{elapsed}s{RESET}") | ||
| else: | ||
| print(f" {RED}✗ Schema mismatch{RESET} {GRAY}{elapsed}s{RESET}") | ||
|
|
||
| except Exception as e: | ||
| print(f" {RED}✗ Error: {e}{RESET}") | ||
|
|
||
| print(DIVIDER) | ||
|
|
||
| pct = round(passed / total * 100) | ||
| print(f"\n {BOLD}Results{RESET} {GREEN}{passed}/{total} valid{RESET} Schema match: {GREEN}{pct}%{RESET}\n") |
There was a problem hiding this comment.
Handle an empty invoices.txt before computing percentages.
If the file is empty or only whitespace, total becomes 0 and Line 131 raises ZeroDivisionError. Return early with a clear message instead of crashing after the banner.
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 126-126: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/agent_demo.py` around lines
97 - 132, The loop can divide by zero when raw_docs is empty (total == 0);
update the logic after reading raw_docs (where total and passed are set) to
check if total == 0 and return/exit early with a clear message instead of
proceeding to the loop and final percentage computation—e.g., detect total == 0
and print a user-friendly "no documents found" message (referencing raw_docs,
total, passed, and filepath) and skip the pct calculation and final summary so
pct = round(passed / total * 100) is never executed on zero.
| for i, (completion, row, expected) in enumerate(getattr(mod, "SELF_TESTS", [])): | ||
| got = mod.score(completion, row) | ||
| ok = abs(float(got) - float(expected)) < 1e-6 | ||
| print(f" [{'ok ' if ok else 'FAIL'}] test {i}: got {got} expected {expected}") | ||
| if not ok: | ||
| failures += 1 | ||
| if failures: | ||
| print(f"FAIL {failures} self-test(s) failed") | ||
| return 1 | ||
| tests = len(getattr(mod, "SELF_TESTS", [])) | ||
| print("PASS" + (f" {tests} self-tests green" if tests else " (add SELF_TESTS to pin the task)")) |
There was a problem hiding this comment.
Fail self-tests cleanly instead of letting them traceback.
A malformed SELF_TESTS entry, or a score() exception on one of those cases, currently escapes this validator as an unhandled exception. That defeats the point of validating reward.py before GPU spend; this path should print FAIL and return 1 just like the probe checks above.
Suggested fix
failures = 0
- for i, (completion, row, expected) in enumerate(getattr(mod, "SELF_TESTS", [])):
- got = mod.score(completion, row)
+ for i, test_case in enumerate(getattr(mod, "SELF_TESTS", [])):
+ try:
+ completion, row, expected = test_case
+ got = mod.score(completion, row)
+ except Exception as e:
+ print(f"FAIL self-test {i} crashed: {e}")
+ return 1
ok = abs(float(got) - float(expected)) < 1e-6
print(f" [{'ok ' if ok else 'FAIL'}] test {i}: got {got} expected {expected}")
if not ok:
failures += 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/generate_reward.py` around
lines 43 - 53, The SELF_TESTS loop can raise on malformed entries or when
mod.score(...) throws, so change the loop over getattr(mod, "SELF_TESTS", []) in
generate_reward.py to catch exceptions per-test: inside the for i, (completion,
row, expected) in enumerate(...) wrap the unpack and call to
mod.score(completion, row) in a try/except Exception as e, mark the test as
failed (increment failures), and print a clear failure line that includes the
exception message/type instead of letting it traceback; keep the existing
numeric tolerance check when score returns normally and ensure the existing
failures check (printing "FAIL ... self-test(s) failed" and returning 1) still
runs after the loop.
| try: | ||
| resp = client.chat.completions.create(model=model_id, messages=msgs, | ||
| temperature=0.0, max_tokens=512) | ||
| content = resp.choices[0].message.content | ||
| if "</think>" in content: | ||
| content = content.split("</think>")[-1].strip() | ||
| s = score_fn(content, row) |
There was a problem hiding this comment.
Strip fenced JSON in eval before scoring it.
agent_demo.py already removes both </think> and Markdown ```json fences, but run_eval() only removes ``. A completion wrapped in fences gets scored as `0.0` here even when the JSON body is otherwise valid, so the reported accuracy can be artificially low.
Suggested fix
import argparse
import json
import os
+import re
import subprocess
import sys
import time
@@
content = resp.choices[0].message.content
if "</think>" in content:
content = content.split("</think>")[-1].strip()
+ content = re.sub(r"^```(?:json)?\s*|\s*```$", "", content.strip())
s = score_fn(content, row)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py` around lines
55 - 61, The completion content handling currently only strips "</think>" before
scoring; update the logic in run_pipeline.py where the variable content is
prepared (after resp.choices[0].message.content and before calling score_fn) to
also remove Markdown fenced code blocks by trimming a leading triple-backtick
fence (optionally followed by "json") and a trailing triple-backtick from
content.strip(), then pass the cleaned string to score_fn so fenced JSON is
scored correctly; refer to the variables/content handling around model_id, resp,
content and the call to score_fn when making the change.
| try: | ||
| ds = fw.datasets.get(dataset_id=dataset_id) | ||
| if ds.state == "READY": | ||
| checkpoint(f"Dataset ready · {row_count} prompts") | ||
| else: | ||
| print("Waiting for dataset", end="", flush=True) | ||
| while True: | ||
| ds = fw.datasets.get(dataset_id=dataset_id) | ||
| if ds.state == "READY": | ||
| print() | ||
| break | ||
| print(".", end="", flush=True) | ||
| time.sleep(3) | ||
| checkpoint(f"Dataset ready · {row_count} prompts") | ||
| except Exception: | ||
| print(f"Uploading {row_count} prompts...", flush=True) | ||
| fw.datasets.create(dataset_id=dataset_id, dataset={"exampleCount": str(row_count)}) | ||
| fw.datasets.upload(dataset_id=dataset_id, file=train_path) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What exception does the Fireworks Python SDK raise when fw.datasets.get(dataset_id=...) is called for a nonexistent dataset, and how should callers distinguish that from auth or transport failures?
💡 Result:
When the Fireworks Python SDK encounters a nonexistent dataset during a call to fw.datasets.get(dataset_id=...), it raises a fireworks.NotFoundError exception [1][2]. To distinguish this from other failures, you should handle the hierarchical exception structure provided by the SDK [1][2]: 1. NotFoundError: This is a subclass of fireworks.APIStatusError that specifically indicates the requested resource was not found (HTTP 404) [1][2]. 2. APIStatusError: This class is raised for any non-success API response (4xx or 5xx) [1][2]. It contains status_code and response properties, which allow you to programmatically inspect the HTTP status code (e.g., 401 for authentication, 403 for permission denied) [1][2]. 3. APIConnectionError: This class is raised when the library cannot connect to the API, such as during network failures or timeouts [1][2]. All of these exceptions inherit from the base fireworks.APIError class, which can be used to catch any SDK-related error [1][2]. Example handling pattern: try: dataset = client.datasets.get(dataset_id=dataset_id) except fireworks.NotFoundError: print("The requested dataset does not exist.") except fireworks.APIStatusError as e: if e.status_code == 401: print("Authentication failure.") else: print(f"API request failed with status code {e.status_code}") except fireworks.APIConnectionError: print("Failed to connect to the Fireworks API.") except fireworks.APIError: print("An unexpected API error occurred.")
Citations:
- 1: https://github.com/fw-ai-external/python-sdk/blob/main/README.md
- 2: https://pypi.org/project/fireworks-ai/1.0.0a3/
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the referenced file around the snippet and its imports/SDK usage
sed -n '1,220p' grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py | nl -baRepository: patchy631/ai-engineering-hub
Length of output: 115
🌐 Web query:
Fireworks Python SDK exception class for nonexistent dataset datasets.get NotFoundError APIStatusError APIConnectionError
💡 Result:
In the Fireworks Python SDK, when a requested resource—such as a dataset—does not exist, the API returns a 404 status code, which triggers a fireworks.NotFoundError exception [1][2][3]. The Fireworks Python SDK uses a structured hierarchy for error handling, where all exceptions inherit from the base fireworks.APIError class [1][2][3]. The specific exception classes relevant to your query are defined as follows: 1. NotFoundError: Raised when the API returns a 404 status code, indicating that the requested resource (e.g., a specific dataset) could not be found [1][2][4]. 2. APIStatusError: This is the base class for all non-success status codes (4xx or 5xx responses) [1][2][3]. It contains properties for status_code and response, allowing you to inspect the details of the error [1][2][3]. 3. APIConnectionError: This exception is raised when the library is unable to connect to the API, such as in cases of network connectivity issues or timeouts (where no HTTP status code is received) [1][2][4]. Other common exception classes include: - BadRequestError (400) [1][2] - AuthenticationError (401) [1][2] - PermissionDeniedError (403) [1][2] - RateLimitError (429) [1][2] - InternalServerError (>=500) [1][2] When writing code to handle these, you can catch them specifically to implement appropriate retry or logging logic [1][2]. For example: import fireworks try: dataset = client.datasets.get(dataset_id="nonexistent-id") except fireworks.NotFoundError: print("The requested dataset was not found.") except fireworks.APIConnectionError: print("Failed to connect to the Fireworks API.") except fireworks.APIStatusError as e: print(f"API returned status {e.status_code}: {e.response}")
Citations:
- 1: https://github.com/fw-ai-external/python-sdk/blob/main/README.md
- 2: https://pypi.org/project/fireworks-ai/1.2.0a67/
- 3: https://pypi.org/project/fireworks-ai/1.0.0a54/
- 4: https://pypi.org/project/fireworks-ai/1.0.0a59/
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Print the relevant section with line numbers (cat -n works without nl)
cat -n grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py | sed -n '70,170p'
# Check how `fw` is instantiated / what SDK is imported
rg -n "import fireworks|fw\s*=|fireworks\." grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py || trueRepository: patchy631/ai-engineering-hub
Length of output: 4957
Limit the dataset-upload fallback to fireworks.NotFoundError (404)
In grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py (lines 107-124), the except Exception: around fw.datasets.get(dataset_id=...) will also swallow auth/permission failures and transport/transient API errors—then it incorrectly proceeds to fw.datasets.create(...) + fw.datasets.upload(...) as if the dataset were missing. The SDK raises fireworks.NotFoundError for a nonexistent dataset; narrow the fallback to that exception only and re-raise/log everything else.
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 121-121: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py` around lines
107 - 124, The try/except around fw.datasets.get(...) currently catches all
exceptions and wrongly treats any error as "dataset not found"; change the
except Exception to catch only the specific fireworks.NotFoundError (e.g.,
except fireworks.NotFoundError as e:) so you only fall back to
fw.datasets.create(...) and fw.datasets.upload(...) when the SDK reports a 404,
and for any other exception re-raise or log and propagate it instead of
proceeding with create/upload; update references around fw.datasets.get,
fw.datasets.create, fw.datasets.upload and ensure the NotFoundError import or
fully-qualified name is used.
| subprocess.run([sys.executable, | ||
| "agent-skill/grpo-finetune/agent_demo.py", | ||
| str(invoices_path), | ||
| "--deployment", deployment]) |
There was a problem hiding this comment.
Resolve agent_demo.py from __file__, not the caller's working directory.
The subprocess target is hard-coded as agent-skill/grpo-finetune/agent_demo.py, which only works when the current working directory is grpo-finetuning-qwen3. Running this script from the repo root or from its own directory breaks the demo step.
Suggested fix
if invoices_path.exists():
banner("STEP 4 · sample inference")
print("Running the fine-tuned model on sample invoices...\n", flush=True)
time.sleep(1)
+ demo_script = Path(__file__).resolve().parent / "agent_demo.py"
subprocess.run([sys.executable,
- "agent-skill/grpo-finetune/agent_demo.py",
+ str(demo_script),
str(invoices_path),
"--deployment", deployment])🧰 Tools
🪛 Ruff (0.15.15)
[error] 181-181: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/run_pipeline.py` around lines
181 - 184, The subprocess call uses a hard-coded relative path; change it to
resolve agent_demo.py relative to this module's file: import Path from pathlib,
compute script_path = Path(__file__).resolve().parent / "agent_demo.py" (since
run_pipeline.py and agent_demo.py are siblings), and pass str(script_path) to
subprocess.run instead of the literal "agent-skill/grpo-finetune/agent_demo.py";
keep the existing argv list (sys.executable, str(invoices_path), "--deployment",
deployment).
| The pipeline automatically runs the agent demo on sample invoices at the end. | ||
|
|
There was a problem hiding this comment.
Document the demo step as conditional.
run_pipeline.py only invokes agent_demo.py when the invoices file exists; otherwise it prints that the demo is being skipped. This line currently overpromises the behavior the user will see.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@grpo-finetuning-qwen3/agent-skill/grpo-finetune/SKILL.md` around lines 94 -
95, The statement "The pipeline automatically runs the agent demo on sample
invoices at the end." is inaccurate; update the doc to state the demo is
conditional: describe that run_pipeline.py calls agent_demo.py only if the
invoices file exists and otherwise prints that the demo is skipped. Reference
the filenames run_pipeline.py and agent_demo.py and mention the invoices file
existence check so readers understand the exact conditional behavior.
Summary by CodeRabbit
New Features
Documentation