diff --git a/Demos/Agentic_RFT_PrivatePreview/RFT_Best_Practice.md b/Demos/Agentic_RFT_PrivatePreview/RFT_Best_Practice.md index e84bb5f..4cdbb1a 100644 --- a/Demos/Agentic_RFT_PrivatePreview/RFT_Best_Practice.md +++ b/Demos/Agentic_RFT_PrivatePreview/RFT_Best_Practice.md @@ -40,7 +40,7 @@ The grader is the primary driver of RFT success. Invest disproportionate effort - **Use the simplest grader that works**: If validating an exact match answer (for example, a number or multiple‑choice letter), use a **string‑match grader** rather than a model‑based or Python grader — even if those alternatives could also work. - **Prefer deterministic checks**: String validation, code or Python‑based graders, and endpoint‑based graders are more reliable than model‑based grading. - **Aim for well‑distributed rewards**: Rewards that are too sparse or too uniform produce weak learning signals that limit model improvement. -- **Validate on diverse, real‑world inputs**: Use [Foundry evaluations](https://learn.microsoft.com/en-us/azure/foundry/how-to/evaluate-generative-ai-app) to test graders on existing datasets to ensure they behave as expected. +- **Validate on diverse, real‑world inputs**: Validate graders on real world datasets rather than relying only on synthetic data. ### Start Small and Iterate @@ -226,4 +226,4 @@ RFT pipeline supports tool use through function-calling, however MCP is preferre ### Grader Robustness and Reward Integrity -Bad graders can lead models to learn shortcuts (reward hacking). Don’t grade only the final text, grade the tool trace and verify outcomes. In practice that means giving partial credit (outcome vs. tool use vs. safety), explicitly requiring critical steps (for example, lookups before writes), and keeping grading deterministic so improvements reflect policy changes , not grader noise. \ No newline at end of file +Bad graders can lead models to learn shortcuts (reward hacking). Don’t grade only the final text, grade the tool trace and verify outcomes. In practice that means giving partial credit (outcome vs. tool use vs. safety), explicitly requiring critical steps (for example, lookups before writes), and keeping grading deterministic so improvements reflect policy changes , not grader noise.