Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Demos/Agentic_RFT_PrivatePreview/RFT_Best_Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ The grader is the primary driver of RFT success. Invest disproportionate effort
- **Use the simplest grader that works**: If validating an exact match answer (for example, a number or multiple‑choice letter), use a **string‑match grader** rather than a model‑based or Python grader — even if those alternatives could also work.
- **Prefer deterministic checks**: String validation, code or Python‑based graders, and endpoint‑based graders are more reliable than model‑based grading.
- **Aim for well‑distributed rewards**: Rewards that are too sparse or too uniform produce weak learning signals that limit model improvement.
- **Validate on diverse, real‑world inputs**: Use [Foundry evaluations](https://learn.microsoft.com/en-us/azure/foundry/how-to/evaluate-generative-ai-app) to test graders on existing datasets to ensure they behave as expected.
- **Validate on diverse, real‑world inputs**: Validate graders on real world datasets rather than relying only on synthetic data.

### Start Small and Iterate

Expand Down Expand Up @@ -226,4 +226,4 @@ RFT pipeline supports tool use through function-calling, however MCP is preferre

### Grader Robustness and Reward Integrity

Bad graders can lead models to learn shortcuts (reward hacking). Don’t grade only the final text, grade the tool trace and verify outcomes. In practice that means giving partial credit (outcome vs. tool use vs. safety), explicitly requiring critical steps (for example, lookups before writes), and keeping grading deterministic so improvements reflect policy changes , not grader noise.
Bad graders can lead models to learn shortcuts (reward hacking). Don’t grade only the final text, grade the tool trace and verify outcomes. In practice that means giving partial credit (outcome vs. tool use vs. safety), explicitly requiring critical steps (for example, lookups before writes), and keeping grading deterministic so improvements reflect policy changes , not grader noise.