Commit 65714ad
feat: add done-gate to prevent premature task completion (#110)
* feat: add done-gate to prevent agents from prematurely declaring task complete
When enabled via --done-gate, the evaluation runner calls adapter.evaluate()
when the agent signals "done" to verify the task is actually complete. If the
score is below the threshold (default 1.0), the runner overrides the "done"
signal, appends a continuation message to the task instruction, and lets the
agent continue. Limited to a configurable max overrides (default 3) to prevent
infinite loops.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add core4 trial wrapper, north-star updater, and parity plan doc
- core4_eval.py: deterministic wrapper for running repeated Core4 trials
- update_weekly_north_star.py: compute hard-task success rates for STATUS.md
- waa_execution_parity_plan.md: phased plan for WAA execution reliability
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 9de5f39 commit 65714ad
6 files changed
Lines changed: 722 additions & 9 deletions
File tree
- docs
- openadapt_evals/benchmarks
- scripts
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
271 | 271 | | |
272 | 272 | | |
273 | 273 | | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
274 | 277 | | |
275 | | - | |
| 278 | + | |
276 | 279 | | |
277 | | - | |
278 | | - | |
| 280 | + | |
| 281 | + | |
279 | 282 | | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
280 | 286 | | |
281 | 287 | | |
282 | 288 | | |
| |||
441 | 447 | | |
442 | 448 | | |
443 | 449 | | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
444 | 453 | | |
445 | 454 | | |
446 | 455 | | |
| |||
658 | 667 | | |
659 | 668 | | |
660 | 669 | | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
661 | 673 | | |
662 | | - | |
| 674 | + | |
663 | 675 | | |
664 | | - | |
665 | | - | |
| 676 | + | |
| 677 | + | |
666 | 678 | | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
667 | 682 | | |
668 | 683 | | |
669 | 684 | | |
| |||
2357 | 2372 | | |
2358 | 2373 | | |
2359 | 2374 | | |
| 2375 | + | |
| 2376 | + | |
| 2377 | + | |
| 2378 | + | |
| 2379 | + | |
| 2380 | + | |
2360 | 2381 | | |
2361 | 2382 | | |
2362 | 2383 | | |
| |||
2399 | 2420 | | |
2400 | 2421 | | |
2401 | 2422 | | |
| 2423 | + | |
| 2424 | + | |
| 2425 | + | |
| 2426 | + | |
| 2427 | + | |
| 2428 | + | |
2402 | 2429 | | |
2403 | 2430 | | |
2404 | 2431 | | |
| |||
2427 | 2454 | | |
2428 | 2455 | | |
2429 | 2456 | | |
| 2457 | + | |
| 2458 | + | |
| 2459 | + | |
| 2460 | + | |
| 2461 | + | |
| 2462 | + | |
2430 | 2463 | | |
2431 | 2464 | | |
2432 | 2465 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
58 | 59 | | |
59 | 60 | | |
60 | 61 | | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
61 | 65 | | |
62 | 66 | | |
63 | 67 | | |
| |||
72 | 76 | | |
73 | 77 | | |
74 | 78 | | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
75 | 82 | | |
76 | 83 | | |
77 | 84 | | |
| |||
319 | 326 | | |
320 | 327 | | |
321 | 328 | | |
| 329 | + | |
322 | 330 | | |
323 | 331 | | |
324 | 332 | | |
| |||
367 | 375 | | |
368 | 376 | | |
369 | 377 | | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
370 | 462 | | |
371 | | - | |
372 | | - | |
373 | | - | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
374 | 470 | | |
375 | 471 | | |
376 | 472 | | |
| |||
0 commit comments