|
60 | 60 | {"actor":"sjarmak","comment":null,"created_at":"2026-03-10T12:13:47Z","event_type":"status_changed","id":60,"issue_id":"CodeScaleBench-yb4","new_value":"{\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-yb4\",\"title\":\"Investigate OH/Harbor infrastructure failures before rerun\",\"description\":\"Three distinct infra failures need fixing before rerunning OH verification tasks:\\n\\n1. Harbor FileNotFoundError: django-select-for-update agent ran successfully (614 lines output, 0 crashes) but Harbor crashed writing command-2/return-code.txt. Likely Daytona sandbox cleanup race in ccb_harbor.daytona:GuardedDaytonaEnvironment.\\n\\n2. DinD build failure: bustub-hyperloglog baseline (Claude Haiku sentinel, csb_sdlc_feature_haiku_20260309_223654) — DinD build never completed, no task-level result dir created.\\n\\n3. MCP 6.5hr exception: bustub-hyperloglog MCP (same sentinel run) — ran 6.5 hours then exception_raised. flagged.json shows deepsearch_unused + only 7.86% MCP ratio.\\n\\nAfter fixing these, rerun all 12 tasks using configs/oh_full_rerun_20260310.json. The 9 original verification subset tasks crashed due to jupyter/fget bugs (now fixed in d0fab95). The 3 extra tasks (compliance-124, agentic-122, django-select-for-update) also need rerun. Note: 3 tasks are csb_org_* — verify OH launcher handles org tasks (prior rerun silently skipped them).\\n\\nAlso audit official runs for false positives from the no_changes_guard verifier bug (fixed in c5f261f):\\n grep -rl no_changes_guard runs/official/*/validation_result.json\\n\\nTainted runs (do NOT promote): openhands_sonnet46_20260309_{210054,223658,232947}\",\"status\":\"open\",\"priority\":2,\"issue_type\":\"bug\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-10T11:27:18Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-10T11:27:18Z\"}"} |
61 | 61 | {"actor":"sjarmak","comment":null,"created_at":"2026-03-10T12:14:18Z","event_type":"updated","id":61,"issue_id":"CodeScaleBench-yb4","new_value":"{\"notes\":\"## Investigation Results (2026-03-10)\\n\\n### Issue 1: Harbor FileNotFoundError (django-select-for-update)\\n**Root cause**: NOT a Harbor/Daytona sandbox race. The actual error is `FileNotFoundError: [Errno 2] No such file or directory: 'pkill'` in `/tmp/oh_launcher.py` line 262. Some container images don't have `pkill` installed.\\n**Fix**: Added `shutil.which('pkill')` guard in `agent.py` — falls back to `os.system('kill $(ps aux | ...)')` when pkill is unavailable.\\n\\n### Issue 2: Jupyter fget crash (AttributeError: 'list' object has no attribute 'fget')\\n**Status**: Already fixed in d0fab95. Current code on main correctly uses list comprehension to filter sandbox_plugins.\\n\\n### Issue 3: Bustub-hyperloglog MCP 6.5hr timeout\\n**Root cause**: AgentTimeoutError after hitting 24000s max. Haiku sentinel run with only 7.86% MCP usage. Task-level/model issue, not infra bug. No code fix needed.\\n\\n### Issue 4: Bustub-hyperloglog DinD build failure\\n**Status**: Haiku sentinel run — DinD build never completed. Likely transient. Will be retried in rerun.\\n\\n### no_changes_guard audit\\n**Result**: No `no_changes_guard` references found in any official run result files. No false-positive contamination.\\n\\n### OH launcher org task support\\n**Verified**: `openhands_2config.sh` reads task_dir/benchmark from JSON directly. No filtering that skips csb_org_* tasks. The 3 org tasks in oh_full_rerun_20260310.json will work.\\n\\n### Remaining\\n- The pkill fix needs commit+push\\n- Then rerun all 12 tasks via: `--subset oh_full_rerun_20260310.json`\\n- Tainted staging runs (openhands_sonnet46_20260309_{210054,223658,232133,232947,233609}) must NOT be promoted\"}","old_value":"{\"id\":\"CodeScaleBench-yb4\",\"title\":\"Investigate OH/Harbor infrastructure failures before rerun\",\"description\":\"Three distinct infra failures need fixing before rerunning OH verification tasks:\\n\\n1. Harbor FileNotFoundError: django-select-for-update agent ran successfully (614 lines output, 0 crashes) but Harbor crashed writing command-2/return-code.txt. Likely Daytona sandbox cleanup race in ccb_harbor.daytona:GuardedDaytonaEnvironment.\\n\\n2. DinD build failure: bustub-hyperloglog baseline (Claude Haiku sentinel, csb_sdlc_feature_haiku_20260309_223654) — DinD build never completed, no task-level result dir created.\\n\\n3. MCP 6.5hr exception: bustub-hyperloglog MCP (same sentinel run) — ran 6.5 hours then exception_raised. flagged.json shows deepsearch_unused + only 7.86% MCP ratio.\\n\\nAfter fixing these, rerun all 12 tasks using configs/oh_full_rerun_20260310.json. The 9 original verification subset tasks crashed due to jupyter/fget bugs (now fixed in d0fab95). The 3 extra tasks (compliance-124, agentic-122, django-select-for-update) also need rerun. Note: 3 tasks are csb_org_* — verify OH launcher handles org tasks (prior rerun silently skipped them).\\n\\nAlso audit official runs for false positives from the no_changes_guard verifier bug (fixed in c5f261f):\\n grep -rl no_changes_guard runs/official/*/validation_result.json\\n\\nTainted runs (do NOT promote): openhands_sonnet46_20260309_{210054,223658,232947}\",\"status\":\"in_progress\",\"priority\":2,\"issue_type\":\"bug\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-10T11:27:18Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-10T12:13:47Z\"}"} |
62 | 62 | {"actor":"sjarmak","comment":null,"created_at":"2026-03-10T17:22:52Z","event_type":"closed","id":62,"issue_id":"CodeScaleBench-yb4","new_value":"Investigated and fixed 3 OH infrastructure bugs:\n1. pkill FileNotFoundError — guard with shutil.which(), fallback to os.system()\n2. agent_skills plugin timeout — stripped all sandbox_plugins (jupyter + agent_skills)\n3. chown -R /workspace timeout — patched installed runtime_init.py source to replace chown with no-op\n\nAlso: removed bustub-hyperloglog-impl-001 from active selection (TAC infra incompatible), fixed $DEVICE_NAME in teleport instruction.\n\nSmoke test (3 tasks paired on Daytona) passes: all baselines and MCP configs produce real scores. Ready for 12-task rerun.","old_value":""} |
| 63 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:15:58Z","event_type":"created","id":63,"issue_id":"CodeScaleBench-zrs","new_value":"","old_value":""} |
| 64 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:18:08Z","event_type":"created","id":64,"issue_id":"CodeScaleBench-44x","new_value":"","old_value":""} |
| 65 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:18:13Z","event_type":"created","id":65,"issue_id":"CodeScaleBench-izn","new_value":"","old_value":""} |
| 66 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:18:17Z","event_type":"created","id":66,"issue_id":"CodeScaleBench-6cv","new_value":"","old_value":""} |
| 67 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:18:21Z","event_type":"created","id":67,"issue_id":"CodeScaleBench-csg","new_value":"","old_value":""} |
| 68 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:18:24Z","event_type":"created","id":68,"issue_id":"CodeScaleBench-iv9","new_value":"","old_value":""} |
| 69 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:18:26Z","event_type":"created","id":69,"issue_id":"CodeScaleBench-6or","new_value":"","old_value":""} |
| 70 | +{"actor":"sjarmak","comment":null,"created_at":"2026-03-11T01:40:35Z","event_type":"updated","id":70,"issue_id":"CodeScaleBench-zrs","new_value":"{\"notes\":\"Suite merge map: security(39)=sdlc_secure+org_security+org_compliance | debug(26)=sdlc_debug+org_incident | fix(19)=sdlc_fix | feature(34)=sdlc_feature+org_org | refactor(43)=sdlc_refactor+org_migration | understand(44)=sdlc_understand+sdlc_design+org_domain+org_onboarding | document(11)=sdlc_document | test(12)=sdlc_test | crossrepo(47)=org_crossrepo+org_crossrepo_tracing+org_crossorg+org_platform\"}","old_value":"{\"id\":\"CodeScaleBench-zrs\",\"title\":\"Unified dual-score benchmark: agent always produces both direct edits and answer.json\",\"description\":\"Epic: Every task run yields two independent scores (reward_direct from file edits, reward_artifact from answer.json). No mode switching — agent always does both. Requires changes to agent instructions, verifier infrastructure, result extraction, and all 275 task verifiers.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"feature\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-11T01:15:58Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-11T01:15:58Z\"}"} |
0 commit comments