Closed
Conversation
Benchmark framework that measures how CodeGuard rules affect the security quality of code written by AI agents. Architecture: - Docker container per scenario (OpenCode + qwen/qwen3-coder-next) - Each scenario runs twice: with and without CodeGuard skills - LLM judge (gpt-5.4-mini via OpenRouter) evaluates full diff - Async orchestration with configurable parallelism (up to 10) Components: - benchmarks/models.py: Pydantic models with token usage tracking - benchmarks/config.py: YAML-first config, .env only for API creds - benchmarks/orchestrator.py: asyncio + docker SDK - benchmarks/judge.py: holistic security review of entire diff - benchmarks/report.py: per-scenario and per-category aggregation - benchmarks/run.py: CLI with --dry-run, --scenario, --runs, --parallel - benchmarks/docker/: Dockerfile + entrypoint for container lifecycle - benchmarks/scenarios/javavulnerablelab.yaml: 20 realistic coding tasks across 10 vulnerability categories Usage: python -m benchmarks.run --dry-run python -m benchmarks.run --scenario jvl-feat-001 --runs 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use pre-downloaded opencode binary (COPY) instead of curl in Dockerfile to work around flaky GitHub access from Docker build - Fix model format: openrouter/qwen/qwen3-coder (not qwen3-coder-next) - Pass OPENROUTER_API_KEY env var to containers (what opencode expects) - Add opencode.json provider config step in entrypoint - Use `opencode run -m ... --dangerously-skip-permissions --format json` for non-interactive execution with JSON output Tested: 2 scenarios × 2 runs = 8 containers, all succeeded. Agent writes code, judge finds CWE-89/CWE-79 vulnerabilities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- entrypoint.sh: captures opencode JSON event stream as trace.json,
separates stdout (JSON events) and stderr, fixes token parsing
to use opencode's format ("input"/"output" not "prompt_tokens")
- orchestrator.py: extracts trace.json from containers, passes
BENCH_DEBUG env var
- run.py: --debug flag saves per-container artifacts to results/debug/:
{scenario}_{mode}_{run}_trace.json (full event stream)
{scenario}_{mode}_{run}_agent.log (raw opencode output)
{scenario}_{mode}_{run}.diff (git diff)
- models.py: ContainerResult.agent_trace field for JSON events
Tested: debug traces show full agent workflow (glob→read→write→text)
with token counts per step and cost tracking.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OpenCode does not scan .opencode/skills/ automatically.
The only way to inject security rules is through opencode.json:
{"agent": {"build": {"instructions": "...rules content..."}}}
entrypoint.sh now concatenates SKILL.md + all 23 rule files into
a single instructions blob, JSON-escapes it via python, and writes
opencode.json. This lands in the agent's system prompt.
The .opencode/skills/ file copy approach is removed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In with_skills mode, append "Before writing any code, check if there
are security skills installed and apply them" to the user prompt.
This triggers the agent to call the built-in `skill` tool which
loads CodeGuard rules from .opencode/skills/.
without_skills mode gets the original prompt unchanged — clean baseline.
Results: with hint, agent calls skill("software-security") first,
then uses PreparedStatement for SQL queries. Without hint/skills,
agent uses plain Statement with string concatenation.
Model changed to openrouter/qwen/qwen3.6-plus (generates valid Java).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 retries with delays [5s, 15s, 30s] for ConnectError, ReadTimeout, and 5xx errors. 4xx errors (402 etc.) fail fast. Benchmark result (qwen3.6-plus, 2 scenarios × 2 runs): With skills: 4.5 avg score Without skills: 0.0 avg score Delta: +4.5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Judge now uses CVSS v3.1 methodology with reference scores for common CWE types. Each vulnerability tagged with CVSS base score. security_score = highest CVSS of all vulnerabilities found. Benchmark result (qwen3.6-plus, 2 scenarios × 2 runs): jvl-feat-001 (SQL): With=0.5 Without=10.0 Delta=-9.5 jvl-feat-005 (XSS): With=2.5 Without=10.0 Delta=-7.5 OVERALL: With=1.5 Without=10.0 Delta=-8.5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- config.py: fetch_model_pricing() queries /v1/models for per-token rates - models.py: ModelPricing + cost fields in UsageSummary - report.py: calculates agent/judge/total cost from tokens × pricing - run.py: fetches pricing at startup, displays per-1M-token rates Example output (1 scenario, 1 run): Agent: 169,842 in + 2,081 out = 171,923 $0.0593 Judge: 4,204 in + 577 out = 4,781 $0.0057 Total: 176,704 tokens $0.0650 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- run.py: rich progress bars for container and judge phases - --judge-only: re-judge from saved debug/ artifacts without re-running containers (loads diffs, traces, usage from results/debug/) - max_parallel bumped to 30 (CPU/RAM usage was <10% at 10 workers) - rich added to optional dependencies Usage: python -m benchmarks.run # full run with progress bars python -m benchmarks.run --judge-only # re-judge saved results Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Introduced an experimental benchmarking harness for evaluating the security impact of Project CodeGuard on AI-generated code. - Added a new section in the README and index documentation to explain the benchmarking system and its usage. - Updated mkdocs.yml to include a navigation link for the new benchmarking documentation. - Enhanced the judge API to return detailed logs alongside verdicts for better debugging and auditing. This commit lays the groundwork for measuring the effectiveness of security skills in AI coding tasks, providing structured results and insights.
|
|
- Changed directory references from `software-security` to `secure-coding` in configuration and orchestrator files to reflect the new skill pack. - Updated entrypoint script to install the `secure-coding` skill pack and adjusted related paths. - Revised scenarios in `javavulnerablelab.yaml` to replace outdated rule references with the new naming convention. - Enhanced the benchmarking documentation to clarify the role of the `secure-coding` skill in evaluating AI-generated code security. - Added README and SKILL.md files for the new `secure-coding` skill, detailing its purpose and usage. These changes ensure consistency across the codebase and improve the clarity of the secure coding practices being implemented.
- Introduced new README and SKILL.md files for the `secure-coding-ru` skill, outlining its purpose and usage in AI-assisted code generation and review. - Added multiple rules focused on secure coding practices, including `always-no-hardcoded-secrets.md`, `always-crypto-algorithms.md`, and `always-certificate-hygiene.md`, among others. - Each rule provides detailed guidelines on cryptographic algorithms, certificate hygiene, and the handling of sensitive data, ensuring comprehensive coverage of security practices. - Enhanced the overall documentation to support developers in implementing secure coding standards effectively. These additions aim to improve the security posture of AI-generated code by providing clear, actionable guidelines for developers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.