Skip to content

[Proposal] Optional AI-Enhanced Assessment Checks #273

@vinayada1

Description

@vinayada1

Proposal: Optional AI-Enhanced Assessment Checks

Summary

Add optional AI support to improve the quality and accuracy of assessment checks that currently rely on brittle heuristics or require manual human review, and to provide post-evaluation remediation guidance when useful. Users opt in per evaluation with supported provider credentials. AI is never required.

Deterministic evaluation remains the baseline behavior. AI is an optional enhancement layer used only when configured and only for checks where deterministic logic is inconclusive or where advisory remediation guidance would be useful.

Non-Goals

This proposal does not attempt to do the following:

  • Replace deterministic checks as the primary evaluation model.
  • Introduce AI as a mandatory dependency for running the plugin.
  • Commit to identical behavior across every provider or model family in the initial implementation.
  • Solve broader evidence discovery or unrelated workflow changes outside the scope of this plugin proposal.

Motivation

Of the 57 assessment steps currently implemented in the plugin:

  • 2 unconditionally return NeedsReview, offering zero automated analysis:
    • DocumentsTestExecution (OSPS-QA-06.02)
    • DocumentsTestMaintenancePolicy (OSPS-QA-06.03)
  • 8 rely on brittle heuristics that miss valid cases or produce false results.
  • 18 currently return NotRun and may have future candidates for AI-assisted implementation.

Examples of current gaps:

Assessment ID Current Approach Limitation
OSPS-BR-04.01 Case-sensitive match for "Change Log" or "Changelog" Misses What's Changed, Release Notes, CHANGELOG, or links to CHANGELOG.md
OSPS-DO-04.01 Exact heading match for "Support" in README Misses Getting Help, Community Support, and similar headings
OSPS-SA-01.01 Filename and directory matching Finding docs/ does not prove design documentation exists
OSPS-GV-03.01 File existence check only Does not verify quality or completeness of a contribution guide
OSPS-QA-06.02 Always returns NeedsReview No automated analysis at all
OSPS-QA-06.03 Always returns NeedsReview No automated analysis at all

Additionally, the output today is a flat list of per-check verdicts with no advisory remediation guidance. Users who see failed or NeedsReview checks must research the OSPS Baseline themselves to understand what to fix and how.

Proposed Capabilities

1. Per-step AI enhancement

Improve selected checks by adding AI when the deterministic result is inconclusive.

  • If AI is not configured, behavior is unchanged.
  • If AI is configured, deterministic logic still runs first.
  • AI is invoked only when a check remains inconclusive or when the heuristic is likely to miss valid evidence.
  • Each call returns a simple structured response.
  • On timeout or error, the step falls back to the deterministic result.
  • AI-assisted results are clearly labeled in output.
  • Each AI-assisted check defines which files it sends, how large content is trimmed, and how prompt size is kept bounded.

Initial high-impact candidates:

Assessment ID AI-enhanced behavior
OSPS-QA-06.02 Analyze README and CONTRIBUTING content for test execution instructions
OSPS-QA-06.03 Analyze repository docs for test maintenance expectations
OSPS-BR-04.01 Detect changelog content semantically, not just by exact heading match
OSPS-SA-01.01 Verify whether candidate docs actually contain architecture or design content

2. AI-assisted remediation guidance

After deterministic checks complete, the plugin can optionally generate advisory remediation guidance for Failed and NeedsReview results.

This capability does not override verdicts or re-evaluate repository evidence. It turns existing results into more actionable guidance by generating:

  • a short explanation of why the requirement likely failed
  • specific remediation steps tied to the OSPS requirement
  • GitHub-specific guidance where relevant
  • a small set of recommended next actions

The input to this pass is intentionally narrow: requirement ID, current verdict, existing step message, recommendation text from the control catalog, and minimal supporting metadata when needed.

Structured Outputs

Per-step AI checks and remediation guidance can use structured outputs when the configured provider supports them. This is the preferred response mode because it makes responses easier to validate and parse.

Initial per-step schema:

{
  "type": "object",
  "properties": {
    "verdict": {
      "type": "string",
      "enum": ["PASS", "FAIL", "UNCERTAIN"]
    },
    "confidence": {
      "type": "number",
      "minimum": 0.0,
      "maximum": 1.0
    },
    "reasoning": {
      "type": "string"
    },
    "evidence_location": {
      "type": "string"
    }
  },
  "required": ["verdict", "confidence", "reasoning"],
  "additionalProperties": false
}

Provider Compatibility

Both capabilities share the same AI client infrastructure, and the SDK will be provider-neutral rather than centered on one AI provider.

The initial support model stays narrow:

Backend Scope
OpenAI First-class native adapter
Anthropic First-class native adapter
Gemini First-class native adapter

The SDK will expose one provider-neutral Analyze(prompt, content, schema) contract while keeping the initial backend list intentionally small. Structured outputs, authentication, request formats, and tool-calling behavior vary by provider, so broader compatibility should be added only after the first end-to-end checks prove useful.

User Configuration

AI support remains fully opt-in through plugin configuration:

plugins:
  github-repo:
    vars:
      owner: "my-org"
      repo: "my-repo"
      token: "ghp_..."
      ai_provider: "openai" # openai | anthropic | gemini
      ai_api_key: "sk-..."
      ai_model: "gpt-4o"
      ai_timeout: "30s"
      ai_max_tokens: 256
      ai_max_calls: 20

AI is disabled unless ai_provider, ai_model, and ai_api_key are set. In dry-run mode, only ai_provider and ai_model are required.

Where the Work Lives

The implementation spans two repos:

Layer Repo What
AI client infrastructure privateerproj/privateer-sdk Provider-neutral AIClient with native adapters for OpenAI, Anthropic, and Gemini
Per-step AI enhancements ossf/pvtr-github-repo-scanner Step-specific prompts, content extraction, fallback behavior, and [AI-Assisted] result labeling
Post-evaluation remediation ossf/pvtr-github-repo-scanner Advisory remediation guidance generated from evaluation results

Security Considerations

  • API keys are handled like GitHub tokens: read from config and never logged.
  • Only publicly accessible repository content and evaluation results are included in prompts.
  • AI responses are constrained to structured verdicts for per-step checks.
  • AI calls go only to the configured provider endpoint for the selected native backend.

Cost Considerations

  • AI is invoked only when configured and only for inconclusive checks or remediation guidance.
  • Focused prompts and bounded ai_max_tokens keep calls constrained.
  • ai_max_calls provides a hard cap per evaluation.

Implementation Phases

Phase 1: AI client infrastructure (privateer-sdk)

  • Add provider-neutral AI client configuration and backend selection
  • Add Analyze(prompt, content, schema) with structured output support when available and prompt-shaped JSON fallback otherwise
  • Add native adapters for OpenAI, Anthropic, and Gemini
  • Add shared handling for common failures so all checks fall back the same way
  • Add result metadata so AI-assisted results show which provider, model, prompt version, and schema version were used
  • Add call-budget controls with ai_max_calls
  • Add --dry-run-ai for prompt inspection and cost estimation without API calls

Phase 2: High-impact per-step checks (pvtr-github-repo-scanner)

  • DocumentsTestExecution
  • DocumentsTestMaintenancePolicy
  • EnsureLatestReleaseHasChangelog
  • HasDesignDocumentation

Phase 3: AI-assisted remediation guidance (pvtr-github-repo-scanner)

  • Add a post-processing pass over the output payload
  • Generate advisory remediation guidance for failed and NeedsReview results
  • Keep deterministic results unchanged

Phase 4: Medium-impact per-step checks

  • HasSupportDocs
  • HasContributionGuide
  • NoBinariesInRepo
  • CicdSanitizedInputParameters

Phase 5: Audit NotRun checks for AI-first candidates

Some NotRun checks may be better implemented as AI-first checks rather than heuristic checks. That audit should happen only after the first phases prove useful.

Open Questions

  1. Should AI-assisted results be weighted differently from deterministic results in aggregate scoring?
  2. Should remediation output live in a separate companion file or be embedded into the standard output?
  3. What is the right default budget for ai_max_calls once real usage data exists?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions