Skip to content

Latest commit

 

History

History
68 lines (46 loc) · 4.38 KB

File metadata and controls

68 lines (46 loc) · 4.38 KB

evaluator/CLAUDE.md

The Lab 4 schema summary evaluator. An Express server that accepts schema summaries, grades them with Claude against a six-criterion rubric, and issues a Credly badge on pass.

Architecture

TypeScript, compiled to dist/ via tsc. npm run dev uses tsx --watch for hot-reload from backend/*.ts; npm start runs the compiled dist/backend/server.js. Dockerfile is multi-stage: build stage compiles TS, runtime stage runs node dist/backend/server.js.

  • backend/server.ts: Express. POST /api/evaluate accepts {session, name, email, summary}. Calls Anthropic via the Grove gateway (grove-gateway-prod.azure-api.net/grove-foundry-prod/anthropic/v1/messages, authenticated with the api-key header) with the rubric as system prompt. On pass, calls backend/credly.ts (unless CREDLY_DRY_RUN=1).
  • backend/credly.ts: Credly v1 API integration. HTTP Basic auth (token as username, blank password). Up to 3 retries on 5xx. 422 with duplicate-badge is treated as success.
  • frontend/index.html: single-file vanilla JS form. Requires ?session=... query param; refuses to render without it. Submits to /api/evaluate and renders the verdict.

Environment variables

  • GROVE_API_KEY: required. Auth for the Grove gateway, which proxies Anthropic. Passed as the api-key header (not x-api-key). Static, separate from anything else.
  • ANTHROPIC_MODEL: defaults to claude-opus-4-8. Don't downgrade. Model IDs stay Anthropic-shaped because Grove forwards to Anthropic.
  • CREDLY_TOKEN: Credly Acclaim API token. Required in production.
  • CREDLY_ORG_ID: Credly organization UUID. Required in production.
  • CREDLY_BADGE_TEMPLATE_ID: Badge template UUID. Required in production.
  • CREDLY_DRY_RUN: set to 1 for local rehearsal. Logs "would issue badge" without calling Credly.
  • PORT: defaults to 8080.

Rubric coordination

The rubric lives in two places. /rubric.md is the human-readable version. The RUBRIC_PROMPT constant in backend/server.ts is what the grading model sees. Both must stay in sync.

Six criteria, weights sum to 100. Pass threshold: 80, with no criterion at 0.

# Criterion Weight
1 Schema reflects access patterns, not relational normalization habits 20
2 Embed vs. reference decisions are justified with explicit reasoning 20
3 Indexes are present and tied to specific query patterns 15
4 No naive relational translation 15
5 MongoDB-native features are used 15
6 The schema could evolve without a migration 15

Verdict shape

The grading model is instructed to return JSON only, in this exact shape:

{
  "overall_score": 0,
  "overall_verdict": "pass" | "needs-revision",
  "criteria": [
    { "name": "...", "weight": 20, "score": 0, "verdict": "pass" | "partial" | "needs-revision", "feedback": "..." }
  ]
}

Server-side, overall_score is recomputed from the criterion scores (don't trust the model's sum) and overall_verdict is enforced as pass only when overall_score >= 80 AND no criterion scored 0.

Calibration

This is the part that needs the most human judgment. Generate a deliberately-mediocre summary and a deliberately-excellent one, submit both, and tune the rubric prompt until both score appropriately. Don't delegate this calibration to the agent. You are the only one who knows what good schema design looks like for this workshop.

Things to be careful about

Don't change the rubric weights or the pass threshold without an explicit conversation with the workshop owner. Other MongoDB skill badges use 80/100 and the calibration is consistent across them.

Don't change the verdict JSON shape without updating the frontend renderer to match. The frontend assumes the exact field names above.

Don't print sensitive data (Grove key, Credly token) in logs. The submission log includes email, name, session, scores, and badge issuance result; it never includes the raw summary text or API credentials.

The session ID is opaque to the evaluator. It exists for tracking which event a submission came from (e.g., ai-coding-with-mongodb-devday-20260120-newyork). The evaluator records it but doesn't validate it.

Credly issuance is idempotent. A repeated submission with the same email and session should not double-issue. Credly's 422 duplicate response is treated as success, and the dry-run path skips the call entirely.