Skip to content

feat(wave4): Full Techniques graph, enhanced grader, scoring, technique API, BYOK integration + UI (#234-#243)#259

Merged
ComBba merged 3 commits into
mainfrom
feat/group-c-wave4
Feb 9, 2026
Merged

feat(wave4): Full Techniques graph, enhanced grader, scoring, technique API, BYOK integration + UI (#234-#243)#259
ComBba merged 3 commits into
mainfrom
feat/group-c-wave4

Conversation

@ComBba

@ComBba ComBba commented Feb 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Wave 4 of the Fairthon 75-Technique Evaluation System migration (Epic #223).

Implements Group C (LangGraph full techniques pipeline), Group D (BYOK integration + UI), and Group E (API, scoring, enhanced grading).

10 issues resolved. 159 new tests. 1003 total tests passing, 0 failures.

⚠️ Depends on PR #258 (Wave 2 + Wave 3). Merge #258 first, then this PR.

Issues Resolved

Group C — LangGraph Full Techniques Pipeline

Issue Title Key Files
#234 full_techniques graph graph/full_techniques_graph.py — 8 parallel category nodes + deep_synthesis + finalize
#235 8 category nodes graph/nodes/technique_categories/ — BaseCategoryNode + 8 concrete nodes
#236 Deep Synthesis + Finalize deep_synthesis.py (conflict detection), finalize.py (quality gate, scores)

Group D — BYOK

Issue Title Key Files
#239 BYOK Settings UI frontend/src/app/settings/api-keys/page.tsx, components/settings/APIKey*.tsx
#240 BYOK pipeline integration services/evaluation_service.py — stored key lookup with priority chain

Group E — API + Grading

Issue Title Key Files
#237 Scoring aggregation criteria/scoring.py — confidence adjustment, exclusion normalization
#241 Technique API api/routes/techniques.py — 3 GET endpoints
#242 Enhanced CodeGrader agents/code_grader.py — expanded 5→17 BMAD items
#243 LLM Grader criteria/llm_grader.py — subjective item grading with retry

Architecture

Full Techniques Graph

START → [RAG Enrich] → [Web Search] → [Code Analysis]
                          │
     ┌────────────────────┼────────────────────┐
     │    │    │    │    │    │    │    │
     ▼    ▼    ▼    ▼    ▼    ▼    ▼    ▼
   Aroma Palate Body Finish Balance Vintage Terroir Cellar
     │    │    │    │    │    │    │    │
     └────┴────┴────┴────┴────┴────┴────┘
                    │
              Deep Synthesis (conflict detection)
                    │
              Finalize (quality gate: PASS/CONCERNS/FAIL/INCOMPLETE)
                    │
                   END

BYOK Key Priority

Request API key (explicit) → Stored BYOK key (from /settings) → System default

Grading Architecture

CodeGraderAgent.evaluate_all()
├── Objective (7 items): D1, D2, C1, C2, C4, B1, B2  → heuristic scoring
└── Subjective (11 items): A1-A4, B3-B4, C3, C5, D3-D4  → LLMGrader
     └── grade_item(item_id, repo_context, llm)
         ├── Assemble context (≤8000 chars)
         ├── Apply grading prompt from grading_prompts.py
         └── Parse + validate response (2 retries)

New Files (26)

Backend (22)

  • app/criteria/scoring.py — Confidence adjustment, exclusion normalization, coverage
  • app/criteria/grading_prompts.py — 11 structured prompt templates
  • app/criteria/llm_grader.py — LLMGrader class
  • app/api/routes/techniques.py — 3 GET endpoints
  • app/graph/full_techniques_graph.py — StateGraph definition
  • app/graph/nodes/technique_categories/ — 10 files (base + 8 nodes + init)
  • app/graph/nodes/technique_categories/deep_synthesis.py
  • app/graph/nodes/technique_categories/finalize.py
  • 11 test files (159 new tests total)

Frontend (4)

  • src/app/settings/api-keys/page.tsx — Protected settings page
  • src/components/settings/APIKeyManager.tsx — Orchestrator
  • src/components/settings/APIKeyForm.tsx — Key registration form
  • src/components/settings/APIKeyList.tsx — Key list with delete

Modified Files (6)

  • app/agents/code_grader.py — Expanded to 17 items (backward compatible)
  • app/graph/graph_factory.py — Registered full_techniques mode (3 modes total)
  • app/main.py — Registered techniques router
  • app/services/evaluation_service.py — BYOK stored key lookup + full_techniques progress
  • frontend/src/components/UserMenu.tsx — Added API Keys link
  • frontend/src/lib/api.ts — Added BYOK API methods

Testing

1003 passed, 0 failed
Category Tests New
Scoring 15 +15
Technique API 20 +20
Enhanced Grader 15 +15
LLM Grader 22 +22
Category Nodes 13 +13
Full Techniques Graph 10 +10
Deep Synthesis 3 +3
Finalize 6 +6
BYOK Integration 9 +9
Graph Factory (updated) 10 +1 (fix for 3 modes)
Existing (no regressions) 880

What's Next

Closes #234, Closes #235, Closes #236, Closes #237, Closes #239, Closes #240, Closes #241, Closes #242, Closes #243

Summary by CodeRabbit

릴리스 노트

  • 새로운 기능
    • API 키 관리 페이지 추가: 사용자가 Google Gemini API 키를 등록하고 관리할 수 있습니다.
    • 코드 평가 항목 확장: 5개에서 17개의 평가 항목으로 증가하여 더욱 포괄적인 분석을 제공합니다.
    • 기법 API 엔드포인트 추가: 카테고리, 모드, 우선순위별 기법 목록 조회 및 통계 조회 기능을 추가했습니다.
    • 새로운 평가 모드: 다층 평가 및 종합 분석을 포함한 완전한 기법 기반 평가 워크플로우를 지원합니다.

…ue API, BYOK integration + UI (#234-#243)

Wave 4: Groups C + D + E — 10 issues implemented

Group C — LangGraph Full Techniques Pipeline:
- #234: full_techniques_graph.py — 8 parallel category nodes + deep_synthesis + finalize
- #235: 8 BaseCategoryNode subclasses (aroma→cellar) with TechniqueRouter integration
- #236: deep_synthesis (conflict detection) + finalize (quality gate, category scores)

Group D — BYOK:
- #239: Frontend settings page /settings/api-keys with APIKeyManager, Form, List
- #240: Stored BYOK key integration into evaluation pipeline (priority: request > stored > system)

Group E — API + Grading:
- #237: scoring.py — confidence adjustment, exclusion normalization, coverage calculation
- #241: GET /api/techniques, /techniques/stats, /techniques/{id} endpoints
- #242: Enhanced CodeGraderAgent (5 → 17 BMAD items, 7 objective + 11 subjective)
- #243: LLMGrader class for subjective items with retry logic, context assembly

Also:
- Registered full_techniques mode in graph_factory.py (3 modes total)
- Updated evaluation_service.py for full_techniques progress tracking (10 steps)
- Added API Keys link to UserMenu

Tests: 1003 passed, 0 failed (159 new tests added)
@vercel

vercel Bot commented Feb 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
somm-dev Ready Ready Preview, Comment Feb 9, 2026 8:51am

Request Review

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@coderabbitai

coderabbitai Bot commented Feb 9, 2026

Copy link
Copy Markdown

Walkthrough

8개의 병렬 범주 노드와 심층 합성 기능을 갖춘 전체 기법 그래프를 구현하고, BMAD 항목을 5개에서 17개로 확장하며, LLM 기반 주관적 채점, 기법 API 엔드포인트, 프론트엔드 API 키 관리 UI를 추가합니다.

Changes

Cohort / File(s) Summary
Code Grader Enhancement
backend/app/agents/code_grader.py, backend/tests/test_enhanced_code_grader.py
EVALUABLE_ITEMS를 5개에서 17개로 확장(A1-D4), objective/subjective 항목 분리, evaluate_all() 비동기 메서드 추가, LLMGrader 통합, _evaluate_code_quality/testing_coverage 내부 메서드 추가.
Criteria & Scoring Modules
backend/app/criteria/grading_prompts.py, backend/app/criteria/llm_grader.py, backend/app/criteria/scoring.py, backend/tests/test_llm_grader.py, backend/tests/test_scoring.py
주관적 BMAD 항목 평가용 구조화된 프롬프트 프레임워크, LLM 기반 채점 클래스(재시도 로직/컨텍스트 조립/응답 검증), 신뢰도 조정 및 제외 정규화 점수 계산 기능 구현.
Techniques API
backend/app/api/routes/techniques.py, backend/tests/test_technique_api.py
기법 목록/상세/통계 조회 엔드포인트(GET /techniques, /techniques/{id}, /techniques/stats), 모드/카테고리/해시 필터링, 75개 기법 모두 지원, TechniqueRegistry 통합.
Full Techniques Graph Core
backend/app/graph/full_techniques_graph.py, backend/app/graph/graph_factory.py, backend/tests/test_full_techniques_graph.py
8개 병렬 범주 노드 + 심층 합성 + 최종화 노드를 포함하는 StateGraph 생성, graph_factory에 full_techniques 모드 등록, 완전한 평가 파이프라인 구성.
Category Base & Nodes
backend/app/graph/nodes/technique_categories/base_category.py, backend/app/graph/nodes/technique_categories/{aroma,palate,body,finish,balance,vintage,terroir,cellar}_node.py, backend/app/graph/nodes/technique_categories/__init__.py, backend/tests/test_category_nodes.py
BaseCategoryNode 추상 기본 클래스(평가 파이프라인/이벤트 방출/결과 집계), 8개 구체적 카테고리 노드(각각 TechniqueCategory 및 display_name 설정), 공개 API로 노드 클래스 재내보내기.
Deep Synthesis & Finalize
backend/app/graph/nodes/technique_categories/deep_synthesis.py, backend/app/graph/nodes/technique_categories/finalize.py, backend/tests/test_deep_synthesis.py, backend/tests/test_finalize_node.py
Deep synthesis: 충돌하는 점수 감지 및 교차 참조, Finalize: 신뢰도 조정 적용/차원별 점수 계산/품질 게이트(PASS/CONCERNS/FAIL/INCOMPLETE) 생성.
Evaluation Service Integration
backend/app/services/evaluation_service.py, backend/tests/test_byok_integration.py
_get_stored_key 헬퍼 추가(암호화된 저장된 BYOK 키 검색), _prepare_repo_context 확장(user_id 매개변수/저장된 키 조회/api_key_source 추적), full_techniques 모드 평가 진행률 지원.
Frontend API Key Management
frontend/src/app/settings/api-keys/page.tsx, frontend/src/components/settings/{APIKeyManager,APIKeyForm,APIKeyList}.tsx, frontend/src/components/UserMenu.tsx, frontend/src/lib/api.ts, backend/app/main.py
API 키 설정 페이지 및 관리 컴포넌트(폼/목록/매니저), registerKey/getKeyStatus/deleteKey/validateKey API 메서드, 사용자 메뉴에 API 키 링크 추가, techniques 라우터 마운트.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant EvalService as Evaluation Service
    participant Graph as Full Techniques Graph
    participant CatNodes as Category Nodes<br/>(8 parallel)
    participant Router as Technique Router
    participant DeepSynth as Deep Synthesis
    participant Finalize as Finalize Node
    
    User->>EvalService: start_evaluation(repo, user_id, full_techniques)
    EvalService->>EvalService: _prepare_repo_context(user_id)
    EvalService->>EvalService: retrieve stored BYOK key if available
    EvalService->>Graph: invoke(state)
    Graph->>CatNodes: Send() → 8 parallel category nodes
    
    par Category Execution
        CatNodes->>Router: select(category=AROMA, mode=full_techniques)
        Router-->>CatNodes: techniques list
        CatNodes->>CatNodes: execute techniques sequentially
        CatNodes-->>Graph: partial item_scores + methodology_trace
    and
        CatNodes->>Router: select(category=PALATE, mode=full_techniques)
        Router-->>CatNodes: techniques list
        CatNodes->>CatNodes: execute techniques sequentially
        CatNodes-->>Graph: partial item_scores + methodology_trace
    and
        Note over CatNodes: ... (6 more categories in parallel)
    end
    
    Graph->>DeepSynth: merge all category results
    DeepSynth->>DeepSynth: cross-reference conflicting scores
    DeepSynth->>DeepSynth: flag conflicts (score_range > 2.0)
    DeepSynth-->>Graph: consolidated item_scores + conflict details
    
    Graph->>Finalize: apply scoring aggregation
    Finalize->>Finalize: adjust_score_by_confidence()
    Finalize->>Finalize: calculate per-dimension scores (A/B/C/D)
    Finalize->>Finalize: determine quality_gate (PASS/CONCERNS/FAIL)
    Finalize-->>Graph: final evaluation_summary + metadata
    
    Graph-->>EvalService: completed state
    EvalService-->>User: evaluation result with 17 BMAD items scored
Loading
sequenceDiagram
    participant CodeGrader as Code Grader Agent
    participant ObjEval as Objective Evaluators<br/>(C1, C2, etc.)
    participant LLMGrader as LLM Grader
    participant Gemini as Gemini LLM
    
    CodeGrader->>ObjEval: evaluate_code_quality(repo_context)
    ObjEval->>ObjEval: parse code metrics, linting config
    ObjEval-->>CodeGrader: C1, C2 ItemScores
    
    CodeGrader->>CodeGrader: iterate subjective items (A1-A4, B3-B4, C3, C5, D3-D4)
    
    loop For each subjective item
        CodeGrader->>LLMGrader: grade_item(item_id, repo_context, llm)
        LLMGrader->>LLMGrader: _assemble_context(repo_context, max_chars=8000)
        LLMGrader->>LLMGrader: format_prompt(item_id, context)
        
        rect rgba(200, 150, 100, 0.5)
            LLMGrader->>Gemini: ainvoke(prompt, json_mode=True)
            Gemini-->>LLMGrader: {score, confidence, evidence, rationale}
        end
        
        LLMGrader->>LLMGrader: _validate_response(response, item_id)
        alt Validation fails (retry up to 2x)
            LLMGrader->>Gemini: retry with clearer prompt
        end
        
        LLMGrader->>LLMGrader: _update_usage(tokens, cost)
        LLMGrader-->>CodeGrader: ItemScore (subjective)
    end
    
    CodeGrader->>CodeGrader: aggregate all 17 item_scores
    CodeGrader-->>CodeGrader: return item_scores + usage metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement, graph, criteria, api, frontend, feature

Poem

🐰 열일곱 항목을 채점하는 토끼,
여덟 갈래 길로 기법들 흐르고,
심층 합성으로 점수를 모아,
API 키 설정도 우아하게—
전체 기법 평가, 이제 완성! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Pull request title directly and comprehensively summarizes the main change: Wave 4 implementation covering full techniques graph, enhanced grader, scoring, technique API, BYOK integration and UI across multiple groups.
Linked Issues check ✅ Passed 모든 9개 링크된 이슈의 목표가 충족됨: Full Techniques graph (#234), 8개 카테고리 노드 (#235), Deep Synthesis/Finalize (#236), 채점 모듈 (#237), BYOK UI (#239), BYOK 통합 (#240), Technique API (#241), 향상된 Code Grader (#242), LLM Grader (#243). 파일 추가/수정 모두 해당 목표 달성을 위해 존재.
Out of Scope Changes check ✅ Passed 범위 밖의 변경 사항 없음: 모든 파일 변경은 Wave 4 구현(그래프, 채점, API, BYOK, UI)의 필수 요소. 파일 추가(26개 backend, 4개 frontend)와 수정(graph_factory, evaluation_service, main, code_grader, UserMenu)은 모두 명시적 목표와 직결됨.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/group-c-wave4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist

Copy link
Copy Markdown

Summary of Changes

Hello @ComBba, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Fairthon 75-Technique Evaluation System by introducing a full-fledged evaluation pipeline capable of running all 75 techniques across various categories. It also upgrades the grading system to cover all 17 BMAD items, leveraging LLMs for subjective assessments and implementing advanced scoring logic. Furthermore, it provides a dedicated API for technique management and integrates a "Bring Your Own Key" (BYOK) feature, allowing users greater control over their evaluation resources. These changes collectively expand the system's capabilities, improve evaluation depth, and enhance user flexibility.

Highlights

  • Full Techniques Evaluation Graph: Introduced a comprehensive LangGraph pipeline (full_techniques_graph.py) that orchestrates 8 parallel category-specific evaluation nodes (e.g., Aroma, Palate), followed by deep synthesis for conflict detection and a finalization step for quality gating and scoring.
  • Enhanced Grader & Scoring: Expanded the CodeGraderAgent to evaluate all 17 BMAD items, integrating a new LLMGrader for subjective evaluations and a refined scoring.py module for confidence-adjusted and exclusion-normalized scores.
  • Technique API: Implemented a new API (api/routes/techniques.py) to list, filter, and retrieve detailed information about the 75 evaluation techniques, including aggregated statistics.
  • BYOK (Bring Your Own Key) Integration: Added functionality to allow users to register, store, and manage their own API keys (evaluation_service.py, frontend/src/app/settings/api-keys/page.tsx, etc.), with a priority chain for key usage.
Changelog
  • backend/app/agents/code_grader.py
    • Expanded EVALUABLE_ITEMS from 5 to 17 BMAD items.
    • Introduced OBJECTIVE_ITEMS and SUBJECTIVE_ITEMS lists.
    • Added evaluate_all asynchronous method for comprehensive 17-item evaluation, integrating LLMGrader for subjective items.
    • Added new private methods _evaluate_code_quality (C1) and _evaluate_testing_coverage (C2) for objective assessments.
  • backend/app/api/routes/techniques.py
    • New file: Implements API endpoints for /techniques.
    • Provides routes to list all techniques, filter by category, hat, or evaluation mode.
    • Includes a route to retrieve detailed information for a specific technique.
    • Offers a /stats endpoint for aggregated technique statistics (total, by category, priority, mode).
  • backend/app/criteria/grading_prompts.py
    • New file: Defines structured prompt templates and rubrics for 11 subjective BMAD evaluation items (A1-A4, B2-B4, C3, C5, D3-D4) to be used by the LLMGrader.
  • backend/app/criteria/llm_grader.py
    • New file: Implements the LLMGrader class for LLM-based subjective evaluation.
    • Handles context assembly from repository data, prompt formatting, LLM invocation, and structured response validation with retries.
    • Includes token usage and cost tracking.
  • backend/app/criteria/scoring.py
    • New file: Provides utility functions for score aggregation.
    • adjust_score_by_confidence: Modifies scores based on confidence levels (high, medium, low).
    • calculate_exclusion_normalized_score: Computes normalized scores, accounting for excluded or missing evaluation items.
    • is_information_absent: Helper to determine if evidence or rationale indicates missing information.
  • backend/app/graph/full_techniques_graph.py
    • New file: Defines the create_full_techniques_graph function, which constructs a LangGraph StateGraph.
    • Orchestrates a parallel execution of 8 category-specific nodes (Aroma, Palate, Body, Finish, Balance, Vintage, Terroir, Cellar).
    • Includes optional RAG, web search, and code analysis enrichment steps.
    • Integrates deep_synthesis for conflict detection and finalize for overall quality gating.
  • backend/app/graph/graph_factory.py
    • Modified to register the new create_full_techniques_graph builder under the FULL_TECHNIQUES evaluation mode.
  • backend/app/graph/nodes/technique_categories/init.py
    • New file: Consolidates imports and exports for the new category-specific graph nodes.
  • backend/app/graph/nodes/technique_categories/aroma_node.py
    • New file: Implements the Aroma category node for the full_techniques_graph.
  • backend/app/graph/nodes/technique_categories/balance_node.py
    • New file: Implements the Balance category node for the full_techniques_graph.
  • backend/app/graph/nodes/technique_categories/base_category.py
    • New file: Provides an abstract base class for the 8 category nodes, handling technique selection, execution, and result aggregation within their respective categories.
  • backend/app/graph/nodes/technique_categories/body_node.py
    • New file: Implements the Body category node for the full_techniques_graph.
  • backend/app/graph/nodes/technique_categories/cellar_node.py
    • New file: Implements the Cellar category node for the full_techniques_graph.
  • backend/app/graph/nodes/technique_categories/deep_synthesis.py
    • New file: Implements a node to detect conflicts in item scores from different techniques.
  • backend/app/graph/nodes/technique_categories/finalize.py
    • New file: Implements a node to calculate overall scores, coverage, and determine a quality gate status (PASS/CONCERNS/FAIL/INCOMPLETE).
  • backend/app/graph/nodes/technique_categories/finish_node.py
    • New file: Implements the Finish category node for the full_techniques_graph.
  • backend/app/graph/nodes/technique_categories/palate_node.py
    • New file: Implements the Palate category node for the full_techniques_graph.
  • backend/app/graph/nodes/technique_categories/terroir_node.py
    • New file: Implements the Terroir category node for the full_techniques_graph.
  • backend/app/graph/nodes/technique_categories/vintage_node.py
    • New file: Implements the Vintage category node for the full_techniques_graph.
  • backend/app/main.py
    • Registered the new techniques.router to expose the technique API endpoints.
  • backend/app/services/evaluation_service.py
    • Added _get_stored_key to retrieve user-specific BYOK API keys from the database.
    • Modified _prepare_repo_context to implement a priority chain for API key resolution: explicit request key > stored BYOK key > system default.
    • Updated start_evaluation and get_evaluation_progress to recognize and handle the new full_techniques evaluation mode, adjusting total steps accordingly.
  • backend/tests/test_byok_integration.py
    • New file: Contains unit tests for the BYOK key retrieval and priority logic in evaluation_service.py.
  • backend/tests/test_category_nodes.py
    • New file: Unit tests for the new BaseCategoryNode and its concrete implementations.
  • backend/tests/test_deep_synthesis.py
    • New file: Unit tests for the deep_synthesis graph node, including conflict detection.
  • backend/tests/test_enhanced_code_grader.py
    • New file: Unit tests for the expanded CodeGraderAgent, covering 17-item evaluation, objective/subjective item handling, and backward compatibility.
  • backend/tests/test_finalize_node.py
    • New file: Unit tests for the finalize graph node, including score aggregation and quality gate determination.
  • backend/tests/test_full_techniques_graph.py
    • New file: Unit tests for the full_techniques_graph creation and structure.
  • backend/tests/test_graph_factory.py
    • Modified to include tests for the full_techniques mode registration.
  • backend/tests/test_llm_grader.py
    • New file: Unit tests for the LLMGrader class, covering initialization, context assembly, response validation, and usage tracking.
  • backend/tests/test_scoring.py
    • New file: Unit tests for the scoring aggregation functions, including confidence adjustment and exclusion normalization.
  • backend/tests/test_technique_api.py
    • New file: Integration tests for the new /api/techniques endpoints, covering listing, filtering, detail retrieval, and statistics.
  • frontend/src/app/settings/api-keys/page.tsx
    • New file: Implements a new protected frontend page for managing user API keys.
  • frontend/src/components/UserMenu.tsx
    • Added a new menu item "API Keys" to the user dropdown menu, linking to the new settings page.
  • frontend/src/components/settings/APIKeyForm.tsx
    • New file: Frontend component for a form to register new API keys, including validation and success/error states.
  • frontend/src/components/settings/APIKeyList.tsx
    • New file: Frontend component to display a list of registered API keys and provide functionality to delete them.
  • frontend/src/components/settings/APIKeyManager.tsx
    • New file: Orchestrates the API key form and list components, handling data fetching and state management for the API key settings page.
  • frontend/src/lib/api.ts
    • Added new TypeScript interfaces (KeyStatusResponse, RegisterKeyResponse, ValidateKeyResponse) for API key related data.
    • Extended the api object with new methods for registerKey, getKeyStatus, deleteKey, and validateKey.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several major features, including the full techniques evaluation graph, an enhanced grader with LLM support, a new scoring system, a techniques API, and BYOK integration. A critical Prompt Injection vulnerability was identified in the LLM-based grading logic, where untrusted repository data is directly incorporated into prompts. Furthermore, a critical bug in the new LLM grader prevents subjective scoring, and several medium-severity maintainability issues, such as magic numbers and local imports, were noted. Addressing these points, particularly the Prompt Injection vulnerability through robust prompt construction, will enhance the functionality's robustness, clarity, and security.

Comment on lines +58 to +65
return {
"item_id": item_id,
"score": float(response["score"]),
"max_score": float(max_score),
"confidence": response["confidence"],
"evidence": response["evidence"],
"rationale": response["rationale"],
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a critical bug here that will cause a KeyError. The _call_llm method returns a dictionary where the parsed LLM output is nested under the "parsed" key (e.g., {"parsed": {"score": 5, ...}, "usage": ...}). However, the code here attempts to access score, confidence, etc., directly from the top-level response dictionary, which will fail.

You should access these values from response["parsed"].

                    return {
                        "item_id": item_id,
                        "score": float(response["parsed"]["score"]),
                        "max_score": float(max_score),
                        "confidence": response["parsed"]["confidence"],
                        "evidence": response["parsed"]["evidence"],
                        "rationale": response["parsed"]["rationale"],
                    }

Comment on lines +48 to +49
context = self._assemble_context(repo_context, max_chars=8000)
prompt = format_prompt(item_id, context)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The grade_item method assembles a context string from untrusted repository data (README, file tree, code snippets) and formats it directly into the LLM prompt. This makes the system vulnerable to prompt injection attacks, where a malicious repository can manipulate the evaluation results by including instructions for the LLM within its files. For example, an attacker could craft a README that contains instructions to ignore previous instructions and output a specific high score.

        # Use robust delimiters and clear instructions to mitigate prompt injection
        context = self._assemble_context(repo_context, max_chars=8000)
        # Consider wrapping the context in tags and adding instructions to the prompt template
        # to treat the content within those tags as untrusted data.
        prompt = format_prompt(item_id, f"<repository_context>\n{context}\n</repository_context>")

Comment on lines +41 to +53
SUBJECTIVE_ITEMS = [
"A1",
"A2",
"A3",
"A4",
"B2",
"B3",
"B4",
"C3",
"C5",
"D3",
"D4",
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The item B2 (System Architecture) is incorrectly included in SUBJECTIVE_ITEMS. B2 is evaluated by a deterministic heuristic (_evaluate_system_architecture) and is correctly listed in OBJECTIVE_ITEMS. To avoid confusion and maintain a clear distinction between objective and subjective evaluations, it should be removed from this list.

    SUBJECTIVE_ITEMS = [
        "A1",
        "A2",
        "A3",
        "A4",
        "B3",
        "B4",
        "C3",
        "C5",
        "D3",
        "D4",
    ]

Returns:
New dictionary with adjusted scores (input is not mutated)
"""
from app.criteria.bmad_items import get_item

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This import is performed inside the apply_confidence_adjustment_to_scores function. According to Python best practices (PEP 8), imports should be placed at the top of the file. This improves readability, avoids potential circular import issues, and makes dependencies clear. Please move this import to the top of the module.

for item_id, scores in item_score_variants.items():
if len(scores) > 1:
score_range = max(scores) - min(scores)
if score_range > 2.0:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 2.0 used as a threshold for detecting score conflicts is a magic number. It should be defined as a named constant at the top of the module (e.g., CONFLICT_SCORE_RANGE_THRESHOLD = 2.0) to improve readability and make it easier to configure in the future.

Suggested change
if score_range > 2.0:
if score_range > 2.0: # TODO: Consider making this a constant

if cat_id in category_scores:
category_scores[cat_id]["score"] += min(score_value, item.max_score)

coverage = evaluated_count / 17 if evaluated_count > 0 else 0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The total number of BMAD items, 17, is used here as a magic number. This makes the code harder to maintain if the number of items changes. You can make this more robust by deriving the count directly from the bmad_items module, which is already imported.

Suggested change
coverage = evaluated_count / 17 if evaluated_count > 0 else 0.0
coverage = evaluated_count / len(list_items()) if evaluated_count > 0 else 0.0

Comment on lines +40 to +41
from app.database.repositories.api_key import APIKeyRepository
from app.services.encryption import EncryptionService

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These imports are performed inside the _get_stored_key function. According to Python best practices (PEP 8), imports should be placed at the top of the file. This improves readability, avoids potential circular import issues, and makes dependencies clear. Please move these imports to the top of the module.

- Add asyncio.to_thread() to avoid blocking event loop in llm_grader.py
- Add progress events to BaseCategoryNode matching sommelier pattern
- Add input validation for category/hat params in techniques API
- Replace hardcoded BMAD item count with dynamic lookup in scoring.py
- Ensure immutable state access in deep_synthesis.py
- Remove B2 duplicate from SUBJECTIVE_ITEMS (already in OBJECTIVE)
- Update tests to reflect new validation behavior

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/services/evaluation_service.py (2)

570-579: ⚠️ Potential issue | 🟠 Major

full_techniques 모드에서 jeanpierre_result 접근은 항상 빈 결과를 반환

full_techniques 모드는 jeanpierre_result 대신 finalize 노드를 통해 total_score, quality_gate, normalized_score 등을 생성합니다. 그러나 여기서는 jeanpierre_result만 참조하므로, full_techniques 평가가 성공해도 score: 0, rating_tier: ""이 반환됩니다.

🐛 제안하는 수정
+        if evaluation_mode == "full_techniques":
+            return {
+                "evaluation_id": evaluation_id,
+                "status": "completed",
+                "score": result.get("normalized_score", 0),
+                "quality_gate": result.get("quality_gate", ""),
+                "rating_tier": result.get("quality_gate", ""),
+            }
+
         jeanpierre = result.get("jeanpierre_result") or {}
         return {
             "evaluation_id": evaluation_id,

289-384: ⚠️ Potential issue | 🟠 Major

save_evaluation_resultsfull_techniques 모드를 처리하지 않음

save_evaluation_resultsis_grand_tasting 조건으로만 분기하고, full_techniques 모드는 else 블록으로 빠집니다. 이 경우 코드는 jeanpierre_result를 찾으려 하지만, full_techniques 그래프는 jeanpierre_result 대신 total_score, quality_gate, category_scores, strengths, improvements 등의 필드를 생성하므로 jeanpierre_result가 없어 결과적으로 overall_score=0, summary="", sommelier_outputs=[]의 빈 평가가 저장됩니다.

full_techniques 모드에 맞는 결과 변환 로직을 추가하여 이 필드들을 올바르게 처리해야 합니다.

🤖 Fix all issues with AI agents
In `@backend/app/agents/code_grader.py`:
- Around line 156-169: The evaluate_all loop calls LLMGrader.grade_item and is
failing because grade_item in llm_grader.py reads response["score"] instead of
the actual parsed value response["parsed"]["score"]; update LLMGrader.grade_item
to extract score, max_score, confidence, rationale, and evidence from
response["parsed"] (e.g., use response["parsed"]["score"] etc.) and ensure
evaluate_all's usage of LLMGrader.grade_item remains the same so subjective
items use the corrected parsed fields rather than causing KeyError and fallback.
- Around line 812-821: The coverage detection currently uses substring matching
which causes false positives; change the logic that sets has_coverage to compare
only the filename portion against the exact set of coverage filenames
(coverage_configs) using a basename extraction (e.g., path.split("/")[-1] or
os.path.basename) and normalize to lower-case before equality checks so entries
like "src/coverage.ini.bak" no longer match; update the code around
coverage_configs and has_coverage to implement this exact filename comparison
using file_tree and the filename extraction.

In `@backend/app/api/routes/techniques.py`:
- Line 22: Replace the hardcoded VALID_HATS set with a dynamic derivation from
the Hat enum to keep a single source of truth: instead of the current VALID_HATS
constant, compute it the same way VALID_CATEGORIES is derived (i.e., build the
set from Hat members' .value), locating the CONSTANT named VALID_HATS and the
Hat enum definition used in hat_mappings so the set reflects any changes to Hat
automatically.

In `@backend/app/criteria/grading_prompts.py`:
- Around line 363-374: The current format_prompt function uses
prompt_template.format(context=context) which raises KeyError/ValueError when
the inserted context contains literal braces; change format_prompt to avoid
str.format parsing by replacing the placeholder explicitly: fetch
prompt_template via get_prompt(item_id)["prompt"] and return
prompt_template.replace("{context}", context) (or replace only the first
occurrence if desired) so literal `{` and `}` in context are preserved and no
format parsing occurs.

In `@backend/app/criteria/llm_grader.py`:
- Around line 56-66: grade_item currently accesses response["score"],
response["confidence"], response["evidence"], and response["rationale"] directly
but _call_llm returns {"parsed": parsed, "usage": usage} and _validate_response
checks response.get("parsed", {}); update grade_item to read these values from
response["parsed"] (e.g., response.get("parsed", {}).get("score") etc.), keep
the existing _update_usage(response.get("usage", {})) call, and ensure returned
keys use the parsed values (score, confidence, evidence, rationale) while
preserving types (float for score/max_score) so that grade_item, _call_llm, and
_validate_response are consistent.

In `@backend/app/criteria/scoring.py`:
- Around line 155-180: The status default in the loop inside the function that
builds adjusted_scores uses ItemStatus.EVALUATED but
calculate_exclusion_normalized_score uses ItemStatus.DATA_MISSING, causing
inconsistent handling when status is absent; update the code in this loop to use
the same default as calculate_exclusion_normalized_score (use
ItemStatus.DATA_MISSING) or require status to be present (e.g., assert/raise if
missing) so both calculate_exclusion_normalized_score and the adjusted_scores
logic treat missing status the same; change the line that sets status =
item_data.get("status", ItemStatus.EVALUATED) to use ItemStatus.DATA_MISSING (or
replace with an explicit validation) and ensure any callers conform.

In `@backend/app/graph/nodes/technique_categories/base_category.py`:
- Around line 112-123: The event currently uses
aggregated.total_token_usage.get("total_tokens", 0) which is always 0 because
TechniqueRouter.aggregate_results only accumulates prompt_tokens and
completion_tokens; update the tokens_used value before emitting the sommelier
event by computing total_tokens =
aggregated.total_token_usage.get("prompt_tokens", 0) +
aggregated.total_token_usage.get("completion_tokens", 0) (or choose existing
keys if named differently) and pass that total to create_sommelier_event; locate
this change in base_category (use symbols: aggregated.total_token_usage,
create_sommelier_event, TechniqueRouter.aggregate_results) so the emitted
tokens_used reflects actual usage.

In `@backend/app/graph/nodes/technique_categories/finalize.py`:
- Around line 15-20: The category_scores dict hardcodes per-category "max"
values; replace those static max numbers by calling get_category_max(category)
for each category key so the max values derive from the BMAD_CATEGORIES source;
update the code that builds category_scores (the variable named category_scores
in finalize.py) to iterate the category keys (or BMAD_CATEGORIES) and set
{"score": 0.0, "max": get_category_max(category)} using the imported
get_category_max function so changes to BMAD_CATEGORIES stay consistent.
- Line 49: Replace the hardcoded denominator 17 with a dynamic lookup using
list_items(): call list_items() to get the BMAD items collection (e.g., items =
list_items(...)), compute total_count = len(items), and then set coverage =
evaluated_count / total_count if total_count > 0 else 0.0; update the expression
that currently uses 17 to use total_count and preserve the existing
evaluated_count and zero-division guard.

In `@backend/tests/test_enhanced_code_grader.py`:
- Around line 264-284: The test TestC2TestingCoverage currently uses
asyncio.run() inside test_c2_detects_test_coverage_configs which prevents pytest
async handling; change the test to be an async pytest coroutine by adding
`@pytest.mark.asyncio` and making test_c2_detects_test_coverage_configs async,
remove the inner asyncio.run/test_all wrapper, and directly await
agent.evaluate_all(context, llm=None) (referencing the TestC2TestingCoverage
class, the test_c2_detects_test_coverage_configs function, and the
CodeGraderAgent.evaluate_all call) so assertions run on the awaited result.
- Around line 238-261: The test uses asyncio.run() inside
TestC1CodeQuality.test_c1_detects_linter_configs which will conflict with
pytest-asyncio; convert the inner async function into an async test by marking
the test with `@pytest.mark.asyncio` (or make a new async test method) and replace
asyncio.run(test_all()) with a direct await of CodeGraderAgent.evaluate_all
(i.e., await agent.evaluate_all(context, llm=None)); update imports to include
pytest if needed and remove asyncio.run usage so
TestC1CodeQuality.test_c1_detects_linter_configs (or a new async sibling) awaits
agent.evaluate_all directly.

In `@frontend/src/components/settings/APIKeyForm.tsx`:
- Around line 78-84: The toggle button that shows/hides the API key (uses
showKey state and setShowKey handler, rendering Eye/EyeOff icons) is missing an
accessible label; add a dynamic aria-label to the button (e.g.,
aria-label={showKey ? "Hide API key" : "Show API key"}) so screen readers can
convey its purpose, and ensure the attribute is applied on the same button
element that currently calls setShowKey.
- Around line 31-33: The setTimeout in APIKeyForm that calls onSuccess after
1500ms can fire after the component unmounts causing memory leaks or stale
state; store the timer ID (e.g., with a useRef) when calling setTimeout inside
the handler that triggers onSuccess, and clearTimeout(timerRef.current) in a
cleanup (useEffect return or component cleanup) or before setting a new timer so
the callback will not run after unmount; update the code paths that call
setTimeout to use this ref and ensure the timer is cleared when the component
unmounts or before scheduling a new timeout.
🧹 Nitpick comments (29)
frontend/src/lib/api.ts (1)

263-267: deleteKeyprovider 파라미터가 URL 경로에 직접 삽입됨

provider 값이 백엔드 응답에서 오므로 현재 위험은 낮지만, encodeURIComponent로 감싸는 것이 방어적 코딩으로 바람직합니다.

🛡️ 수정 제안
  deleteKey: async (provider: string): Promise<{ deleted: boolean; provider: string }> => {
-   return fetchWithConfig(`/api/keys/${provider}`, {
+   return fetchWithConfig(`/api/keys/${encodeURIComponent(provider)}`, {
      method: 'DELETE',
    });
  },
frontend/src/components/settings/APIKeyList.tsx (1)

13-28: confirm/alert 대신 커스텀 모달 사용 고려

window.confirmwindow.alert는 기능적으로 동작하지만, 앱의 나머지 UI와 시각적으로 일관되지 않습니다. 향후 커스텀 확인 대화상자로 교체하면 UX가 개선됩니다.

frontend/src/components/settings/APIKeyManager.tsx (1)

11-20: fetchKeys 재호출 시 에러 상태가 사용자에게 표시되지 않음

fetchKeys가 실패하면 console.error만 호출되고 사용자에게는 아무런 피드백이 없습니다. 키 목록 로드 실패 시 빈 목록이 표시되어 사용자가 키가 없다고 오해할 수 있습니다.

🛡️ 수정 제안: 에러 상태 추가
 export const APIKeyManager: React.FC = () => {
   const [keys, setKeys] = useState<KeyStatusResponse[]>([]);
   const [isLoading, setIsLoading] = useState(true);
+  const [error, setError] = useState<string | null>(null);

   const fetchKeys = async () => {
     try {
+      setError(null);
       const data = await api.getKeyStatus();
       setKeys(data);
     } catch (error) {
       console.error('Failed to fetch API keys:', error);
+      setError('Failed to load API keys. Please try again.');
     } finally {
       setIsLoading(false);
     }
   };

그리고 렌더링 부분에서 error 상태를 표시하세요.

backend/app/criteria/scoring.py (1)

185-206: is_information_absent에서 단일 키워드 매칭이 오탐(false positive)을 유발할 수 있습니다.

"missing", "absent" 같은 짧은 키워드가 rationale 텍스트 내에서 다른 문맥으로 사용될 수 있습니다 (예: "the missing link in their approach" → 데이터 부재가 아닌 의미). 현재 단순 in 매칭으로는 이런 경우를 구분할 수 없습니다.

현재 동작이 의도적으로 보수적(false positive 허용)이라면 괜찮지만, 정밀도가 중요해질 경우 문장 경계 또는 더 구체적인 패턴 매칭을 고려해 볼 수 있습니다.

backend/app/graph/nodes/technique_categories/base_category.py (2)

82-110: logger.error 대신 logger.exception을 사용하면 스택 트레이스가 포함됩니다.

Line 86에서 logger.error는 예외 발생 시 트레이스 정보를 기록하지 않습니다. logger.exception으로 변경하면 디버깅에 필요한 전체 스택 트레이스가 자동으로 포함됩니다.

♻️ 수정 제안
         except Exception as e:
-            logger.error(f"{self.category.value}: execution failed - {e}")
+            logger.exception(f"{self.category.value}: execution failed")

39-46: config 파라미터가 사용되지 않지만 LangGraph 인터페이스 준수를 위해 필요합니다.

Ruff의 ARG002 경고가 발생하지만, RunnableConfig은 LangGraph 그래프 노드 시그니처의 일부이므로 유지가 맞습니다. 필요하다면 _ 프리픽스를 붙이거나 # noqa: ARG002 주석으로 경고를 억제할 수 있습니다.

backend/app/graph/nodes/technique_categories/__init__.py (1)

11-21: __all__ 정렬 고려 (Ruff RUF022)

정적 분석 도구(Ruff)에서 __all__이 정렬되지 않았다고 경고하고 있습니다. isort 스타일 정렬을 적용하면 lint 경고를 해소할 수 있습니다.

♻️ 제안된 수정
 __all__ = [
+    "AromaCategoryNode",
+    "BalanceCategoryNode",
     "BaseCategoryNode",
-    "AromaCategoryNode",
-    "PalateCategoryNode",
     "BodyCategoryNode",
+    "CellarCategoryNode",
     "FinishCategoryNode",
-    "BalanceCategoryNode",
-    "VintageCategoryNode",
+    "PalateCategoryNode",
     "TerroirCategoryNode",
-    "CellarCategoryNode",
+    "VintageCategoryNode",
 ]
backend/tests/test_category_nodes.py (1)

164-181: 불필요한 @pytest.mark.asyncio 데코레이터

test_all_nodes_have_correct_enum_values 테스트는 await를 사용하지 않는 순수 동기 로직입니다. @pytest.mark.asyncio 마크와 async def는 불필요하며, 일반 동기 메서드로 변경하는 것이 더 명확합니다.

♻️ 제안된 수정
-    `@pytest.mark.asyncio`
-    async def test_all_nodes_have_correct_enum_values(self):
+    def test_all_nodes_have_correct_enum_values(self):
backend/app/main.py (1)

68-69: 다른 라우터 등록과의 일관성을 위해 설명 주석 추가를 권장합니다.

다른 라우터들은 모두 설명 주석이 있습니다 (예: # Admin routes, # API key management routes). techniques 라우터에도 동일하게 주석을 추가하면 일관성이 향상됩니다.

♻️ 제안 수정
 # API key management routes
 app.include_router(api_keys.router, prefix=settings.API_V1_STR)

+# Technique evaluation routes
 app.include_router(techniques.router, prefix=settings.API_V1_STR)
backend/tests/test_deep_synthesis.py (1)

1-7: 사용하지 않는 MagicMock import.

Line 3에서 MagicMock을 import하지만 이 테스트 파일에서 사용되지 않습니다.

♻️ 제안 수정
 import pytest
 from datetime import datetime, timezone
-from unittest.mock import MagicMock
 
 from app.graph.nodes.technique_categories.deep_synthesis import deep_synthesis
 from app.graph.state import EvaluationState
 from app.models.graph import ItemScore, TraceEvent
backend/tests/test_finalize_node.py (1)

53-72: 테스트에서 반복되는 item_scores 생성 로직이 있습니다 — 헬퍼 함수 추출을 고려하세요.

test_finalize_quality_gate_pass, test_finalize_quality_gate_concerns, test_finalize_quality_gate_fail 테스트가 동일한 17개 항목 리스트와 거의 동일한 item_scores 생성 패턴을 공유합니다. 점수 값만 다릅니다. 헬퍼 함수나 fixture로 추출하면 유지보수성이 향상됩니다.

♻️ 예시 헬퍼 함수
ALL_BMAD_ITEMS = ["A1","A2","A3","A4","B1","B2","B3","B4",
                  "C1","C2","C3","C4","C5","D1","D2","D3","D4"]

def _make_item_scores(score: float, max_score: float = 6) -> dict:
    return {
        item_id: ItemScore(
            item_id=item_id,
            score=score,
            max_score=max_score,
            evaluated_by="test",
            technique_id="tech1",
            timestamp=datetime.now(timezone.utc).isoformat(),
            status="evaluated",
        )
        for item_id in ALL_BMAD_ITEMS
    }
backend/app/graph/nodes/technique_categories/deep_synthesis.py (1)

1-5: 사용하지 않는 List import.

Line 2에서 List를 import하지만 함수 시그니처와 반환 타입에서 사용되지 않습니다.

♻️ 제안 수정
 from datetime import datetime, timezone
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, Optional
 from langchain_core.runnables import RunnableConfig
backend/app/graph/full_techniques_graph.py (1)

1-4: __all__이 import 문 사이에 위치해 있습니다.

__all__은 관례상 모든 import 이후에 정의하는 것이 일반적입니다. 현재 Line 3에서 __all__이 선언된 후 Line 4-21에서 추가 import가 이어집니다.

♻️ 제안 수정
 from langgraph.graph import StateGraph, END
 
-__all__ = ["create_full_techniques_graph"]
 from app.core.config import settings
 from app.graph.state import EvaluationState
 from app.graph.checkpoint import get_checkpointer
@@ ...
 from app.graph.nodes.code_analysis_enrich import code_analysis_enrich
 
+__all__ = ["create_full_techniques_graph"]
+
backend/tests/test_byok_integration.py (1)

217-226: 사용하지 않는 resolved_key 변수 — _로 대체하세요.

Ruff(RUF059)에서 지적한 대로, resolved_key가 언팩되었지만 사용되지 않습니다. Line 260에서도 동일한 문제가 있습니다.

♻️ 제안 수정
-            repo_context, resolved_key = await _prepare_repo_context(
+            repo_context, _ = await _prepare_repo_context(
                 repo_url="https://github.com/owner/repo",
                 api_key=None,
                 github_token=None,
                 user_id=None,
             )

Line 260에도 동일하게 적용:

-            repo_context, resolved_key = await _prepare_repo_context(
+            repo_context, _ = await _prepare_repo_context(
backend/tests/test_full_techniques_graph.py (2)

72-88: settings mock 시 다른 속성 접근 문제 가능성

app.graph.full_techniques_graph.settings 전체를 패치하면, RAG_ENABLED 외에 그래프 생성 중 접근되는 다른 settings 속성들(web_search_enrich, code_analysis_enrich 등에서 사용하는 설정)이 MagicMock 기본값을 반환하게 됩니다. 이는 의도치 않은 동작을 유발할 수 있습니다.

settings 객체 전체를 mock하는 대신, RAG_ENABLED 속성만 선택적으로 패치하는 것이 더 안전합니다:

♻️ 제안하는 수정
-            with patch("app.graph.full_techniques_graph.settings") as mock_settings:
-                mock_settings.RAG_ENABLED = True
+            with patch("app.graph.full_techniques_graph.settings.RAG_ENABLED", True):

184-212: _graph_builders.clear() 호출이 다른 테스트에 영향을 줄 수 있음

_graph_builders는 모듈 레벨 상태이므로, clear() 호출 후 복구하지 않으면 테스트 실행 순서에 따라 다른 테스트가 실패할 수 있습니다. _register_graphs()가 내부적으로 빈 dict를 감지하여 재등록한다면 괜찮지만, 확실한 격리를 위해 teardown에서 원래 상태를 복구하는 것이 좋습니다.

#!/bin/bash
# Check how _register_graphs handles re-registration
ast-grep --pattern 'def _register_graphs($$$) {
  $$$
}'
rg -n "_register_graphs" --type=py -C5 -g '!**/test*'
backend/tests/test_llm_grader.py (2)

318-326: 하드코딩된 단가로 비용 계산을 검증하고 있어 취약함

0.000030.00006이라는 단가가 LLMGrader 내부의 상수와 동일해야 합니다. 이 단가가 변경되면 테스트가 조용히 실패합니다. LLMGrader에서 단가 상수를 노출하여 테스트에서 참조하는 방식이 더 유지보수에 유리합니다.


91-101: 매직 넘버 103의 의미가 불분명함

max_chars=100에 말줄임표 "..." 3글자를 더한 값이지만, 코드만 보면 의도가 명확하지 않습니다. 주석이나 변수를 통해 의도를 명시하면 가독성이 향상됩니다.

♻️ 제안
-        assert len(context) <= 103  # Allow for "..." truncation
+        max_chars_limit = 100
+        ellipsis_len = len("...")
+        assert len(context) <= max_chars_limit + ellipsis_len
backend/tests/test_technique_api.py (1)

264-282: test_technique_id_with_name_stats_does_not_conflict는 중복 테스트

이 테스트(Line 275-281)는 test_stats_route_not_treated_as_technique_id(Line 267-273)와 동일한 엔드포인트를 호출하고 동일한 assertion을 수행합니다. 테스트 이름은 "stats라는 이름의 technique이 존재할 경우"를 시사하지만, 실제로는 /api/techniques/stats가 stats 엔드포인트로 라우팅되는 것만 확인하므로 의미 있는 차별화가 없습니다.

backend/tests/test_enhanced_code_grader.py (1)

229-235: 미사용 루프 변수 item_code

Ruff(B007)에서 지적된 대로, item_code가 루프 본문에서 사용되지 않으므로 _item_code로 변경하여 의도를 명시해야 합니다.

♻️ 제안
-        for item_code, score_data in result["item_scores"].items():
+        for _item_code, score_data in result["item_scores"].items():
backend/tests/test_scoring.py (2)

231-233: rationaleNone인 경우의 테스트가 누락되어 있습니다.

is_information_absent 함수는 rationale: str | None을 받습니다. 빈 문자열("") 테스트는 있지만 None rationale + 유효한 evidence 조합에 대한 테스트가 없습니다. rationale is not None 분기를 명시적으로 검증하면 좋겠습니다.

🧪 None rationale 테스트 추가 제안
     def test_returns_false_with_evidence_and_empty_rationale(self):
         """Should return False with evidence even if rationale is empty."""
         assert is_information_absent(["evidence"], "") is False
+
+    def test_returns_false_with_evidence_and_none_rationale(self):
+        """Should return False with evidence even if rationale is None."""
+        assert is_information_absent(["evidence"], None) is False

138-168: apply_confidence_adjustment_to_scores 테스트가 bmad_items 모듈의 max_score 값에 암묵적으로 의존합니다.

apply_confidence_adjustment_to_scores 함수는 내부적으로 get_item(item_id)를 호출하여 bmad_items에서 실제 max_score를 가져옵니다. 현재 테스트 데이터의 max_score 값(A1=7, A2=6, A3=6)이 실제 BMAD_ITEMS의 정의와 일치하고 있지만, 이 값들이 변경되면 특히 low confidence 계산(neutral_score = max_score * 0.5를 사용)에서 테스트가 깨질 수 있습니다.

테스트의 격리성을 높이려면 get_item을 mock하거나, 테스트 docstring에 이 의존성을 명시적으로 문서화하는 것을 권장합니다.

backend/app/criteria/llm_grader.py (4)

137-148: JSON 파싱 폴백 로직에서 불완전한 코드 블록 처리 시 예외 발생 가능.

content```json이 있지만 닫는 ```이 없는 경우, split("```")[0]이 전체 나머지 문자열을 반환하여 json.loads가 실패합니다. 이 경우 예외가 grade_itemexcept 블록에서 포착되어 재시도되므로 치명적이지는 않지만, 보다 방어적인 처리를 고려할 수 있습니다.


270-272: 토큰당 비용이 하드코딩되어 있습니다.

prompt_cost = prompt_tokens * 0.00003, completion_cost = completion_tokens * 0.00006은 특정 모델(Gemini)의 가격에 고정되어 있습니다. 모델이 변경되거나 가격이 업데이트되면 이 값들도 수정해야 합니다. 설정이나 상수로 분리하는 것을 권장합니다.

♻️ 상수 분리 제안
+# Per-token cost (USD) — adjust when model changes
+COST_PER_PROMPT_TOKEN = 0.00003
+COST_PER_COMPLETION_TOKEN = 0.00006
+
+
 class LLMGrader:
     ...
     def _update_usage(self, usage: dict) -> None:
         ...
-        prompt_cost = prompt_tokens * 0.00003
-        completion_cost = completion_tokens * 0.00006
+        prompt_cost = prompt_tokens * COST_PER_PROMPT_TOKEN
+        completion_cost = completion_tokens * COST_PER_COMPLETION_TOKEN
         self.total_cost_usd += prompt_cost + completion_cost

82-93: 예외 메시지에 str(e) 대신 !s 변환 플래그 사용을 권장합니다.

Ruff RUF010 힌트에 따라, f"LLM call failed: {str(e)}"f"LLM call failed: {e!s}"로 변경하면 더 관용적입니다. 또한 Exception을 광범위하게 잡는 것에 대해 — 외부 LLM 클라이언트의 예외 유형을 예측하기 어려우므로 이 컨텍스트에서는 합리적입니다.


104-135: ainvoke / invoke / callable 분기에서 중복 코드가 있습니다.

contentusage 추출 로직이 ainvokeinvoke 분기에서 동일하게 반복됩니다. 응답에서 content/usage를 추출하는 헬퍼를 분리하면 중복을 줄일 수 있습니다.

backend/app/agents/code_grader.py (3)

19-54: 가변 클래스 속성에 ClassVar 어노테이션 추가를 권장합니다.

Ruff RUF012에 따라, 가변 클래스 속성(EVALUABLE_ITEMS, OBJECTIVE_ITEMS, SUBJECTIVE_ITEMS, ORIGINAL_ITEMS)은 typing.ClassVar로 어노테이션하는 것이 Python 모범 사례입니다.

♻️ ClassVar 어노테이션 추가
+from typing import Any, ClassVar
+
 class CodeGraderAgent:
-    EVALUABLE_ITEMS = [
+    EVALUABLE_ITEMS: ClassVar[list[str]] = [
         "A1",
         ...
     ]
-    OBJECTIVE_ITEMS = ["C1", "C2", "C4", "B1", "B2", "D1", "D2"]
+    OBJECTIVE_ITEMS: ClassVar[list[str]] = ["C1", "C2", "C4", "B1", "B2", "D1", "D2"]
     ...

113-182: evaluate_all에서 주관적 항목의 LLM 호출이 순차적으로 실행됩니다.

10개의 SUBJECTIVE_ITEMS에 대해 llm_grader.grade_item이 순차적으로 호출됩니다. LLM 호출은 I/O 바운드이므로 asyncio.gather를 사용하여 병렬화하면 전체 평가 시간을 크게 단축할 수 있습니다. 다만 LLM API의 rate limit을 고려해야 합니다.

♻️ 병렬 호출 제안
-        llm_grader = LLMGrader()
-        for item_code in self.SUBJECTIVE_ITEMS:
-            if llm is not None or item_code not in item_scores:
-                result = await llm_grader.grade_item(item_code, repo_context, llm)
-                item_scores[item_code] = {
-                    "item_code": result["item_id"],
-                    "score": result["score"],
-                    "max_score": result["max_score"],
-                    "evidence": result["evidence"],
-                    "rationale": result["rationale"],
-                    "confidence": result["confidence"],
-                    "metrics": {},
-                }
+        llm_grader = LLMGrader()
+        items_to_grade = [
+            code for code in self.SUBJECTIVE_ITEMS
+            if llm is not None or code not in item_scores
+        ]
+        if items_to_grade:
+            results = await asyncio.gather(
+                *(llm_grader.grade_item(code, repo_context, llm) for code in items_to_grade)
+            )
+            for result in results:
+                item_scores[result["item_id"]] = {
+                    "item_code": result["item_id"],
+                    "score": result["score"],
+                    "max_score": result["max_score"],
+                    "evidence": result["evidence"],
+                    "rationale": result["rationale"],
+                    "confidence": result["confidence"],
+                    "metrics": {},
+                }

682-780: _evaluate_code_quality_evaluate_system_architecture 사이에 상당한 코드 중복이 있습니다.

린터 설정 감지(ruff, flake8, pylint, eslint 등), 타입 체커 설정 감지(mypy, pyright, tsconfig), filenames 세트 구성 로직이 두 메서드에서 거의 동일하게 반복됩니다. 공통 감지 로직을 헬퍼 메서드로 추출하면 유지보수성이 향상됩니다.

Also applies to: 531-680

Comment on lines +156 to +169

llm_grader = LLMGrader()
for item_code in self.SUBJECTIVE_ITEMS:
if llm is not None or item_code not in item_scores:
result = await llm_grader.grade_item(item_code, repo_context, llm)
item_scores[item_code] = {
"item_code": result["item_id"],
"score": result["score"],
"max_score": result["max_score"],
"evidence": result["evidence"],
"rationale": result["rationale"],
"confidence": result["confidence"],
"metrics": {},
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

evaluate_all에서 llm_grader.grade_item 호출 시 llm_grader.py의 버그 영향을 받습니다.

llm_grader.pygrade_item에서 response["score"] 대신 response["parsed"]["score"]를 참조해야 하는 치명적 버그가 있습니다. LLM이 제공되면 이 코드 경로에서 KeyError가 발생하여 결국 재시도 후 fallback 결과(score=0, confidence="low")가 반환됩니다. 즉, LLM을 사용한 주관적 평가가 항상 실패하게 됩니다.

llm_grader.py의 수정이 선행되어야 합니다.

🤖 Prompt for AI Agents
In `@backend/app/agents/code_grader.py` around lines 156 - 169, The evaluate_all
loop calls LLMGrader.grade_item and is failing because grade_item in
llm_grader.py reads response["score"] instead of the actual parsed value
response["parsed"]["score"]; update LLMGrader.grade_item to extract score,
max_score, confidence, rationale, and evidence from response["parsed"] (e.g.,
use response["parsed"]["score"] etc.) and ensure evaluate_all's usage of
LLMGrader.grade_item remains the same so subjective items use the corrected
parsed fields rather than causing KeyError and fallback.

Comment on lines +812 to +821
coverage_configs = [
".coveragerc",
"coverage.ini",
".nycrc",
"codecov.yml",
".codecov.yml",
]
has_coverage = any(
cfg in path.lower() for path in file_tree for cfg in coverage_configs
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

커버리지 설정 감지 로직에서 오탐(false positive) 가능성이 있습니다.

any(cfg in path.lower() for path in file_tree for cfg in coverage_configs) 패턴은 경로의 일부 문자열과 매칭합니다. 예를 들어 "src/coverage.ini.bak" 같은 경로도 매칭됩니다. _evaluate_test_existence 등 다른 메서드에서는 파일명만 추출(path.split("/")[-1])하여 비교하는데, 여기서는 전체 경로에서 부분 문자열 매칭을 사용하여 일관성이 떨어집니다.

🔧 파일명 기반 매칭으로 수정
+        filenames = {path.split("/")[-1].lower() for path in file_tree}
         coverage_configs = [
             ".coveragerc",
             "coverage.ini",
             ".nycrc",
             "codecov.yml",
             ".codecov.yml",
         ]
-        has_coverage = any(
-            cfg in path.lower() for path in file_tree for cfg in coverage_configs
-        )
+        has_coverage = any(cfg in filenames for cfg in coverage_configs)
🤖 Prompt for AI Agents
In `@backend/app/agents/code_grader.py` around lines 812 - 821, The coverage
detection currently uses substring matching which causes false positives; change
the logic that sets has_coverage to compare only the filename portion against
the exact set of coverage filenames (coverage_configs) using a basename
extraction (e.g., path.split("/")[-1] or os.path.basename) and normalize to
lower-case before equality checks so entries like "src/coverage.ini.bak" no
longer match; update the code around coverage_configs and has_coverage to
implement this exact filename comparison using file_tree and the filename
extraction.

router = APIRouter(prefix="/techniques", tags=["techniques"])

VALID_CATEGORIES = {cat.value for cat in TechniqueCategory}
VALID_HATS = {"white", "red", "black", "yellow", "green", "blue"}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Search for canonical hat definitions and VALID_HATS usage
rg -n --type=py 'valid_hats|VALID_HATS|Hat.*Enum|hat.*enum|applicable_hats' -g '!*test*' | head -50

Repository: Two-Weeks-Team/somm.dev

Length of output: 1090


🏁 Script executed:

# Search for hat-related definitions more broadly
rg -n --type=py '"white"|"red"|"black"|"yellow"|"green"|"blue"' -B2 -A2 | head -80

Repository: Two-Weeks-Team/somm.dev

Length of output: 5161


🏁 Script executed:

# Check for Hat enum or class definitions
fd --type f -name '*.py' -exec grep -l 'class Hat\|enum.*Hat\|Hat.*=.*{' {} \; | head -20

Repository: Two-Weeks-Team/somm.dev

Length of output: 297


🏁 Script executed:

# Look at the techniques.py file more closely for context
wc -l backend/app/api/routes/techniques.py

Repository: Two-Weeks-Team/somm.dev

Length of output: 107


VALID_HATS를 Hat 열거형에서 동적으로 파생시키기

backend/app/criteria/hat_mappings.py에 정의된 Hat 열거형이 canonical 소스입니다. 현재 코드는 line 21의 VALID_CATEGORIES처럼 열거형에서 동적으로 값을 파생시키지 않고, line 22에서 VALID_HATS를 하드코딩하고 있어 유지보수 일관성이 떨어집니다. 같은 파일에서 VALID_CATEGORIES = {cat.value for cat in TechniqueCategory}를 사용하는 것과 달리, 하드코딩된 집합은 Hat 정의가 변경될 때 동기화 오류 위험이 있습니다.

VALID_HATS = {hat.value for hat in Hat}로 변경하여 Hat 열거형의 유일한 소스에서 유도하는 것이 권장됩니다.

🤖 Prompt for AI Agents
In `@backend/app/api/routes/techniques.py` at line 22, Replace the hardcoded
VALID_HATS set with a dynamic derivation from the Hat enum to keep a single
source of truth: instead of the current VALID_HATS constant, compute it the same
way VALID_CATEGORIES is derived (i.e., build the set from Hat members' .value),
locating the CONSTANT named VALID_HATS and the Hat enum definition used in
hat_mappings so the set reflects any changes to Hat automatically.

Comment thread backend/app/criteria/grading_prompts.py Outdated
Comment on lines +363 to +374
def format_prompt(item_id: str, context: str) -> str:
"""Format a grading prompt with the given repository context.

Args:
item_id: The BMAD item ID
context: Repository context string to insert into the prompt

Returns:
Formatted prompt string ready for LLM consumption
"""
prompt_template = get_prompt(item_id)["prompt"]
return prompt_template.format(context=context)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

str.format()은 중괄호가 포함된 context에서 KeyError/ValueError 발생

context에 코드 스니펫이 포함될 수 있고, Python/JavaScript/JSON 코드에는 {, } 문자가 빈번합니다. str.format(context=context) 호출 시 중괄호가 format 변수로 해석되어 예외가 발생합니다.

예: context = "function() { return {}; }"KeyError: ' return '

🐛 제안하는 수정
 def format_prompt(item_id: str, context: str) -> str:
     prompt_template = get_prompt(item_id)["prompt"]
-    return prompt_template.format(context=context)
+    return prompt_template.replace("{context}", context)

또는 프롬프트 템플릿에서 {context} 외의 중괄호를 이중 이스케이프({{, }})하는 방법도 있으나, replace()가 더 안전하고 간단합니다.

🤖 Prompt for AI Agents
In `@backend/app/criteria/grading_prompts.py` around lines 363 - 374, The current
format_prompt function uses prompt_template.format(context=context) which raises
KeyError/ValueError when the inserted context contains literal braces; change
format_prompt to avoid str.format parsing by replacing the placeholder
explicitly: fetch prompt_template via get_prompt(item_id)["prompt"] and return
prompt_template.replace("{context}", context) (or replace only the first
occurrence if desired) so literal `{` and `}` in context are preserved and no
format parsing occurs.

Comment on lines +56 to +66
if self._validate_response(response, item_id, max_score):
self._update_usage(response.get("usage", {}))

return {
"item_id": item_id,
"score": float(response["score"]),
"max_score": float(max_score),
"confidence": response["confidence"],
"evidence": response["evidence"],
"rationale": response["rationale"],
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n backend/app/criteria/llm_grader.py

Repository: Two-Weeks-Team/somm.dev

Length of output: 12496


치명적 버그: response["score"] 등이 아닌 response["parsed"]["score"]를 참조해야 합니다.

_call_llm은 라인 149에서 {"parsed": parsed, "usage": usage} 구조를 반환합니다. 하지만 grade_item의 라인 61~65에서 response["score"], response["confidence"], response["evidence"], response["rationale"]를 직접 접근하고 있어 KeyError가 발생합니다. _validate_response는 정확하게 response.get("parsed", {})를 사용하고 있으므로, grade_item도 동일하게 중첩된 parsed 키를 통해 접근해야 합니다.

수정 제안
                 if self._validate_response(response, item_id, max_score):
                     self._update_usage(response.get("usage", {}))
+                    parsed = response["parsed"]
 
                     return {
                         "item_id": item_id,
-                        "score": float(response["score"]),
+                        "score": float(parsed["score"]),
                         "max_score": float(max_score),
-                        "confidence": response["confidence"],
-                        "evidence": response["evidence"],
-                        "rationale": response["rationale"],
+                        "confidence": parsed["confidence"],
+                        "evidence": parsed["evidence"],
+                        "rationale": parsed["rationale"],
                     }
🤖 Prompt for AI Agents
In `@backend/app/criteria/llm_grader.py` around lines 56 - 66, grade_item
currently accesses response["score"], response["confidence"],
response["evidence"], and response["rationale"] directly but _call_llm returns
{"parsed": parsed, "usage": usage} and _validate_response checks
response.get("parsed", {}); update grade_item to read these values from
response["parsed"] (e.g., response.get("parsed", {}).get("score") etc.), keep
the existing _update_usage(response.get("usage", {})) call, and ensure returned
keys use the parsed values (score, confidence, evidence, rationale) while
preserving types (float for score/max_score) so that grade_item, _call_llm, and
_validate_response are consistent.

if cat_id in category_scores:
category_scores[cat_id]["score"] += min(score_value, item.max_score)

coverage = evaluated_count / 17 if evaluated_count > 0 else 0.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

하드코딩된 17을 동적 조회로 교체해야 함

커밋 메시지에서 "replace hardcoded BMAD count with dynamic lookup"이라고 명시했으나, 이 코드는 여전히 17이 하드코딩되어 있습니다. BMAD 항목 수가 변경되면 이 로직이 잘못된 coverage를 계산합니다. list_items()가 이미 임포트되어 있으므로 이를 활용해야 합니다.

🐛 제안하는 수정
-    coverage = evaluated_count / 17 if evaluated_count > 0 else 0.0
+    total_items = len(list_items())
+    coverage = evaluated_count / total_items if total_items > 0 else 0.0
🤖 Prompt for AI Agents
In `@backend/app/graph/nodes/technique_categories/finalize.py` at line 49, Replace
the hardcoded denominator 17 with a dynamic lookup using list_items(): call
list_items() to get the BMAD items collection (e.g., items = list_items(...)),
compute total_count = len(items), and then set coverage = evaluated_count /
total_count if total_count > 0 else 0.0; update the expression that currently
uses 17 to use total_count and preserve the existing evaluated_count and
zero-division guard.

Comment on lines +238 to +261
class TestC1CodeQuality:
def test_c1_detects_linter_configs(self):
from app.agents.code_grader import CodeGraderAgent

agent = CodeGraderAgent()
context = {
"repo_context": {
"readme": "",
"file_tree": [".eslintrc", "src/main.js"],
"main_files": [{"path": "main.js", "content": "const x = 1;"}],
}
}
result = agent.evaluate(context)

assert "C1" not in result["item_scores"]

import asyncio

async def test_all():
all_result = await agent.evaluate_all(context, llm=None)
assert "C1" in all_result["item_scores"]
assert all_result["item_scores"]["C1"]["max_score"] == 7

asyncio.run(test_all())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

동기 테스트 내에서 asyncio.run() 사용은 이벤트 루프 충돌을 일으킬 수 있음

pytest-asyncio가 이미 이벤트 루프를 관리하고 있는 환경에서 asyncio.run()을 호출하면 RuntimeError: asyncio.run() cannot be called when another event loop is running 에러가 발생할 수 있습니다. @pytest.mark.asyncio를 사용하는 async 테스트 메서드로 변환해야 합니다.

🐛 제안하는 수정 (TestC1CodeQuality)
 class TestC1CodeQuality:
-    def test_c1_detects_linter_configs(self):
+    `@pytest.mark.asyncio`
+    async def test_c1_detects_linter_configs(self):
         from app.agents.code_grader import CodeGraderAgent
 
         agent = CodeGraderAgent()
         context = {
             "repo_context": {
                 "readme": "",
                 "file_tree": [".eslintrc", "src/main.js"],
                 "main_files": [{"path": "main.js", "content": "const x = 1;"}],
             }
         }
         result = agent.evaluate(context)
 
         assert "C1" not in result["item_scores"]
 
-        import asyncio
-
-        async def test_all():
-            all_result = await agent.evaluate_all(context, llm=None)
-            assert "C1" in all_result["item_scores"]
-            assert all_result["item_scores"]["C1"]["max_score"] == 7
-
-        asyncio.run(test_all())
+        all_result = await agent.evaluate_all(context, llm=None)
+        assert "C1" in all_result["item_scores"]
+        assert all_result["item_scores"]["C1"]["max_score"] == 7
🤖 Prompt for AI Agents
In `@backend/tests/test_enhanced_code_grader.py` around lines 238 - 261, The test
uses asyncio.run() inside TestC1CodeQuality.test_c1_detects_linter_configs which
will conflict with pytest-asyncio; convert the inner async function into an
async test by marking the test with `@pytest.mark.asyncio` (or make a new async
test method) and replace asyncio.run(test_all()) with a direct await of
CodeGraderAgent.evaluate_all (i.e., await agent.evaluate_all(context,
llm=None)); update imports to include pytest if needed and remove asyncio.run
usage so TestC1CodeQuality.test_c1_detects_linter_configs (or a new async
sibling) awaits agent.evaluate_all directly.

Comment on lines +264 to +284
class TestC2TestingCoverage:
def test_c2_detects_test_coverage_configs(self):
from app.agents.code_grader import CodeGraderAgent

agent = CodeGraderAgent()
context = {
"repo_context": {
"readme": "",
"file_tree": ["tests/test_main.py", ".coveragerc"],
"main_files": [],
}
}

import asyncio

async def test_all():
all_result = await agent.evaluate_all(context, llm=None)
assert "C2" in all_result["item_scores"]
assert all_result["item_scores"]["C2"]["max_score"] == 6

asyncio.run(test_all())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

동일한 asyncio.run() 문제 — TestC2TestingCoverage에서도 동일 패턴

TestC1CodeQuality와 같은 이유로 @pytest.mark.asyncio async 테스트로 변환이 필요합니다.

🐛 제안하는 수정 (TestC2TestingCoverage)
 class TestC2TestingCoverage:
-    def test_c2_detects_test_coverage_configs(self):
+    `@pytest.mark.asyncio`
+    async def test_c2_detects_test_coverage_configs(self):
         from app.agents.code_grader import CodeGraderAgent
 
         agent = CodeGraderAgent()
         context = {
             "repo_context": {
                 "readme": "",
                 "file_tree": ["tests/test_main.py", ".coveragerc"],
                 "main_files": [],
             }
         }
 
-        import asyncio
-
-        async def test_all():
-            all_result = await agent.evaluate_all(context, llm=None)
-            assert "C2" in all_result["item_scores"]
-            assert all_result["item_scores"]["C2"]["max_score"] == 6
-
-        asyncio.run(test_all())
+        all_result = await agent.evaluate_all(context, llm=None)
+        assert "C2" in all_result["item_scores"]
+        assert all_result["item_scores"]["C2"]["max_score"] == 6
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class TestC2TestingCoverage:
def test_c2_detects_test_coverage_configs(self):
from app.agents.code_grader import CodeGraderAgent
agent = CodeGraderAgent()
context = {
"repo_context": {
"readme": "",
"file_tree": ["tests/test_main.py", ".coveragerc"],
"main_files": [],
}
}
import asyncio
async def test_all():
all_result = await agent.evaluate_all(context, llm=None)
assert "C2" in all_result["item_scores"]
assert all_result["item_scores"]["C2"]["max_score"] == 6
asyncio.run(test_all())
class TestC2TestingCoverage:
`@pytest.mark.asyncio`
async def test_c2_detects_test_coverage_configs(self):
from app.agents.code_grader import CodeGraderAgent
agent = CodeGraderAgent()
context = {
"repo_context": {
"readme": "",
"file_tree": ["tests/test_main.py", ".coveragerc"],
"main_files": [],
}
}
all_result = await agent.evaluate_all(context, llm=None)
assert "C2" in all_result["item_scores"]
assert all_result["item_scores"]["C2"]["max_score"] == 6
🤖 Prompt for AI Agents
In `@backend/tests/test_enhanced_code_grader.py` around lines 264 - 284, The test
TestC2TestingCoverage currently uses asyncio.run() inside
test_c2_detects_test_coverage_configs which prevents pytest async handling;
change the test to be an async pytest coroutine by adding `@pytest.mark.asyncio`
and making test_c2_detects_test_coverage_configs async, remove the inner
asyncio.run/test_all wrapper, and directly await agent.evaluate_all(context,
llm=None) (referencing the TestC2TestingCoverage class, the
test_c2_detects_test_coverage_configs function, and the
CodeGraderAgent.evaluate_all call) so assertions run on the awaited result.

Comment on lines +31 to +33
setTimeout(() => {
onSuccess();
}, 1500);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

setTimeout 정리 누락으로 언마운트 시 메모리 누수 및 상태 업데이트 가능

컴포넌트가 1.5초 이내에 언마운트되면 setTimeout 콜백이 이미 언마운트된 컴포넌트에서 onSuccess를 호출합니다. React 19에서는 경고가 발생하지 않지만, 부모 컴포넌트에서 예기치 않은 fetchKeys 호출이 발생할 수 있습니다.

🛡️ 수정 제안

useRef로 타이머 ID를 저장하고 cleanup하거나, useEffect cleanup 패턴을 사용하세요:

+import React, { useState, useRef, useEffect } from 'react';
-import React, { useState } from 'react';
...

 export const APIKeyForm: React.FC<APIKeyFormProps> = ({ onSuccess }) => {
+  const timerRef = useRef<ReturnType<typeof setTimeout>>(null);
   const [apiKey, setApiKey] = useState('');
   ...
+
+  useEffect(() => {
+    return () => {
+      if (timerRef.current) clearTimeout(timerRef.current);
+    };
+  }, []);
+
   ...
       setSuccess(true);
       setApiKey('');
-      setTimeout(() => {
+      timerRef.current = setTimeout(() => {
         onSuccess();
       }, 1500);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
setTimeout(() => {
onSuccess();
}, 1500);
import React, { useState, useRef, useEffect } from 'react';
export const APIKeyForm: React.FC<APIKeyFormProps> = ({ onSuccess }) => {
const timerRef = useRef<ReturnType<typeof setTimeout>>(null);
const [apiKey, setApiKey] = useState('');
const [isLoading, setIsLoading] = useState(false);
const [success, setSuccess] = useState(false);
const [showPassword, setShowPassword] = useState(false);
useEffect(() => {
return () => {
if (timerRef.current) clearTimeout(timerRef.current);
};
}, []);
// ... rest of the component code ...
const handleSubmit = async (e: React.FormEvent) => {
e.preventDefault();
setError(null);
setIsLoading(true);
try {
// API call logic
setSuccess(true);
setApiKey('');
timerRef.current = setTimeout(() => {
onSuccess();
}, 1500);
} finally {
setIsLoading(false);
}
};
// ... rest of component ...
}
🤖 Prompt for AI Agents
In `@frontend/src/components/settings/APIKeyForm.tsx` around lines 31 - 33, The
setTimeout in APIKeyForm that calls onSuccess after 1500ms can fire after the
component unmounts causing memory leaks or stale state; store the timer ID
(e.g., with a useRef) when calling setTimeout inside the handler that triggers
onSuccess, and clearTimeout(timerRef.current) in a cleanup (useEffect return or
component cleanup) or before setting a new timer so the callback will not run
after unmount; update the code paths that call setTimeout to use this ref and
ensure the timer is cleared when the component unmounts or before scheduling a
new timeout.

Comment on lines +78 to +84
<button
type="button"
onClick={() => setShowKey(!showKey)}
className="absolute right-3 top-1/2 -translate-y-1/2 p-1 text-gray-400 hover:text-[#722F37] transition-colors"
>
{showKey ? <EyeOff className="w-5 h-5" /> : <Eye className="w-5 h-5" />}
</button>

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

접근성: show/hide 토글 버튼에 aria-label 누락

스크린 리더 사용자가 이 버튼의 용도를 알 수 없습니다.

♿ 수정 제안
 <button
   type="button"
   onClick={() => setShowKey(!showKey)}
+  aria-label={showKey ? 'Hide API key' : 'Show API key'}
   className="absolute right-3 top-1/2 ..."
 >
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<button
type="button"
onClick={() => setShowKey(!showKey)}
className="absolute right-3 top-1/2 -translate-y-1/2 p-1 text-gray-400 hover:text-[#722F37] transition-colors"
>
{showKey ? <EyeOff className="w-5 h-5" /> : <Eye className="w-5 h-5" />}
</button>
<button
type="button"
onClick={() => setShowKey(!showKey)}
aria-label={showKey ? 'Hide API key' : 'Show API key'}
className="absolute right-3 top-1/2 -translate-y-1/2 p-1 text-gray-400 hover:text-[`#722F37`] transition-colors"
>
{showKey ? <EyeOff className="w-5 h-5" /> : <Eye className="w-5 h-5" />}
</button>
🤖 Prompt for AI Agents
In `@frontend/src/components/settings/APIKeyForm.tsx` around lines 78 - 84, The
toggle button that shows/hides the API key (uses showKey state and setShowKey
handler, rendering Eye/EyeOff icons) is missing an accessible label; add a
dynamic aria-label to the button (e.g., aria-label={showKey ? "Hide API key" :
"Show API key"}) so screen readers can convey its purpose, and ensure the
attribute is applied on the same button element that currently calls setShowKey.

Critical fixes:
- Fix KeyError in llm_grader.py (access response['parsed'] instead of response directly)
- Fix str.format() crash with curly braces in code context (use str.replace)
- Handle full_techniques mode result in evaluation_service.py

Major fixes:
- Replace hardcoded 17 with len(list_items()) in finalize.py
- Fix total_tokens calculation in base_category.py
- Unify status default value to DATA_MISSING in scoring.py

Minor fixes:
- Add CONFLICT_SCORE_RANGE_THRESHOLD constant in deep_synthesis.py
- Fix coverage config detection to use filename matching in code_grader.py
- Add setTimeout cleanup and aria-label in APIKeyForm.tsx

Docs:
- Update README.md with 6 evaluation modes and Techniques API endpoints
@ComBba ComBba merged commit ecf9ec1 into main Feb 9, 2026
4 checks passed
@ComBba ComBba deleted the feat/group-c-wave4 branch February 9, 2026 08:56
ComBba added a commit that referenced this pull request Feb 10, 2026
feat(wave4): Full Techniques graph, enhanced grader, scoring, technique API, BYOK integration + UI (#234-#243)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment