feat(wave4): Full Techniques graph, enhanced grader, scoring, technique API, BYOK integration + UI (#234-#243)#259
Conversation
…ue API, BYOK integration + UI (#234-#243) Wave 4: Groups C + D + E — 10 issues implemented Group C — LangGraph Full Techniques Pipeline: - #234: full_techniques_graph.py — 8 parallel category nodes + deep_synthesis + finalize - #235: 8 BaseCategoryNode subclasses (aroma→cellar) with TechniqueRouter integration - #236: deep_synthesis (conflict detection) + finalize (quality gate, category scores) Group D — BYOK: - #239: Frontend settings page /settings/api-keys with APIKeyManager, Form, List - #240: Stored BYOK key integration into evaluation pipeline (priority: request > stored > system) Group E — API + Grading: - #237: scoring.py — confidence adjustment, exclusion normalization, coverage calculation - #241: GET /api/techniques, /techniques/stats, /techniques/{id} endpoints - #242: Enhanced CodeGraderAgent (5 → 17 BMAD items, 7 objective + 11 subjective) - #243: LLMGrader class for subjective items with retry logic, context assembly Also: - Registered full_techniques mode in graph_factory.py (3 modes total) - Updated evaluation_service.py for full_techniques progress tracking (10 steps) - Added API Keys link to UserMenu Tests: 1003 passed, 0 failed (159 new tests added)
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Walkthrough8개의 병렬 범주 노드와 심층 합성 기능을 갖춘 전체 기법 그래프를 구현하고, BMAD 항목을 5개에서 17개로 확장하며, LLM 기반 주관적 채점, 기법 API 엔드포인트, 프론트엔드 API 키 관리 UI를 추가합니다. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant EvalService as Evaluation Service
participant Graph as Full Techniques Graph
participant CatNodes as Category Nodes<br/>(8 parallel)
participant Router as Technique Router
participant DeepSynth as Deep Synthesis
participant Finalize as Finalize Node
User->>EvalService: start_evaluation(repo, user_id, full_techniques)
EvalService->>EvalService: _prepare_repo_context(user_id)
EvalService->>EvalService: retrieve stored BYOK key if available
EvalService->>Graph: invoke(state)
Graph->>CatNodes: Send() → 8 parallel category nodes
par Category Execution
CatNodes->>Router: select(category=AROMA, mode=full_techniques)
Router-->>CatNodes: techniques list
CatNodes->>CatNodes: execute techniques sequentially
CatNodes-->>Graph: partial item_scores + methodology_trace
and
CatNodes->>Router: select(category=PALATE, mode=full_techniques)
Router-->>CatNodes: techniques list
CatNodes->>CatNodes: execute techniques sequentially
CatNodes-->>Graph: partial item_scores + methodology_trace
and
Note over CatNodes: ... (6 more categories in parallel)
end
Graph->>DeepSynth: merge all category results
DeepSynth->>DeepSynth: cross-reference conflicting scores
DeepSynth->>DeepSynth: flag conflicts (score_range > 2.0)
DeepSynth-->>Graph: consolidated item_scores + conflict details
Graph->>Finalize: apply scoring aggregation
Finalize->>Finalize: adjust_score_by_confidence()
Finalize->>Finalize: calculate per-dimension scores (A/B/C/D)
Finalize->>Finalize: determine quality_gate (PASS/CONCERNS/FAIL)
Finalize-->>Graph: final evaluation_summary + metadata
Graph-->>EvalService: completed state
EvalService-->>User: evaluation result with 17 BMAD items scored
sequenceDiagram
participant CodeGrader as Code Grader Agent
participant ObjEval as Objective Evaluators<br/>(C1, C2, etc.)
participant LLMGrader as LLM Grader
participant Gemini as Gemini LLM
CodeGrader->>ObjEval: evaluate_code_quality(repo_context)
ObjEval->>ObjEval: parse code metrics, linting config
ObjEval-->>CodeGrader: C1, C2 ItemScores
CodeGrader->>CodeGrader: iterate subjective items (A1-A4, B3-B4, C3, C5, D3-D4)
loop For each subjective item
CodeGrader->>LLMGrader: grade_item(item_id, repo_context, llm)
LLMGrader->>LLMGrader: _assemble_context(repo_context, max_chars=8000)
LLMGrader->>LLMGrader: format_prompt(item_id, context)
rect rgba(200, 150, 100, 0.5)
LLMGrader->>Gemini: ainvoke(prompt, json_mode=True)
Gemini-->>LLMGrader: {score, confidence, evidence, rationale}
end
LLMGrader->>LLMGrader: _validate_response(response, item_id)
alt Validation fails (retry up to 2x)
LLMGrader->>Gemini: retry with clearer prompt
end
LLMGrader->>LLMGrader: _update_usage(tokens, cost)
LLMGrader-->>CodeGrader: ItemScore (subjective)
end
CodeGrader->>CodeGrader: aggregate all 17 item_scores
CodeGrader-->>CodeGrader: return item_scores + usage metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @ComBba, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Fairthon 75-Technique Evaluation System by introducing a full-fledged evaluation pipeline capable of running all 75 techniques across various categories. It also upgrades the grading system to cover all 17 BMAD items, leveraging LLMs for subjective assessments and implementing advanced scoring logic. Furthermore, it provides a dedicated API for technique management and integrates a "Bring Your Own Key" (BYOK) feature, allowing users greater control over their evaluation resources. These changes collectively expand the system's capabilities, improve evaluation depth, and enhance user flexibility. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several major features, including the full techniques evaluation graph, an enhanced grader with LLM support, a new scoring system, a techniques API, and BYOK integration. A critical Prompt Injection vulnerability was identified in the LLM-based grading logic, where untrusted repository data is directly incorporated into prompts. Furthermore, a critical bug in the new LLM grader prevents subjective scoring, and several medium-severity maintainability issues, such as magic numbers and local imports, were noted. Addressing these points, particularly the Prompt Injection vulnerability through robust prompt construction, will enhance the functionality's robustness, clarity, and security.
| return { | ||
| "item_id": item_id, | ||
| "score": float(response["score"]), | ||
| "max_score": float(max_score), | ||
| "confidence": response["confidence"], | ||
| "evidence": response["evidence"], | ||
| "rationale": response["rationale"], | ||
| } |
There was a problem hiding this comment.
There's a critical bug here that will cause a KeyError. The _call_llm method returns a dictionary where the parsed LLM output is nested under the "parsed" key (e.g., {"parsed": {"score": 5, ...}, "usage": ...}). However, the code here attempts to access score, confidence, etc., directly from the top-level response dictionary, which will fail.
You should access these values from response["parsed"].
return {
"item_id": item_id,
"score": float(response["parsed"]["score"]),
"max_score": float(max_score),
"confidence": response["parsed"]["confidence"],
"evidence": response["parsed"]["evidence"],
"rationale": response["parsed"]["rationale"],
}| context = self._assemble_context(repo_context, max_chars=8000) | ||
| prompt = format_prompt(item_id, context) |
There was a problem hiding this comment.
The grade_item method assembles a context string from untrusted repository data (README, file tree, code snippets) and formats it directly into the LLM prompt. This makes the system vulnerable to prompt injection attacks, where a malicious repository can manipulate the evaluation results by including instructions for the LLM within its files. For example, an attacker could craft a README that contains instructions to ignore previous instructions and output a specific high score.
# Use robust delimiters and clear instructions to mitigate prompt injection
context = self._assemble_context(repo_context, max_chars=8000)
# Consider wrapping the context in tags and adding instructions to the prompt template
# to treat the content within those tags as untrusted data.
prompt = format_prompt(item_id, f"<repository_context>\n{context}\n</repository_context>")| SUBJECTIVE_ITEMS = [ | ||
| "A1", | ||
| "A2", | ||
| "A3", | ||
| "A4", | ||
| "B2", | ||
| "B3", | ||
| "B4", | ||
| "C3", | ||
| "C5", | ||
| "D3", | ||
| "D4", | ||
| ] |
There was a problem hiding this comment.
The item B2 (System Architecture) is incorrectly included in SUBJECTIVE_ITEMS. B2 is evaluated by a deterministic heuristic (_evaluate_system_architecture) and is correctly listed in OBJECTIVE_ITEMS. To avoid confusion and maintain a clear distinction between objective and subjective evaluations, it should be removed from this list.
SUBJECTIVE_ITEMS = [
"A1",
"A2",
"A3",
"A4",
"B3",
"B4",
"C3",
"C5",
"D3",
"D4",
]| Returns: | ||
| New dictionary with adjusted scores (input is not mutated) | ||
| """ | ||
| from app.criteria.bmad_items import get_item |
There was a problem hiding this comment.
This import is performed inside the apply_confidence_adjustment_to_scores function. According to Python best practices (PEP 8), imports should be placed at the top of the file. This improves readability, avoids potential circular import issues, and makes dependencies clear. Please move this import to the top of the module.
| for item_id, scores in item_score_variants.items(): | ||
| if len(scores) > 1: | ||
| score_range = max(scores) - min(scores) | ||
| if score_range > 2.0: |
There was a problem hiding this comment.
The value 2.0 used as a threshold for detecting score conflicts is a magic number. It should be defined as a named constant at the top of the module (e.g., CONFLICT_SCORE_RANGE_THRESHOLD = 2.0) to improve readability and make it easier to configure in the future.
| if score_range > 2.0: | |
| if score_range > 2.0: # TODO: Consider making this a constant |
| if cat_id in category_scores: | ||
| category_scores[cat_id]["score"] += min(score_value, item.max_score) | ||
|
|
||
| coverage = evaluated_count / 17 if evaluated_count > 0 else 0.0 |
There was a problem hiding this comment.
The total number of BMAD items, 17, is used here as a magic number. This makes the code harder to maintain if the number of items changes. You can make this more robust by deriving the count directly from the bmad_items module, which is already imported.
| coverage = evaluated_count / 17 if evaluated_count > 0 else 0.0 | |
| coverage = evaluated_count / len(list_items()) if evaluated_count > 0 else 0.0 |
| from app.database.repositories.api_key import APIKeyRepository | ||
| from app.services.encryption import EncryptionService |
There was a problem hiding this comment.
These imports are performed inside the _get_stored_key function. According to Python best practices (PEP 8), imports should be placed at the top of the file. This improves readability, avoids potential circular import issues, and makes dependencies clear. Please move these imports to the top of the module.
- Add asyncio.to_thread() to avoid blocking event loop in llm_grader.py - Add progress events to BaseCategoryNode matching sommelier pattern - Add input validation for category/hat params in techniques API - Replace hardcoded BMAD item count with dynamic lookup in scoring.py - Ensure immutable state access in deep_synthesis.py - Remove B2 duplicate from SUBJECTIVE_ITEMS (already in OBJECTIVE) - Update tests to reflect new validation behavior
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/services/evaluation_service.py (2)
570-579:⚠️ Potential issue | 🟠 Major
full_techniques모드에서jeanpierre_result접근은 항상 빈 결과를 반환
full_techniques모드는jeanpierre_result대신finalize노드를 통해total_score,quality_gate,normalized_score등을 생성합니다. 그러나 여기서는jeanpierre_result만 참조하므로,full_techniques평가가 성공해도score: 0,rating_tier: ""이 반환됩니다.🐛 제안하는 수정
+ if evaluation_mode == "full_techniques": + return { + "evaluation_id": evaluation_id, + "status": "completed", + "score": result.get("normalized_score", 0), + "quality_gate": result.get("quality_gate", ""), + "rating_tier": result.get("quality_gate", ""), + } + jeanpierre = result.get("jeanpierre_result") or {} return { "evaluation_id": evaluation_id,
289-384:⚠️ Potential issue | 🟠 Major
save_evaluation_results가full_techniques모드를 처리하지 않음
save_evaluation_results는is_grand_tasting조건으로만 분기하고,full_techniques모드는else블록으로 빠집니다. 이 경우 코드는jeanpierre_result를 찾으려 하지만,full_techniques그래프는jeanpierre_result대신total_score,quality_gate,category_scores,strengths,improvements등의 필드를 생성하므로jeanpierre_result가 없어 결과적으로overall_score=0,summary="",sommelier_outputs=[]의 빈 평가가 저장됩니다.
full_techniques모드에 맞는 결과 변환 로직을 추가하여 이 필드들을 올바르게 처리해야 합니다.
🤖 Fix all issues with AI agents
In `@backend/app/agents/code_grader.py`:
- Around line 156-169: The evaluate_all loop calls LLMGrader.grade_item and is
failing because grade_item in llm_grader.py reads response["score"] instead of
the actual parsed value response["parsed"]["score"]; update LLMGrader.grade_item
to extract score, max_score, confidence, rationale, and evidence from
response["parsed"] (e.g., use response["parsed"]["score"] etc.) and ensure
evaluate_all's usage of LLMGrader.grade_item remains the same so subjective
items use the corrected parsed fields rather than causing KeyError and fallback.
- Around line 812-821: The coverage detection currently uses substring matching
which causes false positives; change the logic that sets has_coverage to compare
only the filename portion against the exact set of coverage filenames
(coverage_configs) using a basename extraction (e.g., path.split("/")[-1] or
os.path.basename) and normalize to lower-case before equality checks so entries
like "src/coverage.ini.bak" no longer match; update the code around
coverage_configs and has_coverage to implement this exact filename comparison
using file_tree and the filename extraction.
In `@backend/app/api/routes/techniques.py`:
- Line 22: Replace the hardcoded VALID_HATS set with a dynamic derivation from
the Hat enum to keep a single source of truth: instead of the current VALID_HATS
constant, compute it the same way VALID_CATEGORIES is derived (i.e., build the
set from Hat members' .value), locating the CONSTANT named VALID_HATS and the
Hat enum definition used in hat_mappings so the set reflects any changes to Hat
automatically.
In `@backend/app/criteria/grading_prompts.py`:
- Around line 363-374: The current format_prompt function uses
prompt_template.format(context=context) which raises KeyError/ValueError when
the inserted context contains literal braces; change format_prompt to avoid
str.format parsing by replacing the placeholder explicitly: fetch
prompt_template via get_prompt(item_id)["prompt"] and return
prompt_template.replace("{context}", context) (or replace only the first
occurrence if desired) so literal `{` and `}` in context are preserved and no
format parsing occurs.
In `@backend/app/criteria/llm_grader.py`:
- Around line 56-66: grade_item currently accesses response["score"],
response["confidence"], response["evidence"], and response["rationale"] directly
but _call_llm returns {"parsed": parsed, "usage": usage} and _validate_response
checks response.get("parsed", {}); update grade_item to read these values from
response["parsed"] (e.g., response.get("parsed", {}).get("score") etc.), keep
the existing _update_usage(response.get("usage", {})) call, and ensure returned
keys use the parsed values (score, confidence, evidence, rationale) while
preserving types (float for score/max_score) so that grade_item, _call_llm, and
_validate_response are consistent.
In `@backend/app/criteria/scoring.py`:
- Around line 155-180: The status default in the loop inside the function that
builds adjusted_scores uses ItemStatus.EVALUATED but
calculate_exclusion_normalized_score uses ItemStatus.DATA_MISSING, causing
inconsistent handling when status is absent; update the code in this loop to use
the same default as calculate_exclusion_normalized_score (use
ItemStatus.DATA_MISSING) or require status to be present (e.g., assert/raise if
missing) so both calculate_exclusion_normalized_score and the adjusted_scores
logic treat missing status the same; change the line that sets status =
item_data.get("status", ItemStatus.EVALUATED) to use ItemStatus.DATA_MISSING (or
replace with an explicit validation) and ensure any callers conform.
In `@backend/app/graph/nodes/technique_categories/base_category.py`:
- Around line 112-123: The event currently uses
aggregated.total_token_usage.get("total_tokens", 0) which is always 0 because
TechniqueRouter.aggregate_results only accumulates prompt_tokens and
completion_tokens; update the tokens_used value before emitting the sommelier
event by computing total_tokens =
aggregated.total_token_usage.get("prompt_tokens", 0) +
aggregated.total_token_usage.get("completion_tokens", 0) (or choose existing
keys if named differently) and pass that total to create_sommelier_event; locate
this change in base_category (use symbols: aggregated.total_token_usage,
create_sommelier_event, TechniqueRouter.aggregate_results) so the emitted
tokens_used reflects actual usage.
In `@backend/app/graph/nodes/technique_categories/finalize.py`:
- Around line 15-20: The category_scores dict hardcodes per-category "max"
values; replace those static max numbers by calling get_category_max(category)
for each category key so the max values derive from the BMAD_CATEGORIES source;
update the code that builds category_scores (the variable named category_scores
in finalize.py) to iterate the category keys (or BMAD_CATEGORIES) and set
{"score": 0.0, "max": get_category_max(category)} using the imported
get_category_max function so changes to BMAD_CATEGORIES stay consistent.
- Line 49: Replace the hardcoded denominator 17 with a dynamic lookup using
list_items(): call list_items() to get the BMAD items collection (e.g., items =
list_items(...)), compute total_count = len(items), and then set coverage =
evaluated_count / total_count if total_count > 0 else 0.0; update the expression
that currently uses 17 to use total_count and preserve the existing
evaluated_count and zero-division guard.
In `@backend/tests/test_enhanced_code_grader.py`:
- Around line 264-284: The test TestC2TestingCoverage currently uses
asyncio.run() inside test_c2_detects_test_coverage_configs which prevents pytest
async handling; change the test to be an async pytest coroutine by adding
`@pytest.mark.asyncio` and making test_c2_detects_test_coverage_configs async,
remove the inner asyncio.run/test_all wrapper, and directly await
agent.evaluate_all(context, llm=None) (referencing the TestC2TestingCoverage
class, the test_c2_detects_test_coverage_configs function, and the
CodeGraderAgent.evaluate_all call) so assertions run on the awaited result.
- Around line 238-261: The test uses asyncio.run() inside
TestC1CodeQuality.test_c1_detects_linter_configs which will conflict with
pytest-asyncio; convert the inner async function into an async test by marking
the test with `@pytest.mark.asyncio` (or make a new async test method) and replace
asyncio.run(test_all()) with a direct await of CodeGraderAgent.evaluate_all
(i.e., await agent.evaluate_all(context, llm=None)); update imports to include
pytest if needed and remove asyncio.run usage so
TestC1CodeQuality.test_c1_detects_linter_configs (or a new async sibling) awaits
agent.evaluate_all directly.
In `@frontend/src/components/settings/APIKeyForm.tsx`:
- Around line 78-84: The toggle button that shows/hides the API key (uses
showKey state and setShowKey handler, rendering Eye/EyeOff icons) is missing an
accessible label; add a dynamic aria-label to the button (e.g.,
aria-label={showKey ? "Hide API key" : "Show API key"}) so screen readers can
convey its purpose, and ensure the attribute is applied on the same button
element that currently calls setShowKey.
- Around line 31-33: The setTimeout in APIKeyForm that calls onSuccess after
1500ms can fire after the component unmounts causing memory leaks or stale
state; store the timer ID (e.g., with a useRef) when calling setTimeout inside
the handler that triggers onSuccess, and clearTimeout(timerRef.current) in a
cleanup (useEffect return or component cleanup) or before setting a new timer so
the callback will not run after unmount; update the code paths that call
setTimeout to use this ref and ensure the timer is cleared when the component
unmounts or before scheduling a new timeout.
🧹 Nitpick comments (29)
frontend/src/lib/api.ts (1)
263-267:deleteKey의provider파라미터가 URL 경로에 직접 삽입됨
provider값이 백엔드 응답에서 오므로 현재 위험은 낮지만,encodeURIComponent로 감싸는 것이 방어적 코딩으로 바람직합니다.🛡️ 수정 제안
deleteKey: async (provider: string): Promise<{ deleted: boolean; provider: string }> => { - return fetchWithConfig(`/api/keys/${provider}`, { + return fetchWithConfig(`/api/keys/${encodeURIComponent(provider)}`, { method: 'DELETE', }); },frontend/src/components/settings/APIKeyList.tsx (1)
13-28:confirm/alert대신 커스텀 모달 사용 고려
window.confirm과window.alert는 기능적으로 동작하지만, 앱의 나머지 UI와 시각적으로 일관되지 않습니다. 향후 커스텀 확인 대화상자로 교체하면 UX가 개선됩니다.frontend/src/components/settings/APIKeyManager.tsx (1)
11-20:fetchKeys재호출 시 에러 상태가 사용자에게 표시되지 않음
fetchKeys가 실패하면console.error만 호출되고 사용자에게는 아무런 피드백이 없습니다. 키 목록 로드 실패 시 빈 목록이 표시되어 사용자가 키가 없다고 오해할 수 있습니다.🛡️ 수정 제안: 에러 상태 추가
export const APIKeyManager: React.FC = () => { const [keys, setKeys] = useState<KeyStatusResponse[]>([]); const [isLoading, setIsLoading] = useState(true); + const [error, setError] = useState<string | null>(null); const fetchKeys = async () => { try { + setError(null); const data = await api.getKeyStatus(); setKeys(data); } catch (error) { console.error('Failed to fetch API keys:', error); + setError('Failed to load API keys. Please try again.'); } finally { setIsLoading(false); } };그리고 렌더링 부분에서
error상태를 표시하세요.backend/app/criteria/scoring.py (1)
185-206:is_information_absent에서 단일 키워드 매칭이 오탐(false positive)을 유발할 수 있습니다.
"missing","absent"같은 짧은 키워드가 rationale 텍스트 내에서 다른 문맥으로 사용될 수 있습니다 (예: "the missing link in their approach" → 데이터 부재가 아닌 의미). 현재 단순in매칭으로는 이런 경우를 구분할 수 없습니다.현재 동작이 의도적으로 보수적(false positive 허용)이라면 괜찮지만, 정밀도가 중요해질 경우 문장 경계 또는 더 구체적인 패턴 매칭을 고려해 볼 수 있습니다.
backend/app/graph/nodes/technique_categories/base_category.py (2)
82-110:logger.error대신logger.exception을 사용하면 스택 트레이스가 포함됩니다.Line 86에서
logger.error는 예외 발생 시 트레이스 정보를 기록하지 않습니다.logger.exception으로 변경하면 디버깅에 필요한 전체 스택 트레이스가 자동으로 포함됩니다.♻️ 수정 제안
except Exception as e: - logger.error(f"{self.category.value}: execution failed - {e}") + logger.exception(f"{self.category.value}: execution failed")
39-46:config파라미터가 사용되지 않지만 LangGraph 인터페이스 준수를 위해 필요합니다.Ruff의 ARG002 경고가 발생하지만,
RunnableConfig은 LangGraph 그래프 노드 시그니처의 일부이므로 유지가 맞습니다. 필요하다면_프리픽스를 붙이거나# noqa: ARG002주석으로 경고를 억제할 수 있습니다.backend/app/graph/nodes/technique_categories/__init__.py (1)
11-21:__all__정렬 고려 (Ruff RUF022)정적 분석 도구(Ruff)에서
__all__이 정렬되지 않았다고 경고하고 있습니다. isort 스타일 정렬을 적용하면 lint 경고를 해소할 수 있습니다.♻️ 제안된 수정
__all__ = [ + "AromaCategoryNode", + "BalanceCategoryNode", "BaseCategoryNode", - "AromaCategoryNode", - "PalateCategoryNode", "BodyCategoryNode", + "CellarCategoryNode", "FinishCategoryNode", - "BalanceCategoryNode", - "VintageCategoryNode", + "PalateCategoryNode", "TerroirCategoryNode", - "CellarCategoryNode", + "VintageCategoryNode", ]backend/tests/test_category_nodes.py (1)
164-181: 불필요한@pytest.mark.asyncio데코레이터
test_all_nodes_have_correct_enum_values테스트는await를 사용하지 않는 순수 동기 로직입니다.@pytest.mark.asyncio마크와async def는 불필요하며, 일반 동기 메서드로 변경하는 것이 더 명확합니다.♻️ 제안된 수정
- `@pytest.mark.asyncio` - async def test_all_nodes_have_correct_enum_values(self): + def test_all_nodes_have_correct_enum_values(self):backend/app/main.py (1)
68-69: 다른 라우터 등록과의 일관성을 위해 설명 주석 추가를 권장합니다.다른 라우터들은 모두 설명 주석이 있습니다 (예:
# Admin routes,# API key management routes). techniques 라우터에도 동일하게 주석을 추가하면 일관성이 향상됩니다.♻️ 제안 수정
# API key management routes app.include_router(api_keys.router, prefix=settings.API_V1_STR) +# Technique evaluation routes app.include_router(techniques.router, prefix=settings.API_V1_STR)backend/tests/test_deep_synthesis.py (1)
1-7: 사용하지 않는MagicMockimport.Line 3에서
MagicMock을 import하지만 이 테스트 파일에서 사용되지 않습니다.♻️ 제안 수정
import pytest from datetime import datetime, timezone -from unittest.mock import MagicMock from app.graph.nodes.technique_categories.deep_synthesis import deep_synthesis from app.graph.state import EvaluationState from app.models.graph import ItemScore, TraceEventbackend/tests/test_finalize_node.py (1)
53-72: 테스트에서 반복되는 item_scores 생성 로직이 있습니다 — 헬퍼 함수 추출을 고려하세요.
test_finalize_quality_gate_pass,test_finalize_quality_gate_concerns,test_finalize_quality_gate_fail테스트가 동일한 17개 항목 리스트와 거의 동일한 item_scores 생성 패턴을 공유합니다. 점수 값만 다릅니다. 헬퍼 함수나 fixture로 추출하면 유지보수성이 향상됩니다.♻️ 예시 헬퍼 함수
ALL_BMAD_ITEMS = ["A1","A2","A3","A4","B1","B2","B3","B4", "C1","C2","C3","C4","C5","D1","D2","D3","D4"] def _make_item_scores(score: float, max_score: float = 6) -> dict: return { item_id: ItemScore( item_id=item_id, score=score, max_score=max_score, evaluated_by="test", technique_id="tech1", timestamp=datetime.now(timezone.utc).isoformat(), status="evaluated", ) for item_id in ALL_BMAD_ITEMS }backend/app/graph/nodes/technique_categories/deep_synthesis.py (1)
1-5: 사용하지 않는Listimport.Line 2에서
List를 import하지만 함수 시그니처와 반환 타입에서 사용되지 않습니다.♻️ 제안 수정
from datetime import datetime, timezone -from typing import Any, Dict, List, Optional +from typing import Any, Dict, Optional from langchain_core.runnables import RunnableConfigbackend/app/graph/full_techniques_graph.py (1)
1-4:__all__이 import 문 사이에 위치해 있습니다.
__all__은 관례상 모든 import 이후에 정의하는 것이 일반적입니다. 현재 Line 3에서__all__이 선언된 후 Line 4-21에서 추가 import가 이어집니다.♻️ 제안 수정
from langgraph.graph import StateGraph, END -__all__ = ["create_full_techniques_graph"] from app.core.config import settings from app.graph.state import EvaluationState from app.graph.checkpoint import get_checkpointer @@ ... from app.graph.nodes.code_analysis_enrich import code_analysis_enrich +__all__ = ["create_full_techniques_graph"] +backend/tests/test_byok_integration.py (1)
217-226: 사용하지 않는resolved_key변수 —_로 대체하세요.Ruff(RUF059)에서 지적한 대로,
resolved_key가 언팩되었지만 사용되지 않습니다. Line 260에서도 동일한 문제가 있습니다.♻️ 제안 수정
- repo_context, resolved_key = await _prepare_repo_context( + repo_context, _ = await _prepare_repo_context( repo_url="https://github.com/owner/repo", api_key=None, github_token=None, user_id=None, )Line 260에도 동일하게 적용:
- repo_context, resolved_key = await _prepare_repo_context( + repo_context, _ = await _prepare_repo_context(backend/tests/test_full_techniques_graph.py (2)
72-88:settingsmock 시 다른 속성 접근 문제 가능성
app.graph.full_techniques_graph.settings전체를 패치하면,RAG_ENABLED외에 그래프 생성 중 접근되는 다른 settings 속성들(web_search_enrich,code_analysis_enrich등에서 사용하는 설정)이MagicMock기본값을 반환하게 됩니다. 이는 의도치 않은 동작을 유발할 수 있습니다.
settings객체 전체를 mock하는 대신,RAG_ENABLED속성만 선택적으로 패치하는 것이 더 안전합니다:♻️ 제안하는 수정
- with patch("app.graph.full_techniques_graph.settings") as mock_settings: - mock_settings.RAG_ENABLED = True + with patch("app.graph.full_techniques_graph.settings.RAG_ENABLED", True):
184-212:_graph_builders.clear()호출이 다른 테스트에 영향을 줄 수 있음
_graph_builders는 모듈 레벨 상태이므로,clear()호출 후 복구하지 않으면 테스트 실행 순서에 따라 다른 테스트가 실패할 수 있습니다._register_graphs()가 내부적으로 빈 dict를 감지하여 재등록한다면 괜찮지만, 확실한 격리를 위해 teardown에서 원래 상태를 복구하는 것이 좋습니다.#!/bin/bash # Check how _register_graphs handles re-registration ast-grep --pattern 'def _register_graphs($$$) { $$$ }' rg -n "_register_graphs" --type=py -C5 -g '!**/test*'backend/tests/test_llm_grader.py (2)
318-326: 하드코딩된 단가로 비용 계산을 검증하고 있어 취약함
0.00003과0.00006이라는 단가가LLMGrader내부의 상수와 동일해야 합니다. 이 단가가 변경되면 테스트가 조용히 실패합니다.LLMGrader에서 단가 상수를 노출하여 테스트에서 참조하는 방식이 더 유지보수에 유리합니다.
91-101: 매직 넘버103의 의미가 불분명함
max_chars=100에 말줄임표"..."3글자를 더한 값이지만, 코드만 보면 의도가 명확하지 않습니다. 주석이나 변수를 통해 의도를 명시하면 가독성이 향상됩니다.♻️ 제안
- assert len(context) <= 103 # Allow for "..." truncation + max_chars_limit = 100 + ellipsis_len = len("...") + assert len(context) <= max_chars_limit + ellipsis_lenbackend/tests/test_technique_api.py (1)
264-282:test_technique_id_with_name_stats_does_not_conflict는 중복 테스트이 테스트(Line 275-281)는
test_stats_route_not_treated_as_technique_id(Line 267-273)와 동일한 엔드포인트를 호출하고 동일한 assertion을 수행합니다. 테스트 이름은 "stats라는 이름의 technique이 존재할 경우"를 시사하지만, 실제로는/api/techniques/stats가 stats 엔드포인트로 라우팅되는 것만 확인하므로 의미 있는 차별화가 없습니다.backend/tests/test_enhanced_code_grader.py (1)
229-235: 미사용 루프 변수item_codeRuff(B007)에서 지적된 대로,
item_code가 루프 본문에서 사용되지 않으므로_item_code로 변경하여 의도를 명시해야 합니다.♻️ 제안
- for item_code, score_data in result["item_scores"].items(): + for _item_code, score_data in result["item_scores"].items():backend/tests/test_scoring.py (2)
231-233:rationale가None인 경우의 테스트가 누락되어 있습니다.
is_information_absent함수는rationale: str | None을 받습니다. 빈 문자열("") 테스트는 있지만Nonerationale + 유효한 evidence 조합에 대한 테스트가 없습니다.rationale is not None분기를 명시적으로 검증하면 좋겠습니다.🧪 None rationale 테스트 추가 제안
def test_returns_false_with_evidence_and_empty_rationale(self): """Should return False with evidence even if rationale is empty.""" assert is_information_absent(["evidence"], "") is False + + def test_returns_false_with_evidence_and_none_rationale(self): + """Should return False with evidence even if rationale is None.""" + assert is_information_absent(["evidence"], None) is False
138-168:apply_confidence_adjustment_to_scores테스트가bmad_items모듈의max_score값에 암묵적으로 의존합니다.
apply_confidence_adjustment_to_scores함수는 내부적으로get_item(item_id)를 호출하여bmad_items에서 실제max_score를 가져옵니다. 현재 테스트 데이터의max_score값(A1=7, A2=6, A3=6)이 실제 BMAD_ITEMS의 정의와 일치하고 있지만, 이 값들이 변경되면 특히lowconfidence 계산(neutral_score = max_score * 0.5를 사용)에서 테스트가 깨질 수 있습니다.테스트의 격리성을 높이려면
get_item을 mock하거나, 테스트 docstring에 이 의존성을 명시적으로 문서화하는 것을 권장합니다.backend/app/criteria/llm_grader.py (4)
137-148: JSON 파싱 폴백 로직에서 불완전한 코드 블록 처리 시 예외 발생 가능.
content에```json이 있지만 닫는```이 없는 경우,split("```")[0]이 전체 나머지 문자열을 반환하여json.loads가 실패합니다. 이 경우 예외가grade_item의except블록에서 포착되어 재시도되므로 치명적이지는 않지만, 보다 방어적인 처리를 고려할 수 있습니다.
270-272: 토큰당 비용이 하드코딩되어 있습니다.
prompt_cost = prompt_tokens * 0.00003,completion_cost = completion_tokens * 0.00006은 특정 모델(Gemini)의 가격에 고정되어 있습니다. 모델이 변경되거나 가격이 업데이트되면 이 값들도 수정해야 합니다. 설정이나 상수로 분리하는 것을 권장합니다.♻️ 상수 분리 제안
+# Per-token cost (USD) — adjust when model changes +COST_PER_PROMPT_TOKEN = 0.00003 +COST_PER_COMPLETION_TOKEN = 0.00006 + + class LLMGrader: ... def _update_usage(self, usage: dict) -> None: ... - prompt_cost = prompt_tokens * 0.00003 - completion_cost = completion_tokens * 0.00006 + prompt_cost = prompt_tokens * COST_PER_PROMPT_TOKEN + completion_cost = completion_tokens * COST_PER_COMPLETION_TOKEN self.total_cost_usd += prompt_cost + completion_cost
82-93: 예외 메시지에str(e)대신!s변환 플래그 사용을 권장합니다.Ruff RUF010 힌트에 따라,
f"LLM call failed: {str(e)}"를f"LLM call failed: {e!s}"로 변경하면 더 관용적입니다. 또한Exception을 광범위하게 잡는 것에 대해 — 외부 LLM 클라이언트의 예외 유형을 예측하기 어려우므로 이 컨텍스트에서는 합리적입니다.
104-135:ainvoke/invoke/ callable 분기에서 중복 코드가 있습니다.
content와usage추출 로직이ainvoke와invoke분기에서 동일하게 반복됩니다. 응답에서 content/usage를 추출하는 헬퍼를 분리하면 중복을 줄일 수 있습니다.backend/app/agents/code_grader.py (3)
19-54: 가변 클래스 속성에ClassVar어노테이션 추가를 권장합니다.Ruff RUF012에 따라, 가변 클래스 속성(
EVALUABLE_ITEMS,OBJECTIVE_ITEMS,SUBJECTIVE_ITEMS,ORIGINAL_ITEMS)은typing.ClassVar로 어노테이션하는 것이 Python 모범 사례입니다.♻️ ClassVar 어노테이션 추가
+from typing import Any, ClassVar + class CodeGraderAgent: - EVALUABLE_ITEMS = [ + EVALUABLE_ITEMS: ClassVar[list[str]] = [ "A1", ... ] - OBJECTIVE_ITEMS = ["C1", "C2", "C4", "B1", "B2", "D1", "D2"] + OBJECTIVE_ITEMS: ClassVar[list[str]] = ["C1", "C2", "C4", "B1", "B2", "D1", "D2"] ...
113-182:evaluate_all에서 주관적 항목의 LLM 호출이 순차적으로 실행됩니다.10개의
SUBJECTIVE_ITEMS에 대해llm_grader.grade_item이 순차적으로 호출됩니다. LLM 호출은 I/O 바운드이므로asyncio.gather를 사용하여 병렬화하면 전체 평가 시간을 크게 단축할 수 있습니다. 다만 LLM API의 rate limit을 고려해야 합니다.♻️ 병렬 호출 제안
- llm_grader = LLMGrader() - for item_code in self.SUBJECTIVE_ITEMS: - if llm is not None or item_code not in item_scores: - result = await llm_grader.grade_item(item_code, repo_context, llm) - item_scores[item_code] = { - "item_code": result["item_id"], - "score": result["score"], - "max_score": result["max_score"], - "evidence": result["evidence"], - "rationale": result["rationale"], - "confidence": result["confidence"], - "metrics": {}, - } + llm_grader = LLMGrader() + items_to_grade = [ + code for code in self.SUBJECTIVE_ITEMS + if llm is not None or code not in item_scores + ] + if items_to_grade: + results = await asyncio.gather( + *(llm_grader.grade_item(code, repo_context, llm) for code in items_to_grade) + ) + for result in results: + item_scores[result["item_id"]] = { + "item_code": result["item_id"], + "score": result["score"], + "max_score": result["max_score"], + "evidence": result["evidence"], + "rationale": result["rationale"], + "confidence": result["confidence"], + "metrics": {}, + }
682-780:_evaluate_code_quality와_evaluate_system_architecture사이에 상당한 코드 중복이 있습니다.린터 설정 감지(ruff, flake8, pylint, eslint 등), 타입 체커 설정 감지(mypy, pyright, tsconfig),
filenames세트 구성 로직이 두 메서드에서 거의 동일하게 반복됩니다. 공통 감지 로직을 헬퍼 메서드로 추출하면 유지보수성이 향상됩니다.Also applies to: 531-680
|
|
||
| llm_grader = LLMGrader() | ||
| for item_code in self.SUBJECTIVE_ITEMS: | ||
| if llm is not None or item_code not in item_scores: | ||
| result = await llm_grader.grade_item(item_code, repo_context, llm) | ||
| item_scores[item_code] = { | ||
| "item_code": result["item_id"], | ||
| "score": result["score"], | ||
| "max_score": result["max_score"], | ||
| "evidence": result["evidence"], | ||
| "rationale": result["rationale"], | ||
| "confidence": result["confidence"], | ||
| "metrics": {}, | ||
| } |
There was a problem hiding this comment.
evaluate_all에서 llm_grader.grade_item 호출 시 llm_grader.py의 버그 영향을 받습니다.
llm_grader.py의 grade_item에서 response["score"] 대신 response["parsed"]["score"]를 참조해야 하는 치명적 버그가 있습니다. LLM이 제공되면 이 코드 경로에서 KeyError가 발생하여 결국 재시도 후 fallback 결과(score=0, confidence="low")가 반환됩니다. 즉, LLM을 사용한 주관적 평가가 항상 실패하게 됩니다.
llm_grader.py의 수정이 선행되어야 합니다.
🤖 Prompt for AI Agents
In `@backend/app/agents/code_grader.py` around lines 156 - 169, The evaluate_all
loop calls LLMGrader.grade_item and is failing because grade_item in
llm_grader.py reads response["score"] instead of the actual parsed value
response["parsed"]["score"]; update LLMGrader.grade_item to extract score,
max_score, confidence, rationale, and evidence from response["parsed"] (e.g.,
use response["parsed"]["score"] etc.) and ensure evaluate_all's usage of
LLMGrader.grade_item remains the same so subjective items use the corrected
parsed fields rather than causing KeyError and fallback.
| coverage_configs = [ | ||
| ".coveragerc", | ||
| "coverage.ini", | ||
| ".nycrc", | ||
| "codecov.yml", | ||
| ".codecov.yml", | ||
| ] | ||
| has_coverage = any( | ||
| cfg in path.lower() for path in file_tree for cfg in coverage_configs | ||
| ) |
There was a problem hiding this comment.
커버리지 설정 감지 로직에서 오탐(false positive) 가능성이 있습니다.
any(cfg in path.lower() for path in file_tree for cfg in coverage_configs) 패턴은 경로의 일부 문자열과 매칭합니다. 예를 들어 "src/coverage.ini.bak" 같은 경로도 매칭됩니다. _evaluate_test_existence 등 다른 메서드에서는 파일명만 추출(path.split("/")[-1])하여 비교하는데, 여기서는 전체 경로에서 부분 문자열 매칭을 사용하여 일관성이 떨어집니다.
🔧 파일명 기반 매칭으로 수정
+ filenames = {path.split("/")[-1].lower() for path in file_tree}
coverage_configs = [
".coveragerc",
"coverage.ini",
".nycrc",
"codecov.yml",
".codecov.yml",
]
- has_coverage = any(
- cfg in path.lower() for path in file_tree for cfg in coverage_configs
- )
+ has_coverage = any(cfg in filenames for cfg in coverage_configs)🤖 Prompt for AI Agents
In `@backend/app/agents/code_grader.py` around lines 812 - 821, The coverage
detection currently uses substring matching which causes false positives; change
the logic that sets has_coverage to compare only the filename portion against
the exact set of coverage filenames (coverage_configs) using a basename
extraction (e.g., path.split("/")[-1] or os.path.basename) and normalize to
lower-case before equality checks so entries like "src/coverage.ini.bak" no
longer match; update the code around coverage_configs and has_coverage to
implement this exact filename comparison using file_tree and the filename
extraction.
| router = APIRouter(prefix="/techniques", tags=["techniques"]) | ||
|
|
||
| VALID_CATEGORIES = {cat.value for cat in TechniqueCategory} | ||
| VALID_HATS = {"white", "red", "black", "yellow", "green", "blue"} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for canonical hat definitions and VALID_HATS usage
rg -n --type=py 'valid_hats|VALID_HATS|Hat.*Enum|hat.*enum|applicable_hats' -g '!*test*' | head -50Repository: Two-Weeks-Team/somm.dev
Length of output: 1090
🏁 Script executed:
# Search for hat-related definitions more broadly
rg -n --type=py '"white"|"red"|"black"|"yellow"|"green"|"blue"' -B2 -A2 | head -80Repository: Two-Weeks-Team/somm.dev
Length of output: 5161
🏁 Script executed:
# Check for Hat enum or class definitions
fd --type f -name '*.py' -exec grep -l 'class Hat\|enum.*Hat\|Hat.*=.*{' {} \; | head -20Repository: Two-Weeks-Team/somm.dev
Length of output: 297
🏁 Script executed:
# Look at the techniques.py file more closely for context
wc -l backend/app/api/routes/techniques.pyRepository: Two-Weeks-Team/somm.dev
Length of output: 107
VALID_HATS를 Hat 열거형에서 동적으로 파생시키기
backend/app/criteria/hat_mappings.py에 정의된 Hat 열거형이 canonical 소스입니다. 현재 코드는 line 21의 VALID_CATEGORIES처럼 열거형에서 동적으로 값을 파생시키지 않고, line 22에서 VALID_HATS를 하드코딩하고 있어 유지보수 일관성이 떨어집니다. 같은 파일에서 VALID_CATEGORIES = {cat.value for cat in TechniqueCategory}를 사용하는 것과 달리, 하드코딩된 집합은 Hat 정의가 변경될 때 동기화 오류 위험이 있습니다.
VALID_HATS = {hat.value for hat in Hat}로 변경하여 Hat 열거형의 유일한 소스에서 유도하는 것이 권장됩니다.
🤖 Prompt for AI Agents
In `@backend/app/api/routes/techniques.py` at line 22, Replace the hardcoded
VALID_HATS set with a dynamic derivation from the Hat enum to keep a single
source of truth: instead of the current VALID_HATS constant, compute it the same
way VALID_CATEGORIES is derived (i.e., build the set from Hat members' .value),
locating the CONSTANT named VALID_HATS and the Hat enum definition used in
hat_mappings so the set reflects any changes to Hat automatically.
| def format_prompt(item_id: str, context: str) -> str: | ||
| """Format a grading prompt with the given repository context. | ||
|
|
||
| Args: | ||
| item_id: The BMAD item ID | ||
| context: Repository context string to insert into the prompt | ||
|
|
||
| Returns: | ||
| Formatted prompt string ready for LLM consumption | ||
| """ | ||
| prompt_template = get_prompt(item_id)["prompt"] | ||
| return prompt_template.format(context=context) |
There was a problem hiding this comment.
str.format()은 중괄호가 포함된 context에서 KeyError/ValueError 발생
context에 코드 스니펫이 포함될 수 있고, Python/JavaScript/JSON 코드에는 {, } 문자가 빈번합니다. str.format(context=context) 호출 시 중괄호가 format 변수로 해석되어 예외가 발생합니다.
예: context = "function() { return {}; }" → KeyError: ' return '
🐛 제안하는 수정
def format_prompt(item_id: str, context: str) -> str:
prompt_template = get_prompt(item_id)["prompt"]
- return prompt_template.format(context=context)
+ return prompt_template.replace("{context}", context)또는 프롬프트 템플릿에서 {context} 외의 중괄호를 이중 이스케이프({{, }})하는 방법도 있으나, replace()가 더 안전하고 간단합니다.
🤖 Prompt for AI Agents
In `@backend/app/criteria/grading_prompts.py` around lines 363 - 374, The current
format_prompt function uses prompt_template.format(context=context) which raises
KeyError/ValueError when the inserted context contains literal braces; change
format_prompt to avoid str.format parsing by replacing the placeholder
explicitly: fetch prompt_template via get_prompt(item_id)["prompt"] and return
prompt_template.replace("{context}", context) (or replace only the first
occurrence if desired) so literal `{` and `}` in context are preserved and no
format parsing occurs.
| if self._validate_response(response, item_id, max_score): | ||
| self._update_usage(response.get("usage", {})) | ||
|
|
||
| return { | ||
| "item_id": item_id, | ||
| "score": float(response["score"]), | ||
| "max_score": float(max_score), | ||
| "confidence": response["confidence"], | ||
| "evidence": response["evidence"], | ||
| "rationale": response["rationale"], | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n backend/app/criteria/llm_grader.pyRepository: Two-Weeks-Team/somm.dev
Length of output: 12496
치명적 버그: response["score"] 등이 아닌 response["parsed"]["score"]를 참조해야 합니다.
_call_llm은 라인 149에서 {"parsed": parsed, "usage": usage} 구조를 반환합니다. 하지만 grade_item의 라인 61~65에서 response["score"], response["confidence"], response["evidence"], response["rationale"]를 직접 접근하고 있어 KeyError가 발생합니다. _validate_response는 정확하게 response.get("parsed", {})를 사용하고 있으므로, grade_item도 동일하게 중첩된 parsed 키를 통해 접근해야 합니다.
수정 제안
if self._validate_response(response, item_id, max_score):
self._update_usage(response.get("usage", {}))
+ parsed = response["parsed"]
return {
"item_id": item_id,
- "score": float(response["score"]),
+ "score": float(parsed["score"]),
"max_score": float(max_score),
- "confidence": response["confidence"],
- "evidence": response["evidence"],
- "rationale": response["rationale"],
+ "confidence": parsed["confidence"],
+ "evidence": parsed["evidence"],
+ "rationale": parsed["rationale"],
}🤖 Prompt for AI Agents
In `@backend/app/criteria/llm_grader.py` around lines 56 - 66, grade_item
currently accesses response["score"], response["confidence"],
response["evidence"], and response["rationale"] directly but _call_llm returns
{"parsed": parsed, "usage": usage} and _validate_response checks
response.get("parsed", {}); update grade_item to read these values from
response["parsed"] (e.g., response.get("parsed", {}).get("score") etc.), keep
the existing _update_usage(response.get("usage", {})) call, and ensure returned
keys use the parsed values (score, confidence, evidence, rationale) while
preserving types (float for score/max_score) so that grade_item, _call_llm, and
_validate_response are consistent.
| if cat_id in category_scores: | ||
| category_scores[cat_id]["score"] += min(score_value, item.max_score) | ||
|
|
||
| coverage = evaluated_count / 17 if evaluated_count > 0 else 0.0 |
There was a problem hiding this comment.
하드코딩된 17을 동적 조회로 교체해야 함
커밋 메시지에서 "replace hardcoded BMAD count with dynamic lookup"이라고 명시했으나, 이 코드는 여전히 17이 하드코딩되어 있습니다. BMAD 항목 수가 변경되면 이 로직이 잘못된 coverage를 계산합니다. list_items()가 이미 임포트되어 있으므로 이를 활용해야 합니다.
🐛 제안하는 수정
- coverage = evaluated_count / 17 if evaluated_count > 0 else 0.0
+ total_items = len(list_items())
+ coverage = evaluated_count / total_items if total_items > 0 else 0.0🤖 Prompt for AI Agents
In `@backend/app/graph/nodes/technique_categories/finalize.py` at line 49, Replace
the hardcoded denominator 17 with a dynamic lookup using list_items(): call
list_items() to get the BMAD items collection (e.g., items = list_items(...)),
compute total_count = len(items), and then set coverage = evaluated_count /
total_count if total_count > 0 else 0.0; update the expression that currently
uses 17 to use total_count and preserve the existing evaluated_count and
zero-division guard.
| class TestC1CodeQuality: | ||
| def test_c1_detects_linter_configs(self): | ||
| from app.agents.code_grader import CodeGraderAgent | ||
|
|
||
| agent = CodeGraderAgent() | ||
| context = { | ||
| "repo_context": { | ||
| "readme": "", | ||
| "file_tree": [".eslintrc", "src/main.js"], | ||
| "main_files": [{"path": "main.js", "content": "const x = 1;"}], | ||
| } | ||
| } | ||
| result = agent.evaluate(context) | ||
|
|
||
| assert "C1" not in result["item_scores"] | ||
|
|
||
| import asyncio | ||
|
|
||
| async def test_all(): | ||
| all_result = await agent.evaluate_all(context, llm=None) | ||
| assert "C1" in all_result["item_scores"] | ||
| assert all_result["item_scores"]["C1"]["max_score"] == 7 | ||
|
|
||
| asyncio.run(test_all()) |
There was a problem hiding this comment.
동기 테스트 내에서 asyncio.run() 사용은 이벤트 루프 충돌을 일으킬 수 있음
pytest-asyncio가 이미 이벤트 루프를 관리하고 있는 환경에서 asyncio.run()을 호출하면 RuntimeError: asyncio.run() cannot be called when another event loop is running 에러가 발생할 수 있습니다. @pytest.mark.asyncio를 사용하는 async 테스트 메서드로 변환해야 합니다.
🐛 제안하는 수정 (TestC1CodeQuality)
class TestC1CodeQuality:
- def test_c1_detects_linter_configs(self):
+ `@pytest.mark.asyncio`
+ async def test_c1_detects_linter_configs(self):
from app.agents.code_grader import CodeGraderAgent
agent = CodeGraderAgent()
context = {
"repo_context": {
"readme": "",
"file_tree": [".eslintrc", "src/main.js"],
"main_files": [{"path": "main.js", "content": "const x = 1;"}],
}
}
result = agent.evaluate(context)
assert "C1" not in result["item_scores"]
- import asyncio
-
- async def test_all():
- all_result = await agent.evaluate_all(context, llm=None)
- assert "C1" in all_result["item_scores"]
- assert all_result["item_scores"]["C1"]["max_score"] == 7
-
- asyncio.run(test_all())
+ all_result = await agent.evaluate_all(context, llm=None)
+ assert "C1" in all_result["item_scores"]
+ assert all_result["item_scores"]["C1"]["max_score"] == 7🤖 Prompt for AI Agents
In `@backend/tests/test_enhanced_code_grader.py` around lines 238 - 261, The test
uses asyncio.run() inside TestC1CodeQuality.test_c1_detects_linter_configs which
will conflict with pytest-asyncio; convert the inner async function into an
async test by marking the test with `@pytest.mark.asyncio` (or make a new async
test method) and replace asyncio.run(test_all()) with a direct await of
CodeGraderAgent.evaluate_all (i.e., await agent.evaluate_all(context,
llm=None)); update imports to include pytest if needed and remove asyncio.run
usage so TestC1CodeQuality.test_c1_detects_linter_configs (or a new async
sibling) awaits agent.evaluate_all directly.
| class TestC2TestingCoverage: | ||
| def test_c2_detects_test_coverage_configs(self): | ||
| from app.agents.code_grader import CodeGraderAgent | ||
|
|
||
| agent = CodeGraderAgent() | ||
| context = { | ||
| "repo_context": { | ||
| "readme": "", | ||
| "file_tree": ["tests/test_main.py", ".coveragerc"], | ||
| "main_files": [], | ||
| } | ||
| } | ||
|
|
||
| import asyncio | ||
|
|
||
| async def test_all(): | ||
| all_result = await agent.evaluate_all(context, llm=None) | ||
| assert "C2" in all_result["item_scores"] | ||
| assert all_result["item_scores"]["C2"]["max_score"] == 6 | ||
|
|
||
| asyncio.run(test_all()) |
There was a problem hiding this comment.
동일한 asyncio.run() 문제 — TestC2TestingCoverage에서도 동일 패턴
위 TestC1CodeQuality와 같은 이유로 @pytest.mark.asyncio async 테스트로 변환이 필요합니다.
🐛 제안하는 수정 (TestC2TestingCoverage)
class TestC2TestingCoverage:
- def test_c2_detects_test_coverage_configs(self):
+ `@pytest.mark.asyncio`
+ async def test_c2_detects_test_coverage_configs(self):
from app.agents.code_grader import CodeGraderAgent
agent = CodeGraderAgent()
context = {
"repo_context": {
"readme": "",
"file_tree": ["tests/test_main.py", ".coveragerc"],
"main_files": [],
}
}
- import asyncio
-
- async def test_all():
- all_result = await agent.evaluate_all(context, llm=None)
- assert "C2" in all_result["item_scores"]
- assert all_result["item_scores"]["C2"]["max_score"] == 6
-
- asyncio.run(test_all())
+ all_result = await agent.evaluate_all(context, llm=None)
+ assert "C2" in all_result["item_scores"]
+ assert all_result["item_scores"]["C2"]["max_score"] == 6📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class TestC2TestingCoverage: | |
| def test_c2_detects_test_coverage_configs(self): | |
| from app.agents.code_grader import CodeGraderAgent | |
| agent = CodeGraderAgent() | |
| context = { | |
| "repo_context": { | |
| "readme": "", | |
| "file_tree": ["tests/test_main.py", ".coveragerc"], | |
| "main_files": [], | |
| } | |
| } | |
| import asyncio | |
| async def test_all(): | |
| all_result = await agent.evaluate_all(context, llm=None) | |
| assert "C2" in all_result["item_scores"] | |
| assert all_result["item_scores"]["C2"]["max_score"] == 6 | |
| asyncio.run(test_all()) | |
| class TestC2TestingCoverage: | |
| `@pytest.mark.asyncio` | |
| async def test_c2_detects_test_coverage_configs(self): | |
| from app.agents.code_grader import CodeGraderAgent | |
| agent = CodeGraderAgent() | |
| context = { | |
| "repo_context": { | |
| "readme": "", | |
| "file_tree": ["tests/test_main.py", ".coveragerc"], | |
| "main_files": [], | |
| } | |
| } | |
| all_result = await agent.evaluate_all(context, llm=None) | |
| assert "C2" in all_result["item_scores"] | |
| assert all_result["item_scores"]["C2"]["max_score"] == 6 |
🤖 Prompt for AI Agents
In `@backend/tests/test_enhanced_code_grader.py` around lines 264 - 284, The test
TestC2TestingCoverage currently uses asyncio.run() inside
test_c2_detects_test_coverage_configs which prevents pytest async handling;
change the test to be an async pytest coroutine by adding `@pytest.mark.asyncio`
and making test_c2_detects_test_coverage_configs async, remove the inner
asyncio.run/test_all wrapper, and directly await agent.evaluate_all(context,
llm=None) (referencing the TestC2TestingCoverage class, the
test_c2_detects_test_coverage_configs function, and the
CodeGraderAgent.evaluate_all call) so assertions run on the awaited result.
| setTimeout(() => { | ||
| onSuccess(); | ||
| }, 1500); |
There was a problem hiding this comment.
setTimeout 정리 누락으로 언마운트 시 메모리 누수 및 상태 업데이트 가능
컴포넌트가 1.5초 이내에 언마운트되면 setTimeout 콜백이 이미 언마운트된 컴포넌트에서 onSuccess를 호출합니다. React 19에서는 경고가 발생하지 않지만, 부모 컴포넌트에서 예기치 않은 fetchKeys 호출이 발생할 수 있습니다.
🛡️ 수정 제안
useRef로 타이머 ID를 저장하고 cleanup하거나, useEffect cleanup 패턴을 사용하세요:
+import React, { useState, useRef, useEffect } from 'react';
-import React, { useState } from 'react';
...
export const APIKeyForm: React.FC<APIKeyFormProps> = ({ onSuccess }) => {
+ const timerRef = useRef<ReturnType<typeof setTimeout>>(null);
const [apiKey, setApiKey] = useState('');
...
+
+ useEffect(() => {
+ return () => {
+ if (timerRef.current) clearTimeout(timerRef.current);
+ };
+ }, []);
+
...
setSuccess(true);
setApiKey('');
- setTimeout(() => {
+ timerRef.current = setTimeout(() => {
onSuccess();
}, 1500);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| setTimeout(() => { | |
| onSuccess(); | |
| }, 1500); | |
| import React, { useState, useRef, useEffect } from 'react'; | |
| export const APIKeyForm: React.FC<APIKeyFormProps> = ({ onSuccess }) => { | |
| const timerRef = useRef<ReturnType<typeof setTimeout>>(null); | |
| const [apiKey, setApiKey] = useState(''); | |
| const [isLoading, setIsLoading] = useState(false); | |
| const [success, setSuccess] = useState(false); | |
| const [showPassword, setShowPassword] = useState(false); | |
| useEffect(() => { | |
| return () => { | |
| if (timerRef.current) clearTimeout(timerRef.current); | |
| }; | |
| }, []); | |
| // ... rest of the component code ... | |
| const handleSubmit = async (e: React.FormEvent) => { | |
| e.preventDefault(); | |
| setError(null); | |
| setIsLoading(true); | |
| try { | |
| // API call logic | |
| setSuccess(true); | |
| setApiKey(''); | |
| timerRef.current = setTimeout(() => { | |
| onSuccess(); | |
| }, 1500); | |
| } finally { | |
| setIsLoading(false); | |
| } | |
| }; | |
| // ... rest of component ... | |
| } |
🤖 Prompt for AI Agents
In `@frontend/src/components/settings/APIKeyForm.tsx` around lines 31 - 33, The
setTimeout in APIKeyForm that calls onSuccess after 1500ms can fire after the
component unmounts causing memory leaks or stale state; store the timer ID
(e.g., with a useRef) when calling setTimeout inside the handler that triggers
onSuccess, and clearTimeout(timerRef.current) in a cleanup (useEffect return or
component cleanup) or before setting a new timer so the callback will not run
after unmount; update the code paths that call setTimeout to use this ref and
ensure the timer is cleared when the component unmounts or before scheduling a
new timeout.
| <button | ||
| type="button" | ||
| onClick={() => setShowKey(!showKey)} | ||
| className="absolute right-3 top-1/2 -translate-y-1/2 p-1 text-gray-400 hover:text-[#722F37] transition-colors" | ||
| > | ||
| {showKey ? <EyeOff className="w-5 h-5" /> : <Eye className="w-5 h-5" />} | ||
| </button> |
There was a problem hiding this comment.
접근성: show/hide 토글 버튼에 aria-label 누락
스크린 리더 사용자가 이 버튼의 용도를 알 수 없습니다.
♿ 수정 제안
<button
type="button"
onClick={() => setShowKey(!showKey)}
+ aria-label={showKey ? 'Hide API key' : 'Show API key'}
className="absolute right-3 top-1/2 ..."
>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <button | |
| type="button" | |
| onClick={() => setShowKey(!showKey)} | |
| className="absolute right-3 top-1/2 -translate-y-1/2 p-1 text-gray-400 hover:text-[#722F37] transition-colors" | |
| > | |
| {showKey ? <EyeOff className="w-5 h-5" /> : <Eye className="w-5 h-5" />} | |
| </button> | |
| <button | |
| type="button" | |
| onClick={() => setShowKey(!showKey)} | |
| aria-label={showKey ? 'Hide API key' : 'Show API key'} | |
| className="absolute right-3 top-1/2 -translate-y-1/2 p-1 text-gray-400 hover:text-[`#722F37`] transition-colors" | |
| > | |
| {showKey ? <EyeOff className="w-5 h-5" /> : <Eye className="w-5 h-5" />} | |
| </button> |
🤖 Prompt for AI Agents
In `@frontend/src/components/settings/APIKeyForm.tsx` around lines 78 - 84, The
toggle button that shows/hides the API key (uses showKey state and setShowKey
handler, rendering Eye/EyeOff icons) is missing an accessible label; add a
dynamic aria-label to the button (e.g., aria-label={showKey ? "Hide API key" :
"Show API key"}) so screen readers can convey its purpose, and ensure the
attribute is applied on the same button element that currently calls setShowKey.
Critical fixes: - Fix KeyError in llm_grader.py (access response['parsed'] instead of response directly) - Fix str.format() crash with curly braces in code context (use str.replace) - Handle full_techniques mode result in evaluation_service.py Major fixes: - Replace hardcoded 17 with len(list_items()) in finalize.py - Fix total_tokens calculation in base_category.py - Unify status default value to DATA_MISSING in scoring.py Minor fixes: - Add CONFLICT_SCORE_RANGE_THRESHOLD constant in deep_synthesis.py - Fix coverage config detection to use filename matching in code_grader.py - Add setTimeout cleanup and aria-label in APIKeyForm.tsx Docs: - Update README.md with 6 evaluation modes and Techniques API endpoints
Summary
Wave 4 of the Fairthon 75-Technique Evaluation System migration (Epic #223).
Implements Group C (LangGraph full techniques pipeline), Group D (BYOK integration + UI), and Group E (API, scoring, enhanced grading).
10 issues resolved. 159 new tests. 1003 total tests passing, 0 failures.
Issues Resolved
Group C — LangGraph Full Techniques Pipeline
graph/full_techniques_graph.py— 8 parallel category nodes + deep_synthesis + finalizegraph/nodes/technique_categories/— BaseCategoryNode + 8 concrete nodesdeep_synthesis.py(conflict detection),finalize.py(quality gate, scores)Group D — BYOK
frontend/src/app/settings/api-keys/page.tsx,components/settings/APIKey*.tsxservices/evaluation_service.py— stored key lookup with priority chainGroup E — API + Grading
criteria/scoring.py— confidence adjustment, exclusion normalizationapi/routes/techniques.py— 3 GET endpointsagents/code_grader.py— expanded 5→17 BMAD itemscriteria/llm_grader.py— subjective item grading with retryArchitecture
Full Techniques Graph
BYOK Key Priority
Grading Architecture
New Files (26)
Backend (22)
app/criteria/scoring.py— Confidence adjustment, exclusion normalization, coverageapp/criteria/grading_prompts.py— 11 structured prompt templatesapp/criteria/llm_grader.py— LLMGrader classapp/api/routes/techniques.py— 3 GET endpointsapp/graph/full_techniques_graph.py— StateGraph definitionapp/graph/nodes/technique_categories/— 10 files (base + 8 nodes + init)app/graph/nodes/technique_categories/deep_synthesis.pyapp/graph/nodes/technique_categories/finalize.pyFrontend (4)
src/app/settings/api-keys/page.tsx— Protected settings pagesrc/components/settings/APIKeyManager.tsx— Orchestratorsrc/components/settings/APIKeyForm.tsx— Key registration formsrc/components/settings/APIKeyList.tsx— Key list with deleteModified Files (6)
app/agents/code_grader.py— Expanded to 17 items (backward compatible)app/graph/graph_factory.py— Registered full_techniques mode (3 modes total)app/main.py— Registered techniques routerapp/services/evaluation_service.py— BYOK stored key lookup + full_techniques progressfrontend/src/components/UserMenu.tsx— Added API Keys linkfrontend/src/lib/api.ts— Added BYOK API methodsTesting
What's Next
Closes #234, Closes #235, Closes #236, Closes #237, Closes #239, Closes #240, Closes #241, Closes #242, Closes #243
Summary by CodeRabbit
릴리스 노트