feat(group-a): EvaluationState extension + Hat-BMAD mappings (#228, #229)#257
feat(group-a): EvaluationState extension + Hat-BMAD mappings (#228, #229)#257ComBba wants to merge 2 commits into
Conversation
#226, #227) - #224: Remove input_source field, add fairthon_source for YAML preservation - Remove filter_techniques() and determine_available_inputs() from loader - Add get_all_techniques() and has_readme_content() to registry - #225: Replace 75 Korean YAML technique definitions with English Fairthon templates - Map Fairthon categories to wine-themed folder names - Remove nameKo field from schema and all YAMLs - #226: Unify EvaluationMode enum as single source of truth - Remove duplicate EvaluationMode from models/graph.py - Replace SIX_HATS with SIX_SOMMELIERS everywhere - Add FULL_TECHNIQUES mode - Update frontend TypeScript types - #227: Define BMAD 17-item evaluation canon (A1-D4, 100 points) - Create backend/app/criteria/ module with bmad_items.py - A=25pts, B=25pts, C=30pts, D=20pts - Fix pre-existing auth test failures (7 tests) - Update _check_ownership tests for 3-param signature - Fix Secure cookie assertion for localhost environment Closes #224, Closes #225, Closes #226, Closes #227
) - #228: Extend EvaluationState with full_techniques fields - Add 14 new state fields (evaluation_mode, quality_gate, category_scores, etc.) - Extend ItemScore: relax score constraint (ge=0), add 8 optional fields - Create constants.py with quality gate logic (PASS/CONCERNS/FAIL/INCOMPLETE) - Add merge_hat_contributions reducer - #229: Create hat and technique mapping modules - hat_mappings.py: Hat enum, Hat→BMAD items, reverse ITEM→Hats lookup - technique_mappings.py: 75 technique→BMAD item mappings - Mode filtering: full_techniques(75) > six_sommeliers(P0+P1) > grand_tasting(P0) - All 17 BMAD items covered by at least one hat Closes #228, Closes #229
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @ComBba, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 이 PR은 평가 시스템의 핵심 데이터 모델과 로직을 확장하여, 'full_techniques' 평가를 위한 기반을 마련합니다. EvaluationState와 ItemScore 모델에 상세 평가 정보를 담을 수 있도록 필드를 추가하고, 6색 모자 및 75개 기법과 BMAD 항목 간의 복잡한 매핑 체계를 구축합니다. 또한, 평가 결과에 대한 품질 게이트를 도입하여 평가의 신뢰성과 유용성을 높이는 것을 목표로 합니다. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
bbc7143 to
1be88eb
Compare
There was a problem hiding this comment.
Code Review
This PR significantly extends EvaluationState and introduces a mapping system for full_techniques evaluation, incorporating new hat_mappings, technique_mappings modules, and quality_gate logic. However, two security concerns were identified: the inclusion of sensitive API keys in the persistent graph state, posing a risk of credential exposure, and a potential crash in the state merging logic when handling structured data from agents. It is recommended to handle secrets using non-persistent configuration and to harden the merging logic against unhashable types. Additionally, specific review comments provide suggestions for improving code readability and maintainability through Pythonic style.
| evaluation_mode: NotRequired[str] | ||
| github_url: NotRequired[Optional[str]] | ||
| github_analysis: NotRequired[Optional[dict]] | ||
| user_api_key: NotRequired[Optional[str]] |
There was a problem hiding this comment.
The user_api_key field is added to the EvaluationState TypedDict, which is used as the persistent state for the LangGraph evaluation pipeline. Storing sensitive API keys in a persistent state store (managed by a checkpointer as seen in backend/app/graph/graph.py) is insecure as these stores are typically not encrypted and the state may be exposed via logs, debugging tools, or API responses. This violates the principle of secure data handling for sensitive credentials.
| merged = dict(existing) | ||
| for k, v in data.items(): | ||
| if isinstance(v, list) and isinstance(merged.get(k), list): | ||
| merged[k] = sorted(set(merged[k]) | set(v)) |
There was a problem hiding this comment.
The use of set() and sorted() on the merged list will cause a TypeError and crash the evaluation pipeline if the list contains unhashable or non-comparable types, such as dictionaries. Since hat_contributions is intended to aggregate data from various agents (sommeliers), it is highly likely to contain structured data (e.g., lists of improvement objects). An attacker could potentially trigger a Denial of Service (DoS) for an evaluation by influencing the LLM output to include such structured data in a merged field.
| merged[k] = sorted(set(merged[k]) | set(v)) | |
| combined = merged[k] + v | |
| unique = [] | |
| for item in combined: | |
| if item not in unique: | |
| unique.append(item) | |
| try: | |
| merged[k] = sorted(unique) | |
| except TypeError: | |
| merged[k] = unique |
| for hat, items in HAT_TO_ITEMS.items(): | ||
| for item in items: | ||
| if item not in ITEM_TO_HATS: | ||
| ITEM_TO_HATS[item] = [] | ||
| ITEM_TO_HATS[item].append(hat) |
| result = [] | ||
| for tech in all_techs: | ||
| if hat in tech.applicable_hats: | ||
| result.append(tech.id) | ||
| return sorted(result) |
There was a problem hiding this comment.
| if mode == "full_techniques": | ||
| return sorted(all_ids) | ||
|
|
||
| if mode == "grand_tasting": | ||
| return sorted( | ||
| [ | ||
| t_id | ||
| for t_id in all_ids | ||
| if TECHNIQUE_PRIORITY.get(t_id, Priority.P2) == Priority.P0 | ||
| ] | ||
| ) | ||
|
|
||
| if mode == "six_sommeliers": | ||
| return sorted( | ||
| [ | ||
| t_id | ||
| for t_id in all_ids | ||
| if TECHNIQUE_PRIORITY.get(t_id, Priority.P2) | ||
| in (Priority.P0, Priority.P1) | ||
| ] | ||
| ) | ||
|
|
||
| return sorted(all_ids) |
There was a problem hiding this comment.
get_techniques_for_mode 함수의 로직이 다소 중복됩니다. full_techniques 모드와 유효하지 않은 모드 모두 모든 기법을 반환하므로, if/elif/else 구조를 사용하여 로직을 더 명확하게 만들고 마지막에 sorted()를 한 번만 호출하도록 리팩토링하는 것이 좋습니다.
if mode == "grand_tasting":
filtered_ids = [
t_id
for t_id in all_ids
if TECHNIQUE_PRIORITY.get(t_id, Priority.P2) == Priority.P0
]
elif mode == "six_sommeliers":
filtered_ids = [
t_id
for t_id in all_ids
if TECHNIQUE_PRIORITY.get(t_id, Priority.P2)
in (Priority.P0, Priority.P1)
]
else: # "full_techniques" or any other invalid mode
filtered_ids = all_ids
return sorted(filtered_ids)| best = None | ||
| best_priority = Priority.P2.value + 1 | ||
| for t_id in tech_ids: | ||
| p = TECHNIQUE_PRIORITY.get(t_id, Priority.P2).value | ||
| if p < best_priority: | ||
| best_priority = p | ||
| best = t_id | ||
| return best |
| current: Optional[Dict[str, object]], incoming: Optional[Dict[str, object]] | ||
| ) -> Dict[str, object]: |
There was a problem hiding this comment.
타입 힌트로 object를 사용하는 것보다 Any를 사용하는 것이 Python 타이핑 컨벤션에 더 부합하며, "어떤 타입이든 올 수 있음"을 더 명확하게 나타냅니다. typing에서 Any를 임포트하여 사용하시는 것을 권장합니다.
| current: Optional[Dict[str, object]], incoming: Optional[Dict[str, object]] | |
| ) -> Dict[str, object]: | |
| current: Optional[Dict[str, Any]], incoming: Optional[Dict[str, Any]] | |
| ) -> Dict[str, Any]: |
| def test_full_techniques_returns_75(self): | ||
| """get_techniques_for_mode('full_techniques') returns 75 techniques.""" | ||
| from app.criteria.technique_mappings import get_techniques_for_mode | ||
|
|
||
| techs = get_techniques_for_mode("full_techniques") | ||
| assert len(techs) == 75, f"Expected 75 techniques, got {len(techs)}" |
There was a problem hiding this comment.
테스트 코드에 기법의 총 개수인 75가 하드코딩되어 있어, 향후 기법이 추가되거나 삭제될 때 테스트가 실패하고 수동으로 수정해야 하는 번거로움이 있습니다. get_registry().count()를 사용하여 기법의 총 개수를 동적으로 가져오면 테스트의 유지보수성과 견고성을 높일 수 있습니다. 이 파일의 다른 테스트 케이스들(test_mode_count_ordering, test_get_techniques_for_mode_invalid_returns_all)에서도 75가 하드코딩되어 있으니 함께 수정하면 좋을 것 같습니다.
| def test_full_techniques_returns_75(self): | |
| """get_techniques_for_mode('full_techniques') returns 75 techniques.""" | |
| from app.criteria.technique_mappings import get_techniques_for_mode | |
| techs = get_techniques_for_mode("full_techniques") | |
| assert len(techs) == 75, f"Expected 75 techniques, got {len(techs)}" | |
| def test_full_techniques_returns_all_techniques(self): | |
| """get_techniques_for_mode('full_techniques') returns all techniques from the registry.""" | |
| from app.criteria.technique_mappings import get_techniques_for_mode | |
| from app.techniques.registry import get_registry | |
| techs = get_techniques_for_mode("full_techniques") | |
| expected_count = get_registry().count() | |
| assert len(techs) == expected_count, f"Expected {expected_count} techniques, got {len(techs)}" |
Summary
Epic #223 — Group A Wave 2. EvaluationState를 full_techniques 평가에 필요한 필드로 확장하고, 6색 모자(Hat)와 BMAD 항목, 75개 기법 간의 매핑 체계를 구축합니다.
Base:
feat/group-a-foundation(PR #256) — Wave 1 머지 후 base를 master로 변경 예정Changes
#228 — EvaluationState 확장 + ItemScore 강화 + Quality Gate
backend/app/models/graph.pyItemScore.score제약 완화:ge=0, le=100→ge=0(BMAD 항목별 최대 4~7점)item_name,max_score,status,hat_used,evidence,rationale,confidence,unevaluated_reasonbackend/app/graph/state.pyevaluation_mode,quality_gate,category_scores,hat_contributions,total_score,strengths,improvements등merge_hat_contributionsreducer 추가backend/app/constants.py(신규)#229 — Hat→BMAD + Technique→Hat/Item/Priority mappings
backend/app/criteria/hat_mappings.py(신규)Hatenum (white, red, black, yellow, green, blue)HAT_TO_ITEMS/ITEM_TO_HATS양방향 매핑validate_coverage()— 17개 BMAD 항목 전체 커버 검증backend/app/criteria/technique_mappings.py(신규)TECHNIQUE_TO_ITEMS)get_techniques_for_mode()— 모드별 필터링full_techniques: 75개 전체six_sommeliers: P0+P1grand_tasting: P0만Testing
Checklist
Closes #228, Closes #229