Skip to content

Commit 77361ee

Browse files
authored
Merge pull request #363 from raifdmueller/feat/new-anchor-evals
feat: evaluation specs + results for PERT, GRASP, VSA
2 parents 9589f07 + 5ae0803 commit 77361ee

21 files changed

Lines changed: 1606 additions & 44 deletions

evaluations/pilot.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -295,9 +295,11 @@ def save_results(all_results, out_file):
295295
def run_pilot(models, dry_run=False, verbose=False, ollama_model="qwen3:4b", no_think=False,
296296
ollama_url="http://localhost:11434", openai_model="gpt-4o-mini",
297297
mistral_model="mistral-large-latest", deepseek_model="deepseek-chat",
298-
claude_model="claude-sonnet-4-20250514"):
298+
claude_model="claude-sonnet-4-20250514", anchor_filter=None):
299299
start_time = time.time()
300300
specs = load_specs()
301+
if anchor_filter:
302+
specs = [s for s in specs if s["anchor"] in anchor_filter]
301303
print(f"Loaded {len(specs)} anchor specs")
302304
print(f"Models: {', '.join(models)}")
303305
print(f"Temperature: {TEMPERATURE}")
@@ -522,6 +524,8 @@ def append_and_save(r):
522524
help="Sampling temperature (default: 0.0). Note: claude-cli/claude-haiku ignore this.")
523525
parser.add_argument("--no-think", action="store_true",
524526
help="Disable reasoning/thinking for Ollama models (faster, fewer tokens)")
527+
parser.add_argument("--anchor", nargs="+",
528+
help="Only evaluate specific anchors (e.g., --anchor pert grasp)")
525529
parser.add_argument("--dry-run", action="store_true",
526530
help="Show prompts without sending")
527531
parser.add_argument("--verbose", action="store_true",
@@ -530,4 +534,4 @@ def append_and_save(r):
530534
set_temperature(args.temperature)
531535
run_pilot(args.model, args.dry_run, args.verbose, args.ollama_model, args.no_think,
532536
args.ollama_url, args.openai_model, args.mistral_model, args.deepseek_model,
533-
args.claude_model)
537+
args.claude_model, args.anchor)

evaluations/report.html

Lines changed: 22 additions & 21 deletions
Large diffs are not rendered by default.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
anchor: explicit-contract-surface
2+
tier: 3
3+
4+
questions:
5+
recognition:
6+
question: 'Which of the following best describes "Explicit Contract Surface"?
7+
8+
'
9+
options:
10+
A: A legal framework for defining service-level agreements between microservices
11+
with automated compliance monitoring and penalty enforcement
12+
B: The practice of exposing only well-defined DTOs and contracts at layer
13+
boundaries, never domain entities, with explicit mapping at every seam
14+
C: A UI design pattern that makes all system constraints and validation rules
15+
visible to users through progressive disclosure and inline feedback
16+
D: A versioning strategy that maintains backward compatibility by exposing
17+
multiple API versions simultaneously through content negotiation
18+
correct: B

evaluations/specs/grasp.yaml

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
anchor: grasp
2+
tier: 3
3+
questions:
4+
recognition:
5+
question: 'Which of the following best describes "GRASP"?
6+
7+
'
8+
options:
9+
A: Ensure that a class has only one reason to change by separating concerns
10+
and maintaining high cohesion within modules while minimizing dependencies
11+
between them
12+
B: Assign responsibility to the class that has the information needed to fulfill
13+
it; assign responsibility for creating an object to the class that aggregates,
14+
contains, or closely uses it
15+
C: Define a family of algorithms, encapsulate each one, and make them interchangeable
16+
so that the algorithm can vary independently from clients that use it
17+
D: Depend on abstractions rather than concretions by inverting the direction
18+
of dependencies so that high-level modules do not depend on low-level modules
19+
correct: B
20+
application:
21+
scenario: You're designing an e-commerce system where customers can place orders
22+
containing multiple items. The system needs to calculate the total cost of an
23+
order, including item prices, taxes, and shipping fees. Currently, the OrderController
24+
class is handling all calculations, validation, and persistence logic.
25+
anchor_prompt: using GRASP
26+
paraphrase_prompt: How should you restructure the responsibility assignments to
27+
create a more maintainable and cohesive design?
28+
options:
29+
A: Keep all calculation logic in OrderController but extract tax and shipping
30+
calculations into separate static utility classes to reduce the controller's
31+
size
32+
B: Move cost calculation to the Order class (which contains the items), create
33+
a TaxCalculator service for tax logic, and let OrderItem handle its own price
34+
calculations
35+
C: Create a single OrderProcessor class that handles all order-related operations
36+
including calculations, validation, and persistence to centralize order logic
37+
D: Move all calculation logic to the database using stored procedures and have
38+
the OrderController only handle data retrieval and UI coordination
39+
correct: B
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
anchor: kiss-principle
2+
tier: 1
3+
4+
questions:
5+
recognition:
6+
question: 'Which of the following best describes the "KISS Principle"?
7+
8+
'
9+
options:
10+
A: A design principle favoring simplicity over complexity, avoiding over-engineering
11+
and unnecessary features in software design
12+
B: A testing methodology that focuses on verifying the simplest possible test
13+
cases before progressing to complex integration scenarios
14+
C: A refactoring pattern that reduces code complexity by extracting common logic
15+
into reusable utility functions and shared libraries
16+
D: A project management approach that prioritizes delivering minimal viable
17+
features with the shortest possible development cycles
18+
correct: A

evaluations/specs/para-method.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
anchor: para-method
2+
tier: 3
3+
4+
questions:
5+
recognition:
6+
question: 'Which of the following best describes the "P.A.R.A. Method"?
7+
8+
'
9+
options:
10+
A: A software architecture pattern that separates concerns into Presentation,
11+
Application, Repository, and Adapter layers
12+
B: A knowledge organization framework that categorizes information into Projects,
13+
Areas, Resources, and Archive based on actionability
14+
C: A prioritization technique that evaluates items by their Probability,
15+
Applicability, Risk, and Achievability scores
16+
D: A documentation standard that structures technical content into Purpose,
17+
Audience, Requirements, and Appendix sections
18+
correct: B

evaluations/specs/pert.yaml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
anchor: pert
2+
tier: 2
3+
questions:
4+
recognition:
5+
question: 'Which of the following best describes "PERT"?
6+
7+
'
8+
options:
9+
A: An iterative requirements prioritization framework that categorizes features
10+
into must-have, should-have, could-have, and won't-have buckets
11+
B: A stochastic project scheduling method using three-point estimation (optimistic,
12+
most likely, pessimistic) to calculate expected duration and variance for
13+
activities in a network diagram
14+
C: A process improvement methodology that uses statistical control charts to
15+
monitor and reduce variation in manufacturing and software delivery processes
16+
D: A stakeholder mapping technique that visualizes the impact and influence
17+
of different actors to prioritize engagement strategies
18+
correct: B
19+
application:
20+
scenario: Your team is planning a critical software release with several interdependent
21+
features. The database migration task has estimates of 2 days (optimistic),
22+
5 days (most likely), and 14 days (pessimistic). The API development task estimates
23+
are 3 days (optimistic), 7 days (most likely), and 11 days (pessimistic).
24+
anchor_prompt: using PERT
25+
paraphrase_prompt: What should be the expected duration estimates for these two
26+
tasks when accounting for uncertainty in a probabilistic scheduling approach?
27+
options:
28+
A: 'Database migration: 7 days, API development: 7 days (using simple averages
29+
of the three estimates)'
30+
B: 'Database migration: 6 days, API development: 7 days (using weighted average
31+
formula with most likely estimate weighted 4x)'
32+
C: 'Database migration: 5 days, API development: 7 days (using the most likely
33+
estimates as the expected values)'
34+
D: 'Database migration: 8 days, API development: 8.5 days (using average of
35+
optimistic and pessimistic estimates plus buffer)'
36+
correct: B
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
anchor: spec-driven-development
2+
tier: 3
3+
4+
questions:
5+
recognition:
6+
question: 'Which of the following best describes "Spec-Driven Development"?
7+
8+
'
9+
options:
10+
A: An approach where software is defined, constrained, and validated through
11+
explicit specifications, with specs as the source of truth rather than code
12+
B: A testing methodology that generates test cases automatically from formal
13+
mathematical specifications of expected behavior
14+
C: A documentation practice that requires detailed technical specifications
15+
to be written and approved before any implementation begins
16+
D: A deployment strategy that uses infrastructure specifications to provision
17+
and configure environments automatically
18+
correct: A
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
anchor: vertical-slice-architecture
2+
tier: 3
3+
questions:
4+
recognition:
5+
question: 'Which of the following best describes "Vertical Slice Architecture
6+
(VSA)"?
7+
8+
'
9+
options:
10+
A: Code is organized into distinct horizontal layers (presentation, business,
11+
data) with clear separation of concerns and dependencies flowing downward
12+
through each technical layer
13+
B: Each feature is organized as an end-to-end slice spanning all layers (request,
14+
validation, domain logic, persistence, API); code for a feature lives together
15+
regardless of technical layer
16+
C: The application core is isolated from external dependencies through ports
17+
and adapters, with business logic at the center surrounded by interface adapters
18+
that connect to external systems
19+
D: Commands and queries are separated into distinct models and handlers, with
20+
write operations handled differently from read operations to optimize for
21+
different access patterns and scalability requirements
22+
correct: B
23+
application:
24+
scenario: Your team is building an e-commerce platform with features like user
25+
registration, product catalog, shopping cart, order processing, and payment
26+
handling. Currently, the codebase is organized into traditional layers (Controllers,
27+
Services, Repositories) but developers are experiencing frequent merge conflicts
28+
and find it difficult to work on features independently. Changes to one feature
29+
often require modifications across multiple layers, slowing down development.
30+
anchor_prompt: using Vertical Slice Architecture (VSA)
31+
paraphrase_prompt: How should you reorganize the codebase to enable independent
32+
feature development and reduce cross-feature coupling?
33+
options:
34+
A: Create separate microservices for each feature (UserService, ProductService,
35+
CartService, etc.) with their own databases and API gateways to ensure complete
36+
isolation between features.
37+
B: Organize code by feature slices where each feature (Registration, Catalog,
38+
Cart, Orders, Payments) contains its own request handlers, validation, business
39+
logic, and data access in a single vertical slice.
40+
C: Implement a strict layered architecture with clear interfaces between Controller,
41+
Service, and Repository layers, using dependency injection to manage cross-layer
42+
dependencies more effectively.
43+
D: Apply Domain-Driven Design by creating bounded contexts for each business
44+
domain and organizing code into aggregate roots with separate application
45+
services for each domain.
46+
correct: B

0 commit comments

Comments
 (0)