You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document defines the failsafe testing strategy for the StackAlchemist platform. It covers unit, integration, end-to-end, contract, chaos, and visual regression testing across all three codebases (Web, Engine, Worker) and their external integrations.
1. Testing Philosophy
StackAlchemist's core value proposition — compile-guaranteed code generation — demands an exceptionally robust test suite. An untested generation pipeline is a broken product. Every layer of the system must be independently verifiable and collectively validated.
Principles:
Test the contract, not the implementation. Mock at service boundaries, not inside services.
Golden files over mocks for LLM output. Real (canned) LLM responses catch real parsing bugs.
Compile verification is a first-class test. If the generated code doesn't build, the product is broken.
No flaky tests in CI. Any test that fails intermittently is deleted or fixed immediately.
Fail loudly. Silent failures (swallowed exceptions, missing files, empty outputs) must be caught by assertions.
4.3 Unit Test Targets — ReconstructionService (CRITICAL)
The ReconstructionService is the most critical component in the system. It parses raw LLM text output into discrete files. Every edge case must be covered:
Test Case
Input
Expected Behavior
Happy path
Well-formed [[FILE:path]]...[[END_FILE]] blocks
Returns dictionary of path → content
Multiple files
5+ file blocks
All files extracted correctly
Missing END_FILE
Block without closing delimiter
Throws MalformedLlmOutputException with context
Missing FILE header
Content before any [[FILE:]]
Ignores preamble content
Empty file block
[[FILE:path]][[END_FILE]]
Returns empty string for that path (valid)
Duplicate paths
Two blocks with same path
Last one wins (logged as warning)
Unexpected paths
Path outside expected template structure
Logged as warning, still included
Truncated response
Response ends mid-block (token limit)
Throws TruncatedLlmResponseException
Markdown wrapping
```csharp around content
Strips markdown fences before extraction
BOM characters
UTF-8 BOM at start of content
Strips BOM, content is clean
Mixed line endings
\r\n and \n mixed
Normalizes to \n
Whitespace in path
[[FILE: src/Controller.cs ]]
Trims whitespace from path
Nested delimiters
[[FILE:]] appearing in code comments
Only matches at line start
4.4 Unit Test Targets — TemplateProvider
Test Case
Expected Behavior
Render all variables
{{ProjectName}}, {{DbConnectionString}}, etc. replaced correctly
Stripe webhook → transaction created → generation triggered with correct tier
WebSocket streaming
Web + Engine + Supabase Realtime
Generation status updates stream to frontend in real-time
BYOK routing
Engine + Mock LLM
Custom API key used when set; platform key used when not
Rate limiting
Web + Engine
Excessive requests return 429 before reaching generation logic
Schema extraction
Web + Engine + Mock LLM
Natural language prompt → extracted JSON schema → valid React Flow data
5.3 Mock LLM Server
For integration tests, a lightweight HTTP server returns canned LLM responses based on the prompt content. This avoids API costs and rate limits while testing the full pipeline.
Implementation: A simple Express.js or .NET minimal API that:
Receives the same request shape as the Anthropic API
Pattern-matches on prompt keywords (e.g., "Product entity" → returns the product golden file)
Can be configured to return malformed responses for chaos testing
6. LLM-Specific Testing
6.1 Golden File Tests (Snapshot Testing for LLM Output)
Maintain a library of real (or realistic) LLM responses saved as text fixtures:
src/StackAlchemist.Engine.Tests/Fixtures/LlmResponses/
├── single-entity-valid.txt # 1 entity (Product), clean output
├── multi-entity-valid.txt # 5 entities, no relationships
├── entity-with-relationships.txt # 3 entities with FK relationships
├── complex-schema.txt # 10+ entities, many-to-many
├── malformed-delimiters.txt # Missing [[END_FILE]] tags
├── truncated-response.txt # Cut off mid-file (simulates token limit)
├── extra-markdown-wrapping.txt # ```csharp fences around code
├── duplicate-file-blocks.txt # Same file path appears twice
└── empty-file-block.txt # [[FILE:path]][[END_FILE]] with no content
Usage: Unit tests parse each golden file through the ReconstructionService and assert expected behavior. When prompts are updated, the golden file suite is re-run to detect regressions.
6.2 Prompt Regression Testing
All prompt templates are version-controlled in src/StackAlchemist.Engine/Prompts/
CI tracks token count per prompt version (logged in test output)
A prompt change triggers mandatory re-run of the golden file suite
Prompt token count exceeding threshold triggers a CI warning
6.3 Chaos Testing for LLM Output
Intentionally malformed inputs test system resilience:
Chaos Scenario
Expected System Behavior
Truncated at 50% of expected output
TruncatedLlmResponseException → retry with "please complete all files" appended
Wrong delimiter style (---FILE:path---)
MalformedLlmOutputException → retry with delimiter format reminder
Extra conversational text mixed in
Preamble/postscript ignored; only delimited blocks extracted
Valid C# but wrong namespace/class name
dotnet build catches it → retry with build error
HTML/XSS in generated code comments
Sanitization strips dangerous content before packaging
7. Database Testing
7.1 Migration Testing
Using supabase db test (pgTAP under the hood):
Verify all tables exist with correct columns and types
Verify RLS policies enforce row isolation
Verify foreign key constraints work correctly
Verify indexes exist on frequently queried columns
7.2 RLS Policy Tests
Test
Assertion
User A queries generations
Only sees their own rows
User A queries transactions
Only sees their own rows
Unauthenticated query
Returns zero rows
Service role query
Sees all rows (for admin/worker)
7.3 Integration with Testcontainers
For .NET integration tests that need database access without Supabase:
# Triggered on every PR to mainsteps:
# Frontend gates
- npm run lint # ESLint zero errors
- npm run type-check # TypeScript strict zero errors
- npx vitest run # Unit + integration tests# Backend gates
- dotnet build # Engine + Worker compile
- dotnet test # xUnit unit + integration tests# Docker gates
- docker build --target web . # Web image builds
- docker build --target engine . # Engine image builds
- docker build --target worker . # Worker image builds# Database gates
- supabase db test # Migration + RLS tests# E2E gates (on merge queue or nightly)
- docker compose -f docker/docker-compose.test.yml up -d
- npx playwright test # Full E2E suite
8.2 Gate Requirements
Gate
Blocking?
Rationale
Lint
✅ Yes
Code quality baseline
Type check
✅ Yes
Catches type errors at compile time
Unit tests
✅ Yes
Core logic correctness
dotnet build
✅ Yes
Backend must compile
dotnet test
✅ Yes
Backend logic correctness
Docker builds
✅ Yes
Deployability guarantee
DB migration tests
✅ Yes
Schema integrity
E2E (Playwright)
⚠️ Merge queue
Slower, runs before merge
Visual regression
⚠️ Manual review
Screenshot diffs require human approval
9. Test Data & Fixture Strategy
Data Type
Local Dev
CI
Staging
Database
Supabase CLI (supabase start)
Testcontainers (PostgreSQL)
Supabase develop branch
LLM responses
Golden file fixtures in repo
Same fixtures
Mock LLM server (Docker)
Stripe
Stripe test mode + stripe listen
Stripe test mode
Stripe test mode
R2/S3 storage
MinIO in Docker
MinIO in Docker
Cloudflare R2 (test bucket)
Auth tokens
Supabase local auth
Mocked JWT tokens
Supabase develop branch
Email (Resend)
Resend test mode (sink)
Mocked
Resend test mode
10. Test Naming Conventions
Frontend (Vitest)
describe('SimpleMode',()=>{it('should render the terminal textarea with placeholder text',()=>{});it('should show loading spinner on prompt submission',()=>{});it('should transition to entity canvas when schema is received',()=>{});it('should display error toast when API call fails',()=>{});});
Note: Coverage targets are guidelines, not hard CI gates. A 79% coverage with excellent tests is better than 95% coverage with trivial assertions.
12. When to Write Tests (Development Workflow)
Before writing a feature — Write the test for the expected behavior first (TDD for critical paths like ReconstructionService, state machine, tier gating).
During feature development — Write tests alongside code for non-critical paths.
After a bug is found — Write a regression test that reproduces the bug before fixing it.
Before merging — All CI gates must pass. No exceptions.