This document defines the verification contract for the research platform. Its purpose is to ensure correctness, comparability, and reproducibility rather than only basic functional behavior.
Related documents:
The test suite must demonstrate that:
- the platform behaves correctly under normal flows
- plug-in components can be swapped without breaking orchestration
- deterministic replay is trustworthy
- schema evolution remains manageable
- failures are surfaced in a controlled and recoverable way
The system must include:
- unit tests
- integration tests
- end-to-end tests
- deterministic replay tests
- regression tests for schemas and exports
The suite should distinguish between:
- pure logic tests with no model dependency
- service tests with mocked generation
- limited end-to-end runs with lightweight fixtures
- explicit real-model smoke tests run separately from the default test suite
Real image generation should not be required for most tests.
Verify:
- prompt encoding returns expected shape
- basis construction returns correct dimensions
E(z) = E0 + U zapplies valid tensor shape rules- trust-region clipping behaves correctly
- anchor penalties reduce drift where expected
- invalid steering dimensions fail clearly
Verify:
- candidate count is correct
- candidates respect trust radius
- orthogonal exploration reduces alignment with exploit direction
- deterministic sampling works under fixed RNG state
- diversity filtering removes near duplicates
- role tags are assigned consistently
Verify:
- ratings normalize correctly
- rankings derive pairwise preferences correctly
- invalid ranking payloads are rejected
- duplicate selections are rejected where required
- optional critique text is preserved
- skip or uncertain actions normalize correctly
Verify:
- winner-copy selects the winning candidate exactly
- averaging updater interpolates correctly
- linear updater moves in the expected direction
- pairwise updater handles symmetric cases correctly
- Bayesian updater changes uncertainty as expected
- trust-region clipping constrains updates
Verify:
- fixed-per-round uses the same seed
- validation candidates receive alternate seeds when configured
- seed manifests are stored for all candidates
- missing seed metadata is treated as a failure
Verify:
- sessions persist immutable config snapshots
- rounds persist in correct order
- candidate and feedback foreign-key relationships remain valid
- replay exports serialize required fields
Flow:
- create experiment
- create session
- request first round
- submit feedback
- request next round
- verify progression and persistence
Use a lightweight mock or tiny test pipeline when full generation is too expensive.
Verify:
- embeddings flow from encoder through steering to generator
- generation failures are captured and surfaced
- successful candidates still persist when one candidate fails
Verify:
- exported replay matches stored rounds and feedback
- images and metadata align correctly
- round order is stable
Verify:
- samplers can be swapped by config
- updaters can be swapped by config
- controller logic does not depend on one concrete strategy implementation
Verify:
- endpoints accept expected payloads
- structured errors are returned on invalid input
- response schemas remain stable
Using browser automation or HTTP-level testing, verify:
- a user can create an experiment from the UI
- a user can start a session
- a user can provide at least two feedback modes
- a user can proceed to the next round
- a user can open replay for a completed session
- replay export API returns the expected round and feedback history
- recoverable errors are shown clearly
These tests are critical.
Given:
- fixed prompt
- fixed experiment configuration
- fixed RNG seeds
- mocked or deterministic generation backend
The replay must reproduce:
- the same candidate proposals
- the same candidate order
- the same update steps
- the same persisted metrics
- the same round summaries
Regression coverage should include:
- historical export loading
- schema migration behavior
- known edge-case prompts
- known edge-case feedback payloads
- previously fixed replay bugs
The test suite should verify controlled behavior for:
- one-candidate render failure
- duplicate feedback submission
- premature next-round generation while feedback is still pending
- invalid ranking payloads
- export generation failure
- database write interruption
- resume after crash
Required fixtures:
- deterministic prompt embedding fixture
- synthetic candidate set fixture
- fake user feedback fixture
- mock image generator fixture
- small replay log fixture
- schema snapshot fixture
- frontend/backend trace capture fixture where needed
The prototype is acceptable when:
- all unit tests pass
- core integration tests pass
- deterministic replay tests pass
- one sampler and one updater can be swapped by configuration only
- the UI supports at least two feedback modes
- exports can be generated and replayed
- browser smoke coverage includes replay export retrieval
- failure-mode behavior is covered for the major recoverable errors
Test reporting should make it easy to identify:
- failing component area
- failing scenario
- whether the failure breaks replay trustworthiness
- whether the failure is isolated or systemic
The test suite is part of the research method, not an implementation afterthought. If replay, schema stability, and strategy interchangeability are not verified, the platform cannot support reliable experimental conclusions.