Phase 1 is complete. Phase 2 implements the first 4 experiments on existing hardware - the "fast path to real signal."
These 4 experiments represent the near-term lane and are designed to:
- Validate the experimental lab infrastructure
- Produce measurable wins on real hardware
- Establish credible baselines for future work
- Prove the framework's controls work correctly
Hypothesis: Layer-adaptive INT4 quantization with BF16 fallback improves ECD by at least 15% on transformer inference without violating the 99% quality floor.
Mutation Scope:
quantization_config: Which layers, which precision levelsfallback_policy: Conditions for falling back to BF16layer_fallback_thresholds: Quality thresholds per layer
Expected Implementation:
# Location: src/kernels/quantization.py
class QuantizationStrategy:
def quantize_layer(self, layer, precision, fallback_threshold):
"""Quantize a layer, fall back to BF16 if quality drops below threshold"""
quantized = quantize_to_precision(layer, precision)
quality_loss = measure_quality_loss(quantized)
if quality_loss > fallback_threshold:
return layer.to_bfloat16() # Fallback
return quantized
def compute_layer_precision_map(self, model, quality_floor):
"""Adaptive precision assignment per layer"""
# Fine binary search for minimum viable precision per layer
# Returns: {layer_name: precision}
pass
class INT4Executor:
"""Trial executor for INT4 experiments"""
def run_trial(self, trial_num, benchmark_id):
# 1. Load model at baseline precision (BF16)
# 2. Apply adaptive quantization config
# 3. Inference on benchmark
# 4. Measure: accuracy, latency, energy, throughput
# 5. Return metrics dict
passMetrics to Track:
accuracy: Must remain >= 99% of baselinelatency_ms_p95: Measure inference p95energy_joules: Per-inference energy (RAPL + GPU)throughput_tok_sec: Tokens per secondecd_estimate: (quality × throughput) / (energy × 1.0 area_proxy)
Trials: 10 (repeated runs with different random seeds)
Quality Floor: 99% of baseline accuracy
Latency Budget: p95 <= 120ms
Data: Transformer model inference on 128-token sequences, batch=1
Hypothesis: Dynamic token pruning reduces average energy per prompt by at least 20% while preserving response quality on long-context tasks.
Mutation Scope:
pruning_threshold: Token importance threshold for keeping a tokenpruning_method: Which importance scoring method (gradient, magnitude, entropy)context_length: Length of context to evaluate pruning on
Expected Implementation:
# Location: src/kernels/sparsity.py
class TokenPruner:
def score_token_importance(self, token, context, method="gradient"):
"""Score how important a token is for output"""
if method == "gradient":
# Compute gradient of loss w.r.t. this token embedding
return compute_gradient_magnitude(token)
elif method == "magnitude":
# Use embedding magnitude as proxy
return token.norm()
elif method == "entropy":
# Attention entropy over this token
return compute_attention_entropy(token, context)
def prune_sequence(self, sequence, threshold, budget=0.9):
"""Remove unimportant tokens from context sequence"""
scores = [self.score_token_importance(t, sequence) for t in sequence]
keep_indices = [i for i, s in enumerate(scores) if s >= threshold]
# Ensure we keep at least budget% of tokens
if len(keep_indices) < len(sequence) * budget:
keep_indices = sorted(range(len(sequence)),
key=lambda i: scores[i],
reverse=True)[:int(len(sequence)*budget)]
return sequence[keep_indices]
class TokenPruningExecutor:
"""Trial executor for sparsity experiments"""
def run_trial(self, trial_num, benchmark_id):
# 1. Load long-context task (e.g., summarization)
# 2. For each prompt:
# a. Score tokens
# b. Prune low-importance tokens
# c. Run inference on pruned context
# 3. Measure accuracy, latency, energy
return metricsMetrics to Track:
accuracy_f1: F1 score on task (e.g., ROUGE for summarization)tokens_pruned_pct: Percentage of tokens removedlatency_ms_p95: End-to-end latencyenergy_joules: Total energy for inferenceecd_estimate: (quality × throughput) / energy
Trials: 10 on diverse long-context datasets
Quality Floor: 95% of baseline accuracy (more lenient since pruning intentionally changes output)
Latency Budget: p95 <= 150ms (context-dependent)
Hypothesis: Activation recomputation optimizes the storage-vs.-recomputation tradeoff, reducing memory bandwidth by at least 25% without exceeding latency budget.
Mutation Scope:
recomputation_layers: Which layers to recompute activations formemory_budget_mb: Maximum memory for storing activationsrecomputation_granularity: At what kernel granularity to recompute
Expected Implementation:
# Location: src/kernels/memory_optimization.py
class ActivationMemoryOptimizer:
def estimate_memory_footprint(self, model, batch_size, seq_len):
"""Estimate activation memory for full storage"""
# Sum activation sizes per layer
total = 0
for layer in model.layers:
act_size = batch_size * seq_len * layer.hidden_dim * 4 # fp32
total += act_size
return total
def select_recomputation_schedule(self, model, memory_budget):
"""Decide which layers to recompute vs. store"""
schedule = {}
cumulative_memory = 0
for layer in model.layers:
recompute_cost = estimate_recomputation_latency(layer)
memory_cost = estimate_activation_footprint(layer)
if cumulative_memory + memory_cost <= memory_budget:
schedule[layer.name] = "store"
cumulative_memory += memory_cost
else:
schedule[layer.name] = "recompute"
return schedule
def optimize_memory_layout(self, model, batch_size, seq_len):
"""Reorganize memory access patterns for cache efficiency"""
# Reorder activation storage for better spatial locality
# Prefetch activations based on access patterns
pass
class MemoryOptimizationExecutor:
"""Trial executor for memory experiments"""
def run_trial(self, trial_num, benchmark_id):
# 1. Profile baseline activation memory
# 2. Apply recomputation schedule
# 3. Measure latency during recomputation
# 4. Measure DRAM bandwidth and cache hits
# 5. Return metrics
return {
"memory_traffic_bytes": total_bytes_moved,
"cache_hit_rate": hits / total_accesses,
"latency_ms": end_to_end_latency,
"energy_joules": energy_consumed,
}Metrics to Track:
memory_traffic_bytes: Total bytes moved between DRAM and cachedram_bandwidth_utilized: Peak BW / system BWlatency_ms_p95: Inference latencyenergy_joules: Total energycache_hit_rate: L3 cache hit percentage
Trials: 10
Quality Floor: N/A (numerical output identical)
Latency Budget: p95 <= 130ms (may increase due to recomputation)
Hypothesis: Kernel fusion and graph-level tiling reduce memory movement and improve p95 latency by at least 15% on dense linear algebra workloads.
Mutation Scope:
fusion_rules: Which kernels to fusetiling_config: Tile sizes for loop tilinginstruction_scheduling: Kernel scheduling strategy
Expected Implementation:
# Location: src/kernels/compiler_optimization.py
class KernelGraphOptimizer:
def identify_fusion_opportunities(self, graph):
"""Find kernels that can be fused"""
# Look for:
# - Elementwise operations (can fuse)
# - Sequential kernels sharing memory (can fuse)
# - Chains of small kernels (definitely fuse)
fusion_groups = []
# Return list of kernel groups to fuse
return fusion_groups
def compute_optimal_tiling(self, kernel, hardware):
"""Compute optimal tile sizes for loop tiling"""
# Tile for cache line size, L1, L2, L3
# Minimize memory traffic while maximizing ILP
tile_config = {
"M": optimal_m,
"N": optimal_n,
"K": optimal_k,
}
return tile_config
def fuse_kernels(self, kernel_list):
"""Fuse multiple kernels into single optimized kernel"""
# Generate fused kernel code
# Inline data between kernels
# Share temporary buffers
fused_kernel = generate_fused_kernel_code(kernel_list)
return fused_kernel
class CompilerOptimizationExecutor:
"""Trial executor for compiler experiments"""
def run_trial(self, trial_num, benchmark_id):
# 1. Analyze computational graph
# 2. Apply fusion rules
# 3. Apply tiling config
# 4. Compile and benchmark
# 5. Measure latency, memory traffic, power
return {
"latency_ms": latency,
"memory_traffic_bytes": bytes_moved,
"power_watts": peak_power,
"energy_joules": energy,
"flops_achieved": actual_flops,
"memory_bandwidth_used": bw_used,
}Metrics to Track:
latency_ms_p95: Kernel execution timememory_traffic_bytes: Total memory movementflops_achieved: Actual FLOP/s achievedmemory_bandwidth_used: Percentage of peak BWenergy_joules: Total energy for operation
Trials: 10 on different batch sizes and matrix dimensions
Quality Floor: N/A (numerical output bitwise identical)
Latency Budget: p95 <= 50ms for dense kernels
All 4 experiments follow this pattern:
class ExperimentExecutor:
def __init__(self, baseline_path, experiment_config):
self.baseline = load_baseline(baseline_path)
self.config = experiment_config
self.metrics_collector = MetricsCollector()
def run_trial(self, trial_num: int, benchmark_id: str) -> Dict[str, float]:
"""
Execute one trial of the experiment.
Returns dictionary of metrics.
"""
try:
# 1. Setup: Load model at baseline
model = load_model_at_precision(self.baseline.representation_type)
# 2. Mutate: Apply experimental change
model = apply_mutation(model, self.config)
# 3. Execute: Run benchmark
results = benchmark.run(model)
# 4. Measure: Collect metrics
metrics = {
"accuracy": results.accuracy,
"latency_ms": results.latency,
"energy_joules": results.energy,
"throughput_tok_sec": results.throughput,
}
# 5. Check: Quality floor
if metrics["accuracy"] < self.baseline.accuracy * 0.99:
raise QualityFloorViolation()
return metrics
except Exception as e:
return {"error": str(e)}
# Usage in experiment runner
runner = ExperimentRunner(baseline_mgr, benchmark_registry, metrics_collector)
executor = Experiment1Executor(baseline, config)
experiment = runner.run_experiment(
experiment_id="exp_001",
trial_executor=executor.run_trial,
)Accept experiment ONLY if:
- ✅ Minimum Effect Size: abs(effect_size) >= 0.10 (10% improvement)
- ✅ Quality Floor: accuracy >= baseline × 0.99 (99% of baseline)
- ✅ Latency Budget: p95_latency <= budget_ms
- ✅ Statistical Significance: p-value < 0.05 (Welch's t-test)
- ✅ Confidence Interval: CI excludes zero
- ✅ Holdout Pass: Replicates on holdout benchmark set
- ✅ Complexity Cost: Orchestration overhead < 5% of savings
For each trial, measure:
- Task quality (accuracy, F1, ROUGE, etc.)
- Latency p50, p95, p99 (milliseconds)
- Throughput (tokens/sec or samples/sec)
- Energy (joules per inference) - use RAPL + GPU telemetry
- Memory traffic (bytes moved between hierarchy levels)
- Cache hit rate (if available)
- Peak power (watts)
- Compute utilization (% of peak FLOPS)
- Baseline hardware profiled (A100, RTX 6000, etc.)
- BF16 baseline model frozen and measured
- Benchmark data prepared (ImageNet, GLUE, translation, etc.)
- Energy measurement tools configured (RAPL, GPU logging)
- Quantization module complete
- Sparsity module complete
- Memory optimizer complete
- Compiler fusion toolkit complete
- Trial executors written and tested
- Config files for all 4 experiments
- Exp 1: INT4 quantization (10 trials)
- Exp 2: Token pruning (10 trials)
- Exp 3: Memory optimization (10 trials)
- Exp 4: Compiler fusion (10 trials)
- All experiments save trial data
- All verdicts generated (accept/reject)
- Report generation working
- Holdout validation passed
- Portfolio dashboard shows all 4
- Individual reports generated
- Comparison report created
- Results stored in ResultsStore
- Next experiments proposed
If framework works correctly, we expect:
- Exp 1 (Quantization): 12-18% ECD improvement, quality floor met
- Exp 2 (Pruning): 18-25% energy reduction, some quality loss acceptable
- Exp 3 (Memory): 20-35% memory traffic reduction, latency tradeoff
- Exp 4 (Fusion): 12-20% latency improvement, no quality impact
These are realistic, achievable targets on existing hardware with proven techniques.
Once all 4 experiments complete and verdicts are stable:
- Move to Phase 3: Mid-term experiments (RNS, LNS, posits, etc.)
- Then Phase 4: Moonshot simulators
- Then Phase 5: Agent-driven hypothesis generation
Only then introduce the hypothesis/build/skeptic agents.
The lab proves itself on known-good experiments first.