Phase 2 Implementation Guide

Overview

Phase 1 is complete. Phase 2 implements the first 4 experiments on existing hardware - the "fast path to real signal."

These 4 experiments represent the near-term lane and are designed to:

Validate the experimental lab infrastructure
Produce measurable wins on real hardware
Establish credible baselines for future work
Prove the framework's controls work correctly

Experiment 1: Adaptive INT4/INT8 Quantization

Hypothesis: Layer-adaptive INT4 quantization with BF16 fallback improves ECD by at least 15% on transformer inference without violating the 99% quality floor.

Mutation Scope:

quantization_config: Which layers, which precision levels
fallback_policy: Conditions for falling back to BF16
layer_fallback_thresholds: Quality thresholds per layer

Expected Implementation:

# Location: src/kernels/quantization.py

class QuantizationStrategy:
    def quantize_layer(self, layer, precision, fallback_threshold):
        """Quantize a layer, fall back to BF16 if quality drops below threshold"""
        quantized = quantize_to_precision(layer, precision)
        quality_loss = measure_quality_loss(quantized)
        
        if quality_loss > fallback_threshold:
            return layer.to_bfloat16()  # Fallback
        return quantized
    
    def compute_layer_precision_map(self, model, quality_floor):
        """Adaptive precision assignment per layer"""
        # Fine binary search for minimum viable precision per layer
        # Returns: {layer_name: precision}
        pass

class INT4Executor:
    """Trial executor for INT4 experiments"""
    def run_trial(self, trial_num, benchmark_id):
        # 1. Load model at baseline precision (BF16)
        # 2. Apply adaptive quantization config
        # 3. Inference on benchmark
        # 4. Measure: accuracy, latency, energy, throughput
        # 5. Return metrics dict
        pass

Metrics to Track:

accuracy: Must remain >= 99% of baseline
latency_ms_p95: Measure inference p95
energy_joules: Per-inference energy (RAPL + GPU)
throughput_tok_sec: Tokens per second
ecd_estimate: (quality × throughput) / (energy × 1.0 area_proxy)

Trials: 10 (repeated runs with different random seeds)

Quality Floor: 99% of baseline accuracy

Latency Budget: p95 <= 120ms

Data: Transformer model inference on 128-token sequences, batch=1

Experiment 2: Dynamic Sparsity / Token Pruning

Hypothesis: Dynamic token pruning reduces average energy per prompt by at least 20% while preserving response quality on long-context tasks.

Mutation Scope:

pruning_threshold: Token importance threshold for keeping a token
pruning_method: Which importance scoring method (gradient, magnitude, entropy)
context_length: Length of context to evaluate pruning on

Expected Implementation:

# Location: src/kernels/sparsity.py

class TokenPruner:
    def score_token_importance(self, token, context, method="gradient"):
        """Score how important a token is for output"""
        if method == "gradient":
            # Compute gradient of loss w.r.t. this token embedding
            return compute_gradient_magnitude(token)
        elif method == "magnitude":
            # Use embedding magnitude as proxy
            return token.norm()
        elif method == "entropy":
            # Attention entropy over this token
            return compute_attention_entropy(token, context)

    def prune_sequence(self, sequence, threshold, budget=0.9):
        """Remove unimportant tokens from context sequence"""
        scores = [self.score_token_importance(t, sequence) for t in sequence]
        keep_indices = [i for i, s in enumerate(scores) if s >= threshold]
        
        # Ensure we keep at least budget% of tokens
        if len(keep_indices) < len(sequence) * budget:
            keep_indices = sorted(range(len(sequence)), 
                                key=lambda i: scores[i], 
                                reverse=True)[:int(len(sequence)*budget)]
        
        return sequence[keep_indices]

class TokenPruningExecutor:
    """Trial executor for sparsity experiments"""
    def run_trial(self, trial_num, benchmark_id):
        # 1. Load long-context task (e.g., summarization)
        # 2. For each prompt:
        #    a. Score tokens
        #    b. Prune low-importance tokens
        #    c. Run inference on pruned context
        # 3. Measure accuracy, latency, energy
        return metrics

Metrics to Track:

accuracy_f1: F1 score on task (e.g., ROUGE for summarization)
tokens_pruned_pct: Percentage of tokens removed
latency_ms_p95: End-to-end latency
energy_joules: Total energy for inference
ecd_estimate: (quality × throughput) / energy

Trials: 10 on diverse long-context datasets

Quality Floor: 95% of baseline accuracy (more lenient since pruning intentionally changes output)

Latency Budget: p95 <= 150ms (context-dependent)

Experiment 3: Memory-Traffic Reduction

Hypothesis: Activation recomputation optimizes the storage-vs.-recomputation tradeoff, reducing memory bandwidth by at least 25% without exceeding latency budget.

Mutation Scope:

recomputation_layers: Which layers to recompute activations for
memory_budget_mb: Maximum memory for storing activations
recomputation_granularity: At what kernel granularity to recompute

Expected Implementation:

# Location: src/kernels/memory_optimization.py

class ActivationMemoryOptimizer:
    def estimate_memory_footprint(self, model, batch_size, seq_len):
        """Estimate activation memory for full storage"""
        # Sum activation sizes per layer
        total = 0
        for layer in model.layers:
            act_size = batch_size * seq_len * layer.hidden_dim * 4  # fp32
            total += act_size
        return total

    def select_recomputation_schedule(self, model, memory_budget):
        """Decide which layers to recompute vs. store"""
        schedule = {}
        cumulative_memory = 0
        
        for layer in model.layers:
            recompute_cost = estimate_recomputation_latency(layer)
            memory_cost = estimate_activation_footprint(layer)
            
            if cumulative_memory + memory_cost <= memory_budget:
                schedule[layer.name] = "store"
                cumulative_memory += memory_cost
            else:
                schedule[layer.name] = "recompute"
        
        return schedule

    def optimize_memory_layout(self, model, batch_size, seq_len):
        """Reorganize memory access patterns for cache efficiency"""
        # Reorder activation storage for better spatial locality
        # Prefetch activations based on access patterns
        pass

class MemoryOptimizationExecutor:
    """Trial executor for memory experiments"""
    def run_trial(self, trial_num, benchmark_id):
        # 1. Profile baseline activation memory
        # 2. Apply recomputation schedule
        # 3. Measure latency during recomputation
        # 4. Measure DRAM bandwidth and cache hits
        # 5. Return metrics
        return {
            "memory_traffic_bytes": total_bytes_moved,
            "cache_hit_rate": hits / total_accesses,
            "latency_ms": end_to_end_latency,
            "energy_joules": energy_consumed,
        }

Metrics to Track:

memory_traffic_bytes: Total bytes moved between DRAM and cache
dram_bandwidth_utilized: Peak BW / system BW
latency_ms_p95: Inference latency
energy_joules: Total energy
cache_hit_rate: L3 cache hit percentage

Trials: 10

Quality Floor: N/A (numerical output identical)

Latency Budget: p95 <= 130ms (may increase due to recomputation)

Experiment 4: Compiler Graph Optimization

Hypothesis: Kernel fusion and graph-level tiling reduce memory movement and improve p95 latency by at least 15% on dense linear algebra workloads.

Mutation Scope:

fusion_rules: Which kernels to fuse
tiling_config: Tile sizes for loop tiling
instruction_scheduling: Kernel scheduling strategy

Expected Implementation:

# Location: src/kernels/compiler_optimization.py

class KernelGraphOptimizer:
    def identify_fusion_opportunities(self, graph):
        """Find kernels that can be fused"""
        # Look for:
        # - Elementwise operations (can fuse)
        # - Sequential kernels sharing memory (can fuse)
        # - Chains of small kernels (definitely fuse)
        fusion_groups = []
        # Return list of kernel groups to fuse
        return fusion_groups

    def compute_optimal_tiling(self, kernel, hardware):
        """Compute optimal tile sizes for loop tiling"""
        # Tile for cache line size, L1, L2, L3
        # Minimize memory traffic while maximizing ILP
        tile_config = {
            "M": optimal_m,
            "N": optimal_n,
            "K": optimal_k,
        }
        return tile_config

    def fuse_kernels(self, kernel_list):
        """Fuse multiple kernels into single optimized kernel"""
        # Generate fused kernel code
        # Inline data between kernels
        # Share temporary buffers
        fused_kernel = generate_fused_kernel_code(kernel_list)
        return fused_kernel

class CompilerOptimizationExecutor:
    """Trial executor for compiler experiments"""
    def run_trial(self, trial_num, benchmark_id):
        # 1. Analyze computational graph
        # 2. Apply fusion rules
        # 3. Apply tiling config
        # 4. Compile and benchmark
        # 5. Measure latency, memory traffic, power
        return {
            "latency_ms": latency,
            "memory_traffic_bytes": bytes_moved,
            "power_watts": peak_power,
            "energy_joules": energy,
            "flops_achieved": actual_flops,
            "memory_bandwidth_used": bw_used,
        }

Metrics to Track:

latency_ms_p95: Kernel execution time
memory_traffic_bytes: Total memory movement
flops_achieved: Actual FLOP/s achieved
memory_bandwidth_used: Percentage of peak BW
energy_joules: Total energy for operation

Trials: 10 on different batch sizes and matrix dimensions

Quality Floor: N/A (numerical output bitwise identical)

Latency Budget: p95 <= 50ms for dense kernels

Common Implementation Pattern

All 4 experiments follow this pattern:

class ExperimentExecutor:
    def __init__(self, baseline_path, experiment_config):
        self.baseline = load_baseline(baseline_path)
        self.config = experiment_config
        self.metrics_collector = MetricsCollector()

    def run_trial(self, trial_num: int, benchmark_id: str) -> Dict[str, float]:
        """
        Execute one trial of the experiment.
        
        Returns dictionary of metrics.
        """
        try:
            # 1. Setup: Load model at baseline
            model = load_model_at_precision(self.baseline.representation_type)
            
            # 2. Mutate: Apply experimental change
            model = apply_mutation(model, self.config)
            
            # 3. Execute: Run benchmark
            results = benchmark.run(model)
            
            # 4. Measure: Collect metrics
            metrics = {
                "accuracy": results.accuracy,
                "latency_ms": results.latency,
                "energy_joules": results.energy,
                "throughput_tok_sec": results.throughput,
            }
            
            # 5. Check: Quality floor
            if metrics["accuracy"] < self.baseline.accuracy * 0.99:
                raise QualityFloorViolation()
            
            return metrics
            
        except Exception as e:
            return {"error": str(e)}

# Usage in experiment runner
runner = ExperimentRunner(baseline_mgr, benchmark_registry, metrics_collector)
executor = Experiment1Executor(baseline, config)
experiment = runner.run_experiment(
    experiment_id="exp_001",
    trial_executor=executor.run_trial,
)

Acceptance Rules (Same for All 4)

Accept experiment ONLY if:

✅ Minimum Effect Size: abs(effect_size) >= 0.10 (10% improvement)
✅ Quality Floor: accuracy >= baseline × 0.99 (99% of baseline)
✅ Latency Budget: p95_latency <= budget_ms
✅ Statistical Significance: p-value < 0.05 (Welch's t-test)
✅ Confidence Interval: CI excludes zero
✅ Holdout Pass: Replicates on holdout benchmark set
✅ Complexity Cost: Orchestration overhead < 5% of savings

Metrics Collection Checklist

For each trial, measure:

Task quality (accuracy, F1, ROUGE, etc.)
Latency p50, p95, p99 (milliseconds)
Throughput (tokens/sec or samples/sec)
Energy (joules per inference) - use RAPL + GPU telemetry
Memory traffic (bytes moved between hierarchy levels)
Cache hit rate (if available)
Peak power (watts)
Compute utilization (% of peak FLOPS)

Quick Checklist for Phase 2 Startup

Prerequisites

Baseline hardware profiled (A100, RTX 6000, etc.)
BF16 baseline model frozen and measured
Benchmark data prepared (ImageNet, GLUE, translation, etc.)
Energy measurement tools configured (RAPL, GPU logging)

Implementation

Quantization module complete
Sparsity module complete
Memory optimizer complete
Compiler fusion toolkit complete
Trial executors written and tested
Config files for all 4 experiments

Execution

Exp 1: INT4 quantization (10 trials)
Exp 2: Token pruning (10 trials)
Exp 3: Memory optimization (10 trials)
Exp 4: Compiler fusion (10 trials)

Validation

All experiments save trial data
All verdicts generated (accept/reject)
Report generation working
Holdout validation passed
Portfolio dashboard shows all 4

Documentation

Individual reports generated
Comparison report created
Results stored in ResultsStore
Next experiments proposed

Expected Outcomes

If framework works correctly, we expect:

Exp 1 (Quantization): 12-18% ECD improvement, quality floor met
Exp 2 (Pruning): 18-25% energy reduction, some quality loss acceptable
Exp 3 (Memory): 20-35% memory traffic reduction, latency tradeoff
Exp 4 (Fusion): 12-20% latency improvement, no quality impact

These are realistic, achievable targets on existing hardware with proven techniques.

After Phase 2

Once all 4 experiments complete and verdicts are stable:

Move to Phase 3: Mid-term experiments (RNS, LNS, posits, etc.)
Then Phase 4: Moonshot simulators
Then Phase 5: Agent-driven hypothesis generation

Only then introduce the hypothesis/build/skeptic agents.

The lab proves itself on known-good experiments first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2 Implementation Guide

Overview

Experiment 1: Adaptive INT4/INT8 Quantization

Experiment 2: Dynamic Sparsity / Token Pruning

Experiment 3: Memory-Traffic Reduction

Experiment 4: Compiler Graph Optimization

Common Implementation Pattern

Acceptance Rules (Same for All 4)

Metrics Collection Checklist

Quick Checklist for Phase 2 Startup

Prerequisites

Implementation

Execution

Validation

Documentation

Expected Outcomes

After Phase 2

FilesExpand file tree

PHASE_2_GUIDE.md

Latest commit

History

PHASE_2_GUIDE.md

File metadata and controls

Phase 2 Implementation Guide

Overview

Experiment 1: Adaptive INT4/INT8 Quantization

Experiment 2: Dynamic Sparsity / Token Pruning

Experiment 3: Memory-Traffic Reduction

Experiment 4: Compiler Graph Optimization

Common Implementation Pattern

Acceptance Rules (Same for All 4)

Metrics Collection Checklist

Quick Checklist for Phase 2 Startup

Prerequisites

Implementation

Execution

Validation

Documentation

Expected Outcomes

After Phase 2