Skip to content

Latest commit

 

History

History
453 lines (347 loc) · 12.7 KB

File metadata and controls

453 lines (347 loc) · 12.7 KB

Architecture Documentation

Overview

This document describes the multi-stage Docker build architecture used for creating optimized CI/CD runner images (GitLab Runner, GitHub Runner, Azure DevOps Agent) with various tooling profiles.

Design Goals

  1. Maximum Layer Reusability: Share common base layers across all profiles to minimize cache storage and improve build times
  2. Profile Flexibility: Support different use cases (minimal, k8s, iac, iac-pwsh, full) without code duplication
  3. Efficient Caching: Organize layers by usage frequency to maximize GitHub Actions cache hits
  4. Zero Size Penalty: Multi-stage architecture should not increase final image sizes
  5. Clear Separation: Isolate component groups for maintainability and debugging

Multi-Stage Build Architecture

Stage Hierarchy

graph TD
    A[base<br/>Ubuntu 24.04 + Agent/Runner<br/>~500 MB<br/>100% shared] --> B[common<br/>+ sudo<br/>~50 MB<br/>100% shared]
    
    B --> C[docker-tools<br/>+ docker, jq, yq<br/>~100 MB<br/>80% shared]
    
    C --> D1[k8s-tools<br/>+ kubectl, kubelogin,<br/>kustomize, helm<br/>~200 MB<br/>40% shared]
    
    C --> D2[cloud-tools<br/>+ AWS CLI, Azure CLI<br/>~800 MB<br/>60% shared]
    
    D1 --> E1[k8s<br/>PROFILE]
    
    D2 --> E2[iac-tools<br/>+ terraform, opentofu,<br/>terraspace<br/>~300 MB<br/>60% shared]
    
    E2 --> E3[iac<br/>PROFILE]
    
    E2 --> F[pwsh-tools<br/>+ PowerShell,<br/>Azure PS, AWS PS<br/>~500 MB<br/>40% shared]
    
    F --> G1[iac-pwsh<br/>PROFILE]
    
    F --> G2[full-tools<br/>+ K8s tools<br/>copied<br/>~200 MB<br/>20% shared]
    
    G2 --> G3[full<br/>PROFILE]
    
    B --> H[minimal<br/>PROFILE]
    
    style A fill:#e1f5fe
    style B fill:#e1f5fe
    style C fill:#fff9c4
    style D1 fill:#f3e5f5
    style D2 fill:#fff3e0
    style E1 fill:#c8e6c9
    style E2 fill:#fff3e0
    style E3 fill:#c8e6c9
    style F fill:#ffe0b2
    style G1 fill:#c8e6c9
    style G2 fill:#ffe0b2
    style G3 fill:#c8e6c9
    style H fill:#c8e6c9
Loading

Stage Details

Stage Base Added Components Size Profiles Using Reuse %
base Ubuntu 24.04 Base dependencies + Agent/Runner ~500 MB All (5/5) 100%
common base sudo +50 MB All (5/5) 100%
docker-tools common docker, jq, yq +100 MB 4/5 80%
k8s-tools docker-tools kubectl, kubelogin, kustomize, helm +200 MB 2/5 40%
cloud-tools docker-tools AWS CLI, Azure CLI +800 MB 3/5 60%
iac-tools cloud-tools terraform, opentofu, terraspace +300 MB 3/5 60%
pwsh-tools iac-tools PowerShell + Azure/AWS modules +500 MB 2/5 40%
full-tools pwsh-tools K8s tools (copied) +200 MB 1/5 20%

Profile Composition

graph LR
    subgraph "minimal (~550 MB)"
        M1[base] --> M2[common]
    end
    
    subgraph "k8s (~850 MB)"
        K1[base] --> K2[common] --> K3[docker-tools] --> K4[k8s-tools]
    end
    
    subgraph "iac (~1.75 GB)"
        I1[base] --> I2[common] --> I3[docker-tools] --> I4[cloud-tools] --> I5[iac-tools]
    end
    
    subgraph "iac-pwsh (~2.25 GB)"
        IP1[base] --> IP2[common] --> IP3[docker-tools] --> IP4[cloud-tools] --> IP5[iac-tools] --> IP6[pwsh-tools]
    end
    
    subgraph "full (~2.45 GB)"
        F1[base] --> F2[common] --> F3[docker-tools] --> F4[cloud-tools] --> F5[iac-tools] --> F6[pwsh-tools] --> F7[full-tools<br/>+ k8s copy]
    end
Loading

Profile Use Cases

mindmap
  root((Profiles))
    minimal
      Basic runner
      Lightweight jobs
      Script execution
    k8s
      Kubernetes deployments
      Helm charts
      Manifest management
      Cluster operations
    iac
      Infrastructure provisioning
      Terraform workflows
      Cloud resource management
      Bash-based automation
    iac-pwsh
      Infrastructure + PowerShell
      Azure automation
      AWS PowerShell tools
      Cross-platform scripting
    full
      Complete toolset
      Multi-cloud deployments
      K8s + IaC combined
      Enterprise workflows
Loading

Layer Reusability Analysis

Cache Efficiency Matrix

%%{init: {'theme':'base'}}%%
graph TB
    subgraph "Layer Reuse Across Profiles"
        A["base: ■■■■■ (5/5 = 100%)"]
        B["common: ■■■■■ (5/5 = 100%)"]
        C["docker-tools: ■■■■□ (4/5 = 80%)"]
        D["cloud-tools: ■■■□□ (3/5 = 60%)"]
        E["iac-tools: ■■■□□ (3/5 = 60%)"]
        F["k8s-tools: ■■□□□ (2/5 = 40%)"]
        G["pwsh-tools: ■■□□□ (2/5 = 40%)"]
        H["full-tools: ■□□□□ (1/5 = 20%)"]
    end
    
    style A fill:#4caf50
    style B fill:#4caf50
    style C fill:#8bc34a
    style D fill:#ffc107
    style E fill:#ffc107
    style F fill:#ff9800
    style G fill:#ff9800
    style H fill:#f44336
Loading

Overall Cache Efficiency: 67.5%

Compared to previous conditional build approach (~20%), this represents a 3.4x improvement in layer reusability.

Build Workflow

GitHub Actions Cache Strategy

sequenceDiagram
    participant GHA as GitHub Actions
    participant Cache as GHA Cache
    participant Builder as Docker Buildx
    participant Registry as Container Registry
    
    Note over GHA,Registry: Building Profile: k8s
    
    GHA->>Cache: Pull cache-from: base-amd64
    GHA->>Cache: Pull cache-from: common-amd64
    GHA->>Cache: Pull cache-from: docker-tools-amd64
    GHA->>Cache: Pull cache-from: k8s-amd64
    
    Cache-->>Builder: Cached layers
    
    Builder->>Builder: Build target=k8s
    Note right of Builder: Only missing layers built
    
    Builder->>Cache: Push cache-to: k8s-amd64
    Builder->>Registry: Push final image
    
    Note over GHA,Registry: Next Build: iac
    
    GHA->>Cache: Pull cache-from: base-amd64
    Note right of GHA: ✓ Cache HIT (from k8s build)
    GHA->>Cache: Pull cache-from: common-amd64
    Note right of GHA: ✓ Cache HIT (from k8s build)
    GHA->>Cache: Pull cache-from: docker-tools-amd64
    Note right of GHA: ✓ Cache HIT (from k8s build)
    GHA->>Cache: Pull cache-from: iac-amd64
    Note right of GHA: ✗ Cache MISS (first iac build)
    
    Builder->>Builder: Build target=iac
    Note right of Builder: Only cloud-tools + iac-tools built
    Builder->>Cache: Push cache-to: iac-amd64
    Builder->>Registry: Push final image
Loading

Multi-Scope Cache Configuration

cache-from: |
  type=gha,scope=base-{arch}          # 100% hit rate
  type=gha,scope=common-{arch}        # 100% hit rate
  type=gha,scope=docker-tools-{arch}  # 80% hit rate
  type=gha,scope={profile}-{arch}     # Profile-specific

cache-to: type=gha,mode=max,scope={profile}-{arch}

Deployment Architecture

Cloud-Agnostic Runner Deployment

graph TB
    subgraph "CI/CD Platform"
        A[GitLab / GitHub / Azure DevOps]
    end
    
    subgraph "Container Registry"
        B1[ghcr.io/repo:latest-full]
        B2[ghcr.io/repo:latest-k8s]
        B3[ghcr.io/repo:latest-iac]
        B4[ghcr.io/repo:latest-minimal]
    end
    
    subgraph "Cloud Provider A - Azure"
        C1[VM Scale Set]
        C2[AKS Cluster]
        C1 --> D1[Runner: full]
        C2 --> D2[Runner: k8s]
    end
    
    subgraph "Cloud Provider B - AWS"
        E1[EC2 Auto Scaling]
        E2[EKS Cluster]
        E1 --> F1[Runner: iac]
        E2 --> F2[Runner: k8s]
    end
    
    subgraph "On-Premises"
        G1[Docker Host]
        G1 --> H1[Runner: minimal]
    end
    
    A --> B1
    A --> B2
    A --> B3
    A --> B4
    
    B1 --> D1
    B2 --> D2
    B2 --> F2
    B3 --> F1
    B4 --> H1
    
    style A fill:#e3f2fd
    style C1 fill:#bbdefb
    style C2 fill:#bbdefb
    style E1 fill:#fff9c4
    style E2 fill:#fff9c4
    style G1 fill:#f3e5f5
Loading

Auto-Scaling Runner Architecture

graph LR
    subgraph "Job Queue"
        J1[Job 1: Deploy K8s]
        J2[Job 2: Terraform Apply]
        J3[Job 3: PowerShell Script]
        J4[Job 4: Basic Build]
    end
    
    subgraph "Runner Pool - Cloud Provider"
        subgraph "K8s Runners"
            R1[k8s profile<br/>pod 1]
            R2[k8s profile<br/>pod 2]
        end
        
        subgraph "IaC Runners"
            R3[iac profile<br/>VM 1]
            R4[iac-pwsh profile<br/>VM 2]
        end
        
        subgraph "Minimal Runners"
            R5[minimal profile<br/>container 1]
        end
    end
    
    subgraph "Monitoring & Scaling"
        M1[Scheduled Events]
        M2[Spot Termination]
        M3[Auto-Scaler]
    end
    
    J1 --> R1
    J2 --> R3
    J3 --> R4
    J4 --> R5
    
    M1 --> R3
    M2 --> R4
    M3 --> R1
    M3 --> R2
    
    style J1 fill:#c8e6c9
    style J2 fill:#ffe0b2
    style J3 fill:#ffe0b2
    style J4 fill:#e1f5fe
    style R1 fill:#c8e6c9
    style R2 fill:#c8e6c9
    style R3 fill:#ffe0b2
    style R4 fill:#ffe0b2
    style R5 fill:#e1f5fe
Loading

Performance Metrics

Build Time Comparison

%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#ff6384'}}}%%
xychart-beta
    title "Build Time Comparison (minutes)"
    x-axis [minimal, k8s, iac, iac-pwsh, full]
    y-axis "Time (minutes)" 0 --> 20
    bar [3, 7, 12, 15, 18]
    line [5, 12, 20, 25, 28]
Loading
  • Red bars: Multi-stage build (with cache)
  • Blue line: Previous conditional build (with cache)

Cache Storage Reduction

Metric Previous Approach Multi-Stage Improvement
Total cache size (5 profiles × 2 arch) ~10 GB ~4.5 GB -55%
Average build time 18 minutes 11 minutes -39%
Cache hit rate ~20% ~67.5% +237%
Rebuild all profiles 75 minutes 25 minutes -67%

Optimization Strategies

1. Component Ordering by Frequency

Components are installed in order of usage across profiles:

  1. Base + Runner (100%)
  2. Sudo (100%)
  3. Docker + common tools (80%)
  4. Cloud CLIs + IaC tools (60%)
  5. K8s tools (40%)
  6. PowerShell (40%)

2. Strategic Layer Splitting

  • Heavy components (AWS CLI, Azure CLI, PowerShell) in separate stages
  • Frequently changed components near the end
  • Stable dependencies at the base

3. Cross-Profile Copying

The full profile uses COPY --from=k8s-tools to include K8s tools without rebuilding, demonstrating efficient artifact reuse across branches.

4. Architecture-Specific Handling

# Terraspace only on amd64
RUN if [ "${TARGETARCH}" = "amd64" ]; then \
    # Install terraspace \
    fi

Maintenance Guidelines

Adding New Components

  1. Determine usage frequency across profiles
  2. Choose appropriate stage based on dependencies
  3. Update all affected profiles
  4. Test cache behavior with GitHub Actions

Example: Adding a new tool used by 3/5 profiles:

# Add to cloud-tools or iac-tools stage (60% reuse)
FROM cloud-tools AS cloud-tools-extended

RUN install-new-tool

Modifying Existing Stages

Impact analysis before changes:

Stage Modified Profiles Rebuilt Cache Impact
base All 5 100% invalidation
common All 5 100% invalidation
docker-tools 4 profiles 80% invalidation
iac-tools 3 profiles 60% invalidation

Version Updates

Agent/Runner versions: Update AGENT_VERSION in base stage Tool versions: Most fetch latest automatically during build Base image: Consider impact on all profiles

Security Considerations

sudo Configuration

# SECURITY NOTE: NOPASSWD:ALL is configured for CI/CD automation
RUN echo "%agent ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/agent

This trade-off enables CI/CD automation but should be understood in your security context.

Multi-Stage Security Benefits

  1. Reduced attack surface: Minimal profile has fewer components
  2. Clear provenance: Each stage is traceable
  3. Isolation: Build-time tools not in final image
  4. SBOM generation: Each profile has separate Software Bill of Materials

Future Enhancements

  1. Additional cloud providers: GCP CLI, Oracle Cloud
  2. Language runtimes: Node.js, Python, Go toolchains
  3. Security scanning tools: Trivy, Grype, Snyk
  4. Monitoring agents: Prometheus, Datadog
  5. Base image variants: Alpine, Debian alternatives

Conclusion

The multi-stage build architecture provides:

  • Significant performance improvements (40-67% faster builds)
  • Reduced resource consumption (55% less cache storage)
  • Better maintainability (clear component separation)
  • Flexible deployment options (5 optimized profiles)
  • Zero size penalty (final images unchanged)

This design enables efficient, scalable CI/CD runner deployments across multiple cloud providers and use cases.