Skip to content

Improve model load error handling with structured diagnostics, retry logic, and self-hosting support #96

@Sumanth-806307

Description

@Sumanth-806307

Problem

Models fail to load across multiple surfaces (chat.webllm.ai, JSFiddle examples, Chrome extensions) as reported in #85. Current error handling lacks:

  • Structured error classification
  • Automatic retry mechanisms
  • Cache recovery logic
  • User-actionable error messages
  • Self-hosting capabilities

Root Causes Identified

  1. Insufficient Error Diagnostics - Generic error messages without classification codes
  2. No Retry Logic - Transient network/CDN failures cause hard stops
  3. Cache Corruption - No automatic cache clearing and retry
  4. No Self-Hosting Support - Users locked into default CDN with no override option

Proposed Solution

Phase 1: Enhanced Error Diagnostics (High Priority)

  • Add ModelLoadErrorCode enum (manifest_fetch_failed, artifact_fetch_failed, worker_init_failed, webgpu_init_failed, cache_invalid)
  • Implement error classification in webllm.ts
  • Add structured error display with actionable guidance
  • Include "Copy Diagnostics" feature for bug reports

Files: app/client/api.ts, app/client/webllm.ts, app/store/chat.ts

Phase 2: Retry Logic & Self-Recovery

  • Automatic retry with exponential backoff (max 3 attempts, 1s → 2s → 4s)
  • Automatic cache clearing on cache_invalid errors
  • Progress indication during retries
  • Only retry on retryable error types

Files: app/client/webllm.ts

Phase 3: Custom Artifact Source Support

Files: app/store/config.ts, app/components/model-config.tsx, app/client/webllm.ts

Phase 4: Documentation

  • Troubleshooting guide with error code explanations
  • Self-hosting setup instructions
  • Updated issue templates with diagnostic fields

Files: docs/TROUBLESHOOTING.md, docs/SELF_HOSTING.md, .github/ISSUE_TEMPLATE/bug_report.md

Acceptance Criteria

  • All model load errors map to defined error codes
  • Retryable errors trigger automatic retry (max 3)
  • Cache corruption triggers automatic clear + retry
  • Custom base URL configurable in Settings
  • Error messages include actionable guidance
  • "Copy Diagnostics" provides complete debug info
  • Documentation covers all error codes and self-hosting

Implementation Details

Full implementation plan available in plan-85.md with:

  • Detailed code examples for each phase
  • Testing strategy (unit, integration, manual)
  • Rollout strategy with risk assessment
  • Success metrics and monitoring approach

Related Issues

Estimated Effort

Time: 3-4 weeks (1 developer)
Priority: High (affects user experience across all surfaces)
Risk: Low-Medium (Phase 1-2), Low (Phase 3-4)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions