This document outlines the data architecture of Spacecraft, describing how content flows from Internet Archive through processing pipelines to various deployment targets.
The Spacecraft data architecture implements a multi-level content flow:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Internet │ │ Processing │ │ Deployment │
│ Archive API │ ──► │ Pipeline │ ──► │ Targets │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Raw Content │ │ Optimized │ │ Target-Specific │
│ Cache │ │ Representations │ │ Formats │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Spacecraft implements a multi-level caching strategy to balance performance, storage, and bandwidth:
The system supports two types of collections:
- Defined through explicit configuration
- Downloaded and processed during build time
- Permanently stored for efficient access
- Example:
"prefix": "scifi", "dynamic": false
- Generated on-demand based on user searches
- Processed at runtime with temporary caching
- Available for limited time periods
- Example:
"prefix": "dyn_query", "dynamic": true
Content is distributed across multiple storage locations:
- Location:
Collections/{prefix}/ - Contains original, unmodified Internet Archive content
- Structured with full directory hierarchy
- Managed via Git LFS for large binary files
- Location:
Unity/CraftSpace/Assets/Resources/Collections/{prefix}/ - Contains pre-processed, optimized assets for Unity
- Embedded in Unity WebGL build for immediate availability
- Limited to high-priority collections due to build size constraints
- Location:
SvelteKit/BackSpace/static/data/collections/{prefix}/ - Provides server-side assets for web application
- Includes complete metadata and optimized assets
- Deployable to CDN for edge distribution
- Location:
SvelteKit/BackSpace/static/data/dynamic/{hash}/ - Temporary storage for runtime-generated collections
- Managed by automated cleanup processes
- TTL (Time To Live) configuration for storage management
Different resolution levels use appropriate caching strategies:
- Ultra-low resolution (1×1, 2×3) embedded directly in JSON
- Always available without additional requests
- Gzipped during transport for efficiency
- Low to medium resolution packed in atlases
- Included in Unity build for priority collections
- Immediately available after application load
- Medium to high resolution served on demand
- Loaded progressively based on distance and visibility
- CDN-distributed for performance
- Google Maps-style tiled approach for high-resolution content
- Multiple zoom levels with appropriate resolution tiles
- Efficient loading of only visible portions at current zoom level
- Essential for detailed maps, manuscripts, and high-resolution images
The system supports cache invalidation through:
- Query parameters (
?clearcache=true,?reload=collections) - Version parameters (
?version={hash}) - API endpoints for forced refreshes
Content versioning is handled through:
- ETag tracking for Internet Archive content
- Timestamp-based change detection
- Hash-based verification for cached content
- Embedding generation for items and collections
- Semantic search capabilities using vector similarity
- Clustering and relationships based on content semantics
- Enables natural language queries against collection content
- Automatic categorization of items based on content
- Visual similarity detection across collections
- Content recommendation systems
- Enhanced metadata extraction from raw content
- Special LOD systems for VR environments
- Gaze-directed progressive loading
- Performance-optimized data structures for XR rendering
- Spatial audio integration with content
- Multi-user annotation and organization capabilities
- Shared collections and curation tools
- Real-time synchronization of user-generated content
- Integration with Internet Archive's community features
The data architecture is supported by a sophisticated processing pipeline:
- Collection Registration: Define collections via Internet Archive queries
- Content Acquisition: Download raw content from Internet Archive
- Multi-Resolution Processing: Generate optimized representations
- Atlas Generation: Pack items into efficient texture atlases
- Deployment: Distribute to appropriate targets (Unity, Web, CDN)
This pipeline can be run incrementally (updating only changed content) or as a complete rebuild.
The data architecture of CraftSpace is designed to balance immediate visual response with efficient bandwidth usage, ensuring users can explore massive digital collections with minimal waiting time while preserving the richness of Internet Archive's content.