Design and implement the Drawer (formerly Librarian) service for organized repository and analysis file storage in S3.
The Drawer maintains a consistent three-directory structure for each repository analysis:
repos/{owner}/{repo}/{commit_sha}/
├── workingcopy/ # Clean source code at specific commit
│ ├── src/
│ ├── README.md
│ └── ... (all source files)
├── repohistory/ # Full git repository with history
│ ├── .git/ # Complete git metadata
│ ├── src/
│ └── ... (full repository)
└── analysis/ # Generated analysis files (created by Analyst)
├── report.md
├── metrics.json
├── summary.txt
└── ... (analysis outputs)
- Purpose: Clean source code snapshot for analysis
- Content: Source files without .git metadata
- Creation: Generated via
git archiveat specific commit - Usage: Primary input for Strands analysis
- Benefits: No git metadata clutter, exact commit state
- Purpose: Complete git repository with full history
- Content: Full git clone including .git directory
- Creation: Standard
git clonewith all history - Usage: Git log analysis, historical insights, blame information
- Benefits: Complete project history available for analysis
- Purpose: Store all generated analysis outputs
- Content: Reports, metrics, summaries, visualizations
- Creation: Populated by Analyst Lambda after processing
- Usage: Source for pull request content, historical tracking
- Benefits: Organized analysis artifacts, easy retrieval
- Bucket Name:
coderipple-drawer - Region: us-east-1 (consistent with other resources)
- Access: Private bucket with IAM role-based access
- Versioning: Disabled (commit SHA provides versioning)
- Lifecycle: Planned transition to cheaper storage after 30 days
- Compression: ZIP format for efficient storage and transfer
- Atomic uploads: Each directory uploaded as single operation
- Metadata: Include commit SHA, timestamp, repository info
- Error handling: Retry logic for failed uploads
- Selective download: Fetch only needed directory (workingcopy vs repohistory)
- Streaming: Support large repository downloads
- Caching: Lambda /tmp caching for repeated access
- Cleanup: Automatic /tmp cleanup after processing
- Upload workingcopy and repohistory after git operations
- Provide S3 locations in "repo_ready" event
- Download workingcopy for analysis
- Upload analysis results to analysis directory
- Update "analysis_complete" event with file locations
- Download analysis files for pull request creation
- Access repository metadata for PR context
- Parallel uploads: Upload directories concurrently
- Compression: Reduce transfer time and storage costs
- Selective operations: Only transfer needed directories
- Connection pooling: Reuse S3 connections
- Private bucket: No public access to repository content
- IAM roles: Least privilege access per Lambda function
- Encryption: Server-side encryption enabled
- Access logging: Track all bucket operations
- S3 bucket with proper IAM policies
- boto3 for S3 operations
- zipfile for compression operations
Ready for S3 bucket creation and Drawer service implementation.