Merge pull request #71 from bfansports/findings/ai-audit-2026-02-17

mhugot · web-flow · commit ce0e591387ef · 2026-04-24T16:15:24.000+02:00
AI Audit: Findings and Recommendations
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -8,11 +8,13 @@ CloudTranscode is bFAN's distributed media transcoding pipeline. It's a set of P
 
 - **Language**: PHP 7+ (legacy codebase, but clean)
 - **Container**: Docker (ECS deployment)
-- **FFmpeg**: 4.2 (video/image processing)
+- **FFmpeg**: 4.2 (video/image processing) — **EOL, from 2019. See Security Findings.**
 - **ImageMagick**: convert commands for image transcoding
 - **AWS Services**: Step Functions (SFN), S3, ECS, EC2, IAM
+- **Orchestration**: AWS Step Functions state machines (see `state_machines/`)
 - **SDK**: CloudProcessingEngine-SDK (bFAN fork) for activity polling and lifecycle
 - **Dependencies**: AWS SDK for PHP 3.x, JSON Schema validation
+- **Monitoring**: None configured — no CloudWatch alarms, dashboards, or custom metrics. See Security Findings.
 
 ## Quick Start
 
@@ -81,11 +83,12 @@ Pass custom client class to activity workers via `-C <client class path>` option
 
 ## Key Patterns
 
+- **Step Functions orchestration**: Workflow is defined in `state_machines/` JSON. SFN distributes tasks to activity workers, handles retries and failure routing. This is the control plane — workers are the data plane.
 - **Activity polling**: Workers use long-polling to fetch tasks from AWS SFN
-- **Sequential output processing**: One TranscodeAssetActivity worker processes all outputs in the `output_assets` array sequentially, not in parallel. To parallelize, split the workflow.
+- **Sequential output processing**: One TranscodeAssetActivity worker processes all outputs in the `output_assets` array sequentially, not in parallel. This is a **performance bottleneck** — a 10-output job ties up one worker for the full duration. To parallelize, split the workflow into separate SFN executions or use Map states.
 - **Stateless workers**: Workers are horizontally scalable Docker containers. State lives in S3 and SFN.
 - **Preset-based transcoding**: FFmpeg commands can be templated using presets (e.g., `360p-4.3-generic`)
-- **Custom FFmpeg commands**: JSON input supports raw FFmpeg command strings for advanced use cases
+- **Custom FFmpeg commands**: JSON input supports raw FFmpeg command strings for advanced use cases. **WARNING: command injection risk — see Security Findings.**
 - **Watermarking**: Overlay images on video with custom position, opacity, size
 - **HTTP input**: Workers can pull source files from HTTP/S URLs instead of S3
 
@@ -107,9 +110,10 @@ Pass custom client class to activity workers via `-C <client class path>` option
 ## Deployment
 
 **Current setup:**
-- Docker image built from `Dockerfile` and pushed to ECR: `501431420968.dkr.ecr.eu-west-1.amazonaws.com/sportarc/cloudtranscode:4.2`
+- Docker image built from `Dockerfile` and pushed to ECR (eu-west-1)
 - ECS cluster runs workers as tasks
 - Each worker polls a specific SFN activity ARN
+- **Note**: AWS account ID `501431420968` is hardcoded in the Dockerfile/configs. Use an environment variable or SSM parameter instead.
 
 **Deployment steps:**
 1. Build Docker image: `docker build -t <ecr-repo>:tag .`
@@ -129,16 +133,59 @@ Pass custom client class to activity workers via `-C <client class path>` option
 - Check S3 output buckets for transcoded files
 - Review CloudWatch Logs for worker output
 
+## Security Findings
+
+> AI audit — 2026-02-17. These findings should be tracked as issues and resolved.
+
+### CRITICAL — Command Injection via FFmpeg/ImageMagick
+
+The transcoder code passes user-supplied JSON parameters (codec, size, preset names, custom command strings) into FFmpeg and ImageMagick shell commands **without escaping or sanitization**. A crafted `output_assets` payload could inject arbitrary shell commands.
+
+**Affected files**: `src/activities/transcoders/` — anywhere parameters are interpolated into shell commands.
+
+**Fix**: Use `escapeshellarg()` on every user-supplied parameter before interpolation. Better: build argument arrays and use `proc_open()` instead of `exec()`/`shell_exec()` with string concatenation. Validate inputs against an allowlist of known codecs, presets, and sizes.
+
+### HIGH — No Rate Limiting
+
+There is no throttling on Step Functions task submission. A misconfigured client or runaway automation can flood the pipeline with jobs, exhausting ECS capacity and S3 write throughput.
+
+**Fix**: Add SFN execution concurrency limits, or use an SQS queue with a controlled consumer rate in front of the pipeline.
+
+### MEDIUM — Hardcoded AWS Account ID
+
+AWS account ID `501431420968` appears in ECR URIs and potentially in SFN ARNs throughout the codebase. This leaks infrastructure details and makes multi-account deployment impossible.
+
+**Fix**: Replace with environment variables, SSM parameters, or CDK/CloudFormation references.
+
+### MEDIUM — FFmpeg 4.2 (2019) — End of Life
+
+FFmpeg 4.2 is from August 2019 and no longer receives security patches. Known CVEs in older FFmpeg versions include heap overflows in demuxers and decoders that can be triggered by malformed input media.
+
+**Fix**: Upgrade the `sportarc/ffmpeg` and `sportarc/cloudtranscode-base` Docker images to FFmpeg 6.x or 7.x. Test transcoding presets for compatibility.
+
+### MEDIUM — No Temp File Encryption
+
+Transcoding temp files (downloaded source, intermediate outputs) are stored on local ECS disk unencrypted. If the disk is an EBS volume, data at rest is exposed unless the volume itself has encryption enabled.
+
+**Fix**: Ensure ECS instances use encrypted EBS volumes. For sensitive media, consider encrypting temp files at the application level or using instance store with dm-crypt.
+
+### LOW — No CloudWatch Monitoring
+
+No CloudWatch alarms, custom metrics, or dashboards are configured. Worker failures, SFN execution errors, and S3 throughput issues are invisible without manual console checks.
+
+**Fix**: Add CloudWatch alarms for SFN execution failures, ECS task stopped events, and worker heartbeat gaps. Publish custom metrics for transcode duration, queue depth, and error rates.
+
 ## Gotchas
 
-- **Sequential processing**: TranscodeAssetActivity processes all outputs sequentially. For parallel transcoding of multiple outputs, you must split the workflow or run multiple workers with separate SFN tasks.
+- **Sequential processing bottleneck**: TranscodeAssetActivity processes all outputs sequentially. A single job with many outputs blocks the worker. Split into parallel SFN branches or use Map states.
 - **Docker base image dependency**: This repo depends on two SportArchive Docker images (`sportarc/ffmpeg`, `sportarc/cloudtranscode-base`). If those images are updated, rebuild this image.
-- **FFmpeg version**: Locked to 4.2. Upgrading FFmpeg requires updating the base image.
+- **FFmpeg version**: Locked to 4.2 (2019, EOL). Upgrading FFmpeg requires updating the base image and retesting all presets.
 - **Client interface requirement**: For production use, you MUST implement a custom client interface class and extend the Dockerfile to include it. Without it, workers run but don't notify client apps of progress/completion.
 - **AWS SFN long polling**: Workers block on GetActivityTask calls (long polling). If AWS SFN is unavailable, workers will hang until timeout.
-- **Temp disk space**: Transcoding uses local disk for temporary files. Ensure ECS instances or Docker volumes have sufficient space for large video files.
+- **Temp disk space**: Transcoding uses local disk for temporary files. Ensure ECS instances or Docker volumes have sufficient space for large video files. Temp files are **not encrypted** at the application level.
 - **Presets location**: The `presets/` directory in this repo may be deprecated. Check if CloudTranscode-FFMpeg-presets is the canonical source.
+- **Hardcoded account ID**: `501431420968` is baked into ECR URIs and possibly SFN ARNs. Must be parameterized for multi-account use.
 
 <!-- Ask: What happens if a worker crashes mid-transcode? Does SFN retry, or is the task lost? Are there heartbeat intervals configured? -->
 <!-- Ask: How are FFmpeg presets loaded — from this repo's presets/ dir, or from CloudTranscode-FFMpeg-presets? -->
-<!-- Ask: What's the relationship between this repo and CloudTranscode-Lambda? When is Lambda used vs ECS workers? -->
+<!-- Ask: What's the relationship between this repo and CloudTranscode-Lambda? When is Lambda used vs ECS workers? -->
diff --git a/FINDINGS.md b/FINDINGS.md
@@ -0,0 +1,274 @@
+# Findings and Recommendations
+
+**Audit Date:** 2026-02-17
+**Auditor:** AI Agent (DevOps Engineer persona)
+
+## Critical Issues
+
+### 1. Command Injection Vulnerabilities in Custom Commands
+**File:** `src/activities/transcoders/VideoTranscoder.php:201`, `src/activities/transcoders/ImageTranscoder.php:202`
+**Issue:** Custom FFmpeg/ImageMagick commands allow direct string replacement without proper validation. While `escapeshellarg()` is used for input file paths, output paths and other parameters are inserted directly via regex replacement, allowing potential command injection.
+**Risk:** Remote code execution if attacker controls JSON input parameters.
+**Recommendation:**
+- Validate all custom commands against a whitelist of allowed patterns
+- Escape ALL parameters passed to shell commands, not just input paths
+- Consider using array-based command construction instead of string concatenation
+- Implement strict JSON schema validation for all input parameters
+
+### 2. Missing Input Validation in craft_convert_cmd
+**File:** `src/activities/transcoders/ImageTranscoder.php:163-189`
+**Issue:** ImageMagick convert command parameters (`quality`, `resize`, `thumbnail`, `crop`) are passed directly to shell without escaping or validation.
+**Risk:** Command injection through malicious JSON input parameters.
+**Recommendation:** Use `escapeshellarg()` for ALL parameters, implement numeric validation for quality, validate resize/crop patterns against regex.
+
+### 3. Hardcoded AWS Account ID in Docker Base Image
+**File:** `Dockerfile:1`, throughout README
+**Issue:** Hardcoded ECR registry `501431420968.dkr.ecr.eu-west-1.amazonaws.com` exposes AWS account ID.
+**Risk:** Information disclosure, potential security targeting.
+**Recommendation:** Use build arguments or environment variables for registry URLs, document as configuration requirement.
+
+### 4. No Rate Limiting or Resource Controls
+**Issue:** No mechanisms to prevent resource exhaustion attacks through large/malicious files.
+**Risk:** DoS through memory/CPU exhaustion, runaway costs from excessive processing.
+**Recommendation:**
+- Implement file size limits before processing
+- Add CPU/memory limits in ECS task definitions
+- Implement request rate limiting at Step Functions level
+- Add CloudWatch alarms for abnormal resource usage
+
+## High Priority
+
+### 5. Insecure Composer Download Over HTTP
+**File:** `Makefile:9`
+**Issue:** Downloading Composer installer via curl piped to PHP without signature verification.
+**Risk:** Supply chain attack if getcomposer.org is compromised.
+**Recommendation:** Download Composer installer with signature verification as per official documentation.
+
+### 6. Missing IAM Permission Boundaries
+**File:** `CLAUDE.md:99-101`
+**Issue:** IAM permissions documented are broad, no mention of least privilege or permission boundaries.
+**Risk:** Excessive permissions, potential for privilege escalation.
+**Recommendation:**
+- Document minimum required permissions
+- Implement IAM permission boundaries
+- Use separate roles for different activities
+- Add bucket-specific resource ARNs instead of wildcards
+
+### 7. No Encryption for Temporary Files
+**File:** `src/activities/BasicActivity.php:52`
+**Issue:** Temporary files stored in `/tmp/CloudTranscode/` without encryption.
+**Risk:** Sensitive media content exposed on disk, persists after container crashes.
+**Recommendation:**
+- Use encrypted EBS volumes for ECS instances
+- Implement secure deletion of temporary files
+- Consider using AWS EFS with encryption for shared storage
+- Add cleanup handlers for abnormal termination
+
+### 8. Insufficient Error Handling and Logging
+**Files:** Throughout codebase
+**Issue:** Errors logged but no structured logging, no CloudWatch metrics, limited monitoring capabilities.
+**Risk:** Difficult incident response, no alerting on failures, poor observability.
+**Recommendation:**
+- Implement structured JSON logging
+- Add CloudWatch custom metrics for job status, processing time, errors
+- Create CloudWatch dashboards and alarms
+- Add distributed tracing with X-Ray
+
+### 9. No Health Checks or Circuit Breakers
+**Issue:** No health check endpoints, no circuit breakers for S3/SFN failures.
+**Risk:** Cascading failures, difficult auto-recovery, poor resilience.
+**Recommendation:**
+- Add health check endpoints for ECS
+- Implement exponential backoff for AWS API calls
+- Add circuit breakers for external dependencies
+- Implement graceful degradation
+
+### 10. Outdated Dependencies
+**File:** `composer.json:3-5`
+**Issue:** Using AWS SDK v3.* (any version), json-schema ~1.3 (very old), FFmpeg 4.2 (from 2019).
+**Risk:** Missing security patches, compatibility issues, performance problems.
+**Recommendation:**
+- Pin specific versions of dependencies
+- Update AWS SDK to latest v3 version
+- Update json-schema to latest version
+- Consider updating FFmpeg to 6.x series
+
+## Medium Priority
+
+### 11. No Input File Type Validation
+**File:** `src/activities/ValidateAssetActivity.php`
+**Issue:** While mime type detection exists, no validation against allowed file types before processing.
+**Risk:** Processing malicious files, potential security exploits in FFmpeg/ImageMagick.
+**Recommendation:** Implement strict whitelist of allowed mime types and file extensions.
+
+### 12. Missing Disaster Recovery Documentation
+**Issue:** No documented DR procedures, backup strategies, or RTO/RPO targets.
+**Risk:** Extended downtime during incidents, data loss.
+**Recommendation:**
+- Document DR procedures
+- Implement automated backups of state machines
+- Add multi-region failover capability
+- Document RTO/RPO requirements
+
+### 13. No Cost Controls or Budget Alerts
+**Issue:** No mechanisms to prevent runaway costs from excessive transcoding.
+**Risk:** Unexpected AWS bills, budget overruns.
+**Recommendation:**
+- Implement AWS Budgets with alerts
+- Add job quotas per customer/time period
+- Monitor and alert on abnormal usage patterns
+- Implement cost allocation tags
+
+### 14. Sequential Processing Bottleneck
+**File:** `CLAUDE.md:134`, `src/activities/TranscodeAssetActivity.php:60-91`
+**Issue:** All outputs processed sequentially by single worker, no parallelization.
+**Risk:** Poor performance, increased costs from longer running instances.
+**Recommendation:**
+- Implement parallel processing using Step Functions Map state
+- Split large jobs into smaller parallel tasks
+- Add job priority queues
+
+### 15. No Retry Logic for Transient Failures
+**Issue:** No documented retry strategy for S3/network failures.
+**Risk:** Job failures from transient issues, manual intervention required.
+**Recommendation:**
+- Implement exponential backoff retry logic
+- Add Step Functions retry configuration
+- Distinguish between transient and permanent failures
+
+## Low Priority
+
+### 16. GitHub Secrets in Plain Workflow
+**File:** `.github/workflows/github-backup.yml:15-16`
+**Issue:** Using legacy secret names, no OIDC authentication.
+**Risk:** Long-lived credentials, potential exposure.
+**Recommendation:** Migrate to GitHub OIDC provider for AWS authentication.
+
+### 17. Missing Container Security Scanning
+**Issue:** No container vulnerability scanning in CI/CD pipeline.
+**Risk:** Deploying containers with known vulnerabilities.
+**Recommendation:**
+- Add Trivy or similar scanner to CI/CD
+- Implement ECR image scanning
+- Add security gates to prevent vulnerable deployments
+
+### 18. No API Versioning Strategy
+**Issue:** No versioning for Step Functions state machines or input/output schemas.
+**Risk:** Breaking changes affecting clients, difficult rollbacks.
+**Recommendation:**
+- Implement semantic versioning for state machines
+- Version input/output JSON schemas
+- Document breaking changes
+
+### 19. Incomplete Documentation
+**File:** `CLAUDE.md` - multiple `<!-- Ask: ... -->` sections
+**Issue:** Missing critical documentation about configuration, testing, operations.
+**Risk:** Operational difficulties, onboarding challenges.
+**Recommendation:** Complete all documentation gaps identified in CLAUDE.md.
+
+### 20. No Performance Benchmarks or SLAs
+**Issue:** Old benchmarks from 2016, no current performance metrics or SLAs.
+**Risk:** Unknown performance characteristics, no performance regression detection.
+**Recommendation:**
+- Update benchmark suite for current instance types
+- Define and monitor SLAs
+- Implement performance testing in CI/CD
+
+## Agent Skill Improvements
+
+### 1. Add Security Scanning Capabilities
+The DevOps agent should include automated security scanning as part of infrastructure audits:
+- Container vulnerability scanning
+- Dependency vulnerability checking
+- IAM permission analysis
+- Secret detection in code
+
+### 2. Cost Optimization Analysis
+Enhance the agent to analyze and recommend cost optimizations:
+- Identify over-provisioned resources
+- Recommend reserved capacity or savings plans
+- Analyze usage patterns for right-sizing
+
+### 3. Disaster Recovery Planning
+Add DR assessment capabilities:
+- Validate backup strategies
+- Test recovery procedures
+- Document RTO/RPO requirements
+- Create runbooks for common scenarios
+
+### 4. Performance Profiling
+Include performance analysis tools:
+- Identify bottlenecks
+- Recommend optimization strategies
+- Benchmark against best practices
+
+### 5. Compliance Checking
+Add compliance validation:
+- Check against AWS Well-Architected Framework
+- Validate security best practices
+- Ensure logging/monitoring compliance
+
+## Positive Observations
+
+### 1. Clean Architecture
+- Well-structured object-oriented PHP code
+- Good separation of concerns between activities and transcoders
+- Modular design allowing easy extension
+
+### 2. Docker Containerization
+- Proper use of Docker for deployment
+- Base image strategy for reusability
+- Clear entrypoint configuration
+
+### 3. Flexible Input Options
+- Support for both S3 and HTTP input sources
+- Custom command support for advanced use cases
+- Preset system for common configurations
+
+### 4. Activity Pattern Implementation
+- Proper use of AWS Step Functions activity pattern
+- Heartbeat implementation for long-running tasks
+- Clean integration with CloudProcessingEngine-SDK
+
+### 5. Progress Tracking
+- Real-time progress reporting during transcoding
+- Heartbeat mechanism to prevent timeouts
+- Callback system for status updates
+
+### 6. Error Reporting
+- Comprehensive error messages
+- Proper exception handling in most cases
+- JSON-formatted responses for automation
+
+### 7. S3 Integration
+- Proper use of AWS SDK
+- Support for encryption and storage classes
+- Efficient file transfer handling
+
+### 8. Documentation Foundation
+- CLAUDE.md provides good overview
+- README includes setup instructions
+- Code comments explain key functionality
+
+## Summary
+
+**Total Findings by Priority:**
+- Critical: 4
+- High: 6
+- Medium: 5
+- Low: 5
+- Total Issues: 20
+
+**Immediate Actions Required:**
+1. Fix command injection vulnerabilities in custom command handling
+2. Implement proper input validation and escaping
+3. Add rate limiting and resource controls
+4. Update dependencies and base images
+
+**Strategic Improvements:**
+1. Implement comprehensive monitoring and alerting
+2. Design and document disaster recovery procedures
+3. Optimize for parallel processing
+4. Enhance security posture with scanning and least privilege
+
+This distributed transcoding system has a solid foundation but requires immediate security hardening and operational improvements before production use. The critical command injection vulnerabilities must be addressed immediately, followed by implementing proper monitoring, resource controls, and disaster recovery capabilities.