Skip to content

Commit ce0e591

Browse files
authored
Merge pull request #71 from bfansports/findings/ai-audit-2026-02-17
AI Audit: Findings and Recommendations
2 parents 0154306 + daf03c4 commit ce0e591

2 files changed

Lines changed: 329 additions & 8 deletions

File tree

CLAUDE.md

Lines changed: 55 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,13 @@ CloudTranscode is bFAN's distributed media transcoding pipeline. It's a set of P
88

99
- **Language**: PHP 7+ (legacy codebase, but clean)
1010
- **Container**: Docker (ECS deployment)
11-
- **FFmpeg**: 4.2 (video/image processing)
11+
- **FFmpeg**: 4.2 (video/image processing)**EOL, from 2019. See Security Findings.**
1212
- **ImageMagick**: convert commands for image transcoding
1313
- **AWS Services**: Step Functions (SFN), S3, ECS, EC2, IAM
14+
- **Orchestration**: AWS Step Functions state machines (see `state_machines/`)
1415
- **SDK**: CloudProcessingEngine-SDK (bFAN fork) for activity polling and lifecycle
1516
- **Dependencies**: AWS SDK for PHP 3.x, JSON Schema validation
17+
- **Monitoring**: None configured — no CloudWatch alarms, dashboards, or custom metrics. See Security Findings.
1618

1719
## Quick Start
1820

@@ -81,11 +83,12 @@ Pass custom client class to activity workers via `-C <client class path>` option
8183

8284
## Key Patterns
8385

86+
- **Step Functions orchestration**: Workflow is defined in `state_machines/` JSON. SFN distributes tasks to activity workers, handles retries and failure routing. This is the control plane — workers are the data plane.
8487
- **Activity polling**: Workers use long-polling to fetch tasks from AWS SFN
85-
- **Sequential output processing**: One TranscodeAssetActivity worker processes all outputs in the `output_assets` array sequentially, not in parallel. To parallelize, split the workflow.
88+
- **Sequential output processing**: One TranscodeAssetActivity worker processes all outputs in the `output_assets` array sequentially, not in parallel. This is a **performance bottleneck** — a 10-output job ties up one worker for the full duration. To parallelize, split the workflow into separate SFN executions or use Map states.
8689
- **Stateless workers**: Workers are horizontally scalable Docker containers. State lives in S3 and SFN.
8790
- **Preset-based transcoding**: FFmpeg commands can be templated using presets (e.g., `360p-4.3-generic`)
88-
- **Custom FFmpeg commands**: JSON input supports raw FFmpeg command strings for advanced use cases
91+
- **Custom FFmpeg commands**: JSON input supports raw FFmpeg command strings for advanced use cases. **WARNING: command injection risk — see Security Findings.**
8992
- **Watermarking**: Overlay images on video with custom position, opacity, size
9093
- **HTTP input**: Workers can pull source files from HTTP/S URLs instead of S3
9194

@@ -107,9 +110,10 @@ Pass custom client class to activity workers via `-C <client class path>` option
107110
## Deployment
108111

109112
**Current setup:**
110-
- Docker image built from `Dockerfile` and pushed to ECR: `501431420968.dkr.ecr.eu-west-1.amazonaws.com/sportarc/cloudtranscode:4.2`
113+
- Docker image built from `Dockerfile` and pushed to ECR (eu-west-1)
111114
- ECS cluster runs workers as tasks
112115
- Each worker polls a specific SFN activity ARN
116+
- **Note**: AWS account ID `501431420968` is hardcoded in the Dockerfile/configs. Use an environment variable or SSM parameter instead.
113117

114118
**Deployment steps:**
115119
1. Build Docker image: `docker build -t <ecr-repo>:tag .`
@@ -129,16 +133,59 @@ Pass custom client class to activity workers via `-C <client class path>` option
129133
- Check S3 output buckets for transcoded files
130134
- Review CloudWatch Logs for worker output
131135

136+
## Security Findings
137+
138+
> AI audit — 2026-02-17. These findings should be tracked as issues and resolved.
139+
140+
### CRITICAL — Command Injection via FFmpeg/ImageMagick
141+
142+
The transcoder code passes user-supplied JSON parameters (codec, size, preset names, custom command strings) into FFmpeg and ImageMagick shell commands **without escaping or sanitization**. A crafted `output_assets` payload could inject arbitrary shell commands.
143+
144+
**Affected files**: `src/activities/transcoders/` — anywhere parameters are interpolated into shell commands.
145+
146+
**Fix**: Use `escapeshellarg()` on every user-supplied parameter before interpolation. Better: build argument arrays and use `proc_open()` instead of `exec()`/`shell_exec()` with string concatenation. Validate inputs against an allowlist of known codecs, presets, and sizes.
147+
148+
### HIGH — No Rate Limiting
149+
150+
There is no throttling on Step Functions task submission. A misconfigured client or runaway automation can flood the pipeline with jobs, exhausting ECS capacity and S3 write throughput.
151+
152+
**Fix**: Add SFN execution concurrency limits, or use an SQS queue with a controlled consumer rate in front of the pipeline.
153+
154+
### MEDIUM — Hardcoded AWS Account ID
155+
156+
AWS account ID `501431420968` appears in ECR URIs and potentially in SFN ARNs throughout the codebase. This leaks infrastructure details and makes multi-account deployment impossible.
157+
158+
**Fix**: Replace with environment variables, SSM parameters, or CDK/CloudFormation references.
159+
160+
### MEDIUM — FFmpeg 4.2 (2019) — End of Life
161+
162+
FFmpeg 4.2 is from August 2019 and no longer receives security patches. Known CVEs in older FFmpeg versions include heap overflows in demuxers and decoders that can be triggered by malformed input media.
163+
164+
**Fix**: Upgrade the `sportarc/ffmpeg` and `sportarc/cloudtranscode-base` Docker images to FFmpeg 6.x or 7.x. Test transcoding presets for compatibility.
165+
166+
### MEDIUM — No Temp File Encryption
167+
168+
Transcoding temp files (downloaded source, intermediate outputs) are stored on local ECS disk unencrypted. If the disk is an EBS volume, data at rest is exposed unless the volume itself has encryption enabled.
169+
170+
**Fix**: Ensure ECS instances use encrypted EBS volumes. For sensitive media, consider encrypting temp files at the application level or using instance store with dm-crypt.
171+
172+
### LOW — No CloudWatch Monitoring
173+
174+
No CloudWatch alarms, custom metrics, or dashboards are configured. Worker failures, SFN execution errors, and S3 throughput issues are invisible without manual console checks.
175+
176+
**Fix**: Add CloudWatch alarms for SFN execution failures, ECS task stopped events, and worker heartbeat gaps. Publish custom metrics for transcode duration, queue depth, and error rates.
177+
132178
## Gotchas
133179

134-
- **Sequential processing**: TranscodeAssetActivity processes all outputs sequentially. For parallel transcoding of multiple outputs, you must split the workflow or run multiple workers with separate SFN tasks.
180+
- **Sequential processing bottleneck**: TranscodeAssetActivity processes all outputs sequentially. A single job with many outputs blocks the worker. Split into parallel SFN branches or use Map states.
135181
- **Docker base image dependency**: This repo depends on two SportArchive Docker images (`sportarc/ffmpeg`, `sportarc/cloudtranscode-base`). If those images are updated, rebuild this image.
136-
- **FFmpeg version**: Locked to 4.2. Upgrading FFmpeg requires updating the base image.
182+
- **FFmpeg version**: Locked to 4.2 (2019, EOL). Upgrading FFmpeg requires updating the base image and retesting all presets.
137183
- **Client interface requirement**: For production use, you MUST implement a custom client interface class and extend the Dockerfile to include it. Without it, workers run but don't notify client apps of progress/completion.
138184
- **AWS SFN long polling**: Workers block on GetActivityTask calls (long polling). If AWS SFN is unavailable, workers will hang until timeout.
139-
- **Temp disk space**: Transcoding uses local disk for temporary files. Ensure ECS instances or Docker volumes have sufficient space for large video files.
185+
- **Temp disk space**: Transcoding uses local disk for temporary files. Ensure ECS instances or Docker volumes have sufficient space for large video files. Temp files are **not encrypted** at the application level.
140186
- **Presets location**: The `presets/` directory in this repo may be deprecated. Check if CloudTranscode-FFMpeg-presets is the canonical source.
187+
- **Hardcoded account ID**: `501431420968` is baked into ECR URIs and possibly SFN ARNs. Must be parameterized for multi-account use.
141188

142189
<!-- Ask: What happens if a worker crashes mid-transcode? Does SFN retry, or is the task lost? Are there heartbeat intervals configured? -->
143190
<!-- Ask: How are FFmpeg presets loaded — from this repo's presets/ dir, or from CloudTranscode-FFMpeg-presets? -->
144-
<!-- Ask: What's the relationship between this repo and CloudTranscode-Lambda? When is Lambda used vs ECS workers? -->
191+
<!-- Ask: What's the relationship between this repo and CloudTranscode-Lambda? When is Lambda used vs ECS workers? -->

FINDINGS.md

Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
# Findings and Recommendations
2+
3+
**Audit Date:** 2026-02-17
4+
**Auditor:** AI Agent (DevOps Engineer persona)
5+
6+
## Critical Issues
7+
8+
### 1. Command Injection Vulnerabilities in Custom Commands
9+
**File:** `src/activities/transcoders/VideoTranscoder.php:201`, `src/activities/transcoders/ImageTranscoder.php:202`
10+
**Issue:** Custom FFmpeg/ImageMagick commands allow direct string replacement without proper validation. While `escapeshellarg()` is used for input file paths, output paths and other parameters are inserted directly via regex replacement, allowing potential command injection.
11+
**Risk:** Remote code execution if attacker controls JSON input parameters.
12+
**Recommendation:**
13+
- Validate all custom commands against a whitelist of allowed patterns
14+
- Escape ALL parameters passed to shell commands, not just input paths
15+
- Consider using array-based command construction instead of string concatenation
16+
- Implement strict JSON schema validation for all input parameters
17+
18+
### 2. Missing Input Validation in craft_convert_cmd
19+
**File:** `src/activities/transcoders/ImageTranscoder.php:163-189`
20+
**Issue:** ImageMagick convert command parameters (`quality`, `resize`, `thumbnail`, `crop`) are passed directly to shell without escaping or validation.
21+
**Risk:** Command injection through malicious JSON input parameters.
22+
**Recommendation:** Use `escapeshellarg()` for ALL parameters, implement numeric validation for quality, validate resize/crop patterns against regex.
23+
24+
### 3. Hardcoded AWS Account ID in Docker Base Image
25+
**File:** `Dockerfile:1`, throughout README
26+
**Issue:** Hardcoded ECR registry `501431420968.dkr.ecr.eu-west-1.amazonaws.com` exposes AWS account ID.
27+
**Risk:** Information disclosure, potential security targeting.
28+
**Recommendation:** Use build arguments or environment variables for registry URLs, document as configuration requirement.
29+
30+
### 4. No Rate Limiting or Resource Controls
31+
**Issue:** No mechanisms to prevent resource exhaustion attacks through large/malicious files.
32+
**Risk:** DoS through memory/CPU exhaustion, runaway costs from excessive processing.
33+
**Recommendation:**
34+
- Implement file size limits before processing
35+
- Add CPU/memory limits in ECS task definitions
36+
- Implement request rate limiting at Step Functions level
37+
- Add CloudWatch alarms for abnormal resource usage
38+
39+
## High Priority
40+
41+
### 5. Insecure Composer Download Over HTTP
42+
**File:** `Makefile:9`
43+
**Issue:** Downloading Composer installer via curl piped to PHP without signature verification.
44+
**Risk:** Supply chain attack if getcomposer.org is compromised.
45+
**Recommendation:** Download Composer installer with signature verification as per official documentation.
46+
47+
### 6. Missing IAM Permission Boundaries
48+
**File:** `CLAUDE.md:99-101`
49+
**Issue:** IAM permissions documented are broad, no mention of least privilege or permission boundaries.
50+
**Risk:** Excessive permissions, potential for privilege escalation.
51+
**Recommendation:**
52+
- Document minimum required permissions
53+
- Implement IAM permission boundaries
54+
- Use separate roles for different activities
55+
- Add bucket-specific resource ARNs instead of wildcards
56+
57+
### 7. No Encryption for Temporary Files
58+
**File:** `src/activities/BasicActivity.php:52`
59+
**Issue:** Temporary files stored in `/tmp/CloudTranscode/` without encryption.
60+
**Risk:** Sensitive media content exposed on disk, persists after container crashes.
61+
**Recommendation:**
62+
- Use encrypted EBS volumes for ECS instances
63+
- Implement secure deletion of temporary files
64+
- Consider using AWS EFS with encryption for shared storage
65+
- Add cleanup handlers for abnormal termination
66+
67+
### 8. Insufficient Error Handling and Logging
68+
**Files:** Throughout codebase
69+
**Issue:** Errors logged but no structured logging, no CloudWatch metrics, limited monitoring capabilities.
70+
**Risk:** Difficult incident response, no alerting on failures, poor observability.
71+
**Recommendation:**
72+
- Implement structured JSON logging
73+
- Add CloudWatch custom metrics for job status, processing time, errors
74+
- Create CloudWatch dashboards and alarms
75+
- Add distributed tracing with X-Ray
76+
77+
### 9. No Health Checks or Circuit Breakers
78+
**Issue:** No health check endpoints, no circuit breakers for S3/SFN failures.
79+
**Risk:** Cascading failures, difficult auto-recovery, poor resilience.
80+
**Recommendation:**
81+
- Add health check endpoints for ECS
82+
- Implement exponential backoff for AWS API calls
83+
- Add circuit breakers for external dependencies
84+
- Implement graceful degradation
85+
86+
### 10. Outdated Dependencies
87+
**File:** `composer.json:3-5`
88+
**Issue:** Using AWS SDK v3.* (any version), json-schema ~1.3 (very old), FFmpeg 4.2 (from 2019).
89+
**Risk:** Missing security patches, compatibility issues, performance problems.
90+
**Recommendation:**
91+
- Pin specific versions of dependencies
92+
- Update AWS SDK to latest v3 version
93+
- Update json-schema to latest version
94+
- Consider updating FFmpeg to 6.x series
95+
96+
## Medium Priority
97+
98+
### 11. No Input File Type Validation
99+
**File:** `src/activities/ValidateAssetActivity.php`
100+
**Issue:** While mime type detection exists, no validation against allowed file types before processing.
101+
**Risk:** Processing malicious files, potential security exploits in FFmpeg/ImageMagick.
102+
**Recommendation:** Implement strict whitelist of allowed mime types and file extensions.
103+
104+
### 12. Missing Disaster Recovery Documentation
105+
**Issue:** No documented DR procedures, backup strategies, or RTO/RPO targets.
106+
**Risk:** Extended downtime during incidents, data loss.
107+
**Recommendation:**
108+
- Document DR procedures
109+
- Implement automated backups of state machines
110+
- Add multi-region failover capability
111+
- Document RTO/RPO requirements
112+
113+
### 13. No Cost Controls or Budget Alerts
114+
**Issue:** No mechanisms to prevent runaway costs from excessive transcoding.
115+
**Risk:** Unexpected AWS bills, budget overruns.
116+
**Recommendation:**
117+
- Implement AWS Budgets with alerts
118+
- Add job quotas per customer/time period
119+
- Monitor and alert on abnormal usage patterns
120+
- Implement cost allocation tags
121+
122+
### 14. Sequential Processing Bottleneck
123+
**File:** `CLAUDE.md:134`, `src/activities/TranscodeAssetActivity.php:60-91`
124+
**Issue:** All outputs processed sequentially by single worker, no parallelization.
125+
**Risk:** Poor performance, increased costs from longer running instances.
126+
**Recommendation:**
127+
- Implement parallel processing using Step Functions Map state
128+
- Split large jobs into smaller parallel tasks
129+
- Add job priority queues
130+
131+
### 15. No Retry Logic for Transient Failures
132+
**Issue:** No documented retry strategy for S3/network failures.
133+
**Risk:** Job failures from transient issues, manual intervention required.
134+
**Recommendation:**
135+
- Implement exponential backoff retry logic
136+
- Add Step Functions retry configuration
137+
- Distinguish between transient and permanent failures
138+
139+
## Low Priority
140+
141+
### 16. GitHub Secrets in Plain Workflow
142+
**File:** `.github/workflows/github-backup.yml:15-16`
143+
**Issue:** Using legacy secret names, no OIDC authentication.
144+
**Risk:** Long-lived credentials, potential exposure.
145+
**Recommendation:** Migrate to GitHub OIDC provider for AWS authentication.
146+
147+
### 17. Missing Container Security Scanning
148+
**Issue:** No container vulnerability scanning in CI/CD pipeline.
149+
**Risk:** Deploying containers with known vulnerabilities.
150+
**Recommendation:**
151+
- Add Trivy or similar scanner to CI/CD
152+
- Implement ECR image scanning
153+
- Add security gates to prevent vulnerable deployments
154+
155+
### 18. No API Versioning Strategy
156+
**Issue:** No versioning for Step Functions state machines or input/output schemas.
157+
**Risk:** Breaking changes affecting clients, difficult rollbacks.
158+
**Recommendation:**
159+
- Implement semantic versioning for state machines
160+
- Version input/output JSON schemas
161+
- Document breaking changes
162+
163+
### 19. Incomplete Documentation
164+
**File:** `CLAUDE.md` - multiple `<!-- Ask: ... -->` sections
165+
**Issue:** Missing critical documentation about configuration, testing, operations.
166+
**Risk:** Operational difficulties, onboarding challenges.
167+
**Recommendation:** Complete all documentation gaps identified in CLAUDE.md.
168+
169+
### 20. No Performance Benchmarks or SLAs
170+
**Issue:** Old benchmarks from 2016, no current performance metrics or SLAs.
171+
**Risk:** Unknown performance characteristics, no performance regression detection.
172+
**Recommendation:**
173+
- Update benchmark suite for current instance types
174+
- Define and monitor SLAs
175+
- Implement performance testing in CI/CD
176+
177+
## Agent Skill Improvements
178+
179+
### 1. Add Security Scanning Capabilities
180+
The DevOps agent should include automated security scanning as part of infrastructure audits:
181+
- Container vulnerability scanning
182+
- Dependency vulnerability checking
183+
- IAM permission analysis
184+
- Secret detection in code
185+
186+
### 2. Cost Optimization Analysis
187+
Enhance the agent to analyze and recommend cost optimizations:
188+
- Identify over-provisioned resources
189+
- Recommend reserved capacity or savings plans
190+
- Analyze usage patterns for right-sizing
191+
192+
### 3. Disaster Recovery Planning
193+
Add DR assessment capabilities:
194+
- Validate backup strategies
195+
- Test recovery procedures
196+
- Document RTO/RPO requirements
197+
- Create runbooks for common scenarios
198+
199+
### 4. Performance Profiling
200+
Include performance analysis tools:
201+
- Identify bottlenecks
202+
- Recommend optimization strategies
203+
- Benchmark against best practices
204+
205+
### 5. Compliance Checking
206+
Add compliance validation:
207+
- Check against AWS Well-Architected Framework
208+
- Validate security best practices
209+
- Ensure logging/monitoring compliance
210+
211+
## Positive Observations
212+
213+
### 1. Clean Architecture
214+
- Well-structured object-oriented PHP code
215+
- Good separation of concerns between activities and transcoders
216+
- Modular design allowing easy extension
217+
218+
### 2. Docker Containerization
219+
- Proper use of Docker for deployment
220+
- Base image strategy for reusability
221+
- Clear entrypoint configuration
222+
223+
### 3. Flexible Input Options
224+
- Support for both S3 and HTTP input sources
225+
- Custom command support for advanced use cases
226+
- Preset system for common configurations
227+
228+
### 4. Activity Pattern Implementation
229+
- Proper use of AWS Step Functions activity pattern
230+
- Heartbeat implementation for long-running tasks
231+
- Clean integration with CloudProcessingEngine-SDK
232+
233+
### 5. Progress Tracking
234+
- Real-time progress reporting during transcoding
235+
- Heartbeat mechanism to prevent timeouts
236+
- Callback system for status updates
237+
238+
### 6. Error Reporting
239+
- Comprehensive error messages
240+
- Proper exception handling in most cases
241+
- JSON-formatted responses for automation
242+
243+
### 7. S3 Integration
244+
- Proper use of AWS SDK
245+
- Support for encryption and storage classes
246+
- Efficient file transfer handling
247+
248+
### 8. Documentation Foundation
249+
- CLAUDE.md provides good overview
250+
- README includes setup instructions
251+
- Code comments explain key functionality
252+
253+
## Summary
254+
255+
**Total Findings by Priority:**
256+
- Critical: 4
257+
- High: 6
258+
- Medium: 5
259+
- Low: 5
260+
- Total Issues: 20
261+
262+
**Immediate Actions Required:**
263+
1. Fix command injection vulnerabilities in custom command handling
264+
2. Implement proper input validation and escaping
265+
3. Add rate limiting and resource controls
266+
4. Update dependencies and base images
267+
268+
**Strategic Improvements:**
269+
1. Implement comprehensive monitoring and alerting
270+
2. Design and document disaster recovery procedures
271+
3. Optimize for parallel processing
272+
4. Enhance security posture with scanning and least privilege
273+
274+
This distributed transcoding system has a solid foundation but requires immediate security hardening and operational improvements before production use. The critical command injection vulnerabilities must be addressed immediately, followed by implementing proper monitoring, resource controls, and disaster recovery capabilities.

0 commit comments

Comments
 (0)