Skip to content

Commit ed6c082

Browse files
committed
Merge #344: fix: [#342] resolve Docker BuildKit 'image already exists' error in CI
ff71199 docs: [#342] add skill for creating ADRs (Jose Celano) 971e628 docs: [#342] add ADR for concurrent Docker image builds in tests (Jose Celano) 5582ea0 fix: [#342] treat 'already exists' build error as success in concurrent tests (Jose Celano) d3c42aa fix: remove docker rmi that caused race conditions in parallel tests (Jose Celano) 2dbd31c fix: improve Docker build error reporting (Jose Celano) c719c92 fix: remove both short and fully-qualified Docker image names before building (Jose Celano) 3b506dd fix: apply Docker BuildKit fix to E2E image builder (Jose Celano) 74f2151 fix: [#342] resolve Docker BuildKit 'image already exists' error in CI (Jose Celano) Pull request description: ## Description Fixes #342 Resolves Docker BuildKit "image already exists" errors in GitHub Actions CI caused by a race condition during parallel test execution. ## Problem When `cargo test` runs multiple integration tests in parallel, each test calls `build_if_missing()` to ensure the Docker image exists. The race condition: 1. **Test A and Test B** both call `build_if_missing()` simultaneously 2. Both call `image_exists()` → both get `false` (no image yet) 3. Both start `docker build` in parallel (~60s each) 4. **Test A finishes first**, tags `dependency-installer-test:ubuntu-24.04` → success 5. **Test B finishes**, all Docker steps complete but the final export/tagging step fails: ``` #8 ERROR: image "docker.io/library/dependency-installer-test:ubuntu-24.04": already exists ``` ## Solution When a Docker build fails with "already exists", treat it as **success** — it means another concurrent test already built and tagged the exact same image, which is now available for use. ### Why this is correct - The "already exists" error only occurs at the export/tagging step, **after** all build steps complete successfully - It means the identical image was already built by a concurrent process - The image is immediately available for container creation - No data loss or corruption possible — Docker tags are atomic pointers ### Approaches tried and why they failed | Attempt | Approach | Result | |---------|----------|--------| | 1 | `docker rmi -f` before building | ❌ Worse race conditions (removing image while another test uses it) | | 2 | Extended `docker rmi` for fully-qualified names | ❌ Same problem | | 3 | Remove `docker rmi`, trust BuildKit atomicity | ❌ BuildKit does **not** handle this silently | | **4 (final)** | **Treat "already exists" as success** | ✅ **CI passes** | ## Changes - **`packages/dependency-installer/tests/containers/image_builder.rs`**: - `build_if_missing()` now detects "already exists" in build output and returns `Ok(())` instead of `Err` - Enhanced error reporting with `tracing::{error, info}` for structured logging - Added `--force-rm` flag for intermediate container cleanup - Updated documentation to reflect the concurrent build handling - **`src/testing/e2e/containers/image_builder.rs`**: - Same "already exists" detection in `build()` method - Added `--force-rm` flag - Updated comments about BuildKit concurrency behavior ## Testing - ✅ All CI checks passing (Container, Coverage, E2E, Linting) - ✅ All local pre-commit checks pass - ✅ `cargo clippy`, `cargo machete`, `cargo fmt` clean ACKs for top commit: josecelano: ACK ff71199 Tree-SHA512: 052a538cb635af82d9dcf64eb5111c4199168166065aff88d580fabf2e55e24c92088a097942ddb3d15aaf07fdb336561b333b488a73f31ac95550ba17af0ea5
2 parents bb95810 + ff71199 commit ed6c082

6 files changed

Lines changed: 420 additions & 3 deletions

File tree

.github/skills/create-adr/skill.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
---
2+
name: create-adr
3+
description: Guide for creating Architectural Decision Records (ADRs) in the torrust-tracker-deployer project. Covers the ADR template, file naming, index registration, and commit workflow. Use when documenting architectural decisions, recording design choices, or adding decision records. Triggers on "create ADR", "add ADR", "new decision record", "architectural decision", "document decision", or "add decision".
4+
metadata:
5+
author: torrust
6+
version: "1.0"
7+
---
8+
9+
# Creating Architectural Decision Records
10+
11+
This skill guides you through creating ADRs for the Torrust Tracker Deployer project.
12+
13+
## Quick Reference
14+
15+
```bash
16+
# 1. Create the ADR file
17+
cp docs/decisions/TEMPLATE.md docs/decisions/{kebab-case-title}.md
18+
19+
# 2. Add entry to the index table in docs/decisions/README.md
20+
21+
# 3. Run pre-commit checks
22+
./scripts/pre-commit.sh
23+
24+
# 4. Commit
25+
git commit -m "docs: [#{issue}] add ADR for {short description}"
26+
```
27+
28+
## When to Create an ADR
29+
30+
Create an ADR when making a decision that:
31+
32+
- Affects the project's architecture or design patterns
33+
- Chooses one approach over alternatives that were considered
34+
- Has consequences (positive or negative) worth documenting
35+
- Would benefit future contributors who ask "why was this done this way?"
36+
37+
Do **not** create an ADR for trivial implementation choices or style preferences already covered by linting rules.
38+
39+
## ADR Template
40+
41+
Every ADR uses the structure from `docs/decisions/TEMPLATE.md`:
42+
43+
```markdown
44+
# Decision: [Title]
45+
46+
## Status
47+
48+
[Proposed | Accepted | Rejected | Superseded]
49+
50+
## Date
51+
52+
YYYY-MM-DD
53+
54+
## Context
55+
56+
What is the issue motivating this decision?
57+
58+
## Decision
59+
60+
What change are we implementing?
61+
62+
## Consequences
63+
64+
What becomes easier or more difficult? What risks are introduced?
65+
66+
## Alternatives Considered
67+
68+
What other options were evaluated and why were they rejected?
69+
70+
## Related Decisions
71+
72+
Links to other relevant ADRs.
73+
74+
## References
75+
76+
Links to external resources, issues, or PRs.
77+
```
78+
79+
## Step-by-Step Process
80+
81+
### Step 1: Choose a Filename
82+
83+
Use `kebab-case` matching the decision topic:
84+
85+
```text
86+
docs/decisions/{kebab-case-title}.md
87+
```
88+
89+
Examples: `concurrent-docker-image-builds-in-tests.md`, `caddy-for-tls-termination.md`
90+
91+
### Step 2: Write the ADR
92+
93+
Fill in every section of the template:
94+
95+
- **Status**: Use `✅ Accepted` for decisions being implemented now. Use `Proposed` if pending review.
96+
- **Date**: Use today's date in `YYYY-MM-DD` format
97+
- **Context**: Explain the problem thoroughly — include enough background for future readers who have no prior context. Include links to issues or PRs if applicable.
98+
- **Decision**: State clearly what was decided and why. Include code examples if the decision involves specific patterns.
99+
- **Consequences**: Document **both** positive and negative consequences. Be honest about trade-offs.
100+
- **Alternatives Considered**: List each alternative with a clear explanation of why it was rejected. This is one of the most valuable sections — it prevents future contributors from re-exploring dead ends.
101+
- **Related Decisions**: Link to other ADRs in the same directory
102+
- **References**: Link to GitHub issues, PRs, external documentation
103+
104+
### Step 3: Add to the Decision Index
105+
106+
Add a new row to the table in `docs/decisions/README.md`, sorted by date (newest first):
107+
108+
```markdown
109+
| ✅ Accepted | YYYY-MM-DD | [Title](./filename.md) | One-line summary (max ~85 chars) |
110+
```
111+
112+
The table columns are: Status, Date, Decision (link), Summary.
113+
114+
### Step 4: Validate and Commit
115+
116+
```bash
117+
# Lint the new ADR and the updated index
118+
npx markdownlint-cli docs/decisions/{filename}.md
119+
npx markdownlint-cli docs/decisions/README.md
120+
npx cspell lint docs/decisions/{filename}.md
121+
122+
# Run full pre-commit checks
123+
./scripts/pre-commit.sh
124+
125+
# Commit with conventional format
126+
git add docs/decisions/{filename}.md docs/decisions/README.md
127+
git commit -m "docs: [#{issue}] add ADR for {short description}"
128+
```
129+
130+
## Guidelines
131+
132+
From `docs/decisions/README.md`:
133+
134+
- **One decision per file**: Each ADR focuses on a single architectural decision
135+
- **Immutable**: Once accepted, ADRs should not be modified. Create new ADRs to supersede old ones
136+
- **Context-rich**: Include enough background for future readers
137+
- **Consequence-aware**: Document both positive and negative consequences
138+
- **Linked**: Reference related decisions and external resources
139+
140+
## Status Definitions
141+
142+
| Status | Meaning |
143+
| -------------- | ------------------------------------------ |
144+
| **Proposed** | Decision is under discussion |
145+
| **Accepted** | Decision has been approved and implemented |
146+
| **Rejected** | Decision was considered but not approved |
147+
| **Superseded** | Decision has been replaced by a newer ADR |
148+
149+
## Common Mistakes
150+
151+
- **Missing alternatives**: Always document what was considered and rejected — this is the most valuable part for future contributors
152+
- **Vague consequences**: Be specific about trade-offs, not just "this is simpler"
153+
- **Forgetting the index**: Every ADR must be added to the table in `docs/decisions/README.md`
154+
- **Wrong sort order**: Index entries are sorted newest-first by date
155+
156+
## References
157+
158+
- ADR index and guidelines: `docs/decisions/README.md`
159+
- ADR template: `docs/decisions/TEMPLATE.md`
160+
- AGENTS.md rule 12: "Before making engineering decisions, document as ADRs"

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,6 +220,7 @@ Available skills:
220220
| Adding commands | `.github/skills/add-new-command/skill.md` |
221221
| Committing changes | `.github/skills/commit-changes/skill.md` |
222222
| Completing refactor plans | `.github/skills/complete-refactor-plan/skill.md` |
223+
| Creating ADRs | `.github/skills/create-adr/skill.md` |
223224
| Creating issues | `.github/skills/create-issue/skill.md` |
224225
| Creating new skills | `.github/skills/add-new-skill/skill.md` |
225226
| Creating refactor plans | `.github/skills/create-refactor-plan/skill.md` |

docs/decisions/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ This directory contains architectural decision records for the Torrust Tracker D
66

77
| Status | Date | Decision | Summary |
88
| ------------- | ---------- | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
9+
| ✅ Accepted | 2026-02-13 | [Concurrent Docker Image Builds in Tests](./concurrent-docker-image-builds-in-tests.md) | Treat "already exists" tagging error as success when parallel tests build same image |
910
| ✅ Accepted | 2026-02-06 | [Agent Skills Content Strategy](./skill-content-strategy-duplication-vs-linking.md) | Three-tier content strategy: self-contained workflows, progressive disclosure, linked docs |
1011
| ✅ Accepted | 2026-01-27 | [Atomic Ansible Playbooks](./atomic-ansible-playbooks.md) | Require one-responsibility playbooks with Rust-side gating and registered static templates |
1112
| ✅ Accepted | 2026-01-24 | [Bind Mount Standardization](./bind-mount-standardization.md) | Use bind mounts exclusively for all Docker Compose volumes for observability and backup |
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# Decision: Handle Concurrent Docker Image Builds in Tests Gracefully
2+
3+
## Status
4+
5+
✅ Accepted
6+
7+
## Date
8+
9+
2026-02-13
10+
11+
## Context
12+
13+
Several integration tests require Docker images that are built on-demand
14+
before the test runs. The image builders (`ImageBuilder::build_if_missing()`
15+
and `ContainerImageBuilder::build()`) follow a check-then-build pattern:
16+
17+
1. Check if the image exists (`docker image inspect`)
18+
2. If it does not exist, build it (`docker build -t <name>:<tag> ...`)
19+
20+
Since `cargo test` runs tests in parallel by default, multiple tests that
21+
depend on the same Docker image can trigger the build simultaneously. This
22+
creates a race condition:
23+
24+
1. **Test A** and **Test B** both call `build_if_missing()` concurrently
25+
2. Both call `image_exists()` → both get `false` (no image yet)
26+
3. Both start `docker build` in parallel (~60 seconds each)
27+
4. **Test A** finishes first and tags the image → success
28+
5. **Test B** finishes all build steps successfully, but the final
29+
export/tagging step fails because the tag already exists:
30+
31+
```text
32+
#8 exporting to image
33+
#8 exporting layers done
34+
#8 naming to docker.io/library/dependency-installer-test:ubuntu-24.04 done
35+
#8 ERROR: image "docker.io/library/dependency-installer-test:ubuntu-24.04": already exists
36+
```
37+
38+
This error is misleading: all Docker build steps completed successfully.
39+
The failure occurs only at the final naming/tagging step because another
40+
concurrent build already claimed the tag. The image is valid and available
41+
for use.
42+
43+
This manifested as flaky CI failures in GitHub Actions where
44+
`dependency-installer` integration tests would intermittently fail with
45+
`BuildFailed` errors despite the Docker image being correctly built. See
46+
[issue 342](https://github.com/torrust/torrust-tracker-deployer/issues/342)
47+
for the full debugging history.
48+
49+
### Additional caveat: image staleness in development
50+
51+
All tests that need the same Docker image share a single tagged image. This
52+
works well in CI, where every workflow run starts with a clean Docker state
53+
and builds a fresh image. However, during local development, if a developer
54+
modifies the Dockerfile or its build context, the existing cached image
55+
will not be rebuilt because `build_if_missing()` skips the build when the
56+
tag already exists. Developers must manually delete the old image
57+
(`docker rmi <name>:<tag>`) before running tests to pick up their changes.
58+
59+
## Decision
60+
61+
When a Docker build fails and the error output contains the string
62+
`"already exists"`, treat it as **success** instead of propagating the
63+
error. The image was built by a concurrent process and is available for
64+
use.
65+
66+
### Implementation
67+
68+
In both image builders, after detecting a non-zero exit code from
69+
`docker build`, check the error output before returning an error:
70+
71+
```rust
72+
if !output.status.success() {
73+
let stderr = String::from_utf8_lossy(&output.stderr);
74+
75+
// Concurrent build race: another test already tagged this image.
76+
// The image is available for use — this is not a real failure.
77+
if stderr.contains("already exists") {
78+
info!(
79+
image = full_image_name,
80+
"Docker image was built by a concurrent process, treating as success"
81+
);
82+
return Ok(());
83+
}
84+
85+
// ... propagate real build errors as before
86+
}
87+
```
88+
89+
### Why this is correct
90+
91+
- The `"already exists"` error only occurs at the export/tagging step,
92+
**after** all build steps complete successfully
93+
- It means the exact same image (same Dockerfile, same tag) was already
94+
built and tagged by a concurrent process
95+
- The image is immediately available for container creation
96+
- Docker tags are atomic pointers — no data loss or corruption is possible
97+
98+
### Affected files
99+
100+
- `packages/dependency-installer/tests/containers/image_builder.rs`
101+
`build_if_missing()` method
102+
- `src/testing/e2e/containers/image_builder.rs`
103+
`build()` method
104+
105+
## Consequences
106+
107+
### Positive
108+
109+
- **CI stability**: Parallel tests no longer fail due to concurrent image
110+
builds — the flaky CI failure is eliminated
111+
- **Simplicity**: Minimal code change (string check on error output) with
112+
no new dependencies or synchronization primitives
113+
- **No performance impact**: No locks, no serialization of builds, no
114+
additional Docker commands
115+
- **Preserves parallelism**: Tests continue to run in parallel without
116+
any coordination overhead
117+
118+
### Negative
119+
120+
- **String-based error detection**: The fix relies on matching the string
121+
`"already exists"` in Docker's error output. If Docker changes this
122+
message in a future version, the detection would stop working and the
123+
original error would resurface (but would not silently break — tests
124+
would fail visibly)
125+
- **Redundant builds**: When the race occurs, both tests perform the full
126+
build (~60 seconds each), wasting CI time. Only the tagging of the
127+
second build is skipped. This is acceptable because the race is
128+
infrequent and the alternative solutions add complexity
129+
- **Image staleness in development**: Developers who modify Dockerfiles
130+
or build contexts must manually remove old images before running tests.
131+
The `build_if_missing()` pattern does not detect changes to the build
132+
inputs
133+
134+
## Alternatives Considered
135+
136+
### 1. Tag images uniquely per test
137+
138+
Give each test a unique image tag (e.g., `test-image:test-<uuid>`) so
139+
concurrent builds never collide on the same tag.
140+
141+
**Rejected because:**
142+
143+
- Pollutes the Docker tag namespace with many test-specific tags
144+
- Requires cleanup logic to remove stale test images
145+
- Each test would build its own image from scratch, significantly
146+
increasing total CI time (no sharing of built images between tests)
147+
- Adds complexity to the test infrastructure for a problem that occurs
148+
only during the tagging step
149+
150+
### 2. Use a file-based lock to serialize builds
151+
152+
Use a lock file or `flock` to ensure only one test builds the image at a
153+
time. Other tests wait for the lock, then find the image already exists.
154+
155+
**Rejected because:**
156+
157+
- Introduces a new synchronization primitive into the test infrastructure
158+
- Cross-process file locks have portability concerns across operating
159+
systems
160+
- Tests may have execution timeouts that could be exceeded while waiting
161+
for the lock, especially if the build takes a long time
162+
- Adds complexity for a problem that has a simpler solution
163+
164+
### 3. Pre-build images before running tests
165+
166+
Add a build step (e.g., in CI workflow or a setup script) that builds all
167+
required Docker images before `cargo test` runs.
168+
169+
**Rejected because:**
170+
171+
- Adds a mandatory setup step that developers must remember to run
172+
- Couples the test execution to an external build step, making it harder
173+
to run tests in isolation
174+
- Does not fully eliminate the race if a developer runs `cargo test`
175+
without the pre-build step
176+
- The current on-demand build pattern is more ergonomic for development
177+
178+
### 4. Remove existing image before building (`docker rmi -f`)
179+
180+
Force-remove the tagged image before starting the build to ensure a
181+
clean slate.
182+
183+
**Tried and rejected because:**
184+
185+
- Creates **worse** race conditions: Test A removes the image while
186+
Test B is using a container based on it
187+
- Does not solve the fundamental concurrency problem — just shifts the
188+
race window
189+
- Was the first approach attempted during the debugging of issue 342
190+
and was proven to make CI failures more frequent
191+
192+
## Related Decisions
193+
194+
- [Single Docker Image for Sequential E2E Command Testing](./single-docker-image-sequential-testing.md)
195+
related decision about Docker image strategy for E2E tests
196+
- [Docker Testing Evolution](./docker-testing-evolution.md)
197+
broader context on Docker usage in the test infrastructure
198+
199+
## References
200+
201+
- [Issue 342: Fix Docker BuildKit "image already exists" error](https://github.com/torrust/torrust-tracker-deployer/issues/342)
202+
- [PR 344: fix: resolve Docker BuildKit "image already exists" error in CI](https://github.com/torrust/torrust-tracker-deployer/pull/344)

0 commit comments

Comments
 (0)