Merge #344: fix: [#342] resolve Docker BuildKit 'image already exists' error in CI

josecelano · josecelano · commit ed6c08258bcb · 2026-02-13T08:57:54.000Z
ff71199 docs: [#342] add skill for creating ADRs (Jose Celano) 971e628 docs: [#342] add ADR for concurrent Docker image builds in tests (Jose Celano) 5582ea0 fix: [#342] treat 'already exists' build error as success in concurrent tests (Jose Celano) d3c42aa fix: remove docker rmi that caused race conditions in parallel tests (Jose Celano) 2dbd31c fix: improve Docker build error reporting (Jose Celano) c719c92 fix: remove both short and fully-qualified Docker image names before building (Jose Celano) 3b506dd fix: apply Docker BuildKit fix to E2E image builder (Jose Celano) 74f2151 fix: [#342] resolve Docker BuildKit 'image already exists' error in CI (Jose Celano) Pull request description: ## Description Fixes #342 Resolves Docker BuildKit "image already exists" errors in GitHub Actions CI caused by a race condition during parallel test execution. ## Problem When `cargo test` runs multiple integration tests in parallel, each test calls `build_if_missing()` to ensure the Docker image exists. The race condition: 1. **Test A and Test B** both call `build_if_missing()` simultaneously 2. Both call `image_exists()` → both get `false` (no image yet) 3. Both start `docker build` in parallel (~60s each) 4. **Test A finishes first**, tags `dependency-installer-test:ubuntu-24.04` → success 5. **Test B finishes**, all Docker steps complete but the final export/tagging step fails: ``` #8 ERROR: image "docker.io/library/dependency-installer-test:ubuntu-24.04": already exists ``` ## Solution When a Docker build fails with "already exists", treat it as **success** — it means another concurrent test already built and tagged the exact same image, which is now available for use. ### Why this is correct - The "already exists" error only occurs at the export/tagging step, **after** all build steps complete successfully - It means the identical image was already built by a concurrent process - The image is immediately available for container creation - No data loss or corruption possible — Docker tags are atomic pointers ### Approaches tried and why they failed | Attempt | Approach | Result | |---------|----------|--------| | 1 | `docker rmi -f` before building | ❌ Worse race conditions (removing image while another test uses it) | | 2 | Extended `docker rmi` for fully-qualified names | ❌ Same problem | | 3 | Remove `docker rmi`, trust BuildKit atomicity | ❌ BuildKit does **not** handle this silently | | **4 (final)** | **Treat "already exists" as success** | ✅ **CI passes** | ## Changes - **`packages/dependency-installer/tests/containers/image_builder.rs`**: - `build_if_missing()` now detects "already exists" in build output and returns `Ok(())` instead of `Err` - Enhanced error reporting with `tracing::{error, info}` for structured logging - Added `--force-rm` flag for intermediate container cleanup - Updated documentation to reflect the concurrent build handling - **`src/testing/e2e/containers/image_builder.rs`**: - Same "already exists" detection in `build()` method - Added `--force-rm` flag - Updated comments about BuildKit concurrency behavior ## Testing - ✅ All CI checks passing (Container, Coverage, E2E, Linting) - ✅ All local pre-commit checks pass - ✅ `cargo clippy`, `cargo machete`, `cargo fmt` clean ACKs for top commit: josecelano: ACK ff71199 Tree-SHA512: 052a538cb635af82d9dcf64eb5111c4199168166065aff88d580fabf2e55e24c92088a097942ddb3d15aaf07fdb336561b333b488a73f31ac95550ba17af0ea5
diff --git a/.github/skills/create-adr/skill.md b/.github/skills/create-adr/skill.md
@@ -0,0 +1,160 @@
+---
+name: create-adr
+description: Guide for creating Architectural Decision Records (ADRs) in the torrust-tracker-deployer project. Covers the ADR template, file naming, index registration, and commit workflow. Use when documenting architectural decisions, recording design choices, or adding decision records. Triggers on "create ADR", "add ADR", "new decision record", "architectural decision", "document decision", or "add decision".
+metadata:
+  author: torrust
+  version: "1.0"
+---
+
+# Creating Architectural Decision Records
+
+This skill guides you through creating ADRs for the Torrust Tracker Deployer project.
+
+## Quick Reference
+
+```bash
+# 1. Create the ADR file
+cp docs/decisions/TEMPLATE.md docs/decisions/{kebab-case-title}.md
+
+# 2. Add entry to the index table in docs/decisions/README.md
+
+# 3. Run pre-commit checks
+./scripts/pre-commit.sh
+
+# 4. Commit
+git commit -m "docs: [#{issue}] add ADR for {short description}"
+```
+
+## When to Create an ADR
+
+Create an ADR when making a decision that:
+
+- Affects the project's architecture or design patterns
+- Chooses one approach over alternatives that were considered
+- Has consequences (positive or negative) worth documenting
+- Would benefit future contributors who ask "why was this done this way?"
+
+Do **not** create an ADR for trivial implementation choices or style preferences already covered by linting rules.
+
+## ADR Template
+
+Every ADR uses the structure from `docs/decisions/TEMPLATE.md`:
+
+```markdown
+# Decision: [Title]
+
+## Status
+
+[Proposed | Accepted | Rejected | Superseded]
+
+## Date
+
+YYYY-MM-DD
+
+## Context
+
+What is the issue motivating this decision?
+
+## Decision
+
+What change are we implementing?
+
+## Consequences
+
+What becomes easier or more difficult? What risks are introduced?
+
+## Alternatives Considered
+
+What other options were evaluated and why were they rejected?
+
+## Related Decisions
+
+Links to other relevant ADRs.
+
+## References
+
+Links to external resources, issues, or PRs.
+```
+
+## Step-by-Step Process
+
+### Step 1: Choose a Filename
+
+Use `kebab-case` matching the decision topic:
+
+```text
+docs/decisions/{kebab-case-title}.md
+```
+
+Examples: `concurrent-docker-image-builds-in-tests.md`, `caddy-for-tls-termination.md`
+
+### Step 2: Write the ADR
+
+Fill in every section of the template:
+
+- **Status**: Use `✅ Accepted` for decisions being implemented now. Use `Proposed` if pending review.
+- **Date**: Use today's date in `YYYY-MM-DD` format
+- **Context**: Explain the problem thoroughly — include enough background for future readers who have no prior context. Include links to issues or PRs if applicable.
+- **Decision**: State clearly what was decided and why. Include code examples if the decision involves specific patterns.
+- **Consequences**: Document **both** positive and negative consequences. Be honest about trade-offs.
+- **Alternatives Considered**: List each alternative with a clear explanation of why it was rejected. This is one of the most valuable sections — it prevents future contributors from re-exploring dead ends.
+- **Related Decisions**: Link to other ADRs in the same directory
+- **References**: Link to GitHub issues, PRs, external documentation
+
+### Step 3: Add to the Decision Index
+
+Add a new row to the table in `docs/decisions/README.md`, sorted by date (newest first):
+
+```markdown
+| ✅ Accepted | YYYY-MM-DD | [Title](./filename.md) | One-line summary (max ~85 chars) |
+```
+
+The table columns are: Status, Date, Decision (link), Summary.
+
+### Step 4: Validate and Commit
+
+```bash
+# Lint the new ADR and the updated index
+npx markdownlint-cli docs/decisions/{filename}.md
+npx markdownlint-cli docs/decisions/README.md
+npx cspell lint docs/decisions/{filename}.md
+
+# Run full pre-commit checks
+./scripts/pre-commit.sh
+
+# Commit with conventional format
+git add docs/decisions/{filename}.md docs/decisions/README.md
+git commit -m "docs: [#{issue}] add ADR for {short description}"
+```
+
+## Guidelines
+
+From `docs/decisions/README.md`:
+
+- **One decision per file**: Each ADR focuses on a single architectural decision
+- **Immutable**: Once accepted, ADRs should not be modified. Create new ADRs to supersede old ones
+- **Context-rich**: Include enough background for future readers
+- **Consequence-aware**: Document both positive and negative consequences
+- **Linked**: Reference related decisions and external resources
+
+## Status Definitions
+
+| Status         | Meaning                                    |
+| -------------- | ------------------------------------------ |
+| **Proposed**   | Decision is under discussion               |
+| **Accepted**   | Decision has been approved and implemented |
+| **Rejected**   | Decision was considered but not approved   |
+| **Superseded** | Decision has been replaced by a newer ADR  |
+
+## Common Mistakes
+
+- **Missing alternatives**: Always document what was considered and rejected — this is the most valuable part for future contributors
+- **Vague consequences**: Be specific about trade-offs, not just "this is simpler"
+- **Forgetting the index**: Every ADR must be added to the table in `docs/decisions/README.md`
+- **Wrong sort order**: Index entries are sorted newest-first by date
+
+## References
+
+- ADR index and guidelines: `docs/decisions/README.md`
+- ADR template: `docs/decisions/TEMPLATE.md`
+- AGENTS.md rule 12: "Before making engineering decisions, document as ADRs"
diff --git a/AGENTS.md b/AGENTS.md
@@ -220,6 +220,7 @@ Available skills:
 | Adding commands             | `.github/skills/add-new-command/skill.md`          |
 | Committing changes          | `.github/skills/commit-changes/skill.md`           |
 | Completing refactor plans   | `.github/skills/complete-refactor-plan/skill.md`   |
+| Creating ADRs               | `.github/skills/create-adr/skill.md`               |
 | Creating issues             | `.github/skills/create-issue/skill.md`             |
 | Creating new skills         | `.github/skills/add-new-skill/skill.md`            |
 | Creating refactor plans     | `.github/skills/create-refactor-plan/skill.md`     |
diff --git a/docs/decisions/README.md b/docs/decisions/README.md
@@ -6,6 +6,7 @@ This directory contains architectural decision records for the Torrust Tracker D
 
 | Status        | Date       | Decision                                                                                                  | Summary                                                                                    |
 | ------------- | ---------- | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
+| ✅ Accepted   | 2026-02-13 | [Concurrent Docker Image Builds in Tests](./concurrent-docker-image-builds-in-tests.md)                   | Treat "already exists" tagging error as success when parallel tests build same image       |
 | ✅ Accepted   | 2026-02-06 | [Agent Skills Content Strategy](./skill-content-strategy-duplication-vs-linking.md)                       | Three-tier content strategy: self-contained workflows, progressive disclosure, linked docs |
 | ✅ Accepted   | 2026-01-27 | [Atomic Ansible Playbooks](./atomic-ansible-playbooks.md)                                                 | Require one-responsibility playbooks with Rust-side gating and registered static templates |
 | ✅ Accepted   | 2026-01-24 | [Bind Mount Standardization](./bind-mount-standardization.md)                                             | Use bind mounts exclusively for all Docker Compose volumes for observability and backup    |
diff --git a/docs/decisions/concurrent-docker-image-builds-in-tests.md b/docs/decisions/concurrent-docker-image-builds-in-tests.md
@@ -0,0 +1,202 @@
+# Decision: Handle Concurrent Docker Image Builds in Tests Gracefully
+
+## Status
+
+✅ Accepted
+
+## Date
+
+2026-02-13
+
+## Context
+
+Several integration tests require Docker images that are built on-demand
+before the test runs. The image builders (`ImageBuilder::build_if_missing()`
+and `ContainerImageBuilder::build()`) follow a check-then-build pattern:
+
+1. Check if the image exists (`docker image inspect`)
+2. If it does not exist, build it (`docker build -t <name>:<tag> ...`)
+
+Since `cargo test` runs tests in parallel by default, multiple tests that
+depend on the same Docker image can trigger the build simultaneously. This
+creates a race condition:
+
+1. **Test A** and **Test B** both call `build_if_missing()` concurrently
+2. Both call `image_exists()` → both get `false` (no image yet)
+3. Both start `docker build` in parallel (~60 seconds each)
+4. **Test A** finishes first and tags the image → success
+5. **Test B** finishes all build steps successfully, but the final
+   export/tagging step fails because the tag already exists:
+
+```text
+#8 exporting to image
+#8 exporting layers done
+#8 naming to docker.io/library/dependency-installer-test:ubuntu-24.04 done
+#8 ERROR: image "docker.io/library/dependency-installer-test:ubuntu-24.04": already exists
+```
+
+This error is misleading: all Docker build steps completed successfully.
+The failure occurs only at the final naming/tagging step because another
+concurrent build already claimed the tag. The image is valid and available
+for use.
+
+This manifested as flaky CI failures in GitHub Actions where
+`dependency-installer` integration tests would intermittently fail with
+`BuildFailed` errors despite the Docker image being correctly built. See
+[issue 342](https://github.com/torrust/torrust-tracker-deployer/issues/342)
+for the full debugging history.
+
+### Additional caveat: image staleness in development
+
+All tests that need the same Docker image share a single tagged image. This
+works well in CI, where every workflow run starts with a clean Docker state
+and builds a fresh image. However, during local development, if a developer
+modifies the Dockerfile or its build context, the existing cached image
+will not be rebuilt because `build_if_missing()` skips the build when the
+tag already exists. Developers must manually delete the old image
+(`docker rmi <name>:<tag>`) before running tests to pick up their changes.
+
+## Decision
+
+When a Docker build fails and the error output contains the string
+`"already exists"`, treat it as **success** instead of propagating the
+error. The image was built by a concurrent process and is available for
+use.
+
+### Implementation
+
+In both image builders, after detecting a non-zero exit code from
+`docker build`, check the error output before returning an error:
+
+```rust
+if !output.status.success() {
+    let stderr = String::from_utf8_lossy(&output.stderr);
+
+    // Concurrent build race: another test already tagged this image.
+    // The image is available for use — this is not a real failure.
+    if stderr.contains("already exists") {
+        info!(
+            image = full_image_name,
+            "Docker image was built by a concurrent process, treating as success"
+        );
+        return Ok(());
+    }
+
+    // ... propagate real build errors as before
+}
+```
+
+### Why this is correct
+
+- The `"already exists"` error only occurs at the export/tagging step,
+  **after** all build steps complete successfully
+- It means the exact same image (same Dockerfile, same tag) was already
+  built and tagged by a concurrent process
+- The image is immediately available for container creation
+- Docker tags are atomic pointers — no data loss or corruption is possible
+
+### Affected files
+
+- `packages/dependency-installer/tests/containers/image_builder.rs` —
+  `build_if_missing()` method
+- `src/testing/e2e/containers/image_builder.rs` —
+  `build()` method
+
+## Consequences
+
+### Positive
+
+- **CI stability**: Parallel tests no longer fail due to concurrent image
+  builds — the flaky CI failure is eliminated
+- **Simplicity**: Minimal code change (string check on error output) with
+  no new dependencies or synchronization primitives
+- **No performance impact**: No locks, no serialization of builds, no
+  additional Docker commands
+- **Preserves parallelism**: Tests continue to run in parallel without
+  any coordination overhead
+
+### Negative
+
+- **String-based error detection**: The fix relies on matching the string
+  `"already exists"` in Docker's error output. If Docker changes this
+  message in a future version, the detection would stop working and the
+  original error would resurface (but would not silently break — tests
+  would fail visibly)
+- **Redundant builds**: When the race occurs, both tests perform the full
+  build (~60 seconds each), wasting CI time. Only the tagging of the
+  second build is skipped. This is acceptable because the race is
+  infrequent and the alternative solutions add complexity
+- **Image staleness in development**: Developers who modify Dockerfiles
+  or build contexts must manually remove old images before running tests.
+  The `build_if_missing()` pattern does not detect changes to the build
+  inputs
+
+## Alternatives Considered
+
+### 1. Tag images uniquely per test
+
+Give each test a unique image tag (e.g., `test-image:test-<uuid>`) so
+concurrent builds never collide on the same tag.
+
+**Rejected because:**
+
+- Pollutes the Docker tag namespace with many test-specific tags
+- Requires cleanup logic to remove stale test images
+- Each test would build its own image from scratch, significantly
+  increasing total CI time (no sharing of built images between tests)
+- Adds complexity to the test infrastructure for a problem that occurs
+  only during the tagging step
+
+### 2. Use a file-based lock to serialize builds
+
+Use a lock file or `flock` to ensure only one test builds the image at a
+time. Other tests wait for the lock, then find the image already exists.
+
+**Rejected because:**
+
+- Introduces a new synchronization primitive into the test infrastructure
+- Cross-process file locks have portability concerns across operating
+  systems
+- Tests may have execution timeouts that could be exceeded while waiting
+  for the lock, especially if the build takes a long time
+- Adds complexity for a problem that has a simpler solution
+
+### 3. Pre-build images before running tests
+
+Add a build step (e.g., in CI workflow or a setup script) that builds all
+required Docker images before `cargo test` runs.
+
+**Rejected because:**
+
+- Adds a mandatory setup step that developers must remember to run
+- Couples the test execution to an external build step, making it harder
+  to run tests in isolation
+- Does not fully eliminate the race if a developer runs `cargo test`
+  without the pre-build step
+- The current on-demand build pattern is more ergonomic for development
+
+### 4. Remove existing image before building (`docker rmi -f`)
+
+Force-remove the tagged image before starting the build to ensure a
+clean slate.
+
+**Tried and rejected because:**
+
+- Creates **worse** race conditions: Test A removes the image while
+  Test B is using a container based on it
+- Does not solve the fundamental concurrency problem — just shifts the
+  race window
+- Was the first approach attempted during the debugging of issue 342
+  and was proven to make CI failures more frequent
+
+## Related Decisions
+
+- [Single Docker Image for Sequential E2E Command Testing](./single-docker-image-sequential-testing.md) —
+  related decision about Docker image strategy for E2E tests
+- [Docker Testing Evolution](./docker-testing-evolution.md) —
+  broader context on Docker usage in the test infrastructure
+
+## References
+
+- [Issue 342: Fix Docker BuildKit "image already exists" error](https://github.com/torrust/torrust-tracker-deployer/issues/342)
+- [PR 344: fix: resolve Docker BuildKit "image already exists" error in CI](https://github.com/torrust/torrust-tracker-deployer/pull/344)
diff --git a/packages/dependency-installer/tests/containers/image_builder.rs b/packages/dependency-installer/tests/containers/image_builder.rs
diff --git a/src/testing/e2e/containers/image_builder.rs b/src/testing/e2e/containers/image_builder.rs