Skip to content

Docker: Retain recordings for failed sessions only from session capabilities#3111

Merged
VietND96 merged 2 commits intotrunkfrom
retain-on-failure
Apr 12, 2026
Merged

Docker: Retain recordings for failed sessions only from session capabilities#3111
VietND96 merged 2 commits intotrunkfrom
retain-on-failure

Conversation

@VietND96
Copy link
Copy Markdown
Member

@VietND96 VietND96 commented Apr 10, 2026

Thanks for contributing to the Docker-Selenium project!
A PR well described will help maintainers to quickly review and merge it

Before submitting your PR, please check our contributing guidelines, applied for this repository.
Avoid large PRs, help reviewers by making them as simple and short as possible.

Description

Retain recordings for failed sessions only

In event-driven mode (SE_VIDEO_EVENT_DRIVEN=true, the default), the video service subscribes to the Grid's ZeroMQ event bus and reacts to session lifecycle events in real time. This enables a retain-on-failure strategy: record every session, but automatically discard the video when the session passes and only keep (and upload) recordings from sessions that fail.

Enable it globally with the environment variable:

SE_RETAIN_ON_FAILURE=true

A session is treated as failed when either of the following is true:

  1. The test code fires a session event whose eventType contains a substring from SE_FAILURE_SESSION_EVENTS (default: :failed,:failure,:error,:aborted).
  2. The session closes with an abnormal reason — TIMEOUT, NODE_REMOVED, or NODE_RESTARTED — instead of the normal QUIT_COMMAND.
Environment variable Default Description
SE_RETAIN_ON_FAILURE false Discard recordings of sessions that pass. Only recordings from failed sessions are retained on disk and queued for upload.
SE_FAILURE_SESSION_EVENTS :failed,:failure,:error,:aborted Comma-separated substrings. Any session event whose eventType contains one of these (case-insensitive) marks the session as failed.

The se:retainOnFailure session capability overrides the global container env var for a specific session. For example, to retain the recording of a single session regardless of the global setting:

options.set_capability('se:retainOnFailure', True)
se:retainOnFailure cap SE_RETAIN_ON_FAILURE env Effective behaviour
true false (default) Retain on failure for this session
false true Always retain for this session
absent true Retain on failure (global default)
absent false (default) Always retain (global default)

Firing session events from test code

The Session Event API lets test code push named events directly to the Grid. The video service listens for these events on the ZeroMQ bus and uses them to determine session failure.

Call driver.fire_session_event(eventType, payload) from your test. Any eventType that contains a configured failure substring (e.g. "test:failed" contains ":failed") marks the session as failed.

from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium import webdriver

options = ChromeOptions()
options.set_capability('se:name', 'checkout_flow')
options.set_capability('se:retainOnFailure', True)   # discard video if this session passes

driver = webdriver.Remote(options=options, command_executor="http://localhost:4444")

try:
    driver.get("https://selenium.dev")
    # ... test steps ...
except Exception as exc:
    # "test:failed" contains ":failed" — matches the default SE_FAILURE_SESSION_EVENTS
    driver.fire_session_event("test:failed", {"error": str(exc)})
    raise
finally:
    driver.quit()

Note: If the test catches an exception and still calls driver.quit() normally, the session close reason is QUIT_COMMAND (not abnormal). In that case, firing a failure event before quit() is the only way to mark the session as failed and prevent the recording from being discarded.

So, you can control the retain-on-failure strategy fully from test code via session capabilities and fire session event.

Motivation and Context

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@qodo-code-review
Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Implement retain-on-failure strategy with session capability override

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Implement retain-on-failure strategy to discard passing session recordings
• Refactor upload logic to support artifact-agnostic queuing and processing
• Add session capability se:retainOnFailure for per-session override control
• Expand default failure event detection to include :error and :aborted
• Update environment variable naming for clarity: SE_UPLOAD_FAILURE_SESSION_ONLYSE_RETAIN_ON_FAILURE
Diagram
flowchart LR
  SessionCreated["Session Created<br/>with capability"] -->|Check se:retainOnFailure| CapCheck{"Capability<br/>set?"}
  CapCheck -->|Yes| UseCap["Use capability<br/>value"]
  CapCheck -->|No| UseEnv["Use SE_RETAIN_ON_FAILURE<br/>env var"]
  UseCap --> SessionState["Store in<br/>SessionState"]
  UseEnv --> SessionState
  SessionState -->|Recording stops| FailureCheck{"Session<br/>failed?"}
  FailureCheck -->|Yes| Queue["Queue for upload"]
  FailureCheck -->|No| Discard["Discard video<br/>immediately"]
  Queue --> Upload["Upload artifact"]
  Discard --> Cleanup["Cleanup session"]
Loading

Grey Divider

File Changes

1. Video/video_service.py ✨ Enhancement +88/-62

Refactor upload logic and implement retain-on-failure strategy

Video/video_service.py


2. Video/video_ready.py ✨ Enhancement +18/-6

Update video readiness check for event-driven mode

Video/video_ready.py


3. tests/SeleniumTests/__init__.py 🧪 Tests +12/-0

Add test capability and event firing for retain-on-failure

tests/SeleniumTests/init.py


View more (8)
4. .github/workflows/docker-test.yml ⚙️ Configuration changes +15/-0

Add retain-on-failure test matrix configurations

.github/workflows/docker-test.yml


5. ENV_VARIABLES.md 📝 Documentation +4/-4

Document retain-on-failure and updated environment variables

ENV_VARIABLES.md


6. Video/Dockerfile ⚙️ Configuration changes +2/-2

Update default environment variables for retain-on-failure

Video/Dockerfile


7. scripts/generate_list_env_vars/description.yaml 📝 Documentation +13/-11

Update environment variable descriptions in metadata

scripts/generate_list_env_vars/description.yaml


8. scripts/generate_list_env_vars/value.yaml 📝 Documentation +5/-5

Update environment variable default values in metadata

scripts/generate_list_env_vars/value.yaml


9. Makefile ⚙️ Configuration changes +2/-0

Add TEST_RETAIN_ON_FAILURE to test targets

Makefile


10. README.md 📝 Documentation +65/-1

Add comprehensive retain-on-failure usage documentation

README.md


11. tests/docker-compose-v3-event-driven-arm64.yml ⚙️ Configuration changes +2/-4

Update compose file to use new retain-on-failure variables

tests/docker-compose-v3-event-driven-arm64.yml


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review bot commented Apr 10, 2026

Code Review by Qodo

🐞 Bugs (2)   📘 Rule violations (2)   📎 Requirement gaps (0)   🎨 UX Issues (0)
🐞\ ≡ Correctness (1) ⚙ Maintainability (1)
📘\ ☼ Reliability (2)

Grey Divider


Action required

1. Unquoted $GITHUB_ENV redirection 📘
Description
The new workflow line appends to $GITHUB_ENV without quoting the path, which is brittle and can
break if the path contains spaces or special characters. This violates the requirement to quote
variables/paths in shell scripts to avoid word-splitting/globbing issues.
Code

.github/workflows/docker-test.yml[R259-260]

+          if [ -n "${TEST_RETAIN_ON_FAILURE}" ]; then
+              echo "TEST_RETAIN_ON_FAILURE=${TEST_RETAIN_ON_FAILURE}" >> $GITHUB_ENV
Evidence
PR Compliance ID 2 requires robust shell practices including quoting variables/paths. The added line
writes to $GITHUB_ENV via >> $GITHUB_ENV without quotes, which is explicitly called out as a
failure mode (unquoted paths causing word-splitting).

.github/workflows/docker-test.yml[259-260]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The workflow appends to `GITHUB_ENV` using an unquoted path (`>> $GITHUB_ENV`), which is brittle under shell word-splitting rules.

## Issue Context
GitHub Actions recommends treating `GITHUB_ENV` as a file path; quoting prevents failures if the path ever contains spaces/special characters.

## Fix Focus Areas
- .github/workflows/docker-test.yml[259-261]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Failure event always fired 🐞
Description
SeleniumGenericTests.tearDown() fires a "test:failed" session event whenever
TEST_RETAIN_ON_FAILURE is enabled, even for passing tests, so sessions are always marked failed
and recordings will never be discarded. This makes CI and local runs unable to validate
retain-on-failure behavior and can cause unnecessary retention/uploads.
Code

tests/SeleniumTests/init.py[R143-148]

+            if TEST_RETAIN_ON_FAILURE:
+                self.driver.fire_session_event(
+                    "test:failed",
+                    {"test": f"{self._testMethodName} ({self.__class__.__name__})"},
+                )
            self.driver.quit()
Evidence
The test suite unconditionally emits a failure event in tearDown when TEST_RETAIN_ON_FAILURE is
true, and the video service considers any eventType containing :failed a failure by default,
setting has_failure_event=True which makes SessionState.is_failed return true.

tests/SeleniumTests/init.py[136-148]
Video/video_service.py[166-173]
Video/video_service.py[297-301]
Video/video_service.py[108-115]
Video/video_service.py[812-823]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`tests/SeleniumTests/__init__.py` fires `driver.fire_session_event("test:failed", ...)` in `tearDown()` whenever `TEST_RETAIN_ON_FAILURE` is enabled, regardless of whether the test actually failed. This forces the video service to mark all sessions as failed and prevents retain-on-failure from ever discarding videos for passing sessions.

### Issue Context
The video service treats event types containing `:failed` as failures by default, so emitting that event for successful tests invalidates the new retain-on-failure behavior and produces misleading CI artifacts.

### Fix Focus Areas
- tests/SeleniumTests/__init__.py[136-148]

### Suggested change
Update `tearDown()` to emit the failure session event **only when the test failed/error’d**. In `unittest`, you can inspect `self._outcome` / `self._outcome.result` (failures + errors) to determine whether the current test had an exception, and only then call `fire_session_event("test:failed", ...)` before `quit()`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. TEST_RETAIN_ON_FAILURE unquoted 📘
Description
The workflow passes TEST_RETAIN_ON_FAILURE into a multi-line shell command without quoting, which
can cause unintended word-splitting if the value is ever not a simple token. This is fragile shell
practice under the compliance requirement for robust quoting.
Code

.github/workflows/docker-test.yml[R283-286]

            USE_RANDOM_USER_ID=${{ matrix.use-random-user }} VERSION=${BRANCH} BUILD_DATE=${BUILD_DATE} \
            TEST_FIREFOX_INSTALL_LANG_PACKAGE=${TEST_FIREFOX_INSTALL_LANG_PACKAGE} SELENIUM_ENABLE_MANAGED_DOWNLOADS=${SELENIUM_ENABLE_MANAGED_DOWNLOADS} \
+            TEST_RETAIN_ON_FAILURE=${TEST_RETAIN_ON_FAILURE} \
            make ${{ matrix.test-strategy }}
Evidence
PR Compliance ID 2 requires quoting variables appropriately in shell contexts. The added line
injects ${TEST_RETAIN_ON_FAILURE} into a shell command without quotes, which is a common source of
brittle behavior.

.github/workflows/docker-test.yml[283-286]
Best Practice: Learned patterns

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`TEST_RETAIN_ON_FAILURE` is appended to a shell command line without quotes, risking word-splitting/globbing if its value changes.

## Issue Context
Even if currently set to boolean-like values, quoting is required for robust shell practices and future-proofing.

## Fix Focus Areas
- .github/workflows/docker-test.yml[283-286]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Misleading capability example text 🐞
Description
README says setting se:retainOnFailure=true will "retain the recording ... regardless of the
global setting", but the surrounding table and later example indicate true means
discard-on-pass/retain-on-failure. This can cause users to set the capability expecting the opposite
behavior.
Code

README.md[R869-873]

+The `se:retainOnFailure` session **capability** overrides the global container env var for a specific session. For example, to retain the recording of a single session regardless of the global setting:
+
+```python
+options.set_capability('se:retainOnFailure', True)
+```
Evidence
The sentence claims the capability retains recordings regardless of global configuration, but the
immediately following table defines se:retainOnFailure=true as retain-on-failure (discard on
pass), and the later code comment explicitly says it discards if the session passes.

README.md[869-880]
README.md[892-895]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The README text introducing `se:retainOnFailure` describes the example as retaining the recording regardless of global setting, but `se:retainOnFailure=true` actually enables retain-on-failure (discard on pass).

### Issue Context
Immediately below, the behavior table and a later code comment describe the correct semantics, so the intro sentence is the inconsistent part.

### Fix Focus Areas
- README.md[869-873]

### Suggested change
Rewrite the sentence to match the semantics, e.g.:
- "...to enable retain-on-failure for a single session regardless of the global setting" (for `true`),
OR change the example to `se:retainOnFailure=false` if the intent is to demonstrate "always retain" overriding a globally-enabled retain-on-failure.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +259 to +260
if [ -n "${TEST_RETAIN_ON_FAILURE}" ]; then
echo "TEST_RETAIN_ON_FAILURE=${TEST_RETAIN_ON_FAILURE}" >> $GITHUB_ENV
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Unquoted $github_env redirection 📘 Rule violation ☼ Reliability

The new workflow line appends to $GITHUB_ENV without quoting the path, which is brittle and can
break if the path contains spaces or special characters. This violates the requirement to quote
variables/paths in shell scripts to avoid word-splitting/globbing issues.
Agent Prompt
## Issue description
The workflow appends to `GITHUB_ENV` using an unquoted path (`>> $GITHUB_ENV`), which is brittle under shell word-splitting rules.

## Issue Context
GitHub Actions recommends treating `GITHUB_ENV` as a file path; quoting prevents failures if the path ever contains spaces/special characters.

## Fix Focus Areas
- .github/workflows/docker-test.yml[259-261]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +143 to 148
if TEST_RETAIN_ON_FAILURE:
self.driver.fire_session_event(
"test:failed",
{"test": f"{self._testMethodName} ({self.__class__.__name__})"},
)
self.driver.quit()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Failure event always fired 🐞 Bug ≡ Correctness

SeleniumGenericTests.tearDown() fires a "test:failed" session event whenever
TEST_RETAIN_ON_FAILURE is enabled, even for passing tests, so sessions are always marked failed
and recordings will never be discarded. This makes CI and local runs unable to validate
retain-on-failure behavior and can cause unnecessary retention/uploads.
Agent Prompt
### Issue description
`tests/SeleniumTests/__init__.py` fires `driver.fire_session_event("test:failed", ...)` in `tearDown()` whenever `TEST_RETAIN_ON_FAILURE` is enabled, regardless of whether the test actually failed. This forces the video service to mark all sessions as failed and prevents retain-on-failure from ever discarding videos for passing sessions.

### Issue Context
The video service treats event types containing `:failed` as failures by default, so emitting that event for successful tests invalidates the new retain-on-failure behavior and produces misleading CI artifacts.

### Fix Focus Areas
- tests/SeleniumTests/__init__.py[136-148]

### Suggested change
Update `tearDown()` to emit the failure session event **only when the test failed/error’d**. In `unittest`, you can inspect `self._outcome` / `self._outcome.result` (failures + errors) to determine whether the current test had an exception, and only then call `fire_session_event("test:failed", ...)` before `quit()`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@VietND96 VietND96 force-pushed the retain-on-failure branch 2 times, most recently from 639ff8d to f606d8c Compare April 11, 2026 04:28
…ilities

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
@VietND96 VietND96 force-pushed the retain-on-failure branch from f606d8c to a5fee57 Compare April 11, 2026 18:03
@VietND96 VietND96 merged commit 03f439f into trunk Apr 12, 2026
56 of 57 checks passed
@VietND96 VietND96 deleted the retain-on-failure branch April 12, 2026 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant