Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 28 additions & 12 deletions content/en/tests/flaky_management/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,24 @@ further_reading:

## Overview

The [Flaky Tests Management][1] page provides a centralized view to track, triage, and remediate flaky tests across your organization. You can view every test's status along with key impact metrics like number of pipeline failures, CI time wasted, and failure rate.
The [Flaky Tests Management][1] page provides a centralized view to track, triage, and remediate flaky tests across your organization. You can view every test's state along with key impact metrics like number of pipeline failures, CI time wasted, and failure rate.

From this UI, you can act on flaky tests to mitigate their impact. Quarantine or disable problematic tests to keep known flakes from breaking builds, and create cases and Jira issues to track work toward fixes.

{{< img src="tests/flaky_management-2.png" alt="Overview of the Flaky Tests Management UI" style="width:100%;" >}}

## Change a flaky test's status
## Change a flaky test's state

Use the status drop-down to change how a flaky test is handled in your CI pipeline. This can help reduce CI noise while retaining traceability and control. Available statuses are:
Use the state drop-down to change how a flaky test is handled in your CI pipeline. This can help reduce CI noise while retaining traceability and control. Available states are:

| Status | Description |
| State | Description |
| ----------- | ----------- |
| **Active** | The test is known to be flaky and is running in CI. |
| **Quarantined** | Keep the test running in the background, but failures don't affect CI status or break pipelines. This is useful for isolating flaky tests without blocking merges. Datadog tags test run events with `@test.test_management.is_quarantined:true` when quarantined. |
| **Disabled** | Skip the test entirely in CI. Use this when a test is no longer relevant or needs to be temporarily removed from the pipeline. Datadog tags test run events with `@test.test_management.is_disabled:true` when disabled. |
| **Fixed** | The test has passed consistently and is no longer flaky. If supported, use the [remediation flow](#confirm-fixes-for-flaky-tests) to confirm the fix and automatically apply this status after it is merged into the default branch. |
| **Fixed** | The test has passed consistently and is no longer flaky. If supported, use the [remediation flow](#confirm-fixes-for-flaky-tests) to confirm the fix and automatically apply this state after it is merged into the default branch. |

<div class="alert alert-info">Status actions have minimum version requirements for each programming language's instrumentation library. See <a href="#compatibility">Compatibility</a> for details.</div>
<div class="alert alert-info">State actions have minimum version requirements for each programming language's instrumentation library. See <a href="#compatibility">Compatibility</a> for details.</div>

## Configure policies to automate the flaky test lifecycle

Expand All @@ -61,7 +61,7 @@ Configure automated Flaky Test Policies to govern how flaky tests are handled in
<p>Toggle to allow flaky tests to be quarantined for this repository.</p>
<p>Customize automation rules based on:</p>
<ul>
<li><strong>Time</strong>: Quarantine a test if its status is <code>Active</code> for a specified number of days. The rule is triggered every day at 12:15 UTC.</li>
<li><strong>Time</strong>: Quarantine a test if its state is <code>Active</code> for a specified number of days. The rule is triggered every day at 12:15 UTC.</li>
<li><strong>Branch</strong>: Quarantine an <code>Active</code> test if it flakes in one or more specified branches.</li>
<li><strong>Failure rate</strong>: Quarantine an <code>Active</code> test if its failure rate over the last 7 days is greater or equal to the specified threshold. The rule is triggered every 15 minutes.</li>
</ul>
Expand All @@ -73,7 +73,7 @@ Configure automated Flaky Test Policies to govern how flaky tests are handled in
<p>Toggle to allow flaky tests to be disabled for this repository. You may want to do this after quarantining or to protect specific branches from flakiness.</p>
<p>Customize automation rules based on:</p>
<ul>
<li><strong>Status and time</strong>: Disable a test if it has a specified status for a specified number of days. The rule is triggered every day at 12:30 UTC.</li>
<li><strong>State and time</strong>: Disable a test if it has a specified state for a specified number of days. The rule is triggered every day at 12:30 UTC.</li>
<li><strong>Branch</strong>: Disable an <code>Active</code> or <code>Quarantined</code> test if it flakes in one or more specified branches.</li>
<li><strong>Failure rate</strong>: Disable an <code>Active</code> or <code>Quarantined</code> test if its failure rate over the last 7 days is greater or equal to the specified threshold. The rule is triggered every 15 minutes.</li>
</ul>
Expand All @@ -85,7 +85,7 @@ Configure automated Flaky Test Policies to govern how flaky tests are handled in
</tr>
<tr>
<td><strong>Fixed</strong></td>
<td>If a flaky test no longer flakes for 30 days, it is automatically moved to Fixed status. This automation is default behavior and can't be customized.</td>
<td>If a flaky test no longer flakes for 30 days, it is automatically moved to the Fixed state. This automation is default behavior and can't be customized.</td>
</tr>
</tbody>
</table>
Expand Down Expand Up @@ -128,7 +128,7 @@ When you fix a flaky test, Test Optimization's remediation flow can confirm the
- If all retries pass, marks the fix as **in progress** in the Flaky Tests Management UI, associates it with the branch used for the fix, and waits for that branch to be merged.
- Tags the last test retry with `@test.test_management.attempt_to_fix_passed:true` in test run events.
- Starts a 14-day [grace period](#grace-period-mechanism) to give time for the fix to propagate everywhere in the repository.
- If any retry fails, keeps the test's current status (`Active`, `Quarantined`, or `Disabled`).
- If any retry fails, keeps the test's current state (`Active`, `Quarantined`, or `Disabled`).
- Tags the last test retry with `@test.test_management.attempt_to_fix_passed:false` in test run events.

### Track fixes that are in progress
Expand Down Expand Up @@ -201,9 +201,25 @@ Flaky Tests Management uses AI to automatically assign a root cause category to

## Receive notifications

Set up notifications to track changes to your flaky tests. Whenever a user or a policy changes the state of a flaky test, a message is sent to your selected recipients. You can send notifications to email addresses or Slack channels (see the [Datadog Slack integration][5]), and route messages based on test code owners. If no code owners are specified, all selected recipients are notified of all flaky test changes in the repository. Configure notification for each repository from the [**Flaky Test Policies**][13] page in Software Delivery settings.
Set up notifications to track changes to your flaky tests. Notifications are sent when:
- A new flaky test is detected on the default branch of the repository.
- A user or policy changes the state of a flaky test.
- The remediation flow for a flaky test succeeds or fails.

Notifications are not sent immediately; they are batched every few minutes to reduce noise.
You can send notifications to email addresses or Slack channels (see the [Datadog Slack integration][5]), and route messages based on test code owners. When multiple code owners are specified, a flaky test must be owned by all specified code owners for the notification rule to match. If no code owners are specified, all selected recipients are notified of all flaky test changes in the repository. Configure notifications for each repository from the [**Flaky Test Policies**][13] page in Software Delivery settings.

Notifications are bundled over a short period to reduce noise.

### Notification types

| Notification type | Description |
|---|---|
| **New flaky test detected** | A new flaky test is detected on the default branch of the repository. |
| **Test quarantined** | A test is quarantined by an automated policy rule (time-based, branch-based, or failure rate). |
| **Test disabled** | A test is disabled by an automated policy rule (time-based, branch-based, or failure rate). |
| **Fix successful** | A test passes all retries in the remediation flow and is marked as "fix in progress". |
| **Fix failed** | A test fails during the remediation flow. |
| **Manual state change** | A user manually changes the state of a flaky test. |

{{< img src="tests/flaky_management_notifications_settings-2.png" alt="Notifications settings UI" style="width:100%;" >}}

Expand Down
2 changes: 1 addition & 1 deletion content/en/tests/guides/setup_new_flaky_pr_gate.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,4 +139,4 @@ For more information, see the GitHub documentation for [status checks][11].
[9]: /tests/flaky_management
[10]: /tests/setup/junit_xml/
[11]: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/collaborating-on-repositories-with-code-quality-features/about-status-checks
[12]: /tests/flaky_management/#change-a-flaky-tests-status
[12]: /tests/flaky_management/#change-a-flaky-tests-state
Loading