checkly · serverless-mom · May 23, 2025 · May 17, 2025 · May 23, 2025 · May 23, 2025
@@ -203,7 +203,7 @@ This solution is fairly situational, but it’s worth considering if the modal y
 
 For example, if you’re checking modal details, but the next step is to close the modal and move elsewhere on the page, consider having the test enter a [degraded state](/docs/browser-checks/degraded-state/) rather than failing.
 
-A full walkthrough of the code changes is [on our documentation site](/docs/browser-checks/degraded-state/), but suffice to say that with soft assertions you can have a check enter a ‘yellow’ state on the Checkly dashboard without triggering the same alerts as a failing check. This is perfect for performance issues that are intermittent, and would otherwise cause constant downtime alerts.
+A full walkthrough of the code changes is [on our documentation site](/docs/browser-checks/degraded-state/), but suffice to say that with soft assertions you can have a check enter a ‘yellow’ state on the Checkly dashboard without triggering the same alerts as a failing check. This is perfect for performance issues that are intermittent, and would otherwise cause constant [downtime alerts](https://www.checklyhq.com/learn/incidents/detection-best-practices/).
 
 To see a demo of how this state works in Checkly, take a look at Stefan’s tutorial video:
 

@@ -0,0 +1,152 @@
+---
+title: Why did my check fail? A guide to diagnosing failures
+displayTitle: Why did my Check Fail? A guide to diagnosing failures
+description: >-
+  When you get a notification of a failed check, you can often determine the cause right in the Checkly dashboard. This guide covers some the most common failure modes and where to find answers.
+author: Nočnica Mellifera
+avatar: 'images/avatars/nica-mellifera.png'
+tags:
+  - FAQ
+---
+
+Checkly warns you of failures before they’re reported by users. With well-timed alerts you can find out minutes or hours before you get your first support ticket that something is broken. But once we’ve gotten a notification of a failure, how do we do the first step of root cause analysis: diagnosing the cause of the failure?
+
+Browser checks, especially, involve multiple systems that have to ‘work’ to pass a check, how do we find out if network issues, script errors, or a garbled database row, were possible causes? This article shows you how to go from an alert to the first part of repair: finding the error.
+
+## Diagnosing a problem based on the pattern: interpreting a check dashboard
+
+Whenever you’re working a technical incident, it’s tempting to go to the most detailed information first: checking traces from individual checks is a bit like starting incident investigation in application logs: the answer is in there, but there’s a lot of information to sift through.
+
+How can we look at high-level dashboards showing multiple check runs, and start to diagnose possible causes?
+
+To emphasize: none of the patterns mentioned here are proof positive, and you should always take the second step of looking into the check’s traces. That detailed analysis is covered in the second half of this guide.
+
+### Identifying a growing problem
+
+When you’re looking at the pattern of failures, one question you’d like to answer is: ‘did [x] code release cause this failure?’ To answer this quickly, look for a pattern like the one below, where a small problem grows with time:
+
+![a Checkly dashboard](/guides/images/reading-traces-01.png)
+
+
+*In this case it looks like a problem has been getting worse for a while, with slow performance causes ‘degraded’ results before checks stop passing entirely*
+
+In this case we can see that degraded results started on or about April 24th. Even though the alerts started a few days later, it’s likely the problem has been growing since late April. Again, this isn’t proof positive since all those results were only ‘degraded,’ a good next step would be to check the resource use of the services. This might be something like a memory leak that’s getting worse with time!
+
+### The comb: failures with set intervals
+
+In the chart below, this check is currently passing, but an oddly regular pattern of failures has been happening for some time. This indicates a common failure mode:
+
+
+![a Checkly dashboard](/guides/images/reading-traces-02.png)
+
+*This regular failure that never affects all checks look like it might be a failure in one region.*
+
+What would cause some, but not all, checks to fail with some regularity? Checking the list of failure locations, we see that all the checks are coming from one place.
+
+
+![a Checkly dashboard](/guides/images/reading-traces-03.png)
+
+*Sure enough one geographic region is failing.*
+
+In this case geo-fencing, localization or some other geographic specific setting is causing only one check location to fail. A few settings to detect this issue more reliably:
+
+If you’re often getting failures in one region that’s critical to your business, consider increasing check frequency or setting your checks to ‘parallel runs’ meaning that ever selected geographic region every time 
+
+
+![a Checkly dashboard](/guides/images/reading-traces-04.png)
+
+*Parallel vs. round-robin checks, each set with a five-minute interval.*
+
+Second, if you’ve selected round-robin checks but still want to detect single-region failures quickly, set your retry strategy to always retry from the same region.
+
+
+![a Checkly dashboard](/guides/images/reading-traces-05.png)
+
+*Now a failure in Cape Town will run all its check attempts from Cape Town*
+
+To read further about retries, regions, and timing, check out how to [optimize your MTTD with the right check frequency](https://www.checklyhq.com/blog/check-frequency/).
+
+## Diving in and finding the problem: beyond the video
+
+Let’s start with a simple example of a failing test. In this case my web shop’s category view is supposed to look like this:
+
+
+![a Checkly dashboard](/guides/images/reading-traces-06.png)
+
+*A user expects to be able to find top sellers and hidden gems in the category view.*
+
+But now, my check on this page is failing, and I’d like to figure out why. The pattern is sudden and complete, with all checks failing, starting a few hours ago.
+
+### We all start by checking the video
+
+Reader, I’ll level with you: the purpose of this article is to get you to do more than checking the video on a failed check. But before I offer more routes to check I wish to acknowledge *we all start by checking the video.* A video has the great benefit of showing both the failure and the ‘normal’ behavior of the check up until that point, it reminds us what page we were loading and there’s a good chance that the failure is visual enough to show up right away. 
+
+
+![a Checkly dashboard](/guides/images/reading-traces-07.png)
+
+*In this example, the gap between the video and what we expect in the screenshot above is significant, the ‘bottom sellers’ section of the page is missing entirely.*
+
+So, of course, we’re all going to check the video capture early in the troubleshooting process. But while the video is nice for finding glaring issues but doesn’t offer much detail beyond that. To know why this page doesn’t look right, we need to go a few steps beyond.
+
+### What loaded and what didn’t: the network tab
+
+The network tab is often a good next step when the video didn’t have our answer: without loading the full trace it’s a quick way to see if anything failed to load or loaded very slowly.
+
+
+![a Checkly dashboard](/guides/images/reading-traces-08.png)
+
+*The network tab shows the same details you’d get from a browser’s dev tools*
+
+In this case, the only failed kit was a fontawesome library, which seems unlikely to break a page component, so it’s time to move on!
+
+### Check each retry - viewing past attempts
+
+If you’re investigating a check with extremely strange or stochastic results, you might wonder ‘this check retried twice before it reported failure, how did those other check runs go?’ We know since we got an alert the other check runs also failed, but to go deeper just use the drop-down on the top left of the check run viewer.
+
+
+![a Checkly dashboard](/guides/images/reading-traces-09.png)
+
+You can select each attempt in the drop-down.
+
+### Try the new AI explainer
+
+When trying to interpret a failure, especially during a late night incident response process, it’s helpful to get a summary of how this failure affects users, written in plain English. Translating technical error messages into a more readable format is something that language models excel at, so you may want to try out our AI Impact Analysis tool. For this failure, the analysis looks pretty spot on: users can’t see a section labelled ‘bottom sellers’
+
+
+![a Checkly dashboard](/guides/images/reading-traces-09a.png)
+
+AI Impact analysis is a beta feature, which you can switch on in your account settings.
+
+
+![a Checkly dashboard](/guides/images/reading-traces-10.png)
+
+While the AI explainer is helpful in figuring out the scope of the problem, to go further we really need the big guns: the full trace of this test session.
+
+### Pro tools: the full trace
+
+How many times has a user reported a failure somewhere on your site, and you’ve wished they could send a full trace from their dev tools? In some support teams, asking for this browser trace is a common first step in the diagnosis process. For problems detected early with Checkly, the same detailed information is available as soon as you get an alert. Click the trace button to get the whole story.
+
+
+![a Checkly dashboard](/guides/images/reading-traces-11.png)
+
+*The magic happens under the ‘trace’ button.*
+
+The full trace will load, let’s step through some of the information on the trace view:
+
+![a Checkly dashboard](/guides/images/reading-traces-12.png)
+
+1. The test is broken down into components, and you can scroll through and see what had loaded as each test was happening.
+2. Screenshots of the page as loads and formats. You can check a screenshot for each assertion.
+3. The network tab that was so useful previously is available here.
+4. The console, errors, and log messages from the browser
+5. The Call tab which shows exactly how the current playwright call was interpreted. This is so interesting it deserves its own screenshot:
+
+![a Checkly dashboard](/guides/images/reading-traces-13.png)
+
+If you notice that the page took a second or two to fully load, and you’re worried it caused the `expect.toBeVisible` call to fail: probably not! Remember that [Playwright’s auto-waiting](https://www.checklyhq.com/learn/playwright/waits-and-timeouts/) ensures that the element will be found as soon as it loads, and your test won’t fail just because of a flash of unformatted content.
+
+### **Conclusions**
+
+Effective incident diagnosis begins with understanding the patterns in your check failures before diving into detailed traces. Starting with the video replay is natural, as it provides immediate visual context, but the real answers often lie in the network logs, console errors, and full traces. 
+
+Ultimately, Checkly equips you with the data needed to move from alert to resolution efficiently. By combining pattern recognition with detailed trace analysis, you can diagnose issues faster, reduce downtime, and maintain a seamless user experience. For further optimization, explore [fine-tuning check frequencies](https://www.checklyhq.com/blog/check-frequency/) and retry strategies to minimize detection time—because the sooner you find the problem, the sooner you can fix it.
@@ -23,7 +23,9 @@ In short, we’re all aware that the software development life cycle (SDLC) ofte
 Ideally development should only be getting requirements from planning and design, with any feedback (issues that need fixing) coming in the QA stages before deployment to production. In reality a huge list of failures can result in developers having to fix failures and re-engineer code to keep production running. There are two significant concerns that a more unified testing and monitoring approach can help us address:
 
 1. When bugs are found post deployment, feedback is often given without full context, with the first reports going to someone other than the developer.
-2. Any fixes have to be delivered out-of-band of the normal development cycle, either coded in haste during incident response, or written and deployed during time that was supposed to be spent on new features.
+
+2. Any fixes have to be delivered out-of-band of the normal development cycle, either coded in haste during [incident response](https://www.checklyhq.com/learn/incidents/what-is-incident-response/), or written and deployed during time that was supposed ot be spent on new features.
+
 
 By unifying around Checkly, **feedback is delivered directly to developers**, with developers working on their own tests, which then offer direct feedback with all the context that product teams need to diagnose root causes. The many unforeseen failures caused by the interaction of application code and the production environment should be monitored with **the same quality of tests that scruitinized our code pre-deployment.**
 

@@ -40,7 +40,7 @@ As cloud architectures expand into greater complexity, broadening our search for
 
 ### Key Improvements in Modern RCA:
 
-1. **Automated Correlation** – With distributed tracing and context propagation, we shouldn’t be trying to line up time codes to correlate observability data from multiple services
+1. **Automated Correlation** – With distributed tracing and [context propagation](https://www.checklyhq.com/learn/opentelemetry/context-propagation/), we shouldn’t be trying to line up time codes to correlate observability data from multiple services
 2. **Context-Aware Alerts** – Alerts prioritize issues impacting business metrics (like checkout errors vs. low-severity logs). Sophisticated routing and status updates on tools like Rootly mean only the right team members get woken up during incidents.
 3. **Proactive Detection** – Once solely used for end-to-end testing, synthetic user testing is now available to development and operations teams. Any activity frequently performed by users can be performed in automatic test sessions with deep assertions to make sure the responses look correct.
 

@@ -55,7 +55,7 @@ Here’s how to set up centralized logging in Kubernetes with OpenTelemetry:
 
 ### Advantages of OpenTelemetry Logging in Kubernetes
 
-By adopting OpenTelemetry for centralized logging, Kubernetes administrators gain several benefits:
+By adopting [OpenTelemetry for centralized logging](https://www.checklyhq.com/learn/opentelemetry/logs/), Kubernetes administrators gain several benefits:
 
 - **Unified Telemetry Stack**: OpenTelemetry consolidates logging, tracing, and metrics collection, eliminating the need for multiple agents and configuration files. This simplifies management and reduces potential points of failure.
 - **Flexibility in Log Destination**: Logs can be sent to multiple backends, allowing organizations to maintain control over their data while still meeting specific requirements, such as long-term storage or compliance.

@@ -159,7 +159,7 @@ Entry-level monitoring tools often focus on surface-level insights, such as erro
 
 **Ensuring Full-Stack Visibility**
 
-Frontend monitoring alone cannot provide a complete view of an application’s health. Many issues arise at the intersection of the frontend and backend, requiring integrated observability across the entire stack. Without this integration, teams risk spending excessive time proving whether an issue is frontend- or backend-related. Implementing context propagation and unified tracing, such as through OpenTelemetry, helps connect frontend events to backend processes, enabling a more holistic understanding of system behavior and streamlining troubleshooting efforts. For more detail on connecting backend OpenTelemetry traces with frontend performance see how [Checkly Traces connects data from across your stack](https://www.checklyhq.com/docs/traces-open-telemetry/how-it-works/).
+Frontend monitoring alone cannot provide a complete view of an application’s health. Many issues arise at the intersection of the frontend and backend, requiring integrated observability across the entire stack. Without this integration, teams risk spending excessive time proving whether an issue is frontend- or backend-related. Implementing [context propagation](https://www.checklyhq.com/learn/opentelemetry/context-propagation/) and unified tracing, such as through OpenTelemetry, helps connect frontend events to backend processes, enabling a more holistic understanding of system behavior and streamlining troubleshooting efforts. For more detail on connecting backend [OpenTelemetry traces](https://www.checklyhq.com/learn/opentelemetry/traces/) with frontend performance see how [Checkly Traces connects data from across your stack](https://www.checklyhq.com/docs/traces-open-telemetry/how-it-works/).
 
 ## Top frontend monitoring tools
 

@@ -73,7 +73,7 @@ The best practices for infrastructure management could fill a whole book, but at
 
 ## Error Rate
 
-Error rate is a critical APM metric that quantifies the frequency of failed requests or transactions within an application, expressed as a percentage of total requests. It captures HTTP errors (e.g., 4xx client errors, 5xx server errors), timeouts, exceptions, and can cover business logic failures (e.g., payment declines, API call failures). This metric is often segmented by endpoint, service, or dependency to pinpoint failure hotspots. A high or rising error rate may indicate code defects, infrastructure instability (e.g., database overload), or third-party service degradation. Engineering teams use error rate alongside logs and distributed tracing to diagnose root causes. In SLOs (Service Level Objectives), error rate thresholds (e.g., <0.1%) help enforce reliability targets. Proactive monitoring and alerting on error rate spikes enable faster incident response, reducing MTTR (Mean Time to Repair) and minimizing user impact.
+Error rate is a critical APM metric that quantifies the frequency of failed requests or transactions within an application, expressed as a percentage of total requests. It captures HTTP errors (e.g., 4xx client errors, 5xx server errors), timeouts, exceptions, and can cover business logic failures (e.g., payment declines, API call failures). This metric is often segmented by endpoint, service, or dependency to pinpoint failure hotspots. A high or rising error rate may indicate code defects, infrastructure instability (e.g., database overload), or third-party service degradation. Engineering teams use error rate alongside logs and distributed tracing to diagnose root causes. In SLOs (Service Level Objectives), error rate thresholds (e.g., <0.1%) help enforce reliability targets. Proactive monitoring and alerting on error rate spikes enable faster [incident response](https://www.checklyhq.com/learn/incidents/what-is-incident-response/), reducing MTTR (Mean Time to Repair) and minimizing user impact.
 
 One of the things that makes a metric a key metric is that without observing it, you can have your application fail silently. For example if our frontend is rendering our site beautifully, but the site content is an error message, the response time may look fine, throughput may be fine (or even up since users are hitting reload repeatedly) and our infrastructure metrics may not show a problem. But clearly, this error message should be spiking our application’s error rate.