Skip to content

Adaptive Sampling Support#576

Merged
majanjua-amzn merged 2 commits intomainfrom
adaptive-sampling
Jan 27, 2026
Merged

Adaptive Sampling Support#576
majanjua-amzn merged 2 commits intomainfrom
adaptive-sampling

Conversation

@majanjua-amzn
Copy link
Copy Markdown
Contributor

@majanjua-amzn majanjua-amzn commented Jan 13, 2026

Background

AWS X-Ray sampling rules now support an adaptive sampling configurations, and as part of that effort the ADOT SDKs must be updated to support the ingestion of the new fields and appropriate functional changes to support [1] boosting sampling rate based on detected anomalies and [2] detect/capture anomalies based on a configuration local to the SDK.

The same changes have been made for the ADOT Java SDK here in the upstream OTel Contrib repo: open-telemetry/opentelemetry-java-contrib#2147

Overall, the goal of this PR is to meet the same needs as the one in ADOT Java with some additional improvements:

  • Appropriately generate anomaly statistics, send them through GetSamplingTargets, and adjust sampling behaviour according to the response (sampling boost)
  • Read configuration local to the SDK to allow users to set the definition of anomalies in their system (status code, operation, and/or latency)
  • Capture anomalies based on the local configuration
  • [NEW/REQUIRED] General sampling statistics improvements: Do not call GetSamplingTarges if there are no sampling or anomaly statistics
  • [NEW/REQUIRED] Added a new attribute aws.xray.adaptive_sampling_configured to identify spans that were generated from an SDK with a local adaptive sampling configuration

Changes

  • Linked sampler, processor, and exporter such that when a span is ended, the processor forwards it to adaptive sampling code and the adaptive sampling logic determines when to capture anomalies, doing so using the exporter
  • Implemented anomaly detection logic (same as in ADOT Java)
    • Implemented local SDK configuration parsing logic (YAML)
  • Implemented the usage of a cache for keeping trace IDs so we don't recount the same traces for anomaly statistics
  • Fixed sampling statistics to count only the root span of a trace, completely eliminating the need to report statistics for downstream services
  • Implemented GetSamplingTargets skipping logic when there are no sampling statistics or sampling boost statistics
  • Unit tests and related files

Testing

  • Unit tests for each component (maintaining and increasing the code coverage)
  • Rigorous manual E2E tests using 3 services, A (root) -> B -> C (generates anomalies):
    • Tested basic anomaly detection without any local configuration, where service C generates a 500 response: Appropriately detects and captures anomalies + responds to boost sent by server
    • Tested with local configuration with errorCodeRegex: "^500|501$" where service C generates a 500 response: Appropriately detects and captures anomalies + responds to boost sent by server
    • Tested with local configuration with errorCodeRegex: "^500|501$", operations: ["GET /status"], where service C generates a 500 response from /status/c/500: Appropriately detects and captures anomalies + responds to boost sent by server
    • Tested with local configuration with errorCodeRegex: "^500|501$", highLatencyMs: 2000, where service C generates a 3 second span with 200 or 500 response: Appropriately not treated as an anomaly when 200, and as an anomaly when 500
    • Tested local configuration with highLatencyMs: 2000, where service C generates a 3 second span with 200 response: Appropriately treated as an anomaly
    • Tested that the anomaly counts, request and total counts, etc, are all correct relative to the number of API invocations done in the last 10 seconds

Links

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@majanjua-amzn majanjua-amzn self-assigned this Jan 13, 2026
@majanjua-amzn majanjua-amzn requested a review from a team as a code owner January 13, 2026 01:18
@majanjua-amzn majanjua-amzn added bug Something isn't working enhancement New feature or request python Pull requests that update Python code labels Jan 13, 2026
@majanjua-amzn majanjua-amzn force-pushed the adaptive-sampling branch 8 times, most recently from d9a38a2 to a39c57b Compare January 14, 2026 01:22
wangzlei pushed a commit that referenced this pull request Jan 14, 2026
### Background
Recently, a new field was added to the X-Ray GetSamplingRules API that
was not accounted for in the AWS X-Ray Remote Sampler implementation
done in ADOT Python. As a result, enabling this new field would cause a
failure and cease the parsing of any other rules in a given API
response.

Example: Received 10 rules from the API, third of which has the
SamplingRateBoost field. The SDK will successfully parse the first two,
fail on the third, then stop there. As such, the SDK will only have 2/10
of the sampling rules and will not be able to effectively make sampling
decisions based on the sampling rules set by the user. Any unmatched
spans will use the _FallbackSampler.

### Changes
- Add usage of `kwargs` in X-Ray sampling API related objects, e.g.
SamplingRule, SamplingTarget, etc.
- Add unit tests proving additional fields do not cause errors.

### Testing
- Unit tests
- Tested in depth as part of
#576,
which this change was a part of but is now separated out to get it in
more quickly


By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
ezhang6811 pushed a commit to ezhang6811/aws-otel-python-instrumentation that referenced this pull request Jan 15, 2026
Recently, a new field was added to the X-Ray GetSamplingRules API that
was not accounted for in the AWS X-Ray Remote Sampler implementation
done in ADOT Python. As a result, enabling this new field would cause a
failure and cease the parsing of any other rules in a given API
response.

Example: Received 10 rules from the API, third of which has the
SamplingRateBoost field. The SDK will successfully parse the first two,
fail on the third, then stop there. As such, the SDK will only have 2/10
of the sampling rules and will not be able to effectively make sampling
decisions based on the sampling rules set by the user. Any unmatched
spans will use the _FallbackSampler.

- Add usage of `kwargs` in X-Ray sampling API related objects, e.g.
SamplingRule, SamplingTarget, etc.
- Add unit tests proving additional fields do not cause errors.

- Unit tests
- Tested in depth as part of
aws-observability#576,
which this change was a part of but is now separated out to get it in
more quickly

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.
@majanjua-amzn majanjua-amzn marked this pull request as draft January 19, 2026 22:45
@majanjua-amzn majanjua-amzn force-pushed the adaptive-sampling branch 4 times, most recently from 9d2eba7 to 930a1ec Compare January 20, 2026 00:49
@majanjua-amzn majanjua-amzn marked this pull request as ready for review January 20, 2026 18:13
@majanjua-amzn majanjua-amzn removed the bug Something isn't working label Jan 21, 2026
@majanjua-amzn majanjua-amzn force-pushed the adaptive-sampling branch 2 times, most recently from 6bb1e54 to cc9c0b8 Compare January 24, 2026 01:28
Copy link
Copy Markdown
Contributor

@wangzlei wangzlei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@majanjua-amzn majanjua-amzn enabled auto-merge (squash) January 27, 2026 23:15
@majanjua-amzn majanjua-amzn merged commit 2b4d0ac into main Jan 27, 2026
25 of 27 checks passed
@majanjua-amzn majanjua-amzn deleted the adaptive-sampling branch January 27, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants