There are at least three issues with the mechanism for selecting traces for collecting "snapshots":
- When dealing with distributed traces, the selection for the head service is inconsistent with the other services. Running a test where service A calls B, and choosing selection probability of 10%, I got the following results. Out of 4781 traces, service A was selected 480 times, and service B was selected 448 times. However, the number of traces where both services were selected was only 39.
- Traces which originate from spans different than SERVER or CONSUMER (like resulting from POJO instrumentation) are never selected (however, their downstream calls may still be selected).
- The selection algorithm for downstream services uses the same algorithm as TraceIdRatioBased sampler, which can lead to metrics skew if that sampler is actually used for sampling.
I believe the selection mechanism needs to be redesigned.
There are at least three issues with the mechanism for selecting traces for collecting "snapshots":
I believe the selection mechanism needs to be redesigned.