[ENG-168] Sample telemetry events#37
Conversation
Summary of ChangesHello @safaricd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a comprehensive event sampling strategy for telemetry data. The core objective is to reduce the volume of telemetry events sent to the backend while ensuring critical PingEvent's are always captured and important outlier events (e.g., long-duration operations) are not missed. The sampling parameters are configurable remotely, providing flexibility to adjust data collection without code changes. This change helps optimize resource usage and data processing for telemetry. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces event sampling for telemetry to reduce the volume of data sent. The core logic is in a new _pass_through method which filters events based on type, user install date, event duration, and a random sampling rate configured remotely. The changes also include some refactoring, like moving the uuid4 function to a shared utility module and making some helper functions public.
My review has identified a critical logic error in the new sampling implementation that causes it to drop events that should be sent and vice-versa. I've also found a potential KeyError due to unsafe dictionary access and a bug in a test case that prevents it from running. I've provided suggestions to fix these issues.
brendan-priorlabs
left a comment
There was a problem hiding this comment.
Thanks for the PR, @safaricd. Left a couple comments for you.
| set_property("install_date", install_date) | ||
| return True | ||
|
|
||
| # Allow all events if user started using tabpfn <= 5 days |
There was a problem hiding this comment.
nit: Comes from config, not always 5 days
| # Fallback to sampling by key | ||
| sdk_version = get_sdk_version() | ||
| day = utc_now.strftime("%Y-%m-%d") | ||
| key = f"ver:{sdk_version}+install:{install_id}+day:{day}" | ||
|
|
||
| h = hashlib.sha256(key.encode()).digest() | ||
| n = int.from_bytes(h[:8], "big") | ||
| interval = n / 2**64 | ||
|
|
||
| sampling_rate = config.get("sampling_rate", 0.3) | ||
| return interval < sampling_rate |
There was a problem hiding this comment.
| # Fallback to sampling by key | |
| sdk_version = get_sdk_version() | |
| day = utc_now.strftime("%Y-%m-%d") | |
| key = f"ver:{sdk_version}+install:{install_id}+day:{day}" | |
| h = hashlib.sha256(key.encode()).digest() | |
| n = int.from_bytes(h[:8], "big") | |
| interval = n / 2**64 | |
| sampling_rate = config.get("sampling_rate", 0.3) | |
| return interval < sampling_rate | |
| sampling_rate = config.get("sampling_rate", 0.3) | |
| return random.random() < sampling_rate |
There was a problem hiding this comment.
This is simpler and ensures we sample uniformly across events. The other way picks 0.3 of the users each day and samples all their events.
There was a problem hiding this comment.
sampling uniform over events seems to give a more representative distribution to me too
There was a problem hiding this comment.
The problem is that if the distribution we send to PostHog is non-uniform, then we can't really recover a uniform distribution from it.
There was a problem hiding this comment.
@safaricd, followed up directly with a clarification. He in fact means dropping everything after # Fallback to sampling by key and having PostHog do that sampling. Sounds great! Thanks for correcting my understanding!
| utc_now = datetime.now(timezone.utc) | ||
| delta = timedelta(days=config.get("max_install_days", 5)) | ||
| if utc_now - install_date <= delta: | ||
| return True |
There was a problem hiding this comment.
It wasn't entirely clear to me where we landed on the uniform vs. non-uniform sampling. @noahho, any thoughts here? With this change we'll bias the distribution towards new users and results slower than 10s. Just wanted to confirm that's expected.
There was a problem hiding this comment.
hmm i understood we ended up only subsampling events. With a 30% subsampling rate we would still record most events and so upsampling recent events is less important. Napkin math showed a subsampling rate of 10-30% would keep our cost appropriate
There was a problem hiding this comment.
Our last conversation was based around:
- Capturing all events for new users up to a specific point - e.g. up to 5 days since install (first usage).
- Capturing outliers, e.g.
duration_in_ms > 10_000
Here's the discussion: https://linear.app/priorlabs/issue/ENG-168/reduce-telemetry-events
Given the total volume of events we gather, I think we can easily get along with uniform sampling across all users.
With that being said, if @noahho can confirm this direction, happy to make the change.
There was a problem hiding this comment.
I recall our last conversation was centered around napkin math (in the issue you sent) and that based on that subsampling alone is enough. Let's take that direction
noahho
left a comment
There was a problem hiding this comment.
Very nice to be able to change these parameters based on our server config! Please just take a look at the comments.
|
Closing this PR as for now on, we set up random sampling via PostHog. |

Change Description
PingEventusing sample ratestelemetry.jsonconfiguration file.This implementation was tested manually using 1,000 random users (install IDs).