[ENG-168] Sample telemetry events by safaricd · Pull Request #37 · PriorLabs/tabpfn_common_utils

safaricd · 2025-09-24T15:30:19Z

Change Description

Sample all events except of PingEvent using sample rates
Sample rate specified remotely using the telemetry.json configuration file.

This implementation was tested manually using 1,000 random users (install IDs).

allow, deny
(303, 697)

gemini-code-assist · 2025-09-24T15:30:38Z

Summary of Changes

Hello @safaricd, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a comprehensive event sampling strategy for telemetry data. The core objective is to reduce the volume of telemetry events sent to the backend while ensuring critical PingEvent's are always captured and important outlier events (e.g., long-duration operations) are not missed. The sampling parameters are configurable remotely, providing flexibility to adjust data collection without code changes. This change helps optimize resource usage and data processing for telemetry.

Highlights

Event Sampling Implementation: Introduced a mechanism to sample telemetry events, excluding PingEvent's, to manage data volume.
Remote Configuration: Sampling rates and related parameters are now dynamically controlled via a telemetry.json configuration file downloaded from a remote server.
Sampling Logic: Events are sampled based on several criteria, including install age (allowing all events for new users), event duration (allowing outliers), and a hash-based sampling rate for general events.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces event sampling for telemetry to reduce the volume of data sent. The core logic is in a new _pass_through method which filters events based on type, user install date, event duration, and a random sampling rate configured remotely. The changes also include some refactoring, like moving the uuid4 function to a shared utility module and making some helper functions public.

My review has identified a critical logic error in the new sampling implementation that causes it to drop events that should be sent and vice-versa. I've also found a potential KeyError due to unsafe dictionary access and a bug in a test case that prevents it from running. I've provided suggestions to fix these issues.

brendan-priorlabs

Thanks for the PR, @safaricd. Left a couple comments for you.

brendan-priorlabs · 2025-09-26T12:23:06Z

+            set_property("install_date", install_date)
+            return True
+
+        # Allow all events if user started using tabpfn <= 5 days


nit: Comes from config, not always 5 days

brendan-priorlabs · 2025-09-26T12:27:34Z

+        # Fallback to sampling by key
+        sdk_version = get_sdk_version()
+        day = utc_now.strftime("%Y-%m-%d")
+        key = f"ver:{sdk_version}+install:{install_id}+day:{day}"
+
+        h = hashlib.sha256(key.encode()).digest()
+        n = int.from_bytes(h[:8], "big")
+        interval = n / 2**64
+
+        sampling_rate = config.get("sampling_rate", 0.3)
+        return interval < sampling_rate


Suggested change

# Fallback to sampling by key

sdk_version = get_sdk_version()

day = utc_now.strftime("%Y-%m-%d")

key = f"ver:{sdk_version}+install:{install_id}+day:{day}"

h = hashlib.sha256(key.encode()).digest()

n = int.from_bytes(h[:8], "big")

interval = n / 2**64

sampling_rate = config.get("sampling_rate", 0.3)

return interval < sampling_rate

sampling_rate = config.get("sampling_rate", 0.3)

return random.random() < sampling_rate

This is simpler and ensures we sample uniformly across events. The other way picks 0.3 of the users each day and samples all their events.

sampling uniform over events seems to give a more representative distribution to me too

In case we want to sample events uniformly across all versions, we can even easily set this up via PostHog alone. If that's the case, my recommendation would be to control this only from their web interface where we can easily tune these settings on the fly.

The problem is that if the distribution we send to PostHog is non-uniform, then we can't really recover a uniform distribution from it.

@safaricd, followed up directly with a clarification. He in fact means dropping everything after # Fallback to sampling by key and having PostHog do that sampling. Sounds great! Thanks for correcting my understanding!

brendan-priorlabs · 2025-09-26T12:38:16Z

+        utc_now = datetime.now(timezone.utc)
+        delta = timedelta(days=config.get("max_install_days", 5))
+        if utc_now - install_date <= delta:
+            return True


It wasn't entirely clear to me where we landed on the uniform vs. non-uniform sampling. @noahho, any thoughts here? With this change we'll bias the distribution towards new users and results slower than 10s. Just wanted to confirm that's expected.

hmm i understood we ended up only subsampling events. With a 30% subsampling rate we would still record most events and so upsampling recent events is less important. Napkin math showed a subsampling rate of 10-30% would keep our cost appropriate

Our last conversation was based around:

Capturing all events for new users up to a specific point - e.g. up to 5 days since install (first usage).

Capturing outliers, e.g. duration_in_ms > 10_000

Here's the discussion: https://linear.app/priorlabs/issue/ENG-168/reduce-telemetry-events

Given the total volume of events we gather, I think we can easily get along with uniform sampling across all users.

With that being said, if @noahho can confirm this direction, happy to make the change.

I recall our last conversation was centered around napkin math (in the issue you sent) and that based on that subsampling alone is enough. Let's take that direction

noahho

Very nice to be able to change these parameters based on our server config! Please just take a look at the comments.

safaricd · 2025-10-01T09:50:22Z

Closing this PR as for now on, we set up random sampling via PostHog.

Implementing sampling across all users

4b2ed0b

gemini-code-assist Bot reviewed Sep 24, 2025

View reviewed changes

Comment thread src/tabpfn_common_utils/telemetry/core/service.py Outdated

Comment thread src/tabpfn_common_utils/telemetry/core/service.py Outdated

Comment thread tests/telemetry/core/test_events.py

Fix for inverted logic

ec89550

safaricd requested review from brendan-priorlabs and noahho September 24, 2025 15:36

brendan-priorlabs reviewed Sep 26, 2025

View reviewed changes

noahho requested changes Sep 26, 2025

View reviewed changes

safaricd closed this Oct 1, 2025

Conversation

safaricd commented Sep 24, 2025

Change Description

Uh oh!

gemini-code-assist Bot commented Sep 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brendan-priorlabs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noahho left a comment

Choose a reason for hiding this comment

Uh oh!

safaricd commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants