Skip to content

[ENG-168] Sample telemetry events#37

Closed
safaricd wants to merge 2 commits into
mainfrom
sample-telemetry-events
Closed

[ENG-168] Sample telemetry events#37
safaricd wants to merge 2 commits into
mainfrom
sample-telemetry-events

Conversation

@safaricd
Copy link
Copy Markdown
Collaborator

Change Description

  • Sample all events except of PingEvent using sample rates
  • Sample rate specified remotely using the telemetry.json configuration file.

This implementation was tested manually using 1,000 random users (install IDs).

allow, deny
(303, 697)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @safaricd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a comprehensive event sampling strategy for telemetry data. The core objective is to reduce the volume of telemetry events sent to the backend while ensuring critical PingEvent's are always captured and important outlier events (e.g., long-duration operations) are not missed. The sampling parameters are configurable remotely, providing flexibility to adjust data collection without code changes. This change helps optimize resource usage and data processing for telemetry.

Highlights

  • Event Sampling Implementation: Introduced a mechanism to sample telemetry events, excluding PingEvent's, to manage data volume.
  • Remote Configuration: Sampling rates and related parameters are now dynamically controlled via a telemetry.json configuration file downloaded from a remote server.
  • Sampling Logic: Events are sampled based on several criteria, including install age (allowing all events for new users), event duration (allowing outliers), and a hash-based sampling rate for general events.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces event sampling for telemetry to reduce the volume of data sent. The core logic is in a new _pass_through method which filters events based on type, user install date, event duration, and a random sampling rate configured remotely. The changes also include some refactoring, like moving the uuid4 function to a shared utility module and making some helper functions public.

My review has identified a critical logic error in the new sampling implementation that causes it to drop events that should be sent and vice-versa. I've also found a potential KeyError due to unsafe dictionary access and a bug in a test case that prevents it from running. I've provided suggestions to fix these issues.

Comment thread src/tabpfn_common_utils/telemetry/core/service.py Outdated
Comment thread src/tabpfn_common_utils/telemetry/core/service.py Outdated
Comment thread tests/telemetry/core/test_events.py
Copy link
Copy Markdown
Contributor

@brendan-priorlabs brendan-priorlabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @safaricd. Left a couple comments for you.

set_property("install_date", install_date)
return True

# Allow all events if user started using tabpfn <= 5 days
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Comes from config, not always 5 days

Comment on lines +155 to +165
# Fallback to sampling by key
sdk_version = get_sdk_version()
day = utc_now.strftime("%Y-%m-%d")
key = f"ver:{sdk_version}+install:{install_id}+day:{day}"

h = hashlib.sha256(key.encode()).digest()
n = int.from_bytes(h[:8], "big")
interval = n / 2**64

sampling_rate = config.get("sampling_rate", 0.3)
return interval < sampling_rate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Fallback to sampling by key
sdk_version = get_sdk_version()
day = utc_now.strftime("%Y-%m-%d")
key = f"ver:{sdk_version}+install:{install_id}+day:{day}"
h = hashlib.sha256(key.encode()).digest()
n = int.from_bytes(h[:8], "big")
interval = n / 2**64
sampling_rate = config.get("sampling_rate", 0.3)
return interval < sampling_rate
sampling_rate = config.get("sampling_rate", 0.3)
return random.random() < sampling_rate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is simpler and ensures we sample uniformly across events. The other way picks 0.3 of the users each day and samples all their events.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampling uniform over events seems to give a more representative distribution to me too

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we want to sample events uniformly across all versions, we can even easily set this up via PostHog alone. If that's the case, my recommendation would be to control this only from their web interface where we can easily tune these settings on the fly.

image

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if the distribution we send to PostHog is non-uniform, then we can't really recover a uniform distribution from it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@safaricd, followed up directly with a clarification. He in fact means dropping everything after # Fallback to sampling by key and having PostHog do that sampling. Sounds great! Thanks for correcting my understanding!

utc_now = datetime.now(timezone.utc)
delta = timedelta(days=config.get("max_install_days", 5))
if utc_now - install_date <= delta:
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't entirely clear to me where we landed on the uniform vs. non-uniform sampling. @noahho, any thoughts here? With this change we'll bias the distribution towards new users and results slower than 10s. Just wanted to confirm that's expected.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i understood we ended up only subsampling events. With a 30% subsampling rate we would still record most events and so upsampling recent events is less important. Napkin math showed a subsampling rate of 10-30% would keep our cost appropriate

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our last conversation was based around:

  • Capturing all events for new users up to a specific point - e.g. up to 5 days since install (first usage).
  • Capturing outliers, e.g. duration_in_ms > 10_000

Here's the discussion: https://linear.app/priorlabs/issue/ENG-168/reduce-telemetry-events

Given the total volume of events we gather, I think we can easily get along with uniform sampling across all users.

With that being said, if @noahho can confirm this direction, happy to make the change.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall our last conversation was centered around napkin math (in the issue you sent) and that based on that subsampling alone is enough. Let's take that direction

Copy link
Copy Markdown
Collaborator

@noahho noahho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to be able to change these parameters based on our server config! Please just take a look at the comments.

@safaricd
Copy link
Copy Markdown
Collaborator Author

safaricd commented Oct 1, 2025

Closing this PR as for now on, we set up random sampling via PostHog.

@safaricd safaricd closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants