Skip to content

[telemetry] Detect Python package manager(s) at project setup#1918

Open
rugpanov wants to merge 1 commit into
mainfrom
telemetry-package-manager-detection
Open

[telemetry] Detect Python package manager(s) at project setup#1918
rugpanov wants to merge 1 commit into
mainfrom
telemetry-package-manager-detection

Conversation

@rugpanov

Copy link
Copy Markdown
Contributor

Changes

Measurement-only telemetry to learn which Python package manager(s) our users' projects actually use (pip / conda / uv / poetry), so the VPEX setup-flow investment can be prioritized from first-party data instead of public-survey estimates. No setup behavior changes — this is detection only.

The work splits cleanly into three layers so each is independently testable and the dependency direction stays correct (high-level → low-level):

  • Pure classifier (packageManagerDetection.ts): given a set of already-collected signals, reports every applicable manager, a best-guess primary (priority uv > poetry > conda > pip), the firing signals, hasLockfile, and interpreter source. Side-effect free and total.
  • Emit (telemetry/packageManagerExtensions.ts): adds recordPackageManagerDetection to the existing Telemetry class via the same declare module pattern as commandExtensions.ts. Keeps disk/Python-extension dependencies out of the telemetry client.
  • Collection (PackageManagerTelemetry.ts): a best-effort, non-blocking collector that reads disk and already-resolved interpreter metadata, runs the pure classifier, and calls the emit method. Deduplicated per session on (trigger, projectRoot); any failure degrades to unknown and is swallowed so it never disrupts setup.

Emission is wired into three setup touchpoints: project-open environment check (auto_open), the set-up-environment command (explicit_command), and first Run/Debug with Databricks Connect (run/debug).

A new Events.PYTHON_ENV_SETUP_DETECTED event carries a typed, documented schema (reuses the existing telemetry transport; opt-out honored; categorical data only — no paths, package names, or cluster names). A handoff note for the analytics/dashboard owner is included at src/telemetry/PACKAGE_MANAGER_DETECTION.md.

Detection correctness (the parts most worth reviewing):

  • interpreterSource is derived from the active interpreter alone, never from project files. A uv.lock project running a conda/venv/system interpreter reports that interpreter's real source, keeping the "uv project, interpreter not uv-managed yet" setup-flow gap visible. A genuinely uv-provisioned venv is identified by the uv = marker in pyvenv.cfg, not by uv.lock.
  • conda is attributed only when the active interpreter resides under CONDA_PREFIX (path-boundary checked), not on the bare env var — which is session-global in the extension host (launching VS Code from an activated conda shell) and would otherwise over-count conda for uv/poetry/pip projects.
  • pyproject [tool.uv]/[tool.poetry] detection uses a bounded table-header scan, not substring matching: ignores comments and in-value mentions, rejects prefix collisions (e.g. tool.uvicorn), and matches subtable and array-of-table headers ([tool.uv.sources], [[tool.poetry.source]]).
  • No external executable is run for telemetry: the uv-on-PATH probe was removed (it spawned a PATH-resolved uv for a weak, non-attributing signal). Detection reads only disk and already-resolved interpreter metadata.

Scope / privacy: measurement only — no changes to setup behavior (the VPEX flows are a separate effort). Only enum/categorical data and a closed set of signal identifiers are emitted; the existing telemetry opt-out (telemetry.telemetryLevel) is respected by the transport.

Tests

  • yarn run test:unit: 202 passing, 0 failing — includes the pure classifier (each manager, interpreter sources, overlaps like uv+pip / conda+pip / poetry+uv, weak signals, none) and pure helpers (pyprojectHasToolSection, pyvenvCfgMarksUv, interpreterUnderCondaPrefix), covering the conda-prefix boundary and shell-global false-positive cases.
  • yarn run build (typecheck) passes.
  • eslint clean; prettier formatted.

Reviewer can validate with:

cd packages/databricks-vscode
yarn run build
yarn run test:unit
npx eslint src --ext ts && npx prettier . -c

Why:
We need first-party data on which Python package manager(s) our users'
projects actually use (pip/conda/uv/poetry) to prioritize VPEX setup-flow
investment, replacing public-survey estimates. Measurement only -- no setup
behavior changes.

What:
- Add packageManagerDetection.ts: a pure, signal-based classifier that
  reports all applicable managers plus a best-guess primary (uv > poetry >
  conda > pip), the firing signals, hasLockfile, and interpreter source.
  Treats bare uv/poetry on PATH as weak signals.
- Add Events.PYTHON_ENV_SETUP_DETECTED with a typed, documented schema in
  telemetry/constants.ts (reuses existing Telemetry client; opt-out honored;
  categorical data only, no paths/package/cluster names).
- Add telemetry/packageManagerExtensions.ts: the emit half, layered onto the
  Telemetry class via the commandExtensions declare-module pattern
  (recordPackageManagerDetection). Keeps disk/Python-extension deps out of the
  Telemetry client.
- Add PackageManagerTelemetry.ts: the collection half -- a best-effort,
  non-blocking collector (disk + already-resolved interpreter metadata) that
  gathers signals, runs the pure classifier, and calls the emit method.
  Deduplicated per session on (trigger, projectRoot); failures degrade to
  unknown and are swallowed.
- Wire emission into three touchpoints: project-open env check (auto_open),
  the set-up-environment command (explicit_command), and first Run/Debug
  with Databricks Connect (run/debug).
- Add unit tests for the detector and pure helpers, and a dashboard-owner
  handoff note.

Detection correctness:
- interpreterSource is derived from the active interpreter alone, never from
  project files: a uv.lock project on a conda/venv/system interpreter reports
  that interpreter's real source, keeping the setup-flow gap visible. A
  genuinely uv-provisioned venv is identified by the `uv =` marker in
  pyvenv.cfg (pure pyvenvCfgMarksUv), not by uv.lock.
- conda is attributed only when the active interpreter resides under
  CONDA_PREFIX (pure interpreterUnderCondaPrefix, with a path-boundary check),
  not on the bare env var, which is session-global in the extension host
  (launching from an activated conda shell) and would otherwise over-count
  conda for uv/poetry/pip projects.
- pyproject [tool.uv]/[tool.poetry] detection uses a pure, bounded table-header
  scan (pyprojectHasToolSection) instead of substring matching: ignores
  comments and in-value mentions, rejects prefix collisions (e.g. tool.uvicorn),
  and matches subtable and array-of-table headers (e.g. [tool.uv.sources],
  [[tool.poetry.source]]) that the substring check missed.
- No external executable is run for telemetry: the uv-on-PATH probe was
  removed (it spawned a PATH-resolved `uv` for a weak, non-attributing signal);
  detection now only reads disk and already-resolved interpreter metadata.

Verification:
- yarn run build (typecheck) passes.
- eslint clean; prettier formatted.
- yarn run test:unit: 228 passing, 0 failing (includes detector + helper tests).

Co-authored-by: Isaac
@rugpanov rugpanov force-pushed the telemetry-package-manager-detection branch from c236f29 to be9f174 Compare June 19, 2026 09:26
@rugpanov rugpanov temporarily deployed to test-trigger-is June 19, 2026 09:26 — with GitHub Actions Inactive
@github-actions

Copy link
Copy Markdown
Contributor

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/vscode

Inputs:

  • PR number: 1918
  • Commit SHA: be9f1746227ce20e8261ed183a86775c0b99da9e

Checks will be approved automatically on success.

@rugpanov

Copy link
Copy Markdown
Contributor Author

@rugpanov rugpanov requested review from anton-107 and misha-db June 22, 2026 13:19
Comment on lines +47 to +52
/** A Python package/environment manager detected for a project. */
export type PackageManagerName = "uv" | "poetry" | "pip" | "conda";
/** Best-guess primary manager, or "unknown" when no signal fires. */
export type PrimaryManagerName = PackageManagerName | "unknown";
/** How the active interpreter was provisioned. */
export type InterpreterSource = "uv" | "conda" | "system" | "venv" | "unknown";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to be duplicated from packageManagerDetection.ts, can we re-use same types?

* free-form content (see {@link detectPackageManagers}). Telemetry opt-out is
* honoured by the underlying {@link Telemetry} client.
*/
export class PackageManagerTelemetry {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests for PackageManagerTelemetry?

Comment on lines +284 to +285
prefix.startsWith(base + "/") ||
prefix.startsWith(base + "\\")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes filesystem is case sensitive, on Windows case C:\Some and C:\some returning false, but they denote same folder

try {
return fs
.readdirSync(projectRoot)
.some((name) => /^requirements.*\.txt$/.test(name));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex matches "requirementswhatever.txt" as well, was it intentional?

@anton-107 anton-107 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few follow-up comments from a closer pass (correctness/maintainability + one telemetry-gating suggestion). Nice structure overall — the pure/emit/collect split and the privacy posture of the payload are good. None of these are blockers; the casts (1) and the opt-out gating (5) are the two I'd most want addressed.

context: PackageManagerDetectionContext
) {
this.recordEvent(Events.PYTHON_ENV_SETUP_DETECTED, {
managersDetected: detection.managers as PackageManagerName[],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe casts hide schema divergence. detection.managers as PackageManagerName[] (and interpreterSource as InterpreterSource on L46) bridge two parallel union definitions: PackageManager/InterpreterSource/PrimaryManager in packageManagerDetection.ts vs PackageManagerName/InterpreterSource/PrimaryManagerName in constants.ts. These as casts defeat the one check that matters here — if someone later adds a manager to one union but not the other, it compiles silently and ships a mismatched event schema. Suggest a single source of truth: import the detection types into constants.ts (or re-export from one place) and drop the casts so the compiler enforces alignment.

override async check() {
// First environment check on project open: emit package-manager
// detection telemetry (deduplicated per session, never throws).
void this.packageManagerTelemetry.emitDetection("auto_open");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targetCompute is nondeterministic for auto_open. This fires before await this.connectionManager.waitForConnect() on the next line. emitDetection awaits resolveEnvironment() and then reads connectionManager.cluster/.serverless via getComputeType(), so whether the connection has resolved by then is a race against the parallel waitForConnect(). Result: targetCompute for auto_open will usually be none but sometimes the real value — dirty data on the trigger that fires most. Either move the emit after waitForConnect() to make it deterministic, or accept+document that auto_open always reports none for compute.

* A `pyproject.toml` exists but declares neither `[tool.uv]` nor
* `[tool.poetry]` (i.e. a plain PEP 621 / pip-installable project).
*/
hasPyprojectPipOnly?: boolean;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasPyprojectPipOnly can misattribute uv projects as pip. uv works with a bare [project] table and no [tool.uv] section, so a uv project without a committed uv.lock classifies as pip. In the common case (uv.lock present) uv still fires and wins primary, so impact is limited to lockfile-less uv projects — but since the whole point is sizing uv adoption accurately, worth a one-line caveat here (or in the handoff doc) so the analytics owner knows pyproject.pipOnly slightly over-counts pip / under-counts uv.

* Pure over its inputs; returns false if either is missing. Accepts both `/`
* and `\\` separators so it is platform-agnostic and deterministic in tests.
*/
export function interpreterUnderCondaPrefix(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case-sensitivity vs the "platform-agnostic" claim. The boundary check handles / vs \ separators but compares case-sensitively, so on Windows C:\Conda\envs\ml vs c:\conda\... would not match even though Windows paths are case-insensitive. Low impact (both values usually come from the same source), but the doc comment says "platform-agnostic" which overstates it. Either lowercase both sides when on Windows, or soften the comment.

* Deduplicated per `(project root, trigger)` within the session. Never
* throws.
*/
async emitDetection(trigger: SetupTrigger): Promise<void> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: short-circuit when telemetry isn't at all. The event payload is correctly suppressed by @vscode/extension-telemetry when the user's telemetry.telemetryLevel isn't all (regular events only send at all), so nothing leaks. But recordEvent only gates on !this.reporter, and reporter is constructed regardless of the user's setting — so for an opted-out user, collectSignals still runs fs.readdirSync(projectRoot) + fs.readFileSync on pyproject.toml/pyvenv.cfg on every auto_open/run/setup; only the send is dropped. That's wasted work, and the optics are poor for a privacy-conscious user auditing the extension (a telemetry code path reading project files despite telemetry being off). Recommend bailing out of emitDetection early when telemetry is disabled (e.g. check vscode.env.isTelemetryEnabled / the reporter level before collecting) so opted-out users get zero telemetry-driven disk access. Separately, the handoff doc says "no paths/names" but doesn't mention that every event — including this one — inherits the ambient user.hashedUserName / user.host / workspaceId envelope; worth stating so reviewers know this links toolchain choices to a stable identity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants