Skip to content

Commit c236f29

Browse files
committed
Instrument package-manager detection telemetry at project setup
Why: We need first-party data on which Python package manager(s) our users' projects actually use (pip/conda/uv/poetry) to prioritize VPEX setup-flow investment, replacing public-survey estimates. Measurement only -- no setup behavior changes. What: - Add packageManagerDetection.ts: a pure, signal-based classifier that reports all applicable managers plus a best-guess primary (uv > poetry > conda > pip), the firing signals, hasLockfile, and interpreter source. Treats bare uv/poetry on PATH as weak signals. - Add Events.PYTHON_ENV_SETUP_DETECTED with a typed, documented schema in telemetry/constants.ts (reuses existing Telemetry client; opt-out honored; categorical data only, no paths/package/cluster names). - Add telemetry/packageManagerExtensions.ts: the emit half, layered onto the Telemetry class via the commandExtensions declare-module pattern (recordPackageManagerDetection). Keeps disk/Python-extension deps out of the Telemetry client. - Add PackageManagerTelemetry.ts: the collection half -- a best-effort, non-blocking collector (disk + already-resolved interpreter metadata) that gathers signals, runs the pure classifier, and calls the emit method. Deduplicated per session on (trigger, projectRoot); failures degrade to unknown and are swallowed. - Wire emission into three touchpoints: project-open env check (auto_open), the set-up-environment command (explicit_command), and first Run/Debug with Databricks Connect (run/debug). - Add unit tests for the detector and pure helpers, and a dashboard-owner handoff note. Detection correctness: - interpreterSource is derived from the active interpreter alone, never from project files: a uv.lock project on a conda/venv/system interpreter reports that interpreter's real source, keeping the setup-flow gap visible. A genuinely uv-provisioned venv is identified by the `uv =` marker in pyvenv.cfg (pure pyvenvCfgMarksUv), not by uv.lock. - conda is attributed only when the active interpreter resides under CONDA_PREFIX (pure interpreterUnderCondaPrefix, with a path-boundary check), not on the bare env var, which is session-global in the extension host (launching from an activated conda shell) and would otherwise over-count conda for uv/poetry/pip projects. - pyproject [tool.uv]/[tool.poetry] detection uses a pure, bounded table-header scan (pyprojectHasToolSection) instead of substring matching: ignores comments and in-value mentions, rejects prefix collisions (e.g. tool.uvicorn), and matches subtable and array-of-table headers (e.g. [tool.uv.sources], [[tool.poetry.source]]) that the substring check missed. - No external executable is run for telemetry: the uv-on-PATH probe was removed (it spawned a PATH-resolved `uv` for a weak, non-attributing signal); detection now only reads disk and already-resolved interpreter metadata. Verification: - yarn run build (typecheck) passes. - eslint clean; prettier formatted. - yarn run test:unit: 202 passing, 0 failing (includes detector + helper tests). Co-authored-by: Isaac
1 parent 79ef3eb commit c236f29

10 files changed

Lines changed: 1168 additions & 6 deletions

packages/databricks-vscode/src/extension.ts

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ import {BundleVariableTreeDataProvider} from "./ui/bundle-variables/BundleVariab
7575
import {ConfigurationTreeViewManager} from "./ui/configuration-view/ConfigurationTreeViewManager";
7676
import {getCLIDependenciesEnvVars} from "./utils/envVarGenerators";
7777
import {EnvironmentCommands} from "./language/EnvironmentCommands";
78+
import {PackageManagerTelemetry} from "./language/PackageManagerTelemetry";
7879
import {WorkspaceFolderManager} from "./vscode-objs/WorkspaceFolderManager";
7980
import {SyncCommands} from "./sync/SyncCommands";
8081
import {CodeSynchronizer} from "./sync";
@@ -335,6 +336,23 @@ export async function activate(
335336
customWhenContext,
336337
telemetry
337338
);
339+
const packageManagerTelemetry = new PackageManagerTelemetry(
340+
telemetry,
341+
pythonExtensionWrapper,
342+
() => {
343+
try {
344+
return workspaceFolderManager.activeProjectUri.fsPath;
345+
} catch (e) {
346+
return undefined;
347+
}
348+
},
349+
() => {
350+
if (connectionManager.serverless) {
351+
return "serverless";
352+
}
353+
return connectionManager.cluster ? "cluster" : "none";
354+
}
355+
);
338356
context.subscriptions.push(
339357
bundleFileWatcher,
340358
bundleValidateModel,
@@ -598,13 +616,15 @@ export async function activate(
598616
connectionManager,
599617
pythonExtensionWrapper,
600618
environmentDependenciesInstaller,
601-
configureAutocomplete
619+
configureAutocomplete,
620+
packageManagerTelemetry
602621
)
603622
);
604623
const environmentCommands = new EnvironmentCommands(
605624
featureManager,
606625
pythonExtensionWrapper,
607-
environmentDependenciesInstaller
626+
environmentDependenciesInstaller,
627+
packageManagerTelemetry
608628
);
609629
context.subscriptions.push(
610630
telemetry.registerCommand(
@@ -982,7 +1002,8 @@ export async function activate(
9821002
featureManager,
9831003
context,
9841004
customWhenContext,
985-
telemetry
1005+
telemetry,
1006+
packageManagerTelemetry
9861007
);
9871008
const debugFactory = new DatabricksDebugAdapterFactory(
9881009
connectionManager,

packages/databricks-vscode/src/language/EnvironmentCommands.ts

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,19 @@ import {Cluster} from "../sdk-extensions";
55
import {EnvironmentDependenciesInstaller} from "./EnvironmentDependenciesInstaller";
66
import {Environment} from "./MsPythonExtensionApi";
77
import {environmentName} from "../utils/environmentUtils";
8+
import {PackageManagerTelemetry} from "./PackageManagerTelemetry";
89

910
export class EnvironmentCommands {
1011
constructor(
1112
private featureManager: FeatureManager,
1213
private pythonExtension: MsPythonExtensionWrapper,
13-
private installer: EnvironmentDependenciesInstaller
14+
private installer: EnvironmentDependenciesInstaller,
15+
private packageManagerTelemetry: PackageManagerTelemetry
1416
) {}
1517

1618
async setup(stepId?: string) {
1719
commands.executeCommand("configurationView.focus");
20+
void this.packageManagerTelemetry.emitDetection("explicit_command");
1821
await window.withProgress(
1922
{location: {viewId: "configurationView"}},
2023
() => this._setup(stepId)

packages/databricks-vscode/src/language/EnvironmentDependenciesVerifier.ts

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ import {ResolvedEnvironment} from "./MsPythonExtensionApi";
1010
import {NamedLogger} from "@databricks/sdk-experimental/dist/logging";
1111
import {ConfigureAutocomplete} from "./ConfigureAutocomplete";
1212
import {workspaceConfigs} from "../vscode-objs/WorkspaceConfigs";
13+
import {PackageManagerTelemetry} from "./PackageManagerTelemetry";
1314

1415
export class EnvironmentDependenciesVerifier extends MultiStepAccessVerifier {
1516
private readonly logger = NamedLogger.getOrCreate(Loggers.Extension);
@@ -18,7 +19,8 @@ export class EnvironmentDependenciesVerifier extends MultiStepAccessVerifier {
1819
private readonly connectionManager: ConnectionManager,
1920
private readonly pythonExtension: MsPythonExtensionWrapper,
2021
private readonly installer: EnvironmentDependenciesInstaller,
21-
private readonly configureAutocomplete: ConfigureAutocomplete
22+
private readonly configureAutocomplete: ConfigureAutocomplete,
23+
private readonly packageManagerTelemetry: PackageManagerTelemetry
2224
) {
2325
super([
2426
"checkCluster",
@@ -403,6 +405,9 @@ export class EnvironmentDependenciesVerifier extends MultiStepAccessVerifier {
403405
}
404406

405407
override async check() {
408+
// First environment check on project open: emit package-manager
409+
// detection telemetry (deduplicated per session, never throws).
410+
void this.packageManagerTelemetry.emitDetection("auto_open");
406411
await this.connectionManager.waitForConnect();
407412
await Promise.all([
408413
this.checkCluster(this.connectionManager.cluster),
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
import fs from "node:fs";
2+
import path from "node:path";
3+
import {NamedLogger} from "@databricks/sdk-experimental/dist/logging";
4+
import {Loggers} from "../logger";
5+
import {Telemetry} from "../telemetry";
6+
import {ComputeType, SetupTrigger} from "../telemetry/constants";
7+
import "../telemetry/packageManagerExtensions";
8+
import {MsPythonExtensionWrapper} from "./MsPythonExtensionWrapper";
9+
import {ResolvedEnvironment} from "./MsPythonExtensionApi";
10+
import {
11+
detectPackageManagers,
12+
interpreterUnderCondaPrefix,
13+
InterpreterSource,
14+
PackageManagerSignals,
15+
pyprojectHasToolSection,
16+
pyvenvCfgMarksUv,
17+
} from "./packageManagerDetection";
18+
19+
export type {SetupTrigger};
20+
21+
/**
22+
* Collects package-manager signals at project-setup touchpoints and emits the
23+
* {@link Events.PYTHON_ENV_SETUP_DETECTED} telemetry event.
24+
*
25+
* All probing is best-effort and non-blocking: any failure degrades to
26+
* `unknown` and is swallowed, never thrown into the user's setup/run flow.
27+
* Only categorical/enum data is emitted — no paths, package names, or other
28+
* free-form content (see {@link detectPackageManagers}). Telemetry opt-out is
29+
* honoured by the underlying {@link Telemetry} client.
30+
*/
31+
export class PackageManagerTelemetry {
32+
private readonly logger = NamedLogger.getOrCreate(Loggers.Extension);
33+
34+
/**
35+
* Triggers already emitted for the current `(project, trigger)` pair, to
36+
* deduplicate within a session so one project open doesn't inflate counts.
37+
*/
38+
private readonly emitted = new Set<string>();
39+
40+
constructor(
41+
private readonly telemetry: Telemetry,
42+
private readonly pythonExtension: MsPythonExtensionWrapper,
43+
private readonly getProjectRoot: () => string | undefined,
44+
private readonly getComputeType: () => ComputeType | "none"
45+
) {}
46+
47+
/**
48+
* Detect the package manager(s) for the active project and emit telemetry.
49+
* Deduplicated per `(project root, trigger)` within the session. Never
50+
* throws.
51+
*/
52+
async emitDetection(trigger: SetupTrigger): Promise<void> {
53+
try {
54+
const projectRoot = this.getProjectRoot();
55+
if (projectRoot === undefined) {
56+
return;
57+
}
58+
const dedupeKey = `${trigger}:${projectRoot}`;
59+
if (this.emitted.has(dedupeKey)) {
60+
return;
61+
}
62+
this.emitted.add(dedupeKey);
63+
64+
const env = await this.resolveEnvironment();
65+
const signals = this.collectSignals(projectRoot, env);
66+
const detection = detectPackageManagers(signals);
67+
68+
this.telemetry.recordPackageManagerDetection(detection, {
69+
pythonVersion: this.getPythonMinorVersion(env),
70+
targetCompute: this.getComputeType(),
71+
trigger,
72+
});
73+
} catch (e) {
74+
// Detection is measurement-only and must never disrupt setup.
75+
this.logger.debug("Package manager detection failed", e);
76+
}
77+
}
78+
79+
private async resolveEnvironment(): Promise<
80+
ResolvedEnvironment | undefined
81+
> {
82+
try {
83+
return await this.pythonExtension.pythonEnvironment;
84+
} catch (e) {
85+
this.logger.debug("Failed to resolve python environment", e);
86+
return undefined;
87+
}
88+
}
89+
90+
/** Detected interpreter minor version (e.g. "3.11"), if available. */
91+
private getPythonMinorVersion(
92+
env: ResolvedEnvironment | undefined
93+
): string | undefined {
94+
const version = env?.version;
95+
if (version?.major === undefined || version.minor === undefined) {
96+
return undefined;
97+
}
98+
return `${version.major}.${version.minor}`;
99+
}
100+
101+
/**
102+
* Classify the *active interpreter's* provenance from the resolved
103+
* environment alone. This is deliberately independent of project files: a
104+
* project carrying `uv.lock` but running a conda/venv/system interpreter
105+
* must report that interpreter's real source, so the setup-flow gap ("uv
106+
* project, interpreter not uv-managed yet") stays visible. `uv.lock` is
107+
* still captured as a strong *project* signal via `hasUvLock`.
108+
*/
109+
private getInterpreterSource(
110+
env: ResolvedEnvironment | undefined
111+
): InterpreterSource {
112+
if (env?.environment === undefined) {
113+
// No managed environment: a global/system interpreter.
114+
return env ? "system" : "unknown";
115+
}
116+
117+
const tools = env.tools ?? [];
118+
if (env.environment.type === "Conda" || tools.includes("Conda")) {
119+
return "conda";
120+
}
121+
if (
122+
tools.includes("Venv") ||
123+
tools.includes("VirtualEnv") ||
124+
tools.includes("Poetry") ||
125+
tools.includes("Pipenv") ||
126+
env.environment.type === "VirtualEnvironment"
127+
) {
128+
// The MS Python extension reports uv-created venvs as plain virtual
129+
// environments. Distinguish a genuinely uv-provisioned interpreter
130+
// by the `uv = <version>` line uv writes into pyvenv.cfg -- this is
131+
// interpreter provenance, not the mere presence of uv.lock.
132+
return this.isUvCreatedVenv(env) ? "uv" : "venv";
133+
}
134+
return "unknown";
135+
}
136+
137+
/**
138+
* True if the active venv's pyvenv.cfg marks it as uv-created. Thin fs
139+
* wrapper around the pure {@link pyvenvCfgMarksUv}.
140+
*/
141+
private isUvCreatedVenv(env: ResolvedEnvironment): boolean {
142+
try {
143+
const sysPrefix = env.executable.sysPrefix;
144+
if (!sysPrefix) {
145+
return false;
146+
}
147+
const cfg = path.join(sysPrefix, "pyvenv.cfg");
148+
if (!fs.existsSync(cfg)) {
149+
return false;
150+
}
151+
return pyvenvCfgMarksUv(fs.readFileSync(cfg, "utf-8"));
152+
} catch (e) {
153+
this.logger.debug("Failed to read pyvenv.cfg", e);
154+
return false;
155+
}
156+
}
157+
158+
/**
159+
* Gather raw signals from disk and the environment. Each probe is guarded
160+
* so a single failure degrades that signal to absent rather than aborting.
161+
*/
162+
private collectSignals(
163+
projectRoot: string,
164+
env: ResolvedEnvironment | undefined
165+
): PackageManagerSignals {
166+
const exists = (file: string) => this.fileExists(projectRoot, file);
167+
const interpreterSource = this.getInterpreterSource(env);
168+
const pyproject = this.readPyproject(projectRoot);
169+
170+
const hasPyprojectToolUv = pyprojectHasToolSection(pyproject, "uv");
171+
const hasPyprojectToolPoetry = pyprojectHasToolSection(
172+
pyproject,
173+
"poetry"
174+
);
175+
const hasPyprojectPipOnly =
176+
pyproject !== undefined &&
177+
!hasPyprojectToolUv &&
178+
!hasPyprojectToolPoetry;
179+
180+
return {
181+
hasUvLock: exists("uv.lock"),
182+
hasPyprojectToolUv,
183+
// uvOnPath is intentionally left unset: it is a weak signal that
184+
// never attributes a project to uv, and probing it would mean
185+
// executing a PATH-resolved `uv` binary purely for telemetry.
186+
hasPoetryLock: exists("poetry.lock"),
187+
hasPyprojectToolPoetry,
188+
poetryOnPath: undefined,
189+
hasRequirementsTxt: this.hasRequirementsTxt(projectRoot),
190+
hasConstraintsTxt: exists("constraints.txt"),
191+
hasPyprojectPipOnly,
192+
hasCondaEnvFile:
193+
exists("environment.yml") || exists("environment.yaml"),
194+
hasCondaPrefix: this.hasActiveCondaInterpreter(env),
195+
interpreterSource,
196+
};
197+
}
198+
199+
/**
200+
* Whether the *active interpreter* lives under `CONDA_PREFIX`.
201+
*
202+
* We deliberately do NOT fire on the bare presence of `CONDA_PREFIX` /
203+
* `CONDA_DEFAULT_ENV`: those are session-global in the extension host (set
204+
* for every project when VS Code is launched from an activated conda
205+
* shell), so using them directly would over-count conda for uv/poetry/pip
206+
* projects. Requiring the active interpreter to reside under the prefix
207+
* keeps this a project-scoped signal.
208+
*/
209+
private hasActiveCondaInterpreter(
210+
env: ResolvedEnvironment | undefined
211+
): boolean {
212+
return interpreterUnderCondaPrefix(
213+
env?.executable.sysPrefix,
214+
process.env["CONDA_PREFIX"]
215+
);
216+
}
217+
218+
private fileExists(projectRoot: string, file: string): boolean {
219+
try {
220+
return fs.existsSync(path.join(projectRoot, file));
221+
} catch (e) {
222+
this.logger.debug(`Failed to stat ${file}`, e);
223+
return false;
224+
}
225+
}
226+
227+
/** True if any `requirements*.txt` file exists in the project root. */
228+
private hasRequirementsTxt(projectRoot: string): boolean {
229+
try {
230+
return fs
231+
.readdirSync(projectRoot)
232+
.some((name) => /^requirements.*\.txt$/.test(name));
233+
} catch (e) {
234+
this.logger.debug("Failed to list project root", e);
235+
return false;
236+
}
237+
}
238+
239+
private readPyproject(projectRoot: string): string | undefined {
240+
try {
241+
const file = path.join(projectRoot, "pyproject.toml");
242+
if (!fs.existsSync(file)) {
243+
return undefined;
244+
}
245+
return fs.readFileSync(file, "utf-8");
246+
} catch (e) {
247+
this.logger.debug("Failed to read pyproject.toml", e);
248+
return undefined;
249+
}
250+
}
251+
}

0 commit comments

Comments
 (0)