Skip to content

Commit cb16853

Browse files
feat: make telemetry off by default (#4281)
## Summary Closes #3940 ## Changes - **Behavior:** `scarf_analytics()` sends the ping only when `UNSTRUCTURED_TELEMETRY_ENABLED=true` (or `1`). Opt-out env vars `DO_NOT_TRACK` and `SCARF_NO_ANALYTICS` are still respected and take precedence. - **Docs:** README Analytics section and logger comment updated to describe the new default and opt-in/opt-out. - **Tests:** New `DescribeScarfAnalytics` tests for default off, opt-in (`true`/`1`), and opt-out overriding opt-in. - **Changelog:** Entry under 0.21.13. --------- Co-authored-by: Lawrence Elitzer (LoLo) <lawrence@unstructured.io>
1 parent 5585e98 commit cb16853

9 files changed

Lines changed: 331 additions & 47 deletions

File tree

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
## 0.22.0
2+
3+
### Breaking changes
4+
- **Opt-out env semantics**: `DO_NOT_TRACK` and `SCARF_NO_ANALYTICS` now treat any non-empty value (after strip) as opt-out. Previously only the exact string `"true"` opted out. Values like `false`, `0`, or `no` now also disable telemetry. To avoid opting out, unset the variable or leave it empty.
5+
6+
### Enhancements
7+
- **Telemetry off by default**: The library-load analytics ping is disabled by default. Set `UNSTRUCTURED_TELEMETRY_ENABLED=true` before importing to restore the previous behavior. Opt-out via `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS` (any non-empty value) takes precedence.
8+
- Telemetry ping uses `requests.get(..., params=...)` for correct URL encoding and a single dev/non-dev code path.
9+
110
## 0.21.12
211
- **Add Check for complex documents**: Adds a check for complex documents to avoid pdfminer with a high ratio of vector objects
312

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -268,4 +268,4 @@ Encountered a bug? Please create a new [GitHub issue](https://github.com/Unstruc
268268

269269
## :chart_with_upwards_trend: Analytics
270270

271-
This library includes a very lightweight analytics "ping" when the library is loaded, however you can opt out of this data collection by setting the environment variable `DO_NOT_TRACK=true` before executing any `unstructured` code. To learn more about how we collect and use this data, please read our [Privacy Policy](https://unstructured.io/privacy-policy).
271+
Telemetry is **off by default**. To opt in, set `UNSTRUCTURED_TELEMETRY_ENABLED=true` (or `=1`) before importing `unstructured`. To opt out, set `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS` to any non-empty value (e.g. `true`, `1`, `yes`, `false`, `0`—any non-empty string opts out); opt-out takes precedence. Unset the variable or leave it empty if you do not want to opt out. See our [Privacy Policy](https://unstructured.io/privacy-policy).

scripts/image/test-outbound-connectivity.sh

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ fi
4848

4949
SCENARIO="${1:-}"
5050
if [[ -z "$SCENARIO" ]]; then
51-
echo "Usage: $0 [--cleanup] {baseline|missing-models|offline|offline-and-missing-models}" >&2
51+
echo "Usage: $0 [--cleanup] {baseline|missing-models|analytics-online-only|offline|offline-and-missing-models}" >&2
5252
exit 1
5353
fi
5454

@@ -61,12 +61,16 @@ fi
6161

6262
# ---------- scenario‑specific settings --------------------------------
6363
DO_NOT_TRACK=""
64+
UNSTRUCTURED_TELEMETRY_ENABLED=""
6465
HF_HUB_OFFLINE=""
6566
REMOVE_CACHE=0
6667
case "$SCENARIO" in
6768
baseline) ;;
6869
missing-models) REMOVE_CACHE=1 ;;
69-
analytics-online-only) HF_HUB_OFFLINE=1 ;;
70+
analytics-online-only)
71+
UNSTRUCTURED_TELEMETRY_ENABLED=1
72+
HF_HUB_OFFLINE=1
73+
;;
7074
offline)
7175
DO_NOT_TRACK=true
7276
HF_HUB_OFFLINE=1
@@ -89,6 +93,7 @@ CID=$(docker run -d --rm --name "sut_${SCENARIO}" \
8993
--network "$NET" \
9094
--cap-add NET_RAW --cap-add NET_ADMIN \
9195
-e DO_NOT_TRACK="$DO_NOT_TRACK" \
96+
-e UNSTRUCTURED_TELEMETRY_ENABLED="$UNSTRUCTURED_TELEMETRY_ENABLED" \
9297
-e HF_HUB_OFFLINE="$HF_HUB_OFFLINE" \
9398
--entrypoint /bin/sh "$IMAGE" -c "sleep infinity")
9499
echo "Container: $CID (scenario $SCENARIO)"
@@ -127,8 +132,8 @@ fi
127132

128133
docker exec -i -e PYTHONUNBUFFERED=1 "$CID" python - <<PY |& tee "${PY_LOG_DIR}/${SCENARIO}.log"
129134
import logging
135+
# Telemetry runs at package init when UNSTRUCTURED_TELEMETRY_ENABLED is set (see analytics-online-only scenario).
130136
from unstructured.partition.auto import partition
131-
from unstructured.logger import logger # force analytics ping if not DO_NOT_TRACK
132137
import urllib.request, time, os, sys
133138
134139
# Configure detailed logging
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
"""Hermetic telemetry tests: env is set to opt-out before importing unstructured.
2+
3+
This module must set DO_NOT_TRACK (or equivalent) before any import of unstructured
4+
so that init_telemetry() runs with opt-out at import time and no real network/subprocess
5+
occurs. Tests then use monkeypatch and mocks to assert behavior.
6+
"""
7+
8+
from __future__ import annotations
9+
10+
# Set opt-out before any unstructured import so package init does not run telemetry.
11+
import os
12+
13+
os.environ["DO_NOT_TRACK"] = "1"
14+
os.environ.pop("UNSTRUCTURED_TELEMETRY_ENABLED", None)
15+
os.environ.pop("SCARF_NO_ANALYTICS", None)
16+
17+
import platform
18+
import subprocess
19+
import sys
20+
from pathlib import Path
21+
from unittest.mock import Mock
22+
23+
import pytest
24+
import requests
25+
26+
from unstructured import utils
27+
28+
29+
@pytest.fixture
30+
def telemetry_mocks(monkeypatch):
31+
"""Clear telemetry env and patch requests.get + subprocess.check_output.
32+
33+
Returns (mock_get, mock_subprocess). Use for both send and no-send tests so
34+
we can assert network and subprocess side effects.
35+
"""
36+
monkeypatch.delenv("UNSTRUCTURED_TELEMETRY_ENABLED", raising=False)
37+
monkeypatch.delenv("SCARF_NO_ANALYTICS", raising=False)
38+
monkeypatch.delenv("DO_NOT_TRACK", raising=False)
39+
mock_get = Mock()
40+
mock_subprocess = Mock()
41+
monkeypatch.setattr("unstructured.utils.requests.get", mock_get)
42+
monkeypatch.setattr("unstructured.utils.subprocess.check_output", mock_subprocess)
43+
return mock_get, mock_subprocess
44+
45+
46+
def _apply_telemetry_env(monkeypatch, env_overrides):
47+
"""Set env vars from dict; keys are env var names, values are str or None (delenv)."""
48+
for key, value in env_overrides.items():
49+
if value is None:
50+
monkeypatch.delenv(key, raising=False)
51+
else:
52+
monkeypatch.setenv(key, value)
53+
54+
55+
class DescribeScarfAnalytics:
56+
"""Tests for scarf_analytics (telemetry off by default, opt-in only)."""
57+
58+
def it_telemetry_opt_out_any_non_empty_for_both_vars(self, monkeypatch):
59+
"""Contract: DO_NOT_TRACK and SCARF_NO_ANALYTICS both opt out on any non-empty value."""
60+
monkeypatch.delenv("DO_NOT_TRACK", raising=False)
61+
monkeypatch.delenv("SCARF_NO_ANALYTICS", raising=False)
62+
assert utils._telemetry_opt_out() is False
63+
monkeypatch.setenv("DO_NOT_TRACK", "yes")
64+
assert utils._telemetry_opt_out() is True
65+
monkeypatch.delenv("DO_NOT_TRACK", raising=False)
66+
monkeypatch.setenv("SCARF_NO_ANALYTICS", "on")
67+
assert utils._telemetry_opt_out() is True
68+
69+
def it_telemetry_opt_in_only_true_or_1(self, monkeypatch):
70+
"""Contract: only UNSTRUCTURED_TELEMETRY_ENABLED in ('true','1') opts in."""
71+
monkeypatch.delenv("UNSTRUCTURED_TELEMETRY_ENABLED", raising=False)
72+
assert utils._telemetry_opt_in() is False
73+
for val in ("true", "1", "True", "TRUE"):
74+
monkeypatch.setenv("UNSTRUCTURED_TELEMETRY_ENABLED", val)
75+
assert utils._telemetry_opt_in() is True
76+
for val in ("false", "0", "yes", ""):
77+
monkeypatch.setenv("UNSTRUCTURED_TELEMETRY_ENABLED", val)
78+
assert utils._telemetry_opt_in() is False
79+
80+
@pytest.mark.parametrize(
81+
"env_overrides",
82+
[
83+
{},
84+
{"DO_NOT_TRACK": "true"},
85+
{"DO_NOT_TRACK": "1"},
86+
{"DO_NOT_TRACK": "TRUE"},
87+
{"DO_NOT_TRACK": "false"},
88+
{"DO_NOT_TRACK": "0"},
89+
{"SCARF_NO_ANALYTICS": "true"},
90+
{"SCARF_NO_ANALYTICS": "yes"},
91+
{"SCARF_NO_ANALYTICS": "on"},
92+
{"SCARF_NO_ANALYTICS": "1"},
93+
{"SCARF_NO_ANALYTICS": "TRUE"},
94+
{"SCARF_NO_ANALYTICS": "false"},
95+
{"SCARF_NO_ANALYTICS": "0"},
96+
{"SCARF_NO_ANALYTICS": " true "},
97+
{"UNSTRUCTURED_TELEMETRY_ENABLED": "false"},
98+
{"UNSTRUCTURED_TELEMETRY_ENABLED": "0"},
99+
{"UNSTRUCTURED_TELEMETRY_ENABLED": "yes"},
100+
{"UNSTRUCTURED_TELEMETRY_ENABLED": "FALSE"},
101+
{"UNSTRUCTURED_TELEMETRY_ENABLED": "true", "DO_NOT_TRACK": "true"},
102+
{"UNSTRUCTURED_TELEMETRY_ENABLED": "true", "SCARF_NO_ANALYTICS": "on"},
103+
],
104+
ids=[
105+
"default_no_opt_in",
106+
"DO_NOT_TRACK=true",
107+
"DO_NOT_TRACK=1",
108+
"DO_NOT_TRACK=TRUE",
109+
"DO_NOT_TRACK=false",
110+
"DO_NOT_TRACK=0",
111+
"SCARF_NO_ANALYTICS=true",
112+
"SCARF_NO_ANALYTICS=yes",
113+
"SCARF_NO_ANALYTICS=on",
114+
"SCARF_NO_ANALYTICS=1",
115+
"SCARF_NO_ANALYTICS=TRUE",
116+
"SCARF_NO_ANALYTICS=false",
117+
"SCARF_NO_ANALYTICS=0",
118+
"SCARF_NO_ANALYTICS=whitespace",
119+
"opt_in=false",
120+
"opt_in=0",
121+
"opt_in=yes",
122+
"opt_in=FALSE",
123+
"opt_in_true_but_DO_NOT_TRACK",
124+
"opt_in_true_but_SCARF_NO_ANALYTICS",
125+
],
126+
)
127+
def it_does_not_send_telemetry_when_disabled_or_opted_out(
128+
self, monkeypatch, telemetry_mocks, env_overrides
129+
):
130+
"""No network or subprocess when telemetry disabled or opt-out set."""
131+
mock_get, mock_subprocess = telemetry_mocks
132+
_apply_telemetry_env(monkeypatch, env_overrides)
133+
utils.scarf_analytics()
134+
mock_get.assert_not_called()
135+
mock_subprocess.assert_not_called()
136+
137+
@pytest.mark.parametrize("opt_in_value", ["true", "True", "TRUE", "1"])
138+
def it_sends_telemetry_when_opt_in_is_set(self, monkeypatch, telemetry_mocks, opt_in_value):
139+
mock_get, mock_subprocess = telemetry_mocks
140+
_apply_telemetry_env(monkeypatch, {"UNSTRUCTURED_TELEMETRY_ENABLED": opt_in_value})
141+
utils.scarf_analytics()
142+
mock_get.assert_called_once()
143+
mock_subprocess.assert_called_once_with(["nvidia-smi"], stderr=subprocess.DEVNULL)
144+
call_args = mock_get.call_args
145+
assert call_args[0][0] == "https://packages.unstructured.io/python-telemetry"
146+
params = call_args[1]["params"]
147+
assert set(params.keys()) == {"version", "platform", "python", "arch", "gpu", "dev"}
148+
assert call_args[1]["timeout"] == 10
149+
150+
@pytest.mark.parametrize(
151+
("version_val", "expected_dev"),
152+
[("1.2.3.dev0", "true"), ("1.2.3", "false")],
153+
ids=["dev_version", "release_version"],
154+
)
155+
def it_sends_telemetry_with_correct_dev_param(
156+
self, monkeypatch, telemetry_mocks, version_val, expected_dev
157+
):
158+
mock_get, mock_subprocess = telemetry_mocks
159+
_apply_telemetry_env(monkeypatch, {"UNSTRUCTURED_TELEMETRY_ENABLED": "true"})
160+
monkeypatch.setattr("unstructured.utils.__version__", version_val)
161+
utils.scarf_analytics()
162+
mock_get.assert_called_once()
163+
mock_subprocess.assert_called_once()
164+
params = mock_get.call_args[1]["params"]
165+
assert params["dev"] == expected_dev
166+
assert params["version"] == version_val
167+
assert params["platform"] == platform.system()
168+
assert params["arch"] == platform.machine()
169+
assert mock_get.call_args[1]["timeout"] == 10
170+
171+
def it_handles_requests_exception_gracefully(self, monkeypatch, telemetry_mocks):
172+
mock_get, mock_subprocess = telemetry_mocks
173+
mock_get.side_effect = requests.RequestException("network error")
174+
_apply_telemetry_env(monkeypatch, {"UNSTRUCTURED_TELEMETRY_ENABLED": "true"})
175+
utils.scarf_analytics() # does not raise
176+
mock_get.assert_called_once()
177+
mock_subprocess.assert_called_once()
178+
assert mock_get.call_args[0][0] == "https://packages.unstructured.io/python-telemetry"
179+
assert "version" in mock_get.call_args[1]["params"]
180+
181+
@pytest.mark.parametrize(
182+
"exc",
183+
[
184+
OSError(),
185+
PermissionError("nvidia-smi denied"),
186+
subprocess.CalledProcessError(returncode=1, cmd=["nvidia-smi"]),
187+
],
188+
ids=["OSError", "PermissionError", "CalledProcessError"],
189+
)
190+
def it_handles_nvidia_smi_failure_gracefully(self, monkeypatch, telemetry_mocks, exc):
191+
"""nvidia-smi probe failures must not propagate; telemetry still sends with gpu=False."""
192+
mock_get, mock_subprocess = telemetry_mocks
193+
mock_subprocess.side_effect = exc
194+
_apply_telemetry_env(monkeypatch, {"UNSTRUCTURED_TELEMETRY_ENABLED": "true"})
195+
utils.scarf_analytics() # does not raise
196+
mock_get.assert_called_once()
197+
assert mock_get.call_args[1]["params"]["gpu"] == "False"
198+
mock_subprocess.assert_called_once_with(["nvidia-smi"], stderr=subprocess.DEVNULL)
199+
200+
def it_import_unstructured_succeeds_with_opt_out(self):
201+
"""Import path with opt-out env does not crash (integration-style)."""
202+
project_root = Path(__file__).resolve().parent.parent
203+
env = {k: v for k, v in os.environ.items() if k != "UNSTRUCTURED_TELEMETRY_ENABLED"}
204+
env.update(
205+
{
206+
"DO_NOT_TRACK": "1",
207+
"SCARF_NO_ANALYTICS": "1",
208+
"UNSTRUCTURED_TELEMETRY_ENABLED": "",
209+
"PYTHONPATH": str(project_root),
210+
}
211+
)
212+
result = subprocess.run(
213+
[sys.executable, "-c", "import unstructured; print('ok')"],
214+
env=env,
215+
cwd=project_root,
216+
capture_output=True,
217+
text=True,
218+
timeout=30,
219+
)
220+
assert result.returncode == 0, result.stderr or result.stdout
221+
assert "ok" in result.stdout
222+
223+
def it_import_unstructured_runs_telemetry_once_when_opt_in(self):
224+
"""Import path with opt-in runs init_telemetry exactly once (patch then import)."""
225+
project_root = Path(__file__).resolve().parent.parent
226+
env = {
227+
k: v
228+
for k, v in os.environ.items()
229+
if k not in ("DO_NOT_TRACK", "SCARF_NO_ANALYTICS", "UNSTRUCTURED_TELEMETRY_ENABLED")
230+
}
231+
env.update(
232+
{
233+
"UNSTRUCTURED_TELEMETRY_ENABLED": "true",
234+
"PYTHONPATH": str(project_root),
235+
}
236+
)
237+
script = """
238+
from unittest.mock import Mock, patch
239+
m_get = Mock()
240+
m_subprocess = Mock()
241+
with patch('requests.get', m_get), patch('subprocess.check_output', m_subprocess):
242+
import unstructured
243+
exit(0 if (m_get.call_count == 1 and m_subprocess.call_count == 1) else 1)
244+
"""
245+
result = subprocess.run(
246+
[sys.executable, "-c", script],
247+
env=env,
248+
cwd=project_root,
249+
capture_output=True,
250+
text=True,
251+
timeout=30,
252+
)
253+
assert result.returncode == 0, (
254+
"Import with opt-in should run telemetry exactly once (requests.get and "
255+
"subprocess.check_output each called once). "
256+
f"stderr={result.stderr!r} stdout={result.stdout!r}"
257+
)

unstructured/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
11
from .partition.utils.config import env_config
2+
from .telemetry import init_telemetry
23

34
# init env_config
45
env_config
6+
7+
# Explicit startup boundary for telemetry (opt-in, best-effort)
8+
init_telemetry()

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.21.12" # pragma: no cover
1+
__version__ = "0.22.0" # pragma: no cover

unstructured/logger.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
import logging
22

3-
from unstructured.utils import scarf_analytics
4-
53
logger = logging.getLogger("unstructured")
64
trace_logger = logging.getLogger("unstructured.trace")
75

@@ -16,9 +14,5 @@ def detail(self, message, *args, **kws):
1614
self._log(DETAIL, message, args, **kws)
1715

1816

19-
# Note(Trevor,Crag): to opt out of scarf analytics, set the environment variable:
20-
# SCARF_NO_ANALYTICS=true. See the README for more info.
21-
scarf_analytics()
22-
2317
# Add the custom log method to the logging.Logger class
2418
logging.Logger.detail = detail # type: ignore

unstructured/telemetry.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
"""Telemetry initializer. Called once at package startup from unstructured/__init__.py."""
2+
3+
from unstructured.utils import scarf_analytics
4+
5+
6+
def init_telemetry() -> None:
7+
"""Run the analytics ping if enabled by env. Best-effort and non-fatal."""
8+
scarf_analytics()

0 commit comments

Comments
 (0)