Skip to content

Commit ac5e944

Browse files
Avi-Robustaclaude
andauthored
[ROB-3981] Don't force Azure Prometheus auth from partial env vars (HolmesGPT#2101)
## Summary - `AzurePrometheusConfig.is_azure_config()` previously selected the Azure variant whenever `AZURE_CLIENT_ID` or `AZURE_TENANT_ID` was set on the pod — even when the user had a self-hosted `prometheus_url` configured and `AZURE_CLIENT_SECRET` was missing. This is common when Azure AI Foundry is the LLM provider: the `AZURE_*` env vars exist for an unrelated subsystem, and Holmes failed the Prometheus toolset with a confusing missing-credentials error. - `is_azure_config` now mirrors the same completeness rule that `AzurePrometheusConfig.__init__` already enforces: Azure mode is selected only when `azure_use_managed_id` is enabled OR all three of `azure_client_id` / `azure_tenant_id` / `azure_client_secret` are resolvable (config field or upper-cased `AZURE_*` env var). - A *partial* Azure signal (some creds present, others missing) now logs a `WARNING` naming what was found vs. missing and falls back to plain Prometheus, so the user immediately sees why Azure auth was skipped. - New `get_config_field(config, field)` module-level helper centralizes the "config dict, else upper-cased env var" lookup pattern. - The credential field list is sourced from `AzurePrometheusConfig._ui_required_fields` so the auto-detect check and the UI form requirement stay in lockstep. - The explicit `subtype: azure-managed-prometheus` opt-in path is untouched — it still selects `AzurePrometheusConfig` directly and raises the existing `ValueError` if credentials are incomplete. ## Test plan - [x] `poetry run pytest tests/plugins/toolsets/test_prometheus_unit.py --no-cov` — 28 passed (14 existing + 7 new `TestIsAzureConfig` + 7 new `TestGetConfigField`). - [x] Regression test asserts the user's exact scenario: with `AZURE_CLIENT_ID` + `AZURE_TENANT_ID` exported (no secret) and a plain `prometheus_url` in config, `determine_prometheus_class` returns `PrometheusConfig` (not `AzurePrometheusConfig`) and emits a "Partial Azure" warning. - [x] Positive cases still pass: all three env vars set → `AzurePrometheusConfig`; `azure_use_managed_id: true` in config or env → `AzurePrometheusConfig`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved Azure authentication validation to detect incomplete credential configurations and log informative warnings instead of silently failing. * Enhanced configuration field resolution with environment variable fallback support. * **Tests** * Expanded test coverage for configuration field handling and Azure authentication detection scenarios. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/HolmesGPT/holmesgpt/pull/2101?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: avi@robusta.dev <avi@robusta.dev> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d25c43a commit ac5e944

2 files changed

Lines changed: 180 additions & 10 deletions

File tree

holmes/plugins/toolsets/prometheus/prometheus.py

Lines changed: 43 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,14 @@ class PrometheusSubtype(str, Enum):
7777
AZURE_MANAGED_PROMETHEUS = "azure-managed-prometheus"
7878

7979

80+
def get_config_field(config: dict[str, Any], field: str) -> Any:
81+
"""Return the value of ``field`` from ``config``, falling back to the
82+
matching upper-cased environment variable. Empty/None config values fall
83+
through to the env var; missing or empty env vars return ``None``.
84+
"""
85+
return config.get(field) or os.environ.get(field.upper()) or None
86+
87+
8088
def format_ssl_error_message(prometheus_url: str, error: SSLError) -> str:
8189
"""Format a clear SSL error message with remediation steps."""
8290
return (
@@ -536,19 +544,44 @@ def __init__(self, **data):
536544

537545
@staticmethod
538546
def is_azure_config(config: dict[str, Any]) -> bool:
539-
"""Check if config dict or environment variables indicate Azure Prometheus config."""
540-
# Check for explicit Azure fields in config
541-
if (
542-
"azure_client_id" in config
543-
or "azure_tenant_id" in config
544-
or "azure_use_managed_id" in config
545-
):
546-
return True
547+
"""Return True if the config (dict + AZURE_* env vars) is complete enough
548+
to actually use Azure Managed Prometheus auth.
549+
550+
Mirrors the completeness rules enforced in ``__init__``: either
551+
``azure_use_managed_id`` is enabled, or all three of
552+
``azure_client_id`` / ``azure_tenant_id`` / ``azure_client_secret`` are
553+
resolvable. A partial Azure signal (e.g. ``AZURE_CLIENT_ID`` set on the
554+
pod for an unrelated subsystem like Azure AI Foundry) logs a warning
555+
explaining why Azure auth was skipped, then returns ``False`` so the
556+
toolset falls back to plain Prometheus instead of failing on a missing
557+
client secret.
558+
"""
559+
creds = {
560+
field: get_config_field(config, field)
561+
for field in AzurePrometheusConfig._ui_required_fields
562+
}
563+
use_managed_id = str(
564+
get_config_field(config, "azure_use_managed_id") or ""
565+
).lower() in ("true", "1")
547566

548-
# Check for Azure environment variables
549-
if os.environ.get("AZURE_CLIENT_ID") or os.environ.get("AZURE_TENANT_ID"):
567+
if use_managed_id or all(creds.values()):
550568
return True
551569

570+
found = [f"{f}/{f.upper()}" for f, v in creds.items() if v]
571+
missing = [f"{f}/{f.upper()}" for f, v in creds.items() if not v]
572+
573+
if found:
574+
logging.warning(
575+
"Partial Azure Managed Prometheus credentials detected "
576+
"(found: %s; missing: %s). Not using Azure auth — falling back "
577+
"to plain Prometheus. To enable Azure, provide all of "
578+
"azure_client_id / azure_tenant_id / azure_client_secret "
579+
"(in the toolset config or as AZURE_* env vars), or set "
580+
"azure_use_managed_id: true for managed identity.",
581+
", ".join(found),
582+
", ".join(missing),
583+
)
584+
552585
return False
553586

554587
def is_amp(self) -> bool:

tests/plugins/toolsets/test_prometheus_unit.py

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,13 @@
1+
import logging
2+
13
import pytest
24

35
from holmes.plugins.toolsets.prometheus.prometheus import (
6+
AzurePrometheusConfig,
7+
PrometheusConfig,
8+
PrometheusToolset,
49
adjust_step_for_max_points,
10+
get_config_field,
511
)
612

713

@@ -192,3 +198,134 @@ def test_override_with_explicit_step_adjusts_if_needed(self, monkeypatch):
192198
max_points_override=1000,
193199
)
194200
assert result == 3.6
201+
202+
203+
AZURE_ENV_VARS = (
204+
"AZURE_CLIENT_ID",
205+
"AZURE_TENANT_ID",
206+
"AZURE_CLIENT_SECRET",
207+
"AZURE_USE_MANAGED_ID",
208+
)
209+
210+
211+
@pytest.fixture
212+
def clean_azure_env(monkeypatch):
213+
for name in AZURE_ENV_VARS:
214+
monkeypatch.delenv(name, raising=False)
215+
return monkeypatch
216+
217+
218+
class TestGetConfigField:
219+
"""get_config_field reads ``field`` from the config dict, falling back to
220+
the upper-cased env var (e.g. ``AZURE_CLIENT_ID``)."""
221+
222+
def test_returns_config_value_when_present(self, clean_azure_env):
223+
assert (
224+
get_config_field({"azure_client_id": "from-config"}, "azure_client_id")
225+
== "from-config"
226+
)
227+
228+
def test_falls_back_to_env_var(self, clean_azure_env):
229+
clean_azure_env.setenv("AZURE_CLIENT_ID", "from-env")
230+
assert get_config_field({}, "azure_client_id") == "from-env"
231+
232+
def test_config_takes_precedence_over_env_var(self, clean_azure_env):
233+
clean_azure_env.setenv("AZURE_CLIENT_ID", "from-env")
234+
assert (
235+
get_config_field({"azure_client_id": "from-config"}, "azure_client_id")
236+
== "from-config"
237+
)
238+
239+
def test_empty_config_value_falls_through_to_env(self, clean_azure_env):
240+
"""Empty string in YAML (e.g. ``azure_client_id: ""``) should not
241+
suppress the env-var fallback — the empty value is no value."""
242+
clean_azure_env.setenv("AZURE_CLIENT_ID", "from-env")
243+
assert (
244+
get_config_field({"azure_client_id": ""}, "azure_client_id") == "from-env"
245+
)
246+
247+
def test_missing_returns_none(self, clean_azure_env):
248+
assert get_config_field({}, "azure_client_id") is None
249+
250+
def test_empty_env_var_returns_none(self, clean_azure_env):
251+
clean_azure_env.setenv("AZURE_CLIENT_ID", "")
252+
assert get_config_field({}, "azure_client_id") is None
253+
254+
def test_non_string_config_value_passes_through(self, clean_azure_env):
255+
"""Bool/int config values (e.g. ``azure_use_managed_id: true``) must
256+
not be coerced — callers handle the type-specific normalization."""
257+
assert (
258+
get_config_field({"azure_use_managed_id": True}, "azure_use_managed_id")
259+
is True
260+
)
261+
262+
263+
class TestIsAzureConfig:
264+
"""is_azure_config must match the completeness check inside
265+
AzurePrometheusConfig.__init__ — a partial signal (e.g. AZURE_CLIENT_ID
266+
alone, common when the LLM uses Azure AI Foundry) should NOT select the
267+
Azure variant, and the user should be warned why."""
268+
269+
def test_no_signals_returns_false(self, clean_azure_env, caplog):
270+
with caplog.at_level(logging.WARNING):
271+
assert AzurePrometheusConfig.is_azure_config({}) is False
272+
assert "Azure" not in caplog.text
273+
274+
def test_partial_env_vars_warns_and_returns_false(self, clean_azure_env, caplog):
275+
"""Regression: user had AZURE_CLIENT_ID + AZURE_TENANT_ID from their
276+
Azure AI Foundry LLM setup, and an internal prometheus_url. Holmes
277+
must NOT hijack into Azure auth."""
278+
clean_azure_env.setenv("AZURE_CLIENT_ID", "from-llm")
279+
clean_azure_env.setenv("AZURE_TENANT_ID", "from-llm")
280+
with caplog.at_level(logging.WARNING):
281+
assert AzurePrometheusConfig.is_azure_config({}) is False
282+
assert "Partial Azure" in caplog.text
283+
assert "azure_client_secret" in caplog.text
284+
285+
def test_partial_config_field_warns_and_returns_false(
286+
self, clean_azure_env, caplog
287+
):
288+
with caplog.at_level(logging.WARNING):
289+
assert (
290+
AzurePrometheusConfig.is_azure_config({"azure_client_id": "x"})
291+
is False
292+
)
293+
assert "Partial Azure" in caplog.text
294+
295+
def test_full_env_vars_returns_true_no_warning(self, clean_azure_env, caplog):
296+
clean_azure_env.setenv("AZURE_CLIENT_ID", "x")
297+
clean_azure_env.setenv("AZURE_TENANT_ID", "y")
298+
clean_azure_env.setenv("AZURE_CLIENT_SECRET", "z")
299+
with caplog.at_level(logging.WARNING):
300+
assert AzurePrometheusConfig.is_azure_config({}) is True
301+
assert "Partial Azure" not in caplog.text
302+
303+
def test_managed_id_in_config_returns_true(self, clean_azure_env, caplog):
304+
with caplog.at_level(logging.WARNING):
305+
assert (
306+
AzurePrometheusConfig.is_azure_config({"azure_use_managed_id": True})
307+
is True
308+
)
309+
assert "Partial Azure" not in caplog.text
310+
311+
def test_managed_id_env_var_returns_true(self, clean_azure_env, caplog):
312+
clean_azure_env.setenv("AZURE_USE_MANAGED_ID", "true")
313+
with caplog.at_level(logging.WARNING):
314+
assert AzurePrometheusConfig.is_azure_config({}) is True
315+
assert "Partial Azure" not in caplog.text
316+
317+
def test_determine_prometheus_class_with_partial_azure_env(
318+
self, clean_azure_env, caplog
319+
):
320+
"""End-to-end: the user's exact scenario — partial Azure env vars on
321+
the pod plus a plain prometheus_url config — must resolve to the
322+
generic PrometheusConfig, not AzurePrometheusConfig."""
323+
clean_azure_env.setenv("AZURE_CLIENT_ID", "from-llm")
324+
clean_azure_env.setenv("AZURE_TENANT_ID", "from-llm")
325+
toolset = PrometheusToolset()
326+
with caplog.at_level(logging.WARNING):
327+
cls = toolset.determine_prometheus_class(
328+
{"prometheus_url": "https://internal.example:8030", "verify_ssl": True}
329+
)
330+
assert cls is PrometheusConfig
331+
assert "Partial Azure" in caplog.text

0 commit comments

Comments
 (0)