perf: reduce tracker cold-start and concurrent measurement overhead by davidberenstein1957 · Pull Request #1246 · mlco2/codecarbon

davidberenstein1957 · 2026-06-17T21:20:42Z

Summary

This PR reduces CodeCarbon measurement launch latency and improves concurrent-run throughput while preserving existing behavior. Changes focus on deferring work until it is needed, caching hardware detection within a process, and slimming import paths.

Performance results — cold launch (offline Mac ARM)

Metric	Before	After	Improvement
Tracker `__init__`	~15.7 s	~94 ms	~99% faster
`start()` (cold)	~1.0 s	~194 ms	~81% faster
First sample (cold)	~18.2 s	~288 ms	~98% faster (~63×)
Warm lifecycle (init+start+stop, same process)	~62 ms	~6 ms	~10×
CLI monitor subprocess overhead (`codecarbon monitor … sleep 2`)	~1.5 s	~900–1000 ms	~35–40% faster

Cold-path numbers are the first tracker in a fresh process; warm-path numbers reuse cached hardware within the same process.

Performance results — run throughput (offline, warm, same process)

Repeated OfflineEmissionsTracker(output_methods=[]) lifecycles (init → start → stop) in one Python process:

Mode	Before	After	Improvement
Sequential runs / min	~926	~2,200	~2.4×
Parallel runs / min (8 threads)	~7,268	~12,300	~1.7×
Warm run latency (p50)	~62 ms	~6 ms	~10×

Before = master baseline (2026-06-17); after = hardware cache + warm lifecycle optimizations. Parallel benchmark: 8 worker threads, 15 s sustained load.

What changed

Tracker lifecycle

Lazy-import heavy modules; defer hardware probing, geo validation, and emissions engine until first use
Skip 1 Hz power monitor when output_methods=[]; skip redundant measurement on stop() when a sample was just taken
Global cpu_percent prime once per process

Hardware detection

Process-level hardware setup cache (hardware_cache.py) — CPU/GPU/RAM detection reused across instances
Platform-aware CPU backend order; cached GPU/CPU/PowerMetrics probes via @lru_cache

API write path

Lazy POST /runs — deferred until first emission upload (create_run_automatically=False + _ensure_api_run)

CLI

Lazy imports in cli/main.py for faster codecarbon monitor startup
Single entry point: codecarbon monitor (with optional --log-level)

Docs

Fix --log-level default in CLI reference (ERROR, not INFO)
FAQ note explaining warm hardware reuse within one Python process

Intentionally not included

These were explored during the perf work but removed to keep the diff focused on high-impact changes:

Process-level output handler cache, config cache, HTTP session pooling, and ApiClient pooling
Separate codecarbon-monitor console script
Benchmark scripts and carbonserver Docker startup shortcuts

Test plan

CODECARBON_ALLOW_MULTIPLE_RUNS=True pytest --ignore=tests/test_viz_data.py -m 'not integ_test' tests/ (541 passed locally)
tests/test_hardware_cache.py — cache hit/miss, clear_cache, round-trip reuse
CLI monitor tests — offline validation, --log-level, wrapped-command delegation
Manual: codecarbon monitor --offline --country-iso-code FRA -- sleep 1
Manual: codecarbon monitor --offline --country-iso-code FRA --log-level debug -- python train.py

Notes

Throughput numbers captured on offline Mac ARM (2026-06-18). The first tracker in a process is slower than subsequent ones because hardware detection runs once and is reused — see the FAQ.

Defer heavy imports and hardware probing until first use, cache hardware setup per process, and add a lightweight codecarbon-monitor CLI entry point so measurement launch and parallel runs stay fast without changing behavior. Co-authored-by: Cursor <cursoragent@cursor.com>

Skip the slow powermetrics sudo probe on Apple Silicon when cpu_load setup succeeds, strip leaked subcommand tokens from monitor ctx.args, and update tests for lazy tracker imports in run_and_monitor. Co-authored-by: Cursor <cursoragent@cursor.com>

Use class-name hardware cache serialization to survive module reloads in tests, lazy-import get_datetime_with_timezone in config CLI, add probe cache clear helpers, and update tests for lazy imports and get_cached_tdp. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide harnesses to measure cold-start, throughput, and API latency during optimization so regressions can be caught and logged consistently. Co-authored-by: Cursor <cursoragent@cursor.com>

Remove local-only harnesses used during optimization; the library perf changes and their tests are sufficient for review without dev tooling. Co-authored-by: Cursor <cursoragent@cursor.com>

codecov · 2026-06-17T21:45:31Z

Codecov Report

❌ Patch coverage is 97.08455% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.35%. Comparing base (58acafa) to head (e4a1d7d).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
codecarbon/core/resource_tracker.py	86.00%	7 Missing ⚠️
codecarbon/core/hardware_cache.py	97.63%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1246      +/-   ##
==========================================
+ Coverage   89.17%   89.35%   +0.17%     
==========================================
  Files          47       48       +1     
  Lines        4510     4771     +261     
==========================================
+ Hits         4022     4263     +241     
- Misses        488      508      +20

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Apply formatter/linter fixes, extract platform CPU backend selection to satisfy flake8 complexity, stabilize the force_cpu_power load test with a mocked cpu_percent, and add hardware_cache/monitor_main coverage tests. Co-authored-by: Cursor <cursoragent@cursor.com>

Avoid isinstance checks across module reload boundaries and mock AppleSiliconChip rebuild so powermetrics is not required on non-macOS runners. Co-authored-by: Cursor <cursoragent@cursor.com>

Add targeted tests for HTTP session reuse, hardware cache round-trips, platform CPU backend selection, and other newly introduced code paths so codecov patch checks pass on the PR. Co-authored-by: Cursor <cursoragent@cursor.com>

Reuse output handlers, ApiClient instances, config reads, and Logfire setup across repeated tracker lifecycles so CSV/API/Logfire paths stay fast on warm runs. Add benchmark scripts for lifecycle and per-output throughput measurement. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Remove output_cache since micro-benchmarks showed no meaningful full-lifecycle gain; retain config caching, ApiClient pooling, and Logfire configure-once. Co-authored-by: Cursor <cursoragent@cursor.com>

Drop session, config, logfire, and file-header caches that added complexity without clear wins, revert carbonserver bootstrap shortcuts, and align tests with direct ApiClient usage. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace hand-rolled globals for GPU/CPU/PowerMetrics probes with functools.lru_cache, use direct imports in hardware_cache.clear_cache(), and dedupe CodeCarbonAPIOutput emit paths. Co-authored-by: Cursor <cursoragent@cursor.com>

Restore lazy sys.modules clearing so conftest teardown does not load gpu_nvidia before FakeGPUEnv tests install mock pynvml. Co-authored-by: Cursor <cursoragent@cursor.com>

Drop codecarbon-monitor in favor of codecarbon monitor, add --log-level there, and document warm hardware reuse plus the correct log-level default. Co-authored-by: Cursor <cursoragent@cursor.com>

Capture cpu counts, canonical GPU ids, and RAPL settings in cached plans, sync tracker state on apply, and pass tracking_mode through all CPU backends. Co-authored-by: Cursor <cursoragent@cursor.com>

Align test_set_cpu_tracking_skips_tdp_when_rapl_available with the resource tracker change that passes tracking_mode to CPU.from_utils. Co-authored-by: Cursor <cursoragent@cursor.com>

inimaz

Nice @davidberenstein1957 thanks a lot for taking a look at this. There are many improvements done at once here. Maybe can you split it into smaller PRs? Like this it will be easier to review. I add some comments already.

inimaz · 2026-06-18T07:43:18Z

        self.model, self.tdp = self._main()

    @staticmethod
-    def _get_cpu_constant_power(match: str, cpu_power_df: pd.DataFrame) -> int:


Do not delete this, if we are using pandas only for typing we can do

from typing import TYPE_CHECKING if TYPE_CHECKING: import pandas as pd

inimaz · 2026-06-18T07:43:51Z

        return None

-    def _get_matching_cpu(
-        self, model_raw: str, cpu_df: pd.DataFrame, greedy=False


inimaz · 2026-06-18T07:44:15Z

 from typing import Dict, Optional

-import pandas as pd
-


Same in this file

inimaz · 2026-06-18T07:45:18Z

+def _hardware_kind(hw) -> str:
+    """Classify hardware without isinstance (safe if modules were reloaded)."""
+    name = type(hw).__name__
+    if name == "RAM":


Let's do an Enum out of these strings

Restore pandas DataFrame annotations via TYPE_CHECKING and replace hardware kind strings with a HardwareKind enum. Co-authored-by: Cursor <cursoragent@cursor.com>

davidberenstein1957 · 2026-06-21T14:07:15Z

Closing in favor of a 4-PR stacked review split per @inimaz's feedback:

perf: defer tracker initialization and slim import path (1/4) #1251 — lazy initialization & import slimming
perf: cache hardware detection and optimize warm-path reuse (2/4) #1252 — hardware detection cache & warm-path reuse
perf: defer API run creation until first emission upload (3/4) #1253 — lazy API run creation
perf: speed up CLI monitor startup and fix wrapped commands (4/4) #1254 — CLI monitor startup

Merging #1251 → #1252 → #1253 → #1254 restores the full change set and benchmarks from this PR.

davidberenstein1957 · 2026-06-21T14:17:11Z

@inimaz — split complete. Please review the stacked PRs instead:

perf: defer tracker initialization and slim import path (1/4) #1251 — lazy init (tracker __init__ 190ms → 1.9ms)
perf: cache hardware detection and optimize warm-path reuse (2/4) #1252 — hardware cache (cold lifecycle 1.7s → 408ms, warm 1.5s → 4.8ms)
perf: defer API run creation until first emission upload (3/4) #1253 — lazy API run creation
perf: speed up CLI monitor startup and fix wrapped commands (4/4) #1254 — CLI monitor (850ms → 643ms)

Each PR has measured benchmarks, test plan, and passing pre-commit/tests.

inimaz · 2026-06-21T15:17:07Z

Nice @davidberenstein1957, I already reviewed the first one!

davidberenstein1957 requested a review from a team as a code owner June 17, 2026 21:20

davidberenstein1957 requested review from benoit-cty and inimaz June 17, 2026 21:20

davidberenstein1957 assigned benoit-cty and inimaz Jun 17, 2026

davidberenstein1957 and others added 4 commits June 17, 2026 23:24

chore: add benchmark and profiling scripts for tracker perf work

fa53a02

Provide harnesses to measure cold-start, throughput, and API latency during optimization so regressions can be caught and logged consistently. Co-authored-by: Cursor <cursoragent@cursor.com>

chore: drop benchmark and profiling scripts from PR

ab9bcec

Remove local-only harnesses used during optimization; the library perf changes and their tests are sufficient for review without dev tooling. Co-authored-by: Cursor <cursoragent@cursor.com>

davidberenstein1957 and others added 2 commits June 17, 2026 23:50

fix: make hardware_cache tests portable on Linux CI

fe75a1e

Avoid isinstance checks across module reload boundaries and mock AppleSiliconChip rebuild so powermetrics is not required on non-macOS runners. Co-authored-by: Cursor <cursoragent@cursor.com>

davidberenstein1957 self-assigned this Jun 18, 2026

davidberenstein1957 and others added 10 commits June 18, 2026 05:10

test: raise patch coverage for perf optimization changes

cf356ed

Add targeted tests for HTTP session reuse, hardware cache round-trips, platform CPU backend selection, and other newly introduced code paths so codecov patch checks pass on the PR. Co-authored-by: Cursor <cursoragent@cursor.com>

fix: satisfy pre-commit for benchmark scripts

c961b35

Co-authored-by: Cursor <cursoragent@cursor.com>

refactor: drop handler singleton cache, keep targeted perf wins

b06dec3

Remove output_cache since micro-benchmarks showed no meaningful full-lifecycle gain; retain config caching, ApiClient pooling, and Logfire configure-once. Co-authored-by: Cursor <cursoragent@cursor.com>

refactor: simplify probe caches with stdlib lru_cache

c0ca28e

Replace hand-rolled globals for GPU/CPU/PowerMetrics probes with functools.lru_cache, use direct imports in hardware_cache.clear_cache(), and dedupe CodeCarbonAPIOutput emit paths. Co-authored-by: Cursor <cursoragent@cursor.com>

fix: avoid eager GPU imports in hardware_cache.clear_cache

ae79098

Restore lazy sys.modules clearing so conftest teardown does not load gpu_nvidia before FakeGPUEnv tests install mock pynvml. Co-authored-by: Cursor <cursoragent@cursor.com>

refactor: use single monitor CLI entry point and update docs

8334302

Drop codecarbon-monitor in favor of codecarbon monitor, add --log-level there, and document warm hardware reuse plus the correct log-level default. Co-authored-by: Cursor <cursoragent@cursor.com>

fix: make hardware cache plans fully self-contained

ab80373

Capture cpu counts, canonical GPU ids, and RAPL settings in cached plans, sync tracker state on apply, and pass tracking_mode through all CPU backends. Co-authored-by: Cursor <cursoragent@cursor.com>

test: expect tracking_mode in RAPL CPU setup assertion

8ccb30e

Align test_set_cpu_tracking_skips_tdp_when_rapl_available with the resource tracker change that passes tracking_mode to CPU.from_utils. Co-authored-by: Cursor <cursoragent@cursor.com>

inimaz reviewed Jun 20, 2026

View reviewed changes

refactor: address PR review feedback on typing and hardware cache

e4a1d7d

Restore pandas DataFrame annotations via TYPE_CHECKING and replace hardware kind strings with a HardwareKind enum. Co-authored-by: Cursor <cursoragent@cursor.com>

davidberenstein1957 closed this Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: reduce tracker cold-start and concurrent measurement overhead#1246

perf: reduce tracker cold-start and concurrent measurement overhead#1246
davidberenstein1957 wants to merge 18 commits into
masterfrom
davidberenstein1957/codecarbon-api-speed-test

davidberenstein1957 commented Jun 17, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

inimaz left a comment

Uh oh!

inimaz Jun 18, 2026

Uh oh!

inimaz Jun 18, 2026

Uh oh!

inimaz Jun 18, 2026

Uh oh!

inimaz Jun 18, 2026

Uh oh!

davidberenstein1957 commented Jun 21, 2026

Uh oh!

davidberenstein1957 commented Jun 21, 2026

Uh oh!

inimaz commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

davidberenstein1957 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance results — cold launch (offline Mac ARM)

Performance results — run throughput (offline, warm, same process)

What changed

Intentionally not included

Test plan

Notes

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

inimaz left a comment

Choose a reason for hiding this comment

Uh oh!

inimaz Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

inimaz Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

inimaz Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

inimaz Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

davidberenstein1957 commented Jun 21, 2026

Uh oh!

davidberenstein1957 commented Jun 21, 2026

Uh oh!

inimaz commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidberenstein1957 commented Jun 17, 2026 •

edited

Loading

codecov Bot commented Jun 17, 2026 •

edited

Loading