perf: reduce tracker cold-start and concurrent measurement overhead#1246
perf: reduce tracker cold-start and concurrent measurement overhead#1246davidberenstein1957 wants to merge 18 commits into
Conversation
Defer heavy imports and hardware probing until first use, cache hardware setup per process, and add a lightweight codecarbon-monitor CLI entry point so measurement launch and parallel runs stay fast without changing behavior. Co-authored-by: Cursor <cursoragent@cursor.com>
Skip the slow powermetrics sudo probe on Apple Silicon when cpu_load setup succeeds, strip leaked subcommand tokens from monitor ctx.args, and update tests for lazy tracker imports in run_and_monitor. Co-authored-by: Cursor <cursoragent@cursor.com>
Use class-name hardware cache serialization to survive module reloads in tests, lazy-import get_datetime_with_timezone in config CLI, add probe cache clear helpers, and update tests for lazy imports and get_cached_tdp. Co-authored-by: Cursor <cursoragent@cursor.com>
Provide harnesses to measure cold-start, throughput, and API latency during optimization so regressions can be caught and logged consistently. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove local-only harnesses used during optimization; the library perf changes and their tests are sufficient for review without dev tooling. Co-authored-by: Cursor <cursoragent@cursor.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1246 +/- ##
==========================================
+ Coverage 89.17% 89.35% +0.17%
==========================================
Files 47 48 +1
Lines 4510 4771 +261
==========================================
+ Hits 4022 4263 +241
- Misses 488 508 +20 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Apply formatter/linter fixes, extract platform CPU backend selection to satisfy flake8 complexity, stabilize the force_cpu_power load test with a mocked cpu_percent, and add hardware_cache/monitor_main coverage tests. Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid isinstance checks across module reload boundaries and mock AppleSiliconChip rebuild so powermetrics is not required on non-macOS runners. Co-authored-by: Cursor <cursoragent@cursor.com>
Add targeted tests for HTTP session reuse, hardware cache round-trips, platform CPU backend selection, and other newly introduced code paths so codecov patch checks pass on the PR. Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse output handlers, ApiClient instances, config reads, and Logfire setup across repeated tracker lifecycles so CSV/API/Logfire paths stay fast on warm runs. Add benchmark scripts for lifecycle and per-output throughput measurement. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Remove output_cache since micro-benchmarks showed no meaningful full-lifecycle gain; retain config caching, ApiClient pooling, and Logfire configure-once. Co-authored-by: Cursor <cursoragent@cursor.com>
Drop session, config, logfire, and file-header caches that added complexity without clear wins, revert carbonserver bootstrap shortcuts, and align tests with direct ApiClient usage. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace hand-rolled globals for GPU/CPU/PowerMetrics probes with functools.lru_cache, use direct imports in hardware_cache.clear_cache(), and dedupe CodeCarbonAPIOutput emit paths. Co-authored-by: Cursor <cursoragent@cursor.com>
Restore lazy sys.modules clearing so conftest teardown does not load gpu_nvidia before FakeGPUEnv tests install mock pynvml. Co-authored-by: Cursor <cursoragent@cursor.com>
Drop codecarbon-monitor in favor of codecarbon monitor, add --log-level there, and document warm hardware reuse plus the correct log-level default. Co-authored-by: Cursor <cursoragent@cursor.com>
Capture cpu counts, canonical GPU ids, and RAPL settings in cached plans, sync tracker state on apply, and pass tracking_mode through all CPU backends. Co-authored-by: Cursor <cursoragent@cursor.com>
Align test_set_cpu_tracking_skips_tdp_when_rapl_available with the resource tracker change that passes tracking_mode to CPU.from_utils. Co-authored-by: Cursor <cursoragent@cursor.com>
inimaz
left a comment
There was a problem hiding this comment.
Nice @davidberenstein1957 thanks a lot for taking a look at this. There are many improvements done at once here. Maybe can you split it into smaller PRs? Like this it will be easier to review. I add some comments already.
| self.model, self.tdp = self._main() | ||
|
|
||
| @staticmethod | ||
| def _get_cpu_constant_power(match: str, cpu_power_df: pd.DataFrame) -> int: |
There was a problem hiding this comment.
Do not delete this, if we are using pandas only for typing we can do
from typing import TYPE_CHECKING
if TYPE_CHECKING:
import pandas as pd| return None | ||
|
|
||
| def _get_matching_cpu( | ||
| self, model_raw: str, cpu_df: pd.DataFrame, greedy=False |
| from typing import Dict, Optional | ||
|
|
||
| import pandas as pd | ||
|
|
| def _hardware_kind(hw) -> str: | ||
| """Classify hardware without isinstance (safe if modules were reloaded).""" | ||
| name = type(hw).__name__ | ||
| if name == "RAM": |
There was a problem hiding this comment.
Let's do an Enum out of these strings
Restore pandas DataFrame annotations via TYPE_CHECKING and replace hardware kind strings with a HardwareKind enum. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Closing in favor of a 4-PR stacked review split per @inimaz's feedback:
Merging #1251 → #1252 → #1253 → #1254 restores the full change set and benchmarks from this PR. |
|
@inimaz — split complete. Please review the stacked PRs instead:
Each PR has measured benchmarks, test plan, and passing pre-commit/tests. |
|
Nice @davidberenstein1957, I already reviewed the first one! |
Summary
This PR reduces CodeCarbon measurement launch latency and improves concurrent-run throughput while preserving existing behavior. Changes focus on deferring work until it is needed, caching hardware detection within a process, and slimming import paths.
Performance results — cold launch (offline Mac ARM)
__init__start()(cold)codecarbon monitor … sleep 2)Cold-path numbers are the first tracker in a fresh process; warm-path numbers reuse cached hardware within the same process.
Performance results — run throughput (offline, warm, same process)
Repeated
OfflineEmissionsTracker(output_methods=[])lifecycles (init → start → stop) in one Python process:Before = master baseline (2026-06-17); after = hardware cache + warm lifecycle optimizations. Parallel benchmark: 8 worker threads, 15 s sustained load.
What changed
Tracker lifecycle
output_methods=[]; skip redundant measurement onstop()when a sample was just takencpu_percentprime once per processHardware detection
hardware_cache.py) — CPU/GPU/RAM detection reused across instances@lru_cacheAPI write path
POST /runs— deferred until first emission upload (create_run_automatically=False+_ensure_api_run)CLI
cli/main.pyfor fastercodecarbon monitorstartupcodecarbon monitor(with optional--log-level)Docs
--log-leveldefault in CLI reference (ERROR, not INFO)Intentionally not included
These were explored during the perf work but removed to keep the diff focused on high-impact changes:
codecarbon-monitorconsole scriptTest plan
CODECARBON_ALLOW_MULTIPLE_RUNS=True pytest --ignore=tests/test_viz_data.py -m 'not integ_test' tests/(541 passed locally)tests/test_hardware_cache.py— cache hit/miss, clear_cache, round-trip reuse--log-level, wrapped-command delegationcodecarbon monitor --offline --country-iso-code FRA -- sleep 1codecarbon monitor --offline --country-iso-code FRA --log-level debug -- python train.pyNotes
Throughput numbers captured on offline Mac ARM (2026-06-18). The first tracker in a process is slower than subsequent ones because hardware detection runs once and is reused — see the FAQ.