perf: reduce tracker cold-start and concurrent measurement overhead#1246
Open
davidberenstein1957 wants to merge 17 commits into
Open
perf: reduce tracker cold-start and concurrent measurement overhead#1246davidberenstein1957 wants to merge 17 commits into
davidberenstein1957 wants to merge 17 commits into
Conversation
Defer heavy imports and hardware probing until first use, cache hardware setup per process, and add a lightweight codecarbon-monitor CLI entry point so measurement launch and parallel runs stay fast without changing behavior. Co-authored-by: Cursor <cursoragent@cursor.com>
Skip the slow powermetrics sudo probe on Apple Silicon when cpu_load setup succeeds, strip leaked subcommand tokens from monitor ctx.args, and update tests for lazy tracker imports in run_and_monitor. Co-authored-by: Cursor <cursoragent@cursor.com>
Use class-name hardware cache serialization to survive module reloads in tests, lazy-import get_datetime_with_timezone in config CLI, add probe cache clear helpers, and update tests for lazy imports and get_cached_tdp. Co-authored-by: Cursor <cursoragent@cursor.com>
Provide harnesses to measure cold-start, throughput, and API latency during optimization so regressions can be caught and logged consistently. Co-authored-by: Cursor <cursoragent@cursor.com>
Remove local-only harnesses used during optimization; the library perf changes and their tests are sufficient for review without dev tooling. Co-authored-by: Cursor <cursoragent@cursor.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1246 +/- ##
==========================================
+ Coverage 89.17% 89.35% +0.17%
==========================================
Files 47 48 +1
Lines 4510 4762 +252
==========================================
+ Hits 4022 4255 +233
- Misses 488 507 +19 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Apply formatter/linter fixes, extract platform CPU backend selection to satisfy flake8 complexity, stabilize the force_cpu_power load test with a mocked cpu_percent, and add hardware_cache/monitor_main coverage tests. Co-authored-by: Cursor <cursoragent@cursor.com>
Avoid isinstance checks across module reload boundaries and mock AppleSiliconChip rebuild so powermetrics is not required on non-macOS runners. Co-authored-by: Cursor <cursoragent@cursor.com>
Add targeted tests for HTTP session reuse, hardware cache round-trips, platform CPU backend selection, and other newly introduced code paths so codecov patch checks pass on the PR. Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse output handlers, ApiClient instances, config reads, and Logfire setup across repeated tracker lifecycles so CSV/API/Logfire paths stay fast on warm runs. Add benchmark scripts for lifecycle and per-output throughput measurement. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Remove output_cache since micro-benchmarks showed no meaningful full-lifecycle gain; retain config caching, ApiClient pooling, and Logfire configure-once. Co-authored-by: Cursor <cursoragent@cursor.com>
Drop session, config, logfire, and file-header caches that added complexity without clear wins, revert carbonserver bootstrap shortcuts, and align tests with direct ApiClient usage. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace hand-rolled globals for GPU/CPU/PowerMetrics probes with functools.lru_cache, use direct imports in hardware_cache.clear_cache(), and dedupe CodeCarbonAPIOutput emit paths. Co-authored-by: Cursor <cursoragent@cursor.com>
Restore lazy sys.modules clearing so conftest teardown does not load gpu_nvidia before FakeGPUEnv tests install mock pynvml. Co-authored-by: Cursor <cursoragent@cursor.com>
Drop codecarbon-monitor in favor of codecarbon monitor, add --log-level there, and document warm hardware reuse plus the correct log-level default. Co-authored-by: Cursor <cursoragent@cursor.com>
Capture cpu counts, canonical GPU ids, and RAPL settings in cached plans, sync tracker state on apply, and pass tracking_mode through all CPU backends. Co-authored-by: Cursor <cursoragent@cursor.com>
Align test_set_cpu_tracking_skips_tdp_when_rapl_available with the resource tracker change that passes tracking_mode to CPU.from_utils. Co-authored-by: Cursor <cursoragent@cursor.com>
inimaz
reviewed
Jun 20, 2026
inimaz
left a comment
Collaborator
There was a problem hiding this comment.
Nice @davidberenstein1957 thanks a lot for taking a look at this. There are many improvements done at once here. Maybe can you split it into smaller PRs? Like this it will be easier to review. I add some comments already.
| self.model, self.tdp = self._main() | ||
|
|
||
| @staticmethod | ||
| def _get_cpu_constant_power(match: str, cpu_power_df: pd.DataFrame) -> int: |
Collaborator
There was a problem hiding this comment.
Do not delete this, if we are using pandas only for typing we can do
from typing import TYPE_CHECKING
if TYPE_CHECKING:
import pandas as pd| return None | ||
|
|
||
| def _get_matching_cpu( | ||
| self, model_raw: str, cpu_df: pd.DataFrame, greedy=False |
| from typing import Dict, Optional | ||
|
|
||
| import pandas as pd | ||
|
|
| def _hardware_kind(hw) -> str: | ||
| """Classify hardware without isinstance (safe if modules were reloaded).""" | ||
| name = type(hw).__name__ | ||
| if name == "RAM": |
Collaborator
There was a problem hiding this comment.
Let's do an Enum out of these strings
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR reduces CodeCarbon measurement launch latency and improves concurrent-run throughput while preserving existing behavior. Changes focus on deferring work until it is needed, caching hardware detection within a process, and slimming import paths.
Performance results — cold launch (offline Mac ARM)
__init__start()(cold)codecarbon monitor … sleep 2)Cold-path numbers are the first tracker in a fresh process; warm-path numbers reuse cached hardware within the same process.
Performance results — run throughput (offline, warm, same process)
Repeated
OfflineEmissionsTracker(output_methods=[])lifecycles (init → start → stop) in one Python process:Before = master baseline (2026-06-17); after = hardware cache + warm lifecycle optimizations. Parallel benchmark: 8 worker threads, 15 s sustained load.
What changed
Tracker lifecycle
output_methods=[]; skip redundant measurement onstop()when a sample was just takencpu_percentprime once per processHardware detection
hardware_cache.py) — CPU/GPU/RAM detection reused across instances@lru_cacheAPI write path
POST /runs— deferred until first emission upload (create_run_automatically=False+_ensure_api_run)CLI
cli/main.pyfor fastercodecarbon monitorstartupcodecarbon monitor(with optional--log-level)Docs
--log-leveldefault in CLI reference (ERROR, not INFO)Intentionally not included
These were explored during the perf work but removed to keep the diff focused on high-impact changes:
codecarbon-monitorconsole scriptTest plan
CODECARBON_ALLOW_MULTIPLE_RUNS=True pytest --ignore=tests/test_viz_data.py -m 'not integ_test' tests/(541 passed locally)tests/test_hardware_cache.py— cache hit/miss, clear_cache, round-trip reuse--log-level, wrapped-command delegationcodecarbon monitor --offline --country-iso-code FRA -- sleep 1codecarbon monitor --offline --country-iso-code FRA --log-level debug -- python train.pyNotes
Throughput numbers captured on offline Mac ARM (2026-06-18). The first tracker in a process is slower than subsequent ones because hardware detection runs once and is reused — see the FAQ.