fix(job-orchestration): Clamp invalid archive timestamps during compression#2228
fix(job-orchestration): Clamp invalid archive timestamps during compression#2228goynam wants to merge 1 commit intoy-scope:mainfrom
Conversation
…ession Archives with bogus begin_timestamp/end_timestamp values (e.g., from upstream services emitting non-epoch values in the timestamp field) create extremely wide time spans that match virtually every search query, severely degrading query scheduler performance. Add _clamp_archive_timestamps() validation in update_archive_metadata() that: - Clamps begin_timestamp before 2020-01-01 to end_timestamp - Clamps end_timestamp more than 1 year in the future to begin_timestamp - Sets both to 0 if both are invalid This prevents new archives from being written with poisoned time ranges while preserving backward compatibility (logger parameter is optional).
WalkthroughAdds timestamp validation and clamping logic to archive metadata handling in the compression task. A new helper function validates and optionally clamps Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Ruff (0.15.12)components/job-orchestration/job_orchestration/executor/compress/compression_task.pyUnexpected Ruff issue shape at index 39 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@components/job-orchestration/job_orchestration/executor/compress/compression_task.py`:
- Around line 103-109: The current module-level computation of
max_valid_timestamp_ms uses time.time() directly which makes tests flaky;
introduce an injectable "now" in milliseconds by replacing the module-level
max_valid_timestamp_ms with a small helper like
get_max_valid_timestamp_ms(now_ms=None) that returns int((now_ms or
int(time.time()*1000)) + 365*86400*1000), and update any callers (e.g.,
timestamp validation logic in this module) to accept an optional now_ms
parameter and call get_max_valid_timestamp_ms(now_ms) so tests can pass a fixed
now_ms for deterministic boundary checks while defaulting to the runtime clock.
- Line 151: The timestamp clamping is currently skipped when the optional logger
parameter is None; change the call sites in compression_task.py so the
timestamp-clamping helper (e.g., clamp_timestamps or the "timestamp clamping"
call) is invoked unconditionally regardless of logger presence, and instead move
the logger presence check into the helper so it only emits warnings when a
logger is provided; update the other similar call sites (the block covering
lines ~174-177) to follow the same pattern so clamping always runs but warning
emission is guarded inside the helper that accepts an optional logger.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1cf08ef6-235c-44d5-83ab-b7b25103d012
📒 Files selected for processing (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
| import time | ||
|
|
||
| # 2020-01-01 00:00:00 UTC in milliseconds | ||
| MIN_VALID_TIMESTAMP_MS = 1577836800000 | ||
| # 1 year from now in milliseconds | ||
| max_valid_timestamp_ms = int((time.time() + 365 * 86400) * 1000) | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Make the “current time” input injectable for deterministic boundary tests.
Using time.time() directly makes edge-case tests around the 1-year threshold flaky. Consider an optional now_ms parameter defaulting to runtime clock.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/job-orchestration/job_orchestration/executor/compress/compression_task.py`
around lines 103 - 109, The current module-level computation of
max_valid_timestamp_ms uses time.time() directly which makes tests flaky;
introduce an injectable "now" in milliseconds by replacing the module-level
max_valid_timestamp_ms with a small helper like
get_max_valid_timestamp_ms(now_ms=None) that returns int((now_ms or
int(time.time()*1000)) + 365*86400*1000), and update any callers (e.g.,
timestamp validation logic in this module) to accept an optional now_ms
parameter and call get_max_valid_timestamp_ms(now_ms) so tests can pass a fixed
now_ms for deterministic boundary checks while defaulting to the runtime clock.
| table_prefix: str, | ||
| dataset: str | None, | ||
| archive_stats: dict[str, Any], | ||
| logger=None, |
There was a problem hiding this comment.
Always apply timestamp clamping regardless of logger presence.
Right now, omitting logger skips clamping entirely, which re-opens the poisoned time-range path. Clamp unconditionally, and only guard warning emission inside the helper.
Proposed fix
def _clamp_archive_timestamps(
archive_stats: dict[str, Any],
- logger,
+ logger=None,
) -> None:
@@
- if bad_begin and bad_end:
- logger.warning(
+ if bad_begin and bad_end:
+ if logger is not None:
+ logger.warning(
"Archive %s has both invalid begin_timestamp=%s and end_timestamp=%s, "
"setting both to 0.",
archive_stats.get("id", "unknown"),
begin_ts,
end_ts,
- )
+ )
archive_stats["begin_timestamp"] = 0
archive_stats["end_timestamp"] = 0
elif bad_begin:
- logger.warning(
+ if logger is not None:
+ logger.warning(
"Archive %s has invalid begin_timestamp=%s (before 2020), clamping to "
"end_timestamp=%s.",
archive_stats.get("id", "unknown"),
begin_ts,
end_ts,
- )
+ )
archive_stats["begin_timestamp"] = end_ts
elif bad_end:
- logger.warning(
+ if logger is not None:
+ logger.warning(
"Archive %s has invalid end_timestamp=%s (>1 year in future), clamping to "
"begin_timestamp=%s.",
archive_stats.get("id", "unknown"),
end_ts,
begin_ts,
- )
+ )
archive_stats["end_timestamp"] = begin_ts
@@
- if logger is not None:
- _clamp_archive_timestamps(stats_to_update, logger)
+ _clamp_archive_timestamps(stats_to_update, logger)Also applies to: 174-177
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@components/job-orchestration/job_orchestration/executor/compress/compression_task.py`
at line 151, The timestamp clamping is currently skipped when the optional
logger parameter is None; change the call sites in compression_task.py so the
timestamp-clamping helper (e.g., clamp_timestamps or the "timestamp clamping"
call) is invoked unconditionally regardless of logger presence, and instead move
the logger presence check into the helper so it only emits warnings when a
logger is provided; update the other similar call sites (the block covering
lines ~174-177) to follow the same pattern so clamping always runs but warning
emission is guarded inside the helper that accepts an optional logger.
…ession
Archives with bogus begin_timestamp/end_timestamp values (e.g., from upstream services emitting non-epoch values in the timestamp field) create extremely wide time spans that match virtually every search query, severely degrading query scheduler performance.
Add _clamp_archive_timestamps() validation in update_archive_metadata() that:
This prevents new archives from being written with poisoned time ranges while preserving backward compatibility (logger parameter is optional).
Description
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
Release Notes