[feature][MSD-318] implement store annotated data framework#3444
[feature][MSD-318] implement store annotated data framework#3444K4rishma wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Implements an annotated data collection framework with asynchronous sample serialization/upload to S3, plus a CLI utility to fetch collected ZIP samples and GUI consent controls.
Changes:
- Add
DataCollectorcore (config persistence, serialization into ZIPs, background worker with retry/backoff, S3 upload backend). - Add S3 download helpers + CLI (
odemis-dc-fetch) to filter/list/download samples from S3. - Integrate user consent flow into the GUI (startup prompt + Help menu toggle) and add unit/integration tests.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/odemis/util/datacollector.py |
New data collection core: config, ZIP serialization, background worker, S3 upload backend. |
src/odemis/util/dc_fetch.py |
New retrieval helpers + CLI entrypoint for downloading samples from S3. |
src/odemis/util/test/datacollector_test.py |
Tests for config, serialization, queue limiting, retry behavior, plus optional real-S3 integration tests. |
src/odemis/util/test/dc_fetch_test.py |
Unit tests for fetch-side parsing/filtering/pagination and client builder overrides. |
src/odemis/gui/win/consent.py |
New wx dialog for consent prompt (opt-in/opt-out/remind later). |
src/odemis/gui/main.py |
Show consent dialog on startup; create and pass DataCollector to menu controller. |
src/odemis/gui/cont/menu.py |
Add Help menu checkbox to toggle data sharing consent. |
scripts/odemis-dc-fetch.py |
Script wrapper for the dc_fetch CLI entrypoint. |
debian/control |
Add python3-boto3 dependency to support S3 operations. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request adds a DataCollector subsystem, a consent UI, and a CLI fetch tool. It introduces a thread-safe DataCollector that serializes events to ZIP artifacts and uploads them to an S3-compatible backend, persistent consent/config handling (file under ~/.config/odemis), a startup consent dialog and a Help/About menu checkbox to toggle sharing, an Sequence Diagram(s)sequenceDiagram
participant User
participant OdemisApp as Odemis App
participant DataCollector
participant ConsentConfig as Consent Config
participant ConsentDialog as Consent Dialog
User->>OdemisApp: Launch application
OdemisApp->>DataCollector: Initialize DataCollector()
OdemisApp->>ConsentConfig: Load consent state
ConsentConfig-->>OdemisApp: consent status (true/false/none)
alt Should prompt for consent?
OdemisApp->>OdemisApp: Schedule _maybe_show_consent_dialog()
OdemisApp->>ConsentDialog: Show modal dialog
User->>ConsentDialog: Click decision button
alt User clicks "Opt in"
ConsentDialog-->>OdemisApp: RESULT_OPT_IN
OdemisApp->>ConsentConfig: set_consent(true)
else User clicks "Opt out"
ConsentDialog-->>OdemisApp: RESULT_OPT_OUT
OdemisApp->>ConsentConfig: set_consent(false)
else User clicks "Remind me later"
ConsentDialog-->>OdemisApp: RESULT_REMIND_LATER
OdemisApp->>ConsentConfig: postpone_consent()
end
ConsentConfig->>ConsentConfig: Save to config file
else Already has consent decision
OdemisApp->>OdemisApp: Skip dialog
end
OdemisApp->>OdemisApp: Continue startup
sequenceDiagram
participant App as Odemis App
participant DataCollector
participant Serializer as _serialize
participant Queue as Work Queue
participant BackgroundWorker as _BackgroundWorker
participant S3 as S3 Backend
participant Storage as Local Storage
App->>DataCollector: record(event_name, payload, ...)
DataCollector->>DataCollector: Validate inputs
DataCollector->>DataCollector: Check consent state
alt Consent is true
DataCollector->>Serializer: Serialize event to ZIP (metadata + sidecars)
Serializer->>Storage: Create ZIP in staging dir
Serializer-->>DataCollector: ZIP path
DataCollector->>Queue: Enqueue work item
DataCollector-->>App: Return (non-blocking)
BackgroundWorker->>Queue: Dequeue work item
BackgroundWorker->>S3: Upload ZIP to S3 (using S3UploadBackend)
alt Upload succeeds
S3-->>BackgroundWorker: Success
BackgroundWorker->>Storage: Delete local ZIP
else Upload fails
S3-->>BackgroundWorker: Error
BackgroundWorker->>BackgroundWorker: Schedule retry with exponential backoff
BackgroundWorker->>Storage: Keep ZIP for retry
end
else Consent is false or unset
DataCollector-->>App: Return (skip enqueue)
end
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@debian/control`:
- Around line 78-79: Update the Python version metadata in debian/control to
require Python 3.9+: change the package dependency declaration "python3-all-dev
(>= 3.6.1)" and the "X-Python3-Version: >= 3.6" entry to at least ">= 3.9" so
that the new builtin generic annotations (e.g., list[str], tuple[Any, str]) are
supported; locate these fields in debian/control and bump their minimum version
to 3.9.
In `@src/odemis/gui/cont/menu.py`:
- Around line 46-54: The __init__ signature was extended with data_collector but
lacks a return type and updated docstring; update MainMenu.__init__ to include
"-> None" and expand its docstring to document the new data_collector parameter
(describe it in reStructuredText style without type annotations). Add explicit
type annotations and "-> None" to the helper methods
_append_data_sharing_menu_item and _on_toggle_data_sharing, and add
reStructuredText docstrings for each (brief description, parameters and returns
sections without type info) so all three functions (__init__,
_append_data_sharing_menu_item, _on_toggle_data_sharing) comply with the
project's typing and docstring guidelines.
In `@src/odemis/gui/win/consent.py`:
- Around line 16-59: Add reStructuredText docstrings to ConsentDialog's methods:
__init__, _on_opt_in, _on_opt_out, _on_remind_later, and _on_close; each
docstring should be a short one- or two-sentence description (no type info, no
backticks or inline markup) placed immediately under the def line, describing
the method's purpose and observable side effects (for example, that __init__
builds the dialog and sets up buttons, and the _on_*/_on_close handlers call
EndModal with the appropriate RESULT_*), and should not repeat type annotations.
In `@src/odemis/util/datacollector.py`:
- Around line 311-313: The code currently truncates item.event_name into
safe_event and uses payload keys directly for filenames (e.g., safe_event,
zip_name and the code that writes sidecar files for payload keys at the blocks
that reference payload keys around lines 326/335/352), but it never strips
path-traversal or invalid characters; update these sites to run event names and
payload keys through a filename sanitizer that: removes or replaces path
separators/backrefs ("/", "\", ".."), strips control chars, collapses sequences
of unsafe chars to a single safe char (e.g., "-"), enforces a post-sanitization
max length (e.g., 64 chars) and falls back to a safe default (like "event" or
"key") if empty; apply this sanitizer before constructing zip_name and before
using any payload key as a filename so all file creation uses sanitized,
truncated names.
- Line 483: The loop can skip calling queue.get() when pending ZIPs exist,
letting self._queue (queue.Queue) grow unbounded; to fix, make the queue bounded
(create self._queue = queue.Queue(maxsize=...) or use a size guard) and
restructure the consumer loop so queue.get() is always called (or a non-blocking
drain is performed) before any continue for pending ZIP handling; specifically,
ensure the code around the pending-ZIP check uses queue.get()/task_done() every
iteration (or blocks on put when full) so new records are either back-pressured
or consumed instead of accumulating.
- Around line 87-127: Add a reStructuredText-style docstring to the private
helper _ensure_section explaining its purpose, parameters, and return value (no
inline backticks or type markup), and update the _get_client function signature
to include an explicit return type annotation (use an appropriate type such as
boto3.client type or typing.Any/typing.Mapping if the concrete client type isn’t
importable) and add a reST docstring for it as well; ensure you remove
backticks/inline formatting from existing docstrings in this file (including
class DataCollectorConfig) and follow the repository rule to provide parameter
and return descriptions in plain reST text.
- Around line 153-161: The _write() method currently writes "consent = none"
when consent is unset which later causes the consent getter (the consent
property using getboolean("general", "consent")) to raise ValueError; update
_write() so that when self.consent is None it does not emit the "consent" option
(i.e., skip writing consent_line) or removes the option from the persisted
config instead of writing the literal "none", keeping the existing true/false
outputs for consent=True/False and ensuring readers using getboolean() won't
fail.
In `@src/odemis/util/dc_fetch.py`:
- Around line 249-258: The current loop writes directly to destination =
output_dir / Path(key).name which can cause basename collisions and leave
truncated files on failed downloads; change the logic in the loop around
s3_client.download_file so that you first create a unique temporary path next to
the intended destination (e.g., using tempfile.NamedTemporaryFile or mkstemp in
output_dir with a predictable suffix), call s3_client.download_file to write to
that temp path, on success atomically rename/move the temp file to destination
and increment downloaded, and on exception remove the temp file and increment
failed (also keep the existing skipped_existing check before attempting
download). Ensure you reference and update the same variables used here
(destination, output_dir, key, s3_client.download_file, skipped_existing,
downloaded, failed) so collisions and partial-file leftovers are avoided.
In `@src/odemis/util/test/datacollector_test.py`:
- Line 52: Remove the module-level call to logging.basicConfig() (currently
invoked as logging.basicConfig(level=logging.DEBUG)) so importing
datacollector_test.py doesn't mutate global logging; either move that call into
an if __name__ == "__main__": block for local debugging or configure logging in
a test-specific helper/fixture (or use pytest's caplog) to set logging level for
tests that need extra diagnostics. Ensure no module-level logging.basicConfig
call remains.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: eaf1a595-4717-4bfc-a8b4-d42f1b7784eb
📒 Files selected for processing (9)
debian/controlscripts/odemis-dc-fetch.pysrc/odemis/gui/cont/menu.pysrc/odemis/gui/main.pysrc/odemis/gui/win/consent.pysrc/odemis/util/datacollector.pysrc/odemis/util/dc_fetch.pysrc/odemis/util/test/datacollector_test.pysrc/odemis/util/test/dc_fetch_test.py
| # Needed for communicating with AWS S3 | ||
| python3-boto3, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf 'Declared Python compatibility in debian/control:\n'
sed -n '18,20p' debian/control
printf '\nBuiltin generic annotations introduced by this PR:\n'
rg -n '\b(list|dict|tuple|set)\[[^]]+\]' \
src/odemis/util/dc_fetch.py \
src/odemis/util/test/dc_fetch_test.py \
src/odemis/util/test/datacollector_test.pyRepository: delmic/odemis
Length of output: 600
🏁 Script executed:
#!/bin/bash
# Check for other Python 3.9+ features (e.g., type hints, match statements, etc.)
echo "=== Checking for union type hints (|) - Python 3.10+ syntax ==="
rg -n '\|\s*None' src/odemis/util/dc_fetch.py src/odemis/util/test/datacollector_test.py
echo "=== Checking for zoneinfo - Python 3.9+ ==="
rg -n 'from zoneinfo|import zoneinfo' src/odemis/util/dc_fetch.py src/odemis/util/test/datacollector_test.py
echo "=== All annotations added to dc_fetch.py and related tests ==="
grep -n 'def \|: ' src/odemis/util/dc_fetch.py | head -30Repository: delmic/odemis
Length of output: 2102
Update debian/control to reflect Python 3.9+ requirement.
This PR introduces builtin generic annotations (list[str], tuple[Any, str]) which require Python 3.9+, but debian/control still declares X-Python3-Version: >= 3.6. The package will fail to import on Python 3.6–3.8. Update line 18 from python3-all-dev (>= 3.6.1) to at least >= 3.9.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@debian/control` around lines 78 - 79, Update the Python version metadata in
debian/control to require Python 3.9+: change the package dependency declaration
"python3-all-dev (>= 3.6.1)" and the "X-Python3-Version: >= 3.6" entry to at
least ">= 3.9" so that the new builtin generic annotations (e.g., list[str],
tuple[Any, str]) are supported; locate these fields in debian/control and bump
their minimum version to 3.9.
There was a problem hiding this comment.
Not sure how to handle it
There was a problem hiding this comment.
🧩 Analysis chain
Let me check the codebase for the minimum Python version already in use and the exact annotations in the new files.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== X-Python3-Version and Build-Depends in debian/control ==="
grep -n 'X-Python3-Version\|python3-all-dev' debian/control
echo ""
echo "=== Builtin generic annotations (requires Python 3.9+) in new files ==="
rg -n '\b(list|dict|tuple|set)\[' src/odemis/util/dc_fetch.py src/odemis/util/datacollector.py 2>/dev/null || echo "(none found)"
echo ""
echo "=== Existing from __future__ import annotations in codebase ==="
rg -rn 'from __future__ import annotations' src/odemis/ | head -10
echo ""
echo "=== Python 3.9+ syntax already in use elsewhere in src/odemis (sample) ==="
rg -rn '\b(list|dict|tuple|set)\[' src/odemis/ --include='*.py' -g '!test' | grep -v 'dc_fetch\|datacollector' | head -20
echo ""
echo "=== Check top of dc_fetch.py for existing __future__ import ==="
head -20 src/odemis/util/dc_fetch.py
Length of output: 544
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Top of dc_fetch.py (imports section) ==="
head -30 src/odemis/util/dc_fetch.py
echo ""
echo "=== Python 3.9+ builtin generics elsewhere in src/odemis (non-test, non-new files) ==="
rg -rn '\b(list|dict|tuple|set)\[' src/odemis/ --include='*.py' \
--ignore-file=<(printf 'dc_fetch.py\ndatacollector.py\n') \
-g '!*/test/*' | head -20
echo ""
echo "=== Python version declared in setup.py or pyproject.toml ==="
fd -e py setup.py | xargs grep -n 'python_requires' 2>/dev/null || echo "(not found in setup.py)"
fd pyproject.toml | xargs grep -n 'python_requires\|requires-python' 2>/dev/null || echo "(not found in pyproject.toml)"
echo ""
echo "=== Python 3.10+ union type hints (X | Y) used anywhere in new files ==="
rg -n '\w\s*\|\s*\w' src/odemis/util/dc_fetch.py src/odemis/util/datacollector.py 2>/dev/null | grep -v '#' | head -10 || echo "(none)"
echo ""
echo "=== What Python version does the target OS ship? (check /etc/os-release) ==="
cat /etc/os-release 2>/dev/null || echo "(not available in sandbox)"
python3 --version 2>/dev/null || echo "(python3 not available)"
Length of output: 1480
@K4rishma, no worries — here's a clear explanation of the issue and the easiest fix.
What's the problem?
Your new files use lowercase builtin generics like list[Path] and tuple[Any, str] directly in type annotations. This syntax (PEP 585) was only introduced in Python 3.9+. However, debian/control declares the package requires Python ≥ 3.6, so it would fail to import on older Python versions.
The easiest fix (minimal change)
dc_fetch.py already imports List, Tuple, etc. from typing at the top — you just need to use them consistently in the two function signatures where lowercase was slipped in:
In src/odemis/util/dc_fetch.py (lines 169 and 193), change:
# Before (Python 3.9+ only)
def create_s3_client_from_config(config: DataCollectorConfig) -> tuple[Any, str]:
...
def build_s3_client_from_config(...) -> tuple[Any, str]:# After (Python 3.6+ compatible)
def create_s3_client_from_config(config: DataCollectorConfig) -> Tuple[Any, str]:
...
def build_s3_client_from_config(...) -> Tuple[Any, str]:(Tuple is already imported from typing in that file — no new imports needed.)
In src/odemis/util/datacollector.py (line 527), change:
# Before (Python 3.9+ only)
def _pending_zip_paths(self, queue_dir: Path) -> list[Path]:# After (Python 3.6+ compatible)
def _pending_zip_paths(self, queue_dir: Path) -> List[Path]:And add List to the typing import in that file:
from typing import ..., List, ...Alternative: bump the minimum Python version
If the project is already targeting a modern OS (e.g., Ubuntu 22.04 which ships Python 3.10, or Debian Bookworm which ships Python 3.11), you could instead bump the declared minimum in debian/control:
-X-Python3-Version: >= 3.6
+X-Python3-Version: >= 3.9
Build-Depends: ...
- python3-all-dev (>= 3.6.1)
+ python3-all-dev (>= 3.9)But this is a wider project-level decision — I'd recommend going with the typing-based fix above as it's the smallest and safest change.
There was a problem hiding this comment.
@K4rishma the code should stay Python 3.8 compatible. You should change the type annotation to be List instead of list and Dict instead of dict.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
7094526 to
48eba1b
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (7)
src/odemis/util/test/datacollector_test.py (2)
536-542:⚠️ Potential issue | 🟠 MajorThis 10 ms assertion is still too tight for CI.
That threshold mostly measures runner load and thread scheduling jitter, not whether
record()stays non-blocking. It is likely to flap even when the implementation is fine.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/test/datacollector_test.py` around lines 536 - 542, The 10 ms CI assertion in test_record_returns_fast is too tight and flakes; update the test to use a much looser threshold (e.g. 50–200 ms) or assert that record() returns "quickly" by averaging multiple runs or checking it returns under a higher bound (suggestion: 100 ms) to avoid thread-scheduling jitter; modify the test_record_returns_fast testcase that calls self._collector.record("perf_test", "1.0", {"image": arr}) to use the relaxed threshold or an averaged timing strategy so CI stability improves.
52-52:⚠️ Potential issue | 🟠 MajorAvoid configuring global logging at import time.
Importing this test module still mutates process-wide logging, which makes suite behavior order-dependent and can leak verbosity/handlers into unrelated tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/test/datacollector_test.py` at line 52, Remove the module-level call to logging.basicConfig(level=logging.DEBUG) in datacollector_test.py and avoid mutating global logging at import time; instead, configure logging locally within the specific test(s) that need debug output (e.g., using the pytest caplog fixture or by calling logging.getLogger(__name__).setLevel(logging.DEBUG) inside the test function or setup/teardown), or use monkeypatch to adjust handlers for isolation so other tests are not affected.src/odemis/util/datacollector.py (4)
153-161:⚠️ Potential issue | 🔴 CriticalDon't persist undecided consent as
none.
_write()still emitsconsent = none, butconsentstill reads viaConfigParser.getboolean(). After a restart, that raisesValueError, so the consent flow can break as soon as the config is reloaded.💡 Proposed fix
- if consent_val is True: - consent_line = "consent = true" - elif consent_val is False: - consent_line = "consent = false" - else: - consent_line = "consent = none" + if consent_val is True: + consent_line = "consent = true" + elif consent_val is False: + consent_line = "consent = false" + else: + consent_line = "# consent = " @@ def consent(self) -> Optional[bool]: - """Return the consent state, or ``None`` if not yet set.""" + """Return the consent state, or None if not yet set.""" try: - return self._cp.getboolean("general", "consent") - except (configparser.NoSectionError, configparser.NoOptionError): + return self._cp.getboolean("general", "consent") + except (configparser.NoSectionError, configparser.NoOptionError, ValueError): return NoneAlso applies to: 186-190
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 153 - 161, The _write() routine currently emits "consent = none" which later breaks ConfigParser.getboolean(); change _write() so it only emits a consent line when self.consent is explicitly True or False (set consent_line for True -> "consent = true" and False -> "consent = false") and skip writing the consent key entirely when consent is None/undecided; apply the same conditional approach to the remind_date handling (use remind_val only when set, otherwise omit the remind_date line) to avoid persisting invalid values that ConfigParser cannot parse.
86-92:⚠️ Potential issue | 🟡 MinorThe new module still misses the repo's Python docstring/type-hint rules.
_ensure_section()still has no docstring,_get_client()still has no return annotation, and several added docstrings still use inline backticks.As per coding guidelines "
**/*.py: Always use type hints for function parameters and return types in Python code" and "**/*.py: Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks".Also applies to: 105-127, 181-183, 445-446
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 86 - 92, Add missing reStructuredText-style docstrings and Python type hints: provide a proper docstring for _ensure_section() (no inline backticks, follow rst conventions), add a return type annotation for _get_client() and annotate all function parameters/returns in the module (including the functions around the referenced regions: the block containing lines 105-127, the functions at 181-183, and 445-446). Also update any docstrings that currently use inline backticks to plain rst text (e.g., replace ``access_key``/``secret_key`` formatting with :param: or simple names) so all functions/classes follow the repo rules for type hints and rst docstrings.
311-313:⚠️ Potential issue | 🟠 MajorSanitize names before using them as archive paths.
event_nameis only truncated, and payload keys are still used verbatim in sidecar filenames. Inputs containing/,\, or..can escapetmp_diror produce unsafe ZIP entry names.Also applies to: 326-336, 352-354, 382-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 311 - 313, The code truncates event_name to create zip_name but does not sanitize it or payload-derived filenames, allowing `/`, `\`, `..`, or other unsafe characters to create directory traversal or unsafe ZIP entries; update the logic that builds safe_event and all sidecar filenames to (1) strip any path components (use the basename), (2) normalize and whitelist characters (e.g., allow only alphanumerics, dots, dashes and underscores, replacing everything else with underscore), (3) collapse consecutive dots and trim length (keep the existing 64-char cap), and (4) ensure resulting names never start with a dot or slash; apply this sanitization where safe_event and zip_name are built and to any payload key→filename conversions used when writing sidecar files so that tmp_dir and ZIP entries cannot be escaped.
533-563:⚠️ Potential issue | 🟠 MajorPending uploads still block queue draining.
When
_process_pending_zips()returnsTrue,_run()immediatelycontinues and skipsqueue.get(). During a long outage/backoff window,record()keeps enqueueing into the unbounded in-memory queue while nothing is consumed.Also applies to: 571-584
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 533 - 563, _pending_zip_paths are causing _process_pending_zips to return True during backoff or init-failure, which makes _run continue and skip queue.get() so the in-memory queue (fed by record()) grows unbounded; change _process_pending_zips (and the similar block at 571-584) so that when no upload work can proceed due to backoff (now < self._next_retry_at) or backend init failure you return False (after any sleep/log/schedule_retry) rather than True, allowing _run to call queue.get() and drain/flush records to disk; keep returning True only when you actually processed/removed pending zips (i.e., after successful _upload and unlink) so the caller can skip immediate queue.get().src/odemis/util/dc_fetch.py (1)
249-258:⚠️ Potential issue | 🟠 MajorDownload to a temp path and preserve the key path.
Using
Path(key).namecollapses host prefixes, so two different S3 keys can collide locally. Writing straight to the final destination also means a failed download can leave a truncated ZIP behind, and the next run will count it asskipped_existinginstead of retrying it.💡 Safer download flow
- destination = output_dir / Path(key).name + destination = output_dir / Path(key) + destination.parent.mkdir(parents=True, exist_ok=True) + tmp_destination = destination.with_suffix(destination.suffix + ".part") if destination.exists(): skipped_existing += 1 continue try: - s3_client.download_file(bucket, key, str(destination)) + s3_client.download_file(bucket, key, str(tmp_destination)) + tmp_destination.replace(destination) downloaded += 1 except Exception: + tmp_destination.unlink(missing_ok=True) failed += 1 logging.exception("Failed to download key %s", key)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/dc_fetch.py` around lines 249 - 258, The code uses Path(key).name which flattens S3 keys and writes directly to the final path (destination), causing filename collisions and leaving partially downloaded files; change the flow in dc_fetch.py so that for each key you compute the full relative path under output_dir (preserve the key's path, not just its name), create its parent directories, download to a temporary file in the same directory (e.g., destination.with_suffix(".part") or a tmp name), call s3_client.download_file to that temp path, on success atomically rename/move the temp file to the final destination and increment downloaded, on failure remove the temp file, increment failed and log the exception (keep skipped_existing only when the final destination already exists), and ensure you reference the same variables in the existing block (destination, s3_client.download_file, output_dir, skipped_existing, downloaded, failed) so behavior is atomic and preserves key paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/odemis/util/datacollector.py`:
- Around line 620-653: DataCollector is missing get_consent_remind_days used by
the GUI; add a method def get_consent_remind_days(self) -> Optional[int]: that
mirrors the pattern of get_consent/should_prompt_for_consent: call
self._lazy_init(), return None or False if not self._init_ok, otherwise return
the value from the backing config (e.g. return
self._config.get_consent_remind_days() or self._config.consent_remind_days) and
include appropriate type hints and the same # type: ignore[union-attr] if
needed; reference DataCollector, its _lazy_init, _init_ok and _config attributes
to locate where to add it.
In `@src/odemis/util/dc_fetch.py`:
- Around line 35-38: The docstrings in this module (starting with
parse_since_utc) use inline backticks and inline type info which violates the
repo's Python docstring rules; update parse_since_utc and the other functions in
this file to use reStructuredText-style docstrings: remove inline backticks,
remove type annotations from the descriptive text, and document
parameters/returns using reST field markers (e.g., :param name: ..., :return:
...) with plain text descriptions; ensure each function/class has a
reST-compliant docstring without inline formatting markers.
- Around line 35-47: The parse_since_utc function fails on ISO strings with a
trailing "Z" in Python 3.10; update parse_since_utc to normalize UTC Z-suffixes
by replacing a trailing "Z" or "z" with "+00:00" before calling
datetime.fromisoformat, then continue the existing logic (keep the date-only
branch using datetime.strptime and the tzinfo handling/astimezone fallback) so
strings like "2026-03-01T12:30:00Z" parse correctly.
In `@src/odemis/util/test/datacollector_test.py`:
- Around line 352-357: The test doubles for the backend have the wrong upload
signature: update the test classes named _Backend used in datacollector_test so
their upload method matches production usage backend.upload(local_path,
remote_key) (accept two parameters, ignore/unused remote_key), keeping the
existing behavior of appending local_path.name to uploaded_names and raising
ConnectionError on first call; change both occurrences (the one exercising
retry/failures and the other at lines ~381-383) so _process_pending_zips() ->
_upload() calls will succeed without a TypeError.
---
Duplicate comments:
In `@src/odemis/util/datacollector.py`:
- Around line 153-161: The _write() routine currently emits "consent = none"
which later breaks ConfigParser.getboolean(); change _write() so it only emits a
consent line when self.consent is explicitly True or False (set consent_line for
True -> "consent = true" and False -> "consent = false") and skip writing the
consent key entirely when consent is None/undecided; apply the same conditional
approach to the remind_date handling (use remind_val only when set, otherwise
omit the remind_date line) to avoid persisting invalid values that ConfigParser
cannot parse.
- Around line 86-92: Add missing reStructuredText-style docstrings and Python
type hints: provide a proper docstring for _ensure_section() (no inline
backticks, follow rst conventions), add a return type annotation for
_get_client() and annotate all function parameters/returns in the module
(including the functions around the referenced regions: the block containing
lines 105-127, the functions at 181-183, and 445-446). Also update any
docstrings that currently use inline backticks to plain rst text (e.g., replace
``access_key``/``secret_key`` formatting with :param: or simple names) so all
functions/classes follow the repo rules for type hints and rst docstrings.
- Around line 311-313: The code truncates event_name to create zip_name but does
not sanitize it or payload-derived filenames, allowing `/`, `\`, `..`, or other
unsafe characters to create directory traversal or unsafe ZIP entries; update
the logic that builds safe_event and all sidecar filenames to (1) strip any path
components (use the basename), (2) normalize and whitelist characters (e.g.,
allow only alphanumerics, dots, dashes and underscores, replacing everything
else with underscore), (3) collapse consecutive dots and trim length (keep the
existing 64-char cap), and (4) ensure resulting names never start with a dot or
slash; apply this sanitization where safe_event and zip_name are built and to
any payload key→filename conversions used when writing sidecar files so that
tmp_dir and ZIP entries cannot be escaped.
- Around line 533-563: _pending_zip_paths are causing _process_pending_zips to
return True during backoff or init-failure, which makes _run continue and skip
queue.get() so the in-memory queue (fed by record()) grows unbounded; change
_process_pending_zips (and the similar block at 571-584) so that when no upload
work can proceed due to backoff (now < self._next_retry_at) or backend init
failure you return False (after any sleep/log/schedule_retry) rather than True,
allowing _run to call queue.get() and drain/flush records to disk; keep
returning True only when you actually processed/removed pending zips (i.e.,
after successful _upload and unlink) so the caller can skip immediate
queue.get().
In `@src/odemis/util/dc_fetch.py`:
- Around line 249-258: The code uses Path(key).name which flattens S3 keys and
writes directly to the final path (destination), causing filename collisions and
leaving partially downloaded files; change the flow in dc_fetch.py so that for
each key you compute the full relative path under output_dir (preserve the key's
path, not just its name), create its parent directories, download to a temporary
file in the same directory (e.g., destination.with_suffix(".part") or a tmp
name), call s3_client.download_file to that temp path, on success atomically
rename/move the temp file to the final destination and increment downloaded, on
failure remove the temp file, increment failed and log the exception (keep
skipped_existing only when the final destination already exists), and ensure you
reference the same variables in the existing block (destination,
s3_client.download_file, output_dir, skipped_existing, downloaded, failed) so
behavior is atomic and preserves key paths.
In `@src/odemis/util/test/datacollector_test.py`:
- Around line 536-542: The 10 ms CI assertion in test_record_returns_fast is too
tight and flakes; update the test to use a much looser threshold (e.g. 50–200
ms) or assert that record() returns "quickly" by averaging multiple runs or
checking it returns under a higher bound (suggestion: 100 ms) to avoid
thread-scheduling jitter; modify the test_record_returns_fast testcase that
calls self._collector.record("perf_test", "1.0", {"image": arr}) to use the
relaxed threshold or an averaged timing strategy so CI stability improves.
- Line 52: Remove the module-level call to
logging.basicConfig(level=logging.DEBUG) in datacollector_test.py and avoid
mutating global logging at import time; instead, configure logging locally
within the specific test(s) that need debug output (e.g., using the pytest
caplog fixture or by calling logging.getLogger(__name__).setLevel(logging.DEBUG)
inside the test function or setup/teardown), or use monkeypatch to adjust
handlers for isolation so other tests are not affected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4707cfbb-e587-4426-82ff-a92c77005543
📒 Files selected for processing (9)
debian/controlscripts/odemis-dc-fetch.pysrc/odemis/gui/cont/menu.pysrc/odemis/gui/main.pysrc/odemis/gui/win/consent.pysrc/odemis/util/datacollector.pysrc/odemis/util/dc_fetch.pysrc/odemis/util/test/datacollector_test.pysrc/odemis/util/test/dc_fetch_test.py
✅ Files skipped from review due to trivial changes (3)
- debian/control
- scripts/odemis-dc-fetch.py
- src/odemis/gui/win/consent.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/odemis/gui/cont/menu.py
48eba1b to
06b4736
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (9)
src/odemis/util/test/datacollector_test.py (3)
52-52:⚠️ Potential issue | 🟠 MajorAvoid configuring global logging when the test module is imported.
logging.basicConfig()mutates process-wide logging for unrelated tests and can make suite behavior order-dependent. Keep this underif __name__ == "__main__":or use per-test logging controls.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/test/datacollector_test.py` at line 52, Remove the module-level call to logging.basicConfig() (the logging.basicConfig line) so importing datacollector_test doesn't mutate global logging; either move that call into an if __name__ == "__main__": block for local debugging or remove it entirely and rely on per-test logging control (pytest's caplog fixture or a test-level logger configuration) within the test functions that need debug output.
352-357:⚠️ Potential issue | 🟠 MajorMatch the backend double to the production upload signature.
_process_pending_zips()calls_upload(), which invokesbackend.upload(local_path, remote_key). These doubles currently accept onlylocal_path, so the tests fail before exercising retry/order behavior.Fix the backend doubles
class _Backend: - def upload(self, local_path: Path) -> None: + def upload(self, local_path: Path, remote_key: str) -> None: + del remote_key uploaded_names.append(local_path.name) if failures["count"] == 0: failures["count"] += 1 raise ConnectionError("temporary network issue") @@ class _Backend: - def upload(self, local_path: Path) -> None: + def upload(self, local_path: Path, remote_key: str) -> None: + del remote_key uploaded.append(local_path.name)Also applies to: 381-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/test/datacollector_test.py` around lines 352 - 357, The test double `_Backend`'s upload method must match the production signature used by `_process_pending_zips()` → `_upload()`, so change the `_Backend.upload` definitions in the tests (the ones around the current `_Backend` class and the second double at the later block) to accept two parameters `(local_path, remote_key)` and ignore or use `remote_key` as needed; keep the existing behavior of recording `local_path.name` and raising a `ConnectionError` on the first call so retry/order behavior is still exercised.
536-542:⚠️ Potential issue | 🟠 MajorLoosen this non-blocking assertion.
A 10 ms wall-clock bound will flap under CI load. Prefer asserting that
record()enqueues work without doing serialization/upload inline, or use a much wider timing budget.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/test/datacollector_test.py` around lines 536 - 542, The test test_record_returns_fast currently asserts a strict 10 ms wall-clock bound on self._collector.record which is flaky; instead change the assertion to verify record enqueues work rather than completes serialization/upload synchronously—e.g., call self._collector.record("perf_test","1.0",{"image":arr}) and then assert the collector's work queue increased (inspect _collector._queue.qsize() or a public enqueue counter) or mock the serializer/uploader on the collector and assert its serialize/upload method was not called synchronously; alternatively, if you must keep a timing check, widen the threshold to a much larger value (e.g., 100–200 ms) and document why.src/odemis/util/dc_fetch.py (3)
35-47:⚠️ Potential issue | 🟠 MajorNormalize UTC
Zsuffixes before parsing.
datetime.fromisoformat()is not safe for--since ...Zon Python 3.10, so valid ISO-8601 UTC input can fail. NormalizeZ/zto+00:00before parsing.Python 3.10 compatibility fix
def parse_since_utc(value: str) -> datetime: """Parse a date/datetime string to UTC-aware datetime. @@ """ text = value.strip() + if text.endswith(("Z", "z")): + text = f"{text[:-1]}+00:00" if len(text) == 10: parsed = datetime.strptime(text, "%Y-%m-%d") return parsed.replace(tzinfo=timezone.utc)In Python 3.10, does datetime.datetime.fromisoformat("2026-03-01T12:30:00Z") raise ValueError, and is replacing a trailing Z with +00:00 the recommended compatibility workaround?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/dc_fetch.py` around lines 35 - 47, The parse_since_utc function fails on ISO strings ending with "Z"/"z" under Python 3.10 because datetime.fromisoformat doesn't accept "Z"; update parse_since_utc to normalize the input before calling fromisoformat by replacing a trailing "Z" or "z" with "+00:00" after trimming whitespace (keep existing logic for 10-char date-only strings), then proceed with fromisoformat and the existing tzinfo handling/astimezone behavior.
249-258:⚠️ Potential issue | 🟠 MajorDownload through a temp file and avoid basename collisions.
Path(key).namedrops the host prefix, so different S3 keys can overwrite/skip each other locally. Writing directly to the final path can also leave a partial ZIP that later appears complete.Safer download pattern
- destination = output_dir / Path(key).name + destination = output_dir / Path(key) + destination.parent.mkdir(parents=True, exist_ok=True) + tmp_destination = destination.with_suffix(destination.suffix + ".part") if destination.exists(): skipped_existing += 1 continue try: - s3_client.download_file(bucket, key, str(destination)) + s3_client.download_file(bucket, key, str(tmp_destination)) + tmp_destination.replace(destination) downloaded += 1 except Exception: + tmp_destination.unlink(missing_ok=True) failed += 1 logging.exception("Failed to download key %s", key)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/dc_fetch.py` around lines 249 - 258, The current download uses Path(key).name which causes basename collisions and writes directly to the final path; change the logic in the download loop (where destination, output_dir, key, s3_client.download_file are referenced) to (1) compute a unique safe local path for each S3 key (e.g., recreate the key’s relative path under output_dir or append a short hash to the basename) and ensure parent directories exist, (2) download to a temporary file in the same directory (e.g., destination.with_suffix(destination.suffix + ".part" or a tmp filename) using s3_client.download_file, (3) on successful download atomically rename/move the temp file to destination, and (4) only update downloaded/skipped/failed counters after the rename; ensure exceptions remove temp files and log the key on failure.
36-39:⚠️ Potential issue | 🟡 MinorRemove inline formatting markers from docstrings.
Several docstrings still use backticks around examples, command names, or return tuples. Convert these to plain reStructuredText field-style descriptions. As per coding guidelines,
**/*.py: Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks.Also applies to: 51-51, 68-68, 80-80, 138-138, 170-170, 194-194, 270-270
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/dc_fetch.py` around lines 36 - 39, The docstring for the date/datetime parser currently contains inline backticks around examples and should be converted to plain reStructuredText: remove backticks and rewrite examples as normal text or field-style descriptions (e.g., "Accepts ISO-8601 date (YYYY-MM-DD) and datetime (YYYY-MM-DDTHH:MM:SS with optional timezone).") in the docstring for the parsing function (the function whose docstring begins "Parse a date/datetime string to UTC-aware datetime."); apply the same change to the other listed docstrings (those at the same file positions noted in the review) so no backticks or inline formatting markers remain and the text follows reStructuredText/plain style.src/odemis/util/datacollector.py (3)
297-388:⚠️ Potential issue | 🟠 MajorSanitize event names and payload keys before using them as filenames.
event_nameis only truncated, and payload keys are used directly for sidecar paths. Values containing/,.., or path separators can escape the queue/temp directory or make serialization fail.Sanitize file-name components before constructing ZIP and sidecars
+import re + +_SAFE_NAME_RE = re.compile(r"[^A-Za-z0-9._-]+") + + +def _safe_name(value: object, default: str, max_len: int = 64) -> str: + """Return a filesystem-safe name component.""" + cleaned = _SAFE_NAME_RE.sub("_", str(value)).strip("._-") + if not cleaned or cleaned in {".", ".."}: + cleaned = default + return cleaned[:max_len] + @@ - safe_event = item.event_name[:64] if item.event_name else "event" + safe_event = _safe_name(item.event_name, default="event") @@ for key, value in item.payload.items(): + safe_key = _safe_name(key, default="payload") if value is None or isinstance(value, (str, int, float, bool)): payload_meta[key] = value elif isinstance(value, numpy.ndarray): if item.image_format.upper() == "HDF5": - arc_name = f"{key}.h5" + arc_name = f"{safe_key}.h5" @@ else: - arc_name = f"{key}.ome.tiff" + arc_name = f"{safe_key}.ome.tiff" @@ elif isinstance(value, (dict, list)): - arc_name = f"extra_{key}.json" + arc_name = f"extra_{safe_key}.json"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 297 - 388, The _serialize function uses raw event_name and payload keys to build filenames (zip_name, arc_name, extra_files entries) which allows path traversal or invalid filenames; sanitize these components by normalizing and whitelisting characters (e.g. replace or remove path separators, dots, control chars), reject or replace dangerous sequences like ".." and leading slashes, and enforce a max length; ensure arc_name is always a simple basename (no directories) before creating abs_path in tmp_dir and before adding entries to payload_meta/extra_files; apply the same sanitisation to zip_name (derived from safe_event) so produced tmp_zip/final_zip and ZIP arc names cannot escape tmp_dir or queue_dir and never create absolute paths.
480-576:⚠️ Potential issue | 🟠 MajorDo not let pending uploads starve in-memory queue consumption.
When pending ZIPs exist or backoff is active,
_run()continues beforequeue.get(). Newrecord()calls then accumulate in the unbounded queue during outages.Keep the queue bounded and consume it even when pending ZIPs exist
- self._queue: queue.Queue = queue.Queue() + self._queue: queue.Queue = queue.Queue(maxsize=1000) @@ self._ensure_thread() - self._queue.put_nowait(item) + try: + self._queue.put_nowait(item) + except queue.Full: + logging.warning("DataCollector queue is full; dropping event '%s'", item.event_name) @@ while True: try: - if self._process_pending_zips(self._queue_dir): - continue + self._process_pending_zips(self._queue_dir) except Exception: logging.exception("DataCollector error while processing pending uploads") self._schedule_retry() - continue try: item = self._queue.get(timeout=1.0) except queue.Empty: continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 480 - 576, The _run loop currently prioritizes pending ZIP uploads and backoff, allowing the in-memory queue to grow unbounded during outages; change _run so it always tries to consume any queued _WorkItem before (or while) sleeping for pending uploads: attempt a non-blocking or timed get from self._queue (use Queue.get_nowait() or get(timeout=...)) inside the loop and call _process_work_item(item) for each retrieved item, falling back to calling _process_pending_zips(self._queue_dir) only when the queue is empty; ensure the lock/daemon behavior is unchanged and preserve existing retry/backoff logic in _process_pending_zips and _schedule_retry/_reset_retry.
144-191:⚠️ Potential issue | 🔴 CriticalDo not persist unset consent as
none.
_write()emitsconsent = none, butconsentlater reads the option withgetboolean(), which raises for"none"after a restart. Comment out/remove the option for the undecided state, and tolerate legacy"none"when reading.Persist undecided consent safely
- else: - consent_line = "consent = none" + else: + consent_line = "# consent = " @@ `@property` def consent(self) -> Optional[bool]: """Return the consent state, or ``None`` if not yet set.""" try: - return self._cp.getboolean("general", "consent") + raw_value = self._cp.get("general", "consent").strip().lower() except (configparser.NoSectionError, configparser.NoOptionError): return None + if raw_value in ("", "none"): + return None + try: + return self._cp.getboolean("general", "consent") + except ValueError: + logging.warning("Invalid data-collection consent value: %r", raw_value) + return None#!/bin/bash # Verify ConfigParser boolean parsing for the persisted undecided value. python3 - <<'PY' import configparser cp = configparser.ConfigParser(interpolation=None) cp.read_string("[general]\nconsent = none\n") try: print(cp.getboolean("general", "consent")) except Exception as exc: print(type(exc).__name__, str(exc)) PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 144 - 191, The _write method currently persists undecided consent as the literal "consent = none" which breaks ConfigParser.getboolean in the consent property; modify _write to NOT write an explicit "consent = none" (either omit the consent line or write it commented) when self.consent is None, and keep writing explicit "consent = true/false" for booleans; update the consent property to tolerate legacy "none" by catching ValueError (or reading the raw string via _cp.get and returning None when the value equals "none") so it returns None for undecided state instead of propagating an exception; adjust references to _write and consent in tests/usage as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/odemis/util/datacollector.py`:
- Around line 697-699: The current validation calls image_format.upper() which
raises AttributeError for non-strings; add an explicit check that image_format
is an instance of str before calling .upper() and if not raise a ValueError with
the same informative message (referencing image_format and _VALID_IMAGE_FORMATS)
so callers get ValueError not AttributeError; modify the validation around the
image_format check (the code that uses image_format.upper() and
_VALID_IMAGE_FORMATS) to perform the isinstance(image_format, str) check first,
then proceed to compare image_format.upper() against _VALID_IMAGE_FORMATS.
In `@src/odemis/util/dc_fetch.py`:
- Around line 137-150: The iter_s3_objects paginator can spin indefinitely if
response["IsTruncated"] is True but there is no "NextContinuationToken"; update
iter_s3_objects to validate that when response.get("IsTruncated") is truthy you
also have a non-empty NextContinuationToken before assigning token and loop
continuing — if NextContinuationToken is missing or falsy, raise an exception
(or log and break) to fail fast and avoid repeating the same request; reference
the iter_s3_objects function and the token/response.get("NextContinuationToken")
logic when making the change.
---
Duplicate comments:
In `@src/odemis/util/datacollector.py`:
- Around line 297-388: The _serialize function uses raw event_name and payload
keys to build filenames (zip_name, arc_name, extra_files entries) which allows
path traversal or invalid filenames; sanitize these components by normalizing
and whitelisting characters (e.g. replace or remove path separators, dots,
control chars), reject or replace dangerous sequences like ".." and leading
slashes, and enforce a max length; ensure arc_name is always a simple basename
(no directories) before creating abs_path in tmp_dir and before adding entries
to payload_meta/extra_files; apply the same sanitisation to zip_name (derived
from safe_event) so produced tmp_zip/final_zip and ZIP arc names cannot escape
tmp_dir or queue_dir and never create absolute paths.
- Around line 480-576: The _run loop currently prioritizes pending ZIP uploads
and backoff, allowing the in-memory queue to grow unbounded during outages;
change _run so it always tries to consume any queued _WorkItem before (or while)
sleeping for pending uploads: attempt a non-blocking or timed get from
self._queue (use Queue.get_nowait() or get(timeout=...)) inside the loop and
call _process_work_item(item) for each retrieved item, falling back to calling
_process_pending_zips(self._queue_dir) only when the queue is empty; ensure the
lock/daemon behavior is unchanged and preserve existing retry/backoff logic in
_process_pending_zips and _schedule_retry/_reset_retry.
- Around line 144-191: The _write method currently persists undecided consent as
the literal "consent = none" which breaks ConfigParser.getboolean in the consent
property; modify _write to NOT write an explicit "consent = none" (either omit
the consent line or write it commented) when self.consent is None, and keep
writing explicit "consent = true/false" for booleans; update the consent
property to tolerate legacy "none" by catching ValueError (or reading the raw
string via _cp.get and returning None when the value equals "none") so it
returns None for undecided state instead of propagating an exception; adjust
references to _write and consent in tests/usage as needed.
In `@src/odemis/util/dc_fetch.py`:
- Around line 35-47: The parse_since_utc function fails on ISO strings ending
with "Z"/"z" under Python 3.10 because datetime.fromisoformat doesn't accept
"Z"; update parse_since_utc to normalize the input before calling fromisoformat
by replacing a trailing "Z" or "z" with "+00:00" after trimming whitespace (keep
existing logic for 10-char date-only strings), then proceed with fromisoformat
and the existing tzinfo handling/astimezone behavior.
- Around line 249-258: The current download uses Path(key).name which causes
basename collisions and writes directly to the final path; change the logic in
the download loop (where destination, output_dir, key, s3_client.download_file
are referenced) to (1) compute a unique safe local path for each S3 key (e.g.,
recreate the key’s relative path under output_dir or append a short hash to the
basename) and ensure parent directories exist, (2) download to a temporary file
in the same directory (e.g., destination.with_suffix(destination.suffix +
".part" or a tmp filename) using s3_client.download_file, (3) on successful
download atomically rename/move the temp file to destination, and (4) only
update downloaded/skipped/failed counters after the rename; ensure exceptions
remove temp files and log the key on failure.
- Around line 36-39: The docstring for the date/datetime parser currently
contains inline backticks around examples and should be converted to plain
reStructuredText: remove backticks and rewrite examples as normal text or
field-style descriptions (e.g., "Accepts ISO-8601 date (YYYY-MM-DD) and datetime
(YYYY-MM-DDTHH:MM:SS with optional timezone).") in the docstring for the parsing
function (the function whose docstring begins "Parse a date/datetime string to
UTC-aware datetime."); apply the same change to the other listed docstrings
(those at the same file positions noted in the review) so no backticks or inline
formatting markers remain and the text follows reStructuredText/plain style.
In `@src/odemis/util/test/datacollector_test.py`:
- Line 52: Remove the module-level call to logging.basicConfig() (the
logging.basicConfig line) so importing datacollector_test doesn't mutate global
logging; either move that call into an if __name__ == "__main__": block for
local debugging or remove it entirely and rely on per-test logging control
(pytest's caplog fixture or a test-level logger configuration) within the test
functions that need debug output.
- Around line 352-357: The test double `_Backend`'s upload method must match the
production signature used by `_process_pending_zips()` → `_upload()`, so change
the `_Backend.upload` definitions in the tests (the ones around the current
`_Backend` class and the second double at the later block) to accept two
parameters `(local_path, remote_key)` and ignore or use `remote_key` as needed;
keep the existing behavior of recording `local_path.name` and raising a
`ConnectionError` on the first call so retry/order behavior is still exercised.
- Around line 536-542: The test test_record_returns_fast currently asserts a
strict 10 ms wall-clock bound on self._collector.record which is flaky; instead
change the assertion to verify record enqueues work rather than completes
serialization/upload synchronously—e.g., call
self._collector.record("perf_test","1.0",{"image":arr}) and then assert the
collector's work queue increased (inspect _collector._queue.qsize() or a public
enqueue counter) or mock the serializer/uploader on the collector and assert its
serialize/upload method was not called synchronously; alternatively, if you must
keep a timing check, widen the threshold to a much larger value (e.g., 100–200
ms) and document why.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 34ecf67e-7de2-4810-b041-ae18054f6102
📒 Files selected for processing (9)
debian/controlscripts/odemis-dc-fetch.pysrc/odemis/gui/cont/menu.pysrc/odemis/gui/main.pysrc/odemis/gui/win/consent.pysrc/odemis/util/datacollector.pysrc/odemis/util/dc_fetch.pysrc/odemis/util/test/datacollector_test.pysrc/odemis/util/test/dc_fetch_test.py
✅ Files skipped from review due to trivial changes (2)
- debian/control
- scripts/odemis-dc-fetch.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/odemis/gui/cont/menu.py
Provides a thread-safe, non-blocking ``DataCollector.record()`` call that any Odemis module can invoke to capture a labelled data sample. CLI interface is implemented to download the data from the cloud storage.
06b4736 to
1f2c680
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (11)
src/odemis/util/datacollector.py (5)
697-699:⚠️ Potential issue | 🟡 MinorValidate
image_formatis a string before calling.upper().Passing
Noneor another non-string currently raisesAttributeError, not the documentedValueError.Preserve the documented ValueError contract
- if image_format.upper() not in _VALID_IMAGE_FORMATS: + if not isinstance(image_format, str) or image_format.upper() not in _VALID_IMAGE_FORMATS: raise ValueError( f"image_format must be one of {_VALID_IMAGE_FORMATS}, got {image_format!r}" )#!/bin/bash # Verify why the current validation path raises the wrong exception type. python3 - <<'PY' try: None.upper() except Exception as exc: print(type(exc).__name__, str(exc)) PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 697 - 699, The current validation calls image_format.upper() and can raise AttributeError for non-string inputs; update the check in the datacollector image format validation (the block that compares image_format.upper() against _VALID_IMAGE_FORMATS) to first ensure image_format is an instance of str (e.g., using isinstance(image_format, str)); if not, raise the same ValueError with the existing message; then call image_format.upper() and compare to _VALID_IMAGE_FORMATS as before so the function consistently raises ValueError for invalid types or values.
153-161:⚠️ Potential issue | 🔴 CriticalHandle persisted
consent = noneon config reload.
_write()emitsconsent = none, butconsentlater callsgetboolean(), which raisesValueErrorfor"none". After restart, config reads can fail instead of returning undecided consent.Make the getter tolerate undecided consent
`@property` def consent(self) -> Optional[bool]: """Return the consent state, or ``None`` if not yet set.""" try: - return self._cp.getboolean("general", "consent") + raw = self._cp.get("general", "consent").strip().lower() except (configparser.NoSectionError, configparser.NoOptionError): return None + if raw in ("", "none"): + return None + try: + return self._cp.getboolean("general", "consent") + except ValueError: + logging.warning("Invalid data-collection consent value in config: %r", raw) + return None#!/bin/bash # Verify ConfigParser boolean parsing for the persisted "none" value. python3 - <<'PY' import configparser cp = configparser.ConfigParser(interpolation=None) cp.read_string("[general]\nconsent = none\n") try: cp.getboolean("general", "consent") except Exception as exc: print(type(exc).__name__, str(exc)) PYAlso applies to: 185-190
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 153 - 161, The persisted config writes "consent = none" which later causes ConfigParser.getboolean() to raise ValueError on reload; update the consent getter (the property/method that reads "consent" and any similar readers around the locations referenced, e.g., lines handling consent at 185-190) to tolerate an undecided value by reading the raw string (ConfigParser.get or get with fallback), normalizing it (strip/lower), and returning None when it equals "none" (or empty) instead of calling getboolean directly; alternatively catch ValueError from getboolean and map it to None so undecided consent is handled safely on config reload while keeping _write() unchanged.
105-127:⚠️ Potential issue | 🟡 MinorBring the new docstrings/type hints in line with the repo rule.
This module still has inline docstring markers,
_ensure_section()has no docstring, and_get_client()lacks a return type annotation. As per coding guidelines,**/*.py: Always use type hints for function parameters and return types in Python code; Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks.Also applies to: 181-183, 445-446, 470-473, 536-537, 676-688
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 105 - 127, Update docstrings and add missing type hints per repo rules: convert the DataCollectorConfig class docstring and any other docstrings referenced (lines ~181-183, 445-446, 470-473, 536-537, 676-688) to reStructuredText style without inline formatting markers or backticks, and add a proper docstring for the helper function _ensure_section() describing purpose, args, and return in reST. Add full type annotations to all function signatures that lack them, specifically annotate _get_client() with its return type and annotate parameters/return types for any functions missing hints in this module; ensure symbol names match exactly (DataCollectorConfig, _ensure_section, _get_client) so the changes are applied to the correct locations.
311-356:⚠️ Potential issue | 🔴 CriticalSanitize event names and payload keys before using them as paths.
event_nameis only truncated, and payload keys are interpolated directly into sidecar paths. A key like/tmp/xor../xcan escapetmp_dirbefore the ZIP is built and can also create unsafe archive entries.Sanitize filename components before file creation
import queue +import re import shutil @@ _DEFAULT_CONSENT_REMIND_DELTA = timedelta(days=30) REMINDER_DATE_KEY = "reminder_date" +_SAFE_FILENAME_RE = re.compile(r"[^A-Za-z0-9._-]+") + + +def _safe_filename_part(value: object, default: str, max_len: int = 64) -> str: + """Return a filesystem-safe filename component.""" + try: + text = str(value) + except Exception: + text = default + cleaned = _SAFE_FILENAME_RE.sub("_", text).strip("._-") + return (cleaned or default)[:max_len] @@ - # Limit event_name length so the filename stays within filesystem limits. - safe_event = item.event_name[:64] if item.event_name else "event" + safe_event = _safe_filename_part(item.event_name, default="event", max_len=64) @@ - for key, value in item.payload.items(): + for key, value in item.payload.items(): + safe_key = _safe_filename_part(key, default="payload", max_len=64) if value is None or isinstance(value, (str, int, float, bool)): payload_meta[key] = value elif isinstance(value, numpy.ndarray): if item.image_format.upper() == "HDF5": - arc_name = f"{key}.h5" + arc_name = f"{safe_key}.h5" @@ - arc_name = f"{key}.ome.tiff" + arc_name = f"{safe_key}.ome.tiff" @@ elif isinstance(value, (dict, list)): - arc_name = f"extra_{key}.json" + arc_name = f"extra_{safe_key}.json"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 311 - 356, The code uses raw event_name and payload keys to build filenames/paths (safe_event, zip_name, arc_name) which can include path traversal or separators; sanitize event_name and each payload key before using them as filenames (e.g., remove/replace path separators, strip leading dots, collapse .., enforce a safe charset and length, fallback to a safe token) and use the sanitized value when building zip_name and arc_name (references: safe_event, zip_name, arc_name, tmp_dir, extra_files, payload_meta). After creating abs_path, ensure abs_path.resolve() is inside tmp_dir.resolve() (use Path.resolve() and is_relative_to or equivalent) and reject or remap any keys that would escape tmp_dir; also ensure the arcname stored in payload_meta is the sanitized basename that will be used as the zip entry name to prevent unsafe archive entries.
480-496:⚠️ Potential issue | 🟠 MajorDo not let pending uploads starve queue consumption.
When pending ZIPs exist,
_run()continues beforequeue.get(). During an outage/backoff, new records can accumulate indefinitely in the unbounded in-memory queue.Drain/limit the in-memory queue while pending uploads retry
_VALID_IMAGE_FORMATS = ("TIFF", "HDF5") +_MAX_IN_MEMORY_QUEUE_ITEMS = 1000 _INITIAL_RETRY_DELAY_SECONDS = 30.0 @@ - self._queue: queue.Queue = queue.Queue() + self._queue: queue.Queue = queue.Queue(maxsize=_MAX_IN_MEMORY_QUEUE_ITEMS) @@ """ self._ensure_thread() - self._queue.put_nowait(item) + try: + self._queue.put_nowait(item) + except queue.Full: + logging.warning("DataCollector queue full; dropping event '%s'", item.event_name) @@ def _run(self) -> None: """Main loop: process items until the thread is stopped.""" while True: try: - if self._process_pending_zips(self._queue_dir): - continue + self._process_pending_zips(self._queue_dir) except Exception: logging.exception("DataCollector error while processing pending uploads") self._schedule_retry() - continue try: item = self._queue.get(timeout=1.0) except queue.Empty: continueAlso applies to: 571-584
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 480 - 496, The _run loop currently skips queue.get() while there are pending uploads and it's backing off, allowing the unbounded in-memory self._queue to grow; change _run so that even during backoff it repeatedly calls self._queue.get(timeout=remaining_backoff) (or otherwise polls the queue) and processes/drains items up to a small limit rather than sleeping unconditionally, ensuring new records are consumed while waiting to retry uploads; reference the _run method, self._queue, self._next_retry_at and the existing enqueue() behavior and implement a bounded-drain or timed-get loop so pending upload retries no longer starve queue consumption.src/odemis/util/dc_fetch.py (4)
137-150:⚠️ Potential issue | 🟠 MajorGuard against truncated listings without a continuation token.
If an S3-compatible endpoint returns
IsTruncated=TruewithoutNextContinuationToken, this loop repeats the same request forever. Fail fast instead.Prevent infinite pagination
if not response.get("IsTruncated"): break token = response.get("NextContinuationToken") + if not token: + raise RuntimeError("S3 listing was truncated without a continuation token")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/dc_fetch.py` around lines 137 - 150, In iter_s3_objects, guard against infinite loops when S3 returns IsTruncated=True but no NextContinuationToken: after calling s3_client.list_objects_v2 and before repeating, check response.get("IsTruncated") and if True and response.get("NextContinuationToken") is falsy, raise an exception (e.g., RuntimeError) with a clear message including bucket and prefix; otherwise assign token = response["NextContinuationToken"] as currently done. This ensures iter_s3_objects (token / response) fails fast instead of looping forever.
35-47:⚠️ Potential issue | 🟠 MajorNormalize trailing
Zbeforefromisoformat()for Python 3.10.
--since 2026-03-01T12:30:00Zis a common UTC ISO-8601 input, but Python 3.10 rejects the trailingZ. Normalize it before parsing.Python 3.10-compatible parsing
""" text = value.strip() + if text.endswith(("Z", "z")): + text = f"{text[:-1]}+00:00" if len(text) == 10: parsed = datetime.strptime(text, "%Y-%m-%d") return parsed.replace(tzinfo=timezone.utc)In Python 3.10, does datetime.datetime.fromisoformat support ISO 8601 timestamps ending with Z, such as 2026-03-01T12:30:00Z?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/dc_fetch.py` around lines 35 - 47, The parse_since_utc function currently calls datetime.fromisoformat(text) which fails on ISO strings that end with 'Z' (e.g., "2026-03-01T12:30:00Z") on Python 3.10; before calling datetime.fromisoformat in parse_since_utc, normalize a trailing 'Z' to an explicit UTC offset (e.g., replace a single trailing 'Z' with '+00:00') so fromisoformat accepts it, preserving the existing date-only branch and the subsequent tzinfo handling/astimezone logic.
50-80:⚠️ Potential issue | 🟡 MinorClean up docstrings to match the Python guideline.
Several new docstrings still use inline formatting markers such as backticks. Please convert these to plain reStructuredText-style docstrings without inline markers. As per coding guidelines,
**/*.py: Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks.Also applies to: 137-138, 169-170, 180-181, 188-194, 214-223, 269-270
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/dc_fetch.py` around lines 50 - 80, Update the docstrings in parse_key_timestamp_utc, parse_key_event_name, build_argument_parser and the other affected functions (around the mentioned ranges) to follow plain reStructuredText style: remove backticks and inline formatting (e.g. change "`<event>-<YYYYMMDDTHHmmss>-<uuid>.zip`" to simply "<event>-<YYYYMMDDTHHmmss>-<uuid>.zip" or a short sentence), keep them brief descriptive sentences without type annotations or inline code markup, and ensure each function/class has a clear one- or two-line description in rST form consistent with the project's docstring guidelines.
249-258:⚠️ Potential issue | 🟠 MajorAvoid basename collisions and partial ZIP leftovers.
Path(key).namedrops the host prefix, so different S3 keys can overwrite/collide locally. Writing directly to the final path also leaves truncated files after failed downloads, which the next run will treat as existing.Safer download pattern
import argparse import logging +import tempfile from datetime import datetime, timezone @@ - destination = output_dir / Path(key).name + safe_parts = [part for part in key.split("/") if part not in ("", ".", "..")] + if not safe_parts: + failed += 1 + logging.warning("Skipping unsafe S3 key %s", key) + continue + + destination = output_dir.joinpath(*safe_parts) if destination.exists(): skipped_existing += 1 continue + destination.parent.mkdir(parents=True, exist_ok=True) + tmp_file = tempfile.NamedTemporaryFile( + dir=str(destination.parent), + prefix=f".{destination.name}.", + suffix=".part", + delete=False, + ) + tmp_destination = Path(tmp_file.name) + tmp_file.close() try: - s3_client.download_file(bucket, key, str(destination)) + s3_client.download_file(bucket, key, str(tmp_destination)) + tmp_destination.replace(destination) downloaded += 1 except Exception: + tmp_destination.unlink(missing_ok=True) failed += 1 logging.exception("Failed to download key %s", key)If you preserve host subdirectories as above, update
src/odemis/util/test/dc_fetch_test.pyto expect the downloaded file underoutput_dir / "host" / ....🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/dc_fetch.py` around lines 249 - 258, The current download uses Path(key).name which flattens S3 keys and can cause collisions; also downloading directly to the final path leaves truncated files on failure. Change the destination logic to preserve the S3 key's path under the output_dir (e.g., destination = output_dir / Path(key)) and ensure destination.parent.mkdir(parents=True, exist_ok=True). Download to a temporary file in the same directory (e.g., destination.with_suffix('.tmp') or a temp file in destination.parent), call s3_client.download_file into that temp file, then atomically rename/move to destination only on success; keep the same counters (downloaded/failed/skipped) and logging (logging.exception) behavior. Finally update src/odemis/util/test/dc_fetch_test.py to expect the downloaded file under output_dir / "host" / ... (the preserved subdirectory path).src/odemis/util/test/datacollector_test.py (2)
52-52:⚠️ Potential issue | 🟠 MajorAvoid configuring global logging at test import time.
logging.basicConfig()mutates process-wide logging as soon as this test module is imported, which can make unrelated tests order-dependent. Move this underif __name__ == "__main__":if it is only for local diagnostics.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/test/datacollector_test.py` at line 52, The top-level call logging.basicConfig(level=logging.DEBUG) in the datacollector_test module mutates global logging at import time; remove that top-level call and instead configure logging only for local/manual runs by placing it under an if __name__ == "__main__": guard, or configure logging per-test using pytest fixtures/caplog inside the test functions; update/remove the logging.basicConfig invocation so tests no longer change process-wide logging on import.
536-542:⚠️ Potential issue | 🟠 MajorThis 10 ms wall-clock assertion is likely to flap.
A threshold this tight measures CI runner jitter more than the non-blocking contract. Prefer asserting that
record()enqueues/delegates work without serializing/uploading on the caller path, or use a much looser bound.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/test/datacollector_test.py` around lines 536 - 542, The wall-clock <10ms> assertion in test_record_returns_fast is flaky; instead verify record() delegates work non-blockingly by asserting it enqueues a task rather than completes the background work. Modify test_record_returns_fast to call self._collector.record("perf_test","1.0",{"image": arr}) and then assert that the collector's enqueue/queue mechanism received the task (for example check a public/internal queue size or mock/spy the enqueue method on self._collector) or mock the uploader to assert record returned immediately while background upload was invoked asynchronously; keep references to self._collector.record and the test_record_returns_fast name so the change is easy to locate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/odemis/gui/main.py`:
- Around line 436-441: When handling ConsentDialog results, after calling
_data_collector.set_consent(True/False) (and possibly after
_data_collector.postpone_consent()), update the MenuController checkbox state so
the Help → “Share data with Delmic” reflects the new consent immediately; locate
where ConsentDialog.RESULT_OPT_IN / RESULT_OPT_OUT are handled in main.py and
either call an existing MenuController refresh method or add and call a small
MenuController.refresh_share_data_checkbox() (or similar) that reads the data
collector’s consent and sets the checkbox state accordingly.
In `@src/odemis/util/test/dc_fetch_test.py`:
- Around line 175-180: Replace the hardcoded secret_key="secret" in the
S3UploadBackend instantiation inside dc_fetch_test.py with a clearly named test
constant (e.g., DUMMY_S3_SECRET) or add an inline Ruff suppression for S106 on
that expression; specifically update the S3UploadBackend(...) call (the
secret_key parameter) so it either uses a top-level DUMMY_S3_SECRET constant or
appends a trailing noqa: S106 comment to the secret_key literal to prevent Ruff
from treating it as a real credential.
---
Duplicate comments:
In `@src/odemis/util/datacollector.py`:
- Around line 697-699: The current validation calls image_format.upper() and can
raise AttributeError for non-string inputs; update the check in the
datacollector image format validation (the block that compares
image_format.upper() against _VALID_IMAGE_FORMATS) to first ensure image_format
is an instance of str (e.g., using isinstance(image_format, str)); if not, raise
the same ValueError with the existing message; then call image_format.upper()
and compare to _VALID_IMAGE_FORMATS as before so the function consistently
raises ValueError for invalid types or values.
- Around line 153-161: The persisted config writes "consent = none" which later
causes ConfigParser.getboolean() to raise ValueError on reload; update the
consent getter (the property/method that reads "consent" and any similar readers
around the locations referenced, e.g., lines handling consent at 185-190) to
tolerate an undecided value by reading the raw string (ConfigParser.get or get
with fallback), normalizing it (strip/lower), and returning None when it equals
"none" (or empty) instead of calling getboolean directly; alternatively catch
ValueError from getboolean and map it to None so undecided consent is handled
safely on config reload while keeping _write() unchanged.
- Around line 105-127: Update docstrings and add missing type hints per repo
rules: convert the DataCollectorConfig class docstring and any other docstrings
referenced (lines ~181-183, 445-446, 470-473, 536-537, 676-688) to
reStructuredText style without inline formatting markers or backticks, and add a
proper docstring for the helper function _ensure_section() describing purpose,
args, and return in reST. Add full type annotations to all function signatures
that lack them, specifically annotate _get_client() with its return type and
annotate parameters/return types for any functions missing hints in this module;
ensure symbol names match exactly (DataCollectorConfig, _ensure_section,
_get_client) so the changes are applied to the correct locations.
- Around line 311-356: The code uses raw event_name and payload keys to build
filenames/paths (safe_event, zip_name, arc_name) which can include path
traversal or separators; sanitize event_name and each payload key before using
them as filenames (e.g., remove/replace path separators, strip leading dots,
collapse .., enforce a safe charset and length, fallback to a safe token) and
use the sanitized value when building zip_name and arc_name (references:
safe_event, zip_name, arc_name, tmp_dir, extra_files, payload_meta). After
creating abs_path, ensure abs_path.resolve() is inside tmp_dir.resolve() (use
Path.resolve() and is_relative_to or equivalent) and reject or remap any keys
that would escape tmp_dir; also ensure the arcname stored in payload_meta is the
sanitized basename that will be used as the zip entry name to prevent unsafe
archive entries.
- Around line 480-496: The _run loop currently skips queue.get() while there are
pending uploads and it's backing off, allowing the unbounded in-memory
self._queue to grow; change _run so that even during backoff it repeatedly calls
self._queue.get(timeout=remaining_backoff) (or otherwise polls the queue) and
processes/drains items up to a small limit rather than sleeping unconditionally,
ensuring new records are consumed while waiting to retry uploads; reference the
_run method, self._queue, self._next_retry_at and the existing enqueue()
behavior and implement a bounded-drain or timed-get loop so pending upload
retries no longer starve queue consumption.
In `@src/odemis/util/dc_fetch.py`:
- Around line 137-150: In iter_s3_objects, guard against infinite loops when S3
returns IsTruncated=True but no NextContinuationToken: after calling
s3_client.list_objects_v2 and before repeating, check
response.get("IsTruncated") and if True and
response.get("NextContinuationToken") is falsy, raise an exception (e.g.,
RuntimeError) with a clear message including bucket and prefix; otherwise assign
token = response["NextContinuationToken"] as currently done. This ensures
iter_s3_objects (token / response) fails fast instead of looping forever.
- Around line 35-47: The parse_since_utc function currently calls
datetime.fromisoformat(text) which fails on ISO strings that end with 'Z' (e.g.,
"2026-03-01T12:30:00Z") on Python 3.10; before calling datetime.fromisoformat in
parse_since_utc, normalize a trailing 'Z' to an explicit UTC offset (e.g.,
replace a single trailing 'Z' with '+00:00') so fromisoformat accepts it,
preserving the existing date-only branch and the subsequent tzinfo
handling/astimezone logic.
- Around line 50-80: Update the docstrings in parse_key_timestamp_utc,
parse_key_event_name, build_argument_parser and the other affected functions
(around the mentioned ranges) to follow plain reStructuredText style: remove
backticks and inline formatting (e.g. change
"`<event>-<YYYYMMDDTHHmmss>-<uuid>.zip`" to simply
"<event>-<YYYYMMDDTHHmmss>-<uuid>.zip" or a short sentence), keep them brief
descriptive sentences without type annotations or inline code markup, and ensure
each function/class has a clear one- or two-line description in rST form
consistent with the project's docstring guidelines.
- Around line 249-258: The current download uses Path(key).name which flattens
S3 keys and can cause collisions; also downloading directly to the final path
leaves truncated files on failure. Change the destination logic to preserve the
S3 key's path under the output_dir (e.g., destination = output_dir / Path(key))
and ensure destination.parent.mkdir(parents=True, exist_ok=True). Download to a
temporary file in the same directory (e.g., destination.with_suffix('.tmp') or a
temp file in destination.parent), call s3_client.download_file into that temp
file, then atomically rename/move to destination only on success; keep the same
counters (downloaded/failed/skipped) and logging (logging.exception) behavior.
Finally update src/odemis/util/test/dc_fetch_test.py to expect the downloaded
file under output_dir / "host" / ... (the preserved subdirectory path).
In `@src/odemis/util/test/datacollector_test.py`:
- Line 52: The top-level call logging.basicConfig(level=logging.DEBUG) in the
datacollector_test module mutates global logging at import time; remove that
top-level call and instead configure logging only for local/manual runs by
placing it under an if __name__ == "__main__": guard, or configure logging
per-test using pytest fixtures/caplog inside the test functions; update/remove
the logging.basicConfig invocation so tests no longer change process-wide
logging on import.
- Around line 536-542: The wall-clock <10ms> assertion in
test_record_returns_fast is flaky; instead verify record() delegates work
non-blockingly by asserting it enqueues a task rather than completes the
background work. Modify test_record_returns_fast to call
self._collector.record("perf_test","1.0",{"image": arr}) and then assert that
the collector's enqueue/queue mechanism received the task (for example check a
public/internal queue size or mock/spy the enqueue method on self._collector) or
mock the uploader to assert record returned immediately while background upload
was invoked asynchronously; keep references to self._collector.record and the
test_record_returns_fast name so the change is easy to locate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ea331f3a-e979-4542-a096-c0d33f53d373
📒 Files selected for processing (9)
debian/controlscripts/odemis-dc-fetch.pysrc/odemis/gui/cont/menu.pysrc/odemis/gui/main.pysrc/odemis/gui/win/consent.pysrc/odemis/util/datacollector.pysrc/odemis/util/dc_fetch.pysrc/odemis/util/test/datacollector_test.pysrc/odemis/util/test/dc_fetch_test.py
✅ Files skipped from review due to trivial changes (3)
- debian/control
- scripts/odemis-dc-fetch.py
- src/odemis/gui/win/consent.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/odemis/gui/cont/menu.py
There was a problem hiding this comment.
🧹 Nitpick comments (5)
src/odemis/util/test/datacollector_test.py (1)
412-414: 💤 Low valueUse
raise ... from excfor proper exception chaining.When re-raising as
SkipTest, chain the original exception to preserve the traceback context.Suggested fix
try: import boto3 except ImportError as exc: - raise unittest.SkipTest(f"boto3 is required for real S3 integration tests: {exc}") + raise unittest.SkipTest(f"boto3 is required for real S3 integration tests: {exc}") from exc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/test/datacollector_test.py` around lines 412 - 414, In the ImportError except block that currently does "except ImportError as exc: raise unittest.SkipTest(f\"boto3 is required for real S3 integration tests: {exc}\")", change the re-raise to use exception chaining by raising the SkipTest from the caught exception (i.e., "raise unittest.SkipTest(...) from exc") so the original ImportError traceback is preserved; update the except ImportError as exc handler around the boto3 import accordingly.src/odemis/util/datacollector.py (2)
473-483: 💤 Low valueReturn type annotation should use a proper type instead of string literal.
The return type
"boto3.client"is not a valid type annotation. UseAnyor a more specific type from boto3's type stubs.Suggested fix
- def _get_client(self) -> "boto3.client": + def _get_client(self) -> Any: """Return a cached boto3 S3 client."""Add
Anyto the existing typing imports at line 46:-from typing import Optional +from typing import Any, Optional🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 473 - 483, The return annotation for _get_client currently uses the invalid string literal "boto3.client"; update the annotation to a real typing (e.g., Any or a boto3 client stub type) and import it: add Any to the typing imports and change def _get_client(self) -> "boto3.client": to def _get_client(self) -> Any: (or to the specific boto3 type if available), referencing the existing _get_client method and self._client/boto3.client call so the function signature and imports are consistent.
381-385: 💤 Low valueAddress or remove the TODO comment.
The TODO at line 384 indicates missing logging for serialization failures. Either add the logging or remove the comment if the current behavior is intentional.
Would you like me to suggest appropriate logging for this fallback path?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/datacollector.py` around lines 381 - 385, Replace the TODO by logging the serialization failure and then falling back to "<unserializable>": inside the try/except around payload_meta[key] = str(value) log the exception including the key and a short representation of value (e.g. logger.exception or logger.debug with exc_info=True) so you capture the error details, and if no module logger exists add a standard logging.getLogger(__name__) at the top; keep the fallback assignment payload_meta[key] = "<unserializable>" and remove the TODO comment.src/odemis/util/dc_fetch.py (1)
52-53: 💤 Low valueDocstrings contain backticks which violate coding guidelines.
The docstrings use backtick formatting (e.g.,
`<event>-<YYYYMMDDTHHmmss>-<uuid>.zip`) which is not permitted per coding guidelines.Suggested fix
def parse_key_timestamp_utc(key: str) -> Optional[datetime]: - """Parse `<event>-<YYYYMMDDTHHmmss>-<uuid>.zip` timestamp from key basename.""" + """Parse event-timestamp-uuid.zip timestamp from key basename.""" ... def parse_key_event_name(key: str) -> Optional[str]: - """Parse event name from `<event>-<YYYYMMDDTHHmmss>-<uuid>.zip` key basename.""" + """Parse event name from event-timestamp-uuid.zip key basename."""As per coding guidelines "
**/*.py: Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks".Also applies to: 68-69
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/dc_fetch.py` around lines 52 - 53, The docstrings for parse_key_timestamp_utc (and the other function around lines 68-69) use backticks/inline formatting which violates the reStructuredText guideline; update these docstrings to plain reST text without backticks or inline markers and describe the pattern in normal text (e.g., state the filename pattern as <event>-YYYYMMDDTHHmmss-<uuid>.zip or "event-YYYYMMDDTHHmmss-uuid.zip") so they follow the project's docstring rules while keeping the same descriptive content.src/odemis/util/test/dc_fetch_test.py (1)
177-182: ⚡ Quick winSuppress Ruff S106 warning for dummy test credential.
The static analysis flags
secret_key="secret"as a potential hardcoded password. Add a suppression comment since this is clearly a test dummy value.Suggested fix
backend = S3UploadBackend( access_key="key", - secret_key="secret", + secret_key="secret", # noqa: S106 - dummy test credential region=S3_REGION, bucket="test-bucket", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/odemis/util/test/dc_fetch_test.py` around lines 177 - 182, The test uses a dummy hardcoded credential in the S3UploadBackend instantiation (secret_key="secret") which triggers Ruff S106; add an inline suppression comment to that argument to silence the warning (e.g., append a "# noqa: S106" or " # ruff: noqa: S106" to the secret_key="secret" line). Locate the S3UploadBackend(...) block in the test (the secret_key parameter) and add the noqa ruff suppression to that line only so the dummy test credential is ignored by static analysis.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/odemis/util/datacollector.py`:
- Around line 473-483: The return annotation for _get_client currently uses the
invalid string literal "boto3.client"; update the annotation to a real typing
(e.g., Any or a boto3 client stub type) and import it: add Any to the typing
imports and change def _get_client(self) -> "boto3.client": to def
_get_client(self) -> Any: (or to the specific boto3 type if available),
referencing the existing _get_client method and self._client/boto3.client call
so the function signature and imports are consistent.
- Around line 381-385: Replace the TODO by logging the serialization failure and
then falling back to "<unserializable>": inside the try/except around
payload_meta[key] = str(value) log the exception including the key and a short
representation of value (e.g. logger.exception or logger.debug with
exc_info=True) so you capture the error details, and if no module logger exists
add a standard logging.getLogger(__name__) at the top; keep the fallback
assignment payload_meta[key] = "<unserializable>" and remove the TODO comment.
In `@src/odemis/util/dc_fetch.py`:
- Around line 52-53: The docstrings for parse_key_timestamp_utc (and the other
function around lines 68-69) use backticks/inline formatting which violates the
reStructuredText guideline; update these docstrings to plain reST text without
backticks or inline markers and describe the pattern in normal text (e.g., state
the filename pattern as <event>-YYYYMMDDTHHmmss-<uuid>.zip or
"event-YYYYMMDDTHHmmss-uuid.zip") so they follow the project's docstring rules
while keeping the same descriptive content.
In `@src/odemis/util/test/datacollector_test.py`:
- Around line 412-414: In the ImportError except block that currently does
"except ImportError as exc: raise unittest.SkipTest(f\"boto3 is required for
real S3 integration tests: {exc}\")", change the re-raise to use exception
chaining by raising the SkipTest from the caught exception (i.e., "raise
unittest.SkipTest(...) from exc") so the original ImportError traceback is
preserved; update the except ImportError as exc handler around the boto3 import
accordingly.
In `@src/odemis/util/test/dc_fetch_test.py`:
- Around line 177-182: The test uses a dummy hardcoded credential in the
S3UploadBackend instantiation (secret_key="secret") which triggers Ruff S106;
add an inline suppression comment to that argument to silence the warning (e.g.,
append a "# noqa: S106" or " # ruff: noqa: S106" to the secret_key="secret"
line). Locate the S3UploadBackend(...) block in the test (the secret_key
parameter) and add the noqa ruff suppression to that line only so the dummy test
credential is ignored by static analysis.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fd235b4d-d275-4adc-ae28-7efa9efc7b43
📒 Files selected for processing (7)
src/odemis/gui/cont/menu.pysrc/odemis/gui/main.pysrc/odemis/gui/win/consent.pysrc/odemis/util/datacollector.pysrc/odemis/util/dc_fetch.pysrc/odemis/util/test/datacollector_test.pysrc/odemis/util/test/dc_fetch_test.py
✅ Files skipped from review due to trivial changes (1)
- src/odemis/gui/win/consent.py
2e89fec to
84ee1ff
Compare
|
I think for such a big PR, a more elaborate description could help me review it a bit better. |
| class TestRealS3Integration(unittest.TestCase): | ||
| """Real S3 integration tests for Phase 2 upload workflow. | ||
|
|
||
| Tests use the credentials from ``_CREDENTIALS_PATH`` and upload to |
There was a problem hiding this comment.
No md in docstrings. I guess the rule is not picked up nicely @pieleric .
Style
For Python code, always use type hints for function parameters and return types.
Include docstrings for all functions and classes, following the reStructuredText style guide,
but without the type information and without using inline formatting markers or backticks.
| shutil.rmtree(str(tmp_dir), ignore_errors=True) | ||
|
|
||
|
|
||
| def _enforce_queue_limit(queue_dir: Path) -> None: |
There was a problem hiding this comment.
Feels like the 10% should be a parameter
|
|
||
| def _run(self) -> None: | ||
| """Main loop: process items until the thread is stopped.""" | ||
| while True: |
There was a problem hiding this comment.
Does this loop need to run without timeout, or is that built in some of the queue calls?
|
|
||
| # Default paths | ||
| _CONF_DIR = os.path.join(os.path.expanduser("~"), ".config", "odemis") | ||
| _DEFAULT_QUEUE_DIR = Path("/var/log/odemis/dc_queue") |
There was a problem hiding this comment.
That directory cannot be created by the user. So it won't work if the data collection framework runs from the GUI (which is the most typical case). I'd suggest to use ~/.local/share/odemis/dc_queue
| # Needed for communicating with AWS S3 | ||
| python3-boto3, |
There was a problem hiding this comment.
@K4rishma the code should stay Python 3.8 compatible. You should change the type annotation to be List instead of list and Dict instead of dict.
| " odemis-dc-fetch --since 2026-03-01T12:30:00 --event z_stack_acquired --output ./dc_samples" | ||
| ) | ||
| parser = argparse.ArgumentParser( | ||
| description="Fetch data-collection ZIP samples from S3.", |
There was a problem hiding this comment.
Indeed, this script is to be used only by the Delmic "data analysts". They have different access rights compared to the standard users. Standard users get a key (via the Odemis package, common to all users) that only give rights to upload data. They cannot list or read the data.
The data analysts will have a different key (one per analyst) that gives full access to the data, and with which this script will work.
@K4rishma , I'd suggest to add documentation. I think you can add a chapter to the developer's documentation. It should describe the basic model of the data collection access rights, and explain how to use the odemis-dc-fetch tool. In particular, it should explain where the keys are stored, and in which format.
| def build_argument_parser() -> argparse.ArgumentParser: | ||
| """Build CLI argument parser for `odemis-dc-fetch`.""" |
There was a problem hiding this comment.
I think the separation between code and script is odd. It does make sense to keep some of the low level functions in a separate file (which makes it easier to test). However, all the argument parsing and main function should be in the odemis-dc-fetch script.
There was a problem hiding this comment.
Don't place it in scripts/ , which is a folder that contains example of how to write python scripts using Odemis. Instead, it should be in util/ .
| else: | ||
| remind_line = "# reminder_date = " | ||
|
|
||
| content = ( |
There was a problem hiding this comment.
Indeed. Just use the ConfigParser.
| S3_REGION = "eu-west-1" | ||
|
|
||
| # Path to the S3 credentials key file (JSON with access_key / secret_key). | ||
| _CREDENTIALS_PATH = "/usr/share/odemis/datacollector.key" |
There was a problem hiding this comment.
The idea is to provide the same key for all systems, part of the Odemis package. That's a special key, which can only upload data. So it's relatively safe. It's similar to what we already do with the support ticket API (special key which only allows to create a new bug report).
| _MAX_RETRY_DELAY_SECONDS = 3600.0 | ||
| # Default delay before re-prompting for consent after postpone. | ||
| _DEFAULT_CONSENT_REMIND_DELTA = timedelta(days=30) | ||
| REMINDER_DATE_KEY = "reminder_date" |
There was a problem hiding this comment.
No need to have a constant for this. That's the same at the other entries of the INI file.
|
|
||
|
|
||
| # S3 bucket name — shared production bucket, created once by the dev team. | ||
| S3_BUCKET = "delmic-odemis-collect" |
There was a problem hiding this comment.
That's a decision I've made. I prefer having a simple code that uses a single platform for now. The code relying on it is just in two files, and very short (see UploadBackend). If we decide to switch platform, we'll update that code. It's simpler than writing a generic code that support multiple platforms right now... while we don't even know how useful the entire "data collection" feature will be.
| "failed": failed, | ||
| } | ||
|
|
||
| def main(argv: Optional[List[str]] = None) -> int: |
957af76 to
93e54ac
Compare
93e54ac to
478bf9a
Compare
Provides a thread-safe, non-blocking
DataCollector.record()call that any Odemis module can invoke to capture a labelled data sample. CLI interface is implemented to download the data from the cloud storage.