Skip to content

[feature][MSD-318] implement store annotated data framework#3444

Open
K4rishma wants to merge 6 commits into
delmic:masterfrom
K4rishma:store_annoted_data_p1
Open

[feature][MSD-318] implement store annotated data framework#3444
K4rishma wants to merge 6 commits into
delmic:masterfrom
K4rishma:store_annoted_data_p1

Conversation

@K4rishma
Copy link
Copy Markdown
Contributor

Provides a thread-safe, non-blocking DataCollector.record() call that any Odemis module can invoke to capture a labelled data sample. CLI interface is implemented to download the data from the cloud storage.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements an annotated data collection framework with asynchronous sample serialization/upload to S3, plus a CLI utility to fetch collected ZIP samples and GUI consent controls.

Changes:

  • Add DataCollector core (config persistence, serialization into ZIPs, background worker with retry/backoff, S3 upload backend).
  • Add S3 download helpers + CLI (odemis-dc-fetch) to filter/list/download samples from S3.
  • Integrate user consent flow into the GUI (startup prompt + Help menu toggle) and add unit/integration tests.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/odemis/util/datacollector.py New data collection core: config, ZIP serialization, background worker, S3 upload backend.
src/odemis/util/dc_fetch.py New retrieval helpers + CLI entrypoint for downloading samples from S3.
src/odemis/util/test/datacollector_test.py Tests for config, serialization, queue limiting, retry behavior, plus optional real-S3 integration tests.
src/odemis/util/test/dc_fetch_test.py Unit tests for fetch-side parsing/filtering/pagination and client builder overrides.
src/odemis/gui/win/consent.py New wx dialog for consent prompt (opt-in/opt-out/remind later).
src/odemis/gui/main.py Show consent dialog on startup; create and pass DataCollector to menu controller.
src/odemis/gui/cont/menu.py Add Help menu checkbox to toggle data sharing consent.
scripts/odemis-dc-fetch.py Script wrapper for the dc_fetch CLI entrypoint.
debian/control Add python3-boto3 dependency to support S3 operations.

Comment thread src/odemis/util/datacollector.py Outdated
Comment thread src/odemis/gui/main.py
Comment thread src/odemis/util/test/datacollector_test.py
Comment thread src/odemis/util/test/datacollector_test.py
Comment thread src/odemis/util/dc_fetch.py
Comment thread src/odemis/util/dc_fetch.py
Comment thread src/odemis/util/dc_fetch.py Outdated
Comment thread src/odemis/util/datacollector.py
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request adds a DataCollector subsystem, a consent UI, and a CLI fetch tool. It introduces a thread-safe DataCollector that serializes events to ZIP artifacts and uploads them to an S3-compatible backend, persistent consent/config handling (file under ~/.config/odemis), a startup consent dialog and a Help/About menu checkbox to toggle sharing, an odemis-dc-fetch-style CLI to download stored ZIPs from S3, a new executable entrypoint script, and packaging changes to depend on python3-boto3. Numerous unit and integration tests for serialization, queueing, retries, and fetching were added.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant OdemisApp as Odemis App
    participant DataCollector
    participant ConsentConfig as Consent Config
    participant ConsentDialog as Consent Dialog

    User->>OdemisApp: Launch application
    OdemisApp->>DataCollector: Initialize DataCollector()
    OdemisApp->>ConsentConfig: Load consent state
    ConsentConfig-->>OdemisApp: consent status (true/false/none)
    
    alt Should prompt for consent?
        OdemisApp->>OdemisApp: Schedule _maybe_show_consent_dialog()
        OdemisApp->>ConsentDialog: Show modal dialog
        User->>ConsentDialog: Click decision button
        
        alt User clicks "Opt in"
            ConsentDialog-->>OdemisApp: RESULT_OPT_IN
            OdemisApp->>ConsentConfig: set_consent(true)
        else User clicks "Opt out"
            ConsentDialog-->>OdemisApp: RESULT_OPT_OUT
            OdemisApp->>ConsentConfig: set_consent(false)
        else User clicks "Remind me later"
            ConsentDialog-->>OdemisApp: RESULT_REMIND_LATER
            OdemisApp->>ConsentConfig: postpone_consent()
        end
        
        ConsentConfig->>ConsentConfig: Save to config file
    else Already has consent decision
        OdemisApp->>OdemisApp: Skip dialog
    end
    
    OdemisApp->>OdemisApp: Continue startup
Loading
sequenceDiagram
    participant App as Odemis App
    participant DataCollector
    participant Serializer as _serialize
    participant Queue as Work Queue
    participant BackgroundWorker as _BackgroundWorker
    participant S3 as S3 Backend
    participant Storage as Local Storage

    App->>DataCollector: record(event_name, payload, ...)
    DataCollector->>DataCollector: Validate inputs
    DataCollector->>DataCollector: Check consent state
    
    alt Consent is true
        DataCollector->>Serializer: Serialize event to ZIP (metadata + sidecars)
        Serializer->>Storage: Create ZIP in staging dir
        Serializer-->>DataCollector: ZIP path
        DataCollector->>Queue: Enqueue work item
        DataCollector-->>App: Return (non-blocking)
        
        BackgroundWorker->>Queue: Dequeue work item
        BackgroundWorker->>S3: Upload ZIP to S3 (using S3UploadBackend)
        
        alt Upload succeeds
            S3-->>BackgroundWorker: Success
            BackgroundWorker->>Storage: Delete local ZIP
        else Upload fails
            S3-->>BackgroundWorker: Error
            BackgroundWorker->>BackgroundWorker: Schedule retry with exponential backoff
            BackgroundWorker->>Storage: Keep ZIP for retry
        end
    else Consent is false or unset
        DataCollector-->>App: Return (skip enqueue)
    end
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and specifically summarizes the main feature: implementing a data storage framework for annotated data, which aligns with the comprehensive changeset.
Description check ✅ Passed The description accurately describes the core changes: a thread-safe DataCollector.record() method and a CLI for downloading stored data from cloud storage, matching the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 80.29% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@debian/control`:
- Around line 78-79: Update the Python version metadata in debian/control to
require Python 3.9+: change the package dependency declaration "python3-all-dev
(>= 3.6.1)" and the "X-Python3-Version: >= 3.6" entry to at least ">= 3.9" so
that the new builtin generic annotations (e.g., list[str], tuple[Any, str]) are
supported; locate these fields in debian/control and bump their minimum version
to 3.9.

In `@src/odemis/gui/cont/menu.py`:
- Around line 46-54: The __init__ signature was extended with data_collector but
lacks a return type and updated docstring; update MainMenu.__init__ to include
"-> None" and expand its docstring to document the new data_collector parameter
(describe it in reStructuredText style without type annotations). Add explicit
type annotations and "-> None" to the helper methods
_append_data_sharing_menu_item and _on_toggle_data_sharing, and add
reStructuredText docstrings for each (brief description, parameters and returns
sections without type info) so all three functions (__init__,
_append_data_sharing_menu_item, _on_toggle_data_sharing) comply with the
project's typing and docstring guidelines.

In `@src/odemis/gui/win/consent.py`:
- Around line 16-59: Add reStructuredText docstrings to ConsentDialog's methods:
__init__, _on_opt_in, _on_opt_out, _on_remind_later, and _on_close; each
docstring should be a short one- or two-sentence description (no type info, no
backticks or inline markup) placed immediately under the def line, describing
the method's purpose and observable side effects (for example, that __init__
builds the dialog and sets up buttons, and the _on_*/_on_close handlers call
EndModal with the appropriate RESULT_*), and should not repeat type annotations.

In `@src/odemis/util/datacollector.py`:
- Around line 311-313: The code currently truncates item.event_name into
safe_event and uses payload keys directly for filenames (e.g., safe_event,
zip_name and the code that writes sidecar files for payload keys at the blocks
that reference payload keys around lines 326/335/352), but it never strips
path-traversal or invalid characters; update these sites to run event names and
payload keys through a filename sanitizer that: removes or replaces path
separators/backrefs ("/", "\", ".."), strips control chars, collapses sequences
of unsafe chars to a single safe char (e.g., "-"), enforces a post-sanitization
max length (e.g., 64 chars) and falls back to a safe default (like "event" or
"key") if empty; apply this sanitizer before constructing zip_name and before
using any payload key as a filename so all file creation uses sanitized,
truncated names.
- Line 483: The loop can skip calling queue.get() when pending ZIPs exist,
letting self._queue (queue.Queue) grow unbounded; to fix, make the queue bounded
(create self._queue = queue.Queue(maxsize=...) or use a size guard) and
restructure the consumer loop so queue.get() is always called (or a non-blocking
drain is performed) before any continue for pending ZIP handling; specifically,
ensure the code around the pending-ZIP check uses queue.get()/task_done() every
iteration (or blocks on put when full) so new records are either back-pressured
or consumed instead of accumulating.
- Around line 87-127: Add a reStructuredText-style docstring to the private
helper _ensure_section explaining its purpose, parameters, and return value (no
inline backticks or type markup), and update the _get_client function signature
to include an explicit return type annotation (use an appropriate type such as
boto3.client type or typing.Any/typing.Mapping if the concrete client type isn’t
importable) and add a reST docstring for it as well; ensure you remove
backticks/inline formatting from existing docstrings in this file (including
class DataCollectorConfig) and follow the repository rule to provide parameter
and return descriptions in plain reST text.
- Around line 153-161: The _write() method currently writes "consent = none"
when consent is unset which later causes the consent getter (the consent
property using getboolean("general", "consent")) to raise ValueError; update
_write() so that when self.consent is None it does not emit the "consent" option
(i.e., skip writing consent_line) or removes the option from the persisted
config instead of writing the literal "none", keeping the existing true/false
outputs for consent=True/False and ensuring readers using getboolean() won't
fail.

In `@src/odemis/util/dc_fetch.py`:
- Around line 249-258: The current loop writes directly to destination =
output_dir / Path(key).name which can cause basename collisions and leave
truncated files on failed downloads; change the logic in the loop around
s3_client.download_file so that you first create a unique temporary path next to
the intended destination (e.g., using tempfile.NamedTemporaryFile or mkstemp in
output_dir with a predictable suffix), call s3_client.download_file to write to
that temp path, on success atomically rename/move the temp file to destination
and increment downloaded, and on exception remove the temp file and increment
failed (also keep the existing skipped_existing check before attempting
download). Ensure you reference and update the same variables used here
(destination, output_dir, key, s3_client.download_file, skipped_existing,
downloaded, failed) so collisions and partial-file leftovers are avoided.

In `@src/odemis/util/test/datacollector_test.py`:
- Line 52: Remove the module-level call to logging.basicConfig() (currently
invoked as logging.basicConfig(level=logging.DEBUG)) so importing
datacollector_test.py doesn't mutate global logging; either move that call into
an if __name__ == "__main__": block for local debugging or configure logging in
a test-specific helper/fixture (or use pytest's caplog) to set logging level for
tests that need extra diagnostics. Ensure no module-level logging.basicConfig
call remains.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eaf1a595-4717-4bfc-a8b4-d42f1b7784eb

📥 Commits

Reviewing files that changed from the base of the PR and between 78e204b and 7094526.

📒 Files selected for processing (9)
  • debian/control
  • scripts/odemis-dc-fetch.py
  • src/odemis/gui/cont/menu.py
  • src/odemis/gui/main.py
  • src/odemis/gui/win/consent.py
  • src/odemis/util/datacollector.py
  • src/odemis/util/dc_fetch.py
  • src/odemis/util/test/datacollector_test.py
  • src/odemis/util/test/dc_fetch_test.py

Comment thread debian/control
Comment on lines +78 to +79
# Needed for communicating with AWS S3
python3-boto3,
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf 'Declared Python compatibility in debian/control:\n'
sed -n '18,20p' debian/control

printf '\nBuiltin generic annotations introduced by this PR:\n'
rg -n '\b(list|dict|tuple|set)\[[^]]+\]' \
  src/odemis/util/dc_fetch.py \
  src/odemis/util/test/dc_fetch_test.py \
  src/odemis/util/test/datacollector_test.py

Repository: delmic/odemis

Length of output: 600


🏁 Script executed:

#!/bin/bash

# Check for other Python 3.9+ features (e.g., type hints, match statements, etc.)
echo "=== Checking for union type hints (|) - Python 3.10+ syntax ==="
rg -n '\|\s*None' src/odemis/util/dc_fetch.py src/odemis/util/test/datacollector_test.py

echo "=== Checking for zoneinfo - Python 3.9+ ==="
rg -n 'from zoneinfo|import zoneinfo' src/odemis/util/dc_fetch.py src/odemis/util/test/datacollector_test.py

echo "=== All annotations added to dc_fetch.py and related tests ==="
grep -n 'def \|: ' src/odemis/util/dc_fetch.py | head -30

Repository: delmic/odemis

Length of output: 2102


Update debian/control to reflect Python 3.9+ requirement.

This PR introduces builtin generic annotations (list[str], tuple[Any, str]) which require Python 3.9+, but debian/control still declares X-Python3-Version: >= 3.6. The package will fail to import on Python 3.6–3.8. Update line 18 from python3-all-dev (>= 3.6.1) to at least >= 3.9.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@debian/control` around lines 78 - 79, Update the Python version metadata in
debian/control to require Python 3.9+: change the package dependency declaration
"python3-all-dev (>= 3.6.1)" and the "X-Python3-Version: >= 3.6" entry to at
least ">= 3.9" so that the new builtin generic annotations (e.g., list[str],
tuple[Any, str]) are supported; locate these fields in debian/control and bump
their minimum version to 3.9.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to handle it

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Let me check the codebase for the minimum Python version already in use and the exact annotations in the new files.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== X-Python3-Version and Build-Depends in debian/control ==="
grep -n 'X-Python3-Version\|python3-all-dev' debian/control

echo ""
echo "=== Builtin generic annotations (requires Python 3.9+) in new files ==="
rg -n '\b(list|dict|tuple|set)\[' src/odemis/util/dc_fetch.py src/odemis/util/datacollector.py 2>/dev/null || echo "(none found)"

echo ""
echo "=== Existing from __future__ import annotations in codebase ==="
rg -rn 'from __future__ import annotations' src/odemis/ | head -10

echo ""
echo "=== Python 3.9+ syntax already in use elsewhere in src/odemis (sample) ==="
rg -rn '\b(list|dict|tuple|set)\[' src/odemis/ --include='*.py' -g '!test' | grep -v 'dc_fetch\|datacollector' | head -20

echo ""
echo "=== Check top of dc_fetch.py for existing __future__ import ==="
head -20 src/odemis/util/dc_fetch.py

Length of output: 544


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Top of dc_fetch.py (imports section) ==="
head -30 src/odemis/util/dc_fetch.py

echo ""
echo "=== Python 3.9+ builtin generics elsewhere in src/odemis (non-test, non-new files) ==="
rg -rn '\b(list|dict|tuple|set)\[' src/odemis/ --include='*.py' \
  --ignore-file=<(printf 'dc_fetch.py\ndatacollector.py\n') \
  -g '!*/test/*' | head -20

echo ""
echo "=== Python version declared in setup.py or pyproject.toml ==="
fd -e py setup.py | xargs grep -n 'python_requires' 2>/dev/null || echo "(not found in setup.py)"
fd pyproject.toml | xargs grep -n 'python_requires\|requires-python' 2>/dev/null || echo "(not found in pyproject.toml)"

echo ""
echo "=== Python 3.10+ union type hints (X | Y) used anywhere in new files ==="
rg -n '\w\s*\|\s*\w' src/odemis/util/dc_fetch.py src/odemis/util/datacollector.py 2>/dev/null | grep -v '#' | head -10 || echo "(none)"

echo ""
echo "=== What Python version does the target OS ship? (check /etc/os-release) ==="
cat /etc/os-release 2>/dev/null || echo "(not available in sandbox)"
python3 --version 2>/dev/null || echo "(python3 not available)"

Length of output: 1480


@K4rishma, no worries — here's a clear explanation of the issue and the easiest fix.


What's the problem?

Your new files use lowercase builtin generics like list[Path] and tuple[Any, str] directly in type annotations. This syntax (PEP 585) was only introduced in Python 3.9+. However, debian/control declares the package requires Python ≥ 3.6, so it would fail to import on older Python versions.


The easiest fix (minimal change)

dc_fetch.py already imports List, Tuple, etc. from typing at the top — you just need to use them consistently in the two function signatures where lowercase was slipped in:

In src/odemis/util/dc_fetch.py (lines 169 and 193), change:

# Before (Python 3.9+ only)
def create_s3_client_from_config(config: DataCollectorConfig) -> tuple[Any, str]:
    ...
def build_s3_client_from_config(...) -> tuple[Any, str]:
# After (Python 3.6+ compatible)
def create_s3_client_from_config(config: DataCollectorConfig) -> Tuple[Any, str]:
    ...
def build_s3_client_from_config(...) -> Tuple[Any, str]:

(Tuple is already imported from typing in that file — no new imports needed.)

In src/odemis/util/datacollector.py (line 527), change:

# Before (Python 3.9+ only)
def _pending_zip_paths(self, queue_dir: Path) -> list[Path]:
# After (Python 3.6+ compatible)
def _pending_zip_paths(self, queue_dir: Path) -> List[Path]:

And add List to the typing import in that file:

from typing import ..., List, ...

Alternative: bump the minimum Python version

If the project is already targeting a modern OS (e.g., Ubuntu 22.04 which ships Python 3.10, or Debian Bookworm which ships Python 3.11), you could instead bump the declared minimum in debian/control:

-X-Python3-Version: >= 3.6
+X-Python3-Version: >= 3.9

 Build-Depends: ...
-               python3-all-dev (>= 3.6.1)
+               python3-all-dev (>= 3.9)

But this is a wider project-level decision — I'd recommend going with the typing-based fix above as it's the smallest and safest change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@K4rishma the code should stay Python 3.8 compatible. You should change the type annotation to be List instead of list and Dict instead of dict.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment thread src/odemis/gui/cont/menu.py Outdated
Comment thread src/odemis/gui/win/consent.py
Comment thread src/odemis/util/datacollector.py
Comment thread src/odemis/util/datacollector.py Outdated
Comment thread src/odemis/util/datacollector.py Outdated
Comment thread src/odemis/util/datacollector.py
Comment thread src/odemis/util/dc_fetch.py Outdated
Comment thread src/odemis/util/test/datacollector_test.py Outdated
Comment thread src/odemis/util/test/datacollector_test.py Outdated
@K4rishma K4rishma force-pushed the store_annoted_data_p1 branch from 7094526 to 48eba1b Compare April 13, 2026 11:47
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (7)
src/odemis/util/test/datacollector_test.py (2)

536-542: ⚠️ Potential issue | 🟠 Major

This 10 ms assertion is still too tight for CI.

That threshold mostly measures runner load and thread scheduling jitter, not whether record() stays non-blocking. It is likely to flap even when the implementation is fine.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/test/datacollector_test.py` around lines 536 - 542, The 10 ms
CI assertion in test_record_returns_fast is too tight and flakes; update the
test to use a much looser threshold (e.g. 50–200 ms) or assert that record()
returns "quickly" by averaging multiple runs or checking it returns under a
higher bound (suggestion: 100 ms) to avoid thread-scheduling jitter; modify the
test_record_returns_fast testcase that calls self._collector.record("perf_test",
"1.0", {"image": arr}) to use the relaxed threshold or an averaged timing
strategy so CI stability improves.

52-52: ⚠️ Potential issue | 🟠 Major

Avoid configuring global logging at import time.

Importing this test module still mutates process-wide logging, which makes suite behavior order-dependent and can leak verbosity/handlers into unrelated tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/test/datacollector_test.py` at line 52, Remove the
module-level call to logging.basicConfig(level=logging.DEBUG) in
datacollector_test.py and avoid mutating global logging at import time; instead,
configure logging locally within the specific test(s) that need debug output
(e.g., using the pytest caplog fixture or by calling
logging.getLogger(__name__).setLevel(logging.DEBUG) inside the test function or
setup/teardown), or use monkeypatch to adjust handlers for isolation so other
tests are not affected.
src/odemis/util/datacollector.py (4)

153-161: ⚠️ Potential issue | 🔴 Critical

Don't persist undecided consent as none.

_write() still emits consent = none, but consent still reads via ConfigParser.getboolean(). After a restart, that raises ValueError, so the consent flow can break as soon as the config is reloaded.

💡 Proposed fix
-        if consent_val is True:
-            consent_line = "consent = true"
-        elif consent_val is False:
-            consent_line = "consent = false"
-        else:
-            consent_line = "consent = none"
+        if consent_val is True:
+            consent_line = "consent = true"
+        elif consent_val is False:
+            consent_line = "consent = false"
+        else:
+            consent_line = "# consent = "
@@
     def consent(self) -> Optional[bool]:
-        """Return the consent state, or ``None`` if not yet set."""
+        """Return the consent state, or None if not yet set."""
         try:
-            return self._cp.getboolean("general", "consent")
-        except (configparser.NoSectionError, configparser.NoOptionError):
+            return self._cp.getboolean("general", "consent")
+        except (configparser.NoSectionError, configparser.NoOptionError, ValueError):
             return None

Also applies to: 186-190

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 153 - 161, The _write()
routine currently emits "consent = none" which later breaks
ConfigParser.getboolean(); change _write() so it only emits a consent line when
self.consent is explicitly True or False (set consent_line for True -> "consent
= true" and False -> "consent = false") and skip writing the consent key
entirely when consent is None/undecided; apply the same conditional approach to
the remind_date handling (use remind_val only when set, otherwise omit the
remind_date line) to avoid persisting invalid values that ConfigParser cannot
parse.

86-92: ⚠️ Potential issue | 🟡 Minor

The new module still misses the repo's Python docstring/type-hint rules.

_ensure_section() still has no docstring, _get_client() still has no return annotation, and several added docstrings still use inline backticks.

As per coding guidelines "**/*.py: Always use type hints for function parameters and return types in Python code" and "**/*.py: Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks".

Also applies to: 105-127, 181-183, 445-446

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 86 - 92, Add missing
reStructuredText-style docstrings and Python type hints: provide a proper
docstring for _ensure_section() (no inline backticks, follow rst conventions),
add a return type annotation for _get_client() and annotate all function
parameters/returns in the module (including the functions around the referenced
regions: the block containing lines 105-127, the functions at 181-183, and
445-446). Also update any docstrings that currently use inline backticks to
plain rst text (e.g., replace ``access_key``/``secret_key`` formatting with
:param: or simple names) so all functions/classes follow the repo rules for type
hints and rst docstrings.

311-313: ⚠️ Potential issue | 🟠 Major

Sanitize names before using them as archive paths.

event_name is only truncated, and payload keys are still used verbatim in sidecar filenames. Inputs containing /, \, or .. can escape tmp_dir or produce unsafe ZIP entry names.

Also applies to: 326-336, 352-354, 382-383

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 311 - 313, The code truncates
event_name to create zip_name but does not sanitize it or payload-derived
filenames, allowing `/`, `\`, `..`, or other unsafe characters to create
directory traversal or unsafe ZIP entries; update the logic that builds
safe_event and all sidecar filenames to (1) strip any path components (use the
basename), (2) normalize and whitelist characters (e.g., allow only
alphanumerics, dots, dashes and underscores, replacing everything else with
underscore), (3) collapse consecutive dots and trim length (keep the existing
64-char cap), and (4) ensure resulting names never start with a dot or slash;
apply this sanitization where safe_event and zip_name are built and to any
payload key→filename conversions used when writing sidecar files so that tmp_dir
and ZIP entries cannot be escaped.

533-563: ⚠️ Potential issue | 🟠 Major

Pending uploads still block queue draining.

When _process_pending_zips() returns True, _run() immediately continues and skips queue.get(). During a long outage/backoff window, record() keeps enqueueing into the unbounded in-memory queue while nothing is consumed.

Also applies to: 571-584

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 533 - 563, _pending_zip_paths
are causing _process_pending_zips to return True during backoff or init-failure,
which makes _run continue and skip queue.get() so the in-memory queue (fed by
record()) grows unbounded; change _process_pending_zips (and the similar block
at 571-584) so that when no upload work can proceed due to backoff (now <
self._next_retry_at) or backend init failure you return False (after any
sleep/log/schedule_retry) rather than True, allowing _run to call queue.get()
and drain/flush records to disk; keep returning True only when you actually
processed/removed pending zips (i.e., after successful _upload and unlink) so
the caller can skip immediate queue.get().
src/odemis/util/dc_fetch.py (1)

249-258: ⚠️ Potential issue | 🟠 Major

Download to a temp path and preserve the key path.

Using Path(key).name collapses host prefixes, so two different S3 keys can collide locally. Writing straight to the final destination also means a failed download can leave a truncated ZIP behind, and the next run will count it as skipped_existing instead of retrying it.

💡 Safer download flow
-            destination = output_dir / Path(key).name
+            destination = output_dir / Path(key)
+            destination.parent.mkdir(parents=True, exist_ok=True)
+            tmp_destination = destination.with_suffix(destination.suffix + ".part")
             if destination.exists():
                 skipped_existing += 1
                 continue
             try:
-                s3_client.download_file(bucket, key, str(destination))
+                s3_client.download_file(bucket, key, str(tmp_destination))
+                tmp_destination.replace(destination)
                 downloaded += 1
             except Exception:
+                tmp_destination.unlink(missing_ok=True)
                 failed += 1
                 logging.exception("Failed to download key %s", key)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/dc_fetch.py` around lines 249 - 258, The code uses
Path(key).name which flattens S3 keys and writes directly to the final path
(destination), causing filename collisions and leaving partially downloaded
files; change the flow in dc_fetch.py so that for each key you compute the full
relative path under output_dir (preserve the key's path, not just its name),
create its parent directories, download to a temporary file in the same
directory (e.g., destination.with_suffix(".part") or a tmp name), call
s3_client.download_file to that temp path, on success atomically rename/move the
temp file to the final destination and increment downloaded, on failure remove
the temp file, increment failed and log the exception (keep skipped_existing
only when the final destination already exists), and ensure you reference the
same variables in the existing block (destination, s3_client.download_file,
output_dir, skipped_existing, downloaded, failed) so behavior is atomic and
preserves key paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/odemis/util/datacollector.py`:
- Around line 620-653: DataCollector is missing get_consent_remind_days used by
the GUI; add a method def get_consent_remind_days(self) -> Optional[int]: that
mirrors the pattern of get_consent/should_prompt_for_consent: call
self._lazy_init(), return None or False if not self._init_ok, otherwise return
the value from the backing config (e.g. return
self._config.get_consent_remind_days() or self._config.consent_remind_days) and
include appropriate type hints and the same # type: ignore[union-attr] if
needed; reference DataCollector, its _lazy_init, _init_ok and _config attributes
to locate where to add it.

In `@src/odemis/util/dc_fetch.py`:
- Around line 35-38: The docstrings in this module (starting with
parse_since_utc) use inline backticks and inline type info which violates the
repo's Python docstring rules; update parse_since_utc and the other functions in
this file to use reStructuredText-style docstrings: remove inline backticks,
remove type annotations from the descriptive text, and document
parameters/returns using reST field markers (e.g., :param name: ..., :return:
...) with plain text descriptions; ensure each function/class has a
reST-compliant docstring without inline formatting markers.
- Around line 35-47: The parse_since_utc function fails on ISO strings with a
trailing "Z" in Python 3.10; update parse_since_utc to normalize UTC Z-suffixes
by replacing a trailing "Z" or "z" with "+00:00" before calling
datetime.fromisoformat, then continue the existing logic (keep the date-only
branch using datetime.strptime and the tzinfo handling/astimezone fallback) so
strings like "2026-03-01T12:30:00Z" parse correctly.

In `@src/odemis/util/test/datacollector_test.py`:
- Around line 352-357: The test doubles for the backend have the wrong upload
signature: update the test classes named _Backend used in datacollector_test so
their upload method matches production usage backend.upload(local_path,
remote_key) (accept two parameters, ignore/unused remote_key), keeping the
existing behavior of appending local_path.name to uploaded_names and raising
ConnectionError on first call; change both occurrences (the one exercising
retry/failures and the other at lines ~381-383) so _process_pending_zips() ->
_upload() calls will succeed without a TypeError.

---

Duplicate comments:
In `@src/odemis/util/datacollector.py`:
- Around line 153-161: The _write() routine currently emits "consent = none"
which later breaks ConfigParser.getboolean(); change _write() so it only emits a
consent line when self.consent is explicitly True or False (set consent_line for
True -> "consent = true" and False -> "consent = false") and skip writing the
consent key entirely when consent is None/undecided; apply the same conditional
approach to the remind_date handling (use remind_val only when set, otherwise
omit the remind_date line) to avoid persisting invalid values that ConfigParser
cannot parse.
- Around line 86-92: Add missing reStructuredText-style docstrings and Python
type hints: provide a proper docstring for _ensure_section() (no inline
backticks, follow rst conventions), add a return type annotation for
_get_client() and annotate all function parameters/returns in the module
(including the functions around the referenced regions: the block containing
lines 105-127, the functions at 181-183, and 445-446). Also update any
docstrings that currently use inline backticks to plain rst text (e.g., replace
``access_key``/``secret_key`` formatting with :param: or simple names) so all
functions/classes follow the repo rules for type hints and rst docstrings.
- Around line 311-313: The code truncates event_name to create zip_name but does
not sanitize it or payload-derived filenames, allowing `/`, `\`, `..`, or other
unsafe characters to create directory traversal or unsafe ZIP entries; update
the logic that builds safe_event and all sidecar filenames to (1) strip any path
components (use the basename), (2) normalize and whitelist characters (e.g.,
allow only alphanumerics, dots, dashes and underscores, replacing everything
else with underscore), (3) collapse consecutive dots and trim length (keep the
existing 64-char cap), and (4) ensure resulting names never start with a dot or
slash; apply this sanitization where safe_event and zip_name are built and to
any payload key→filename conversions used when writing sidecar files so that
tmp_dir and ZIP entries cannot be escaped.
- Around line 533-563: _pending_zip_paths are causing _process_pending_zips to
return True during backoff or init-failure, which makes _run continue and skip
queue.get() so the in-memory queue (fed by record()) grows unbounded; change
_process_pending_zips (and the similar block at 571-584) so that when no upload
work can proceed due to backoff (now < self._next_retry_at) or backend init
failure you return False (after any sleep/log/schedule_retry) rather than True,
allowing _run to call queue.get() and drain/flush records to disk; keep
returning True only when you actually processed/removed pending zips (i.e.,
after successful _upload and unlink) so the caller can skip immediate
queue.get().

In `@src/odemis/util/dc_fetch.py`:
- Around line 249-258: The code uses Path(key).name which flattens S3 keys and
writes directly to the final path (destination), causing filename collisions and
leaving partially downloaded files; change the flow in dc_fetch.py so that for
each key you compute the full relative path under output_dir (preserve the key's
path, not just its name), create its parent directories, download to a temporary
file in the same directory (e.g., destination.with_suffix(".part") or a tmp
name), call s3_client.download_file to that temp path, on success atomically
rename/move the temp file to the final destination and increment downloaded, on
failure remove the temp file, increment failed and log the exception (keep
skipped_existing only when the final destination already exists), and ensure you
reference the same variables in the existing block (destination,
s3_client.download_file, output_dir, skipped_existing, downloaded, failed) so
behavior is atomic and preserves key paths.

In `@src/odemis/util/test/datacollector_test.py`:
- Around line 536-542: The 10 ms CI assertion in test_record_returns_fast is too
tight and flakes; update the test to use a much looser threshold (e.g. 50–200
ms) or assert that record() returns "quickly" by averaging multiple runs or
checking it returns under a higher bound (suggestion: 100 ms) to avoid
thread-scheduling jitter; modify the test_record_returns_fast testcase that
calls self._collector.record("perf_test", "1.0", {"image": arr}) to use the
relaxed threshold or an averaged timing strategy so CI stability improves.
- Line 52: Remove the module-level call to
logging.basicConfig(level=logging.DEBUG) in datacollector_test.py and avoid
mutating global logging at import time; instead, configure logging locally
within the specific test(s) that need debug output (e.g., using the pytest
caplog fixture or by calling logging.getLogger(__name__).setLevel(logging.DEBUG)
inside the test function or setup/teardown), or use monkeypatch to adjust
handlers for isolation so other tests are not affected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4707cfbb-e587-4426-82ff-a92c77005543

📥 Commits

Reviewing files that changed from the base of the PR and between 7094526 and 48eba1b.

📒 Files selected for processing (9)
  • debian/control
  • scripts/odemis-dc-fetch.py
  • src/odemis/gui/cont/menu.py
  • src/odemis/gui/main.py
  • src/odemis/gui/win/consent.py
  • src/odemis/util/datacollector.py
  • src/odemis/util/dc_fetch.py
  • src/odemis/util/test/datacollector_test.py
  • src/odemis/util/test/dc_fetch_test.py
✅ Files skipped from review due to trivial changes (3)
  • debian/control
  • scripts/odemis-dc-fetch.py
  • src/odemis/gui/win/consent.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/odemis/gui/cont/menu.py

Comment thread src/odemis/util/datacollector.py
Comment thread src/odemis/util/dc_fetch.py Outdated
Comment thread src/odemis/util/dc_fetch.py
Comment thread src/odemis/util/test/datacollector_test.py
@K4rishma K4rishma force-pushed the store_annoted_data_p1 branch from 48eba1b to 06b4736 Compare April 20, 2026 12:12
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (9)
src/odemis/util/test/datacollector_test.py (3)

52-52: ⚠️ Potential issue | 🟠 Major

Avoid configuring global logging when the test module is imported.

logging.basicConfig() mutates process-wide logging for unrelated tests and can make suite behavior order-dependent. Keep this under if __name__ == "__main__": or use per-test logging controls.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/test/datacollector_test.py` at line 52, Remove the
module-level call to logging.basicConfig() (the logging.basicConfig line) so
importing datacollector_test doesn't mutate global logging; either move that
call into an if __name__ == "__main__": block for local debugging or remove it
entirely and rely on per-test logging control (pytest's caplog fixture or a
test-level logger configuration) within the test functions that need debug
output.

352-357: ⚠️ Potential issue | 🟠 Major

Match the backend double to the production upload signature.

_process_pending_zips() calls _upload(), which invokes backend.upload(local_path, remote_key). These doubles currently accept only local_path, so the tests fail before exercising retry/order behavior.

Fix the backend doubles
 class _Backend:
-    def upload(self, local_path: Path) -> None:
+    def upload(self, local_path: Path, remote_key: str) -> None:
+        del remote_key
         uploaded_names.append(local_path.name)
         if failures["count"] == 0:
             failures["count"] += 1
             raise ConnectionError("temporary network issue")
@@
 class _Backend:
-    def upload(self, local_path: Path) -> None:
+    def upload(self, local_path: Path, remote_key: str) -> None:
+        del remote_key
         uploaded.append(local_path.name)

Also applies to: 381-383

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/test/datacollector_test.py` around lines 352 - 357, The test
double `_Backend`'s upload method must match the production signature used by
`_process_pending_zips()` → `_upload()`, so change the `_Backend.upload`
definitions in the tests (the ones around the current `_Backend` class and the
second double at the later block) to accept two parameters `(local_path,
remote_key)` and ignore or use `remote_key` as needed; keep the existing
behavior of recording `local_path.name` and raising a `ConnectionError` on the
first call so retry/order behavior is still exercised.

536-542: ⚠️ Potential issue | 🟠 Major

Loosen this non-blocking assertion.

A 10 ms wall-clock bound will flap under CI load. Prefer asserting that record() enqueues work without doing serialization/upload inline, or use a much wider timing budget.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/test/datacollector_test.py` around lines 536 - 542, The test
test_record_returns_fast currently asserts a strict 10 ms wall-clock bound on
self._collector.record which is flaky; instead change the assertion to verify
record enqueues work rather than completes serialization/upload
synchronously—e.g., call self._collector.record("perf_test","1.0",{"image":arr})
and then assert the collector's work queue increased (inspect
_collector._queue.qsize() or a public enqueue counter) or mock the
serializer/uploader on the collector and assert its serialize/upload method was
not called synchronously; alternatively, if you must keep a timing check, widen
the threshold to a much larger value (e.g., 100–200 ms) and document why.
src/odemis/util/dc_fetch.py (3)

35-47: ⚠️ Potential issue | 🟠 Major

Normalize UTC Z suffixes before parsing.

datetime.fromisoformat() is not safe for --since ...Z on Python 3.10, so valid ISO-8601 UTC input can fail. Normalize Z/z to +00:00 before parsing.

Python 3.10 compatibility fix
 def parse_since_utc(value: str) -> datetime:
     """Parse a date/datetime string to UTC-aware datetime.
@@
     """
     text = value.strip()
+    if text.endswith(("Z", "z")):
+        text = f"{text[:-1]}+00:00"
     if len(text) == 10:
         parsed = datetime.strptime(text, "%Y-%m-%d")
         return parsed.replace(tzinfo=timezone.utc)
In Python 3.10, does datetime.datetime.fromisoformat("2026-03-01T12:30:00Z") raise ValueError, and is replacing a trailing Z with +00:00 the recommended compatibility workaround?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/dc_fetch.py` around lines 35 - 47, The parse_since_utc
function fails on ISO strings ending with "Z"/"z" under Python 3.10 because
datetime.fromisoformat doesn't accept "Z"; update parse_since_utc to normalize
the input before calling fromisoformat by replacing a trailing "Z" or "z" with
"+00:00" after trimming whitespace (keep existing logic for 10-char date-only
strings), then proceed with fromisoformat and the existing tzinfo
handling/astimezone behavior.

249-258: ⚠️ Potential issue | 🟠 Major

Download through a temp file and avoid basename collisions.

Path(key).name drops the host prefix, so different S3 keys can overwrite/skip each other locally. Writing directly to the final path can also leave a partial ZIP that later appears complete.

Safer download pattern
-            destination = output_dir / Path(key).name
+            destination = output_dir / Path(key)
+            destination.parent.mkdir(parents=True, exist_ok=True)
+            tmp_destination = destination.with_suffix(destination.suffix + ".part")
             if destination.exists():
                 skipped_existing += 1
                 continue
             try:
-                s3_client.download_file(bucket, key, str(destination))
+                s3_client.download_file(bucket, key, str(tmp_destination))
+                tmp_destination.replace(destination)
                 downloaded += 1
             except Exception:
+                tmp_destination.unlink(missing_ok=True)
                 failed += 1
                 logging.exception("Failed to download key %s", key)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/dc_fetch.py` around lines 249 - 258, The current download
uses Path(key).name which causes basename collisions and writes directly to the
final path; change the logic in the download loop (where destination,
output_dir, key, s3_client.download_file are referenced) to (1) compute a unique
safe local path for each S3 key (e.g., recreate the key’s relative path under
output_dir or append a short hash to the basename) and ensure parent directories
exist, (2) download to a temporary file in the same directory (e.g.,
destination.with_suffix(destination.suffix + ".part" or a tmp filename) using
s3_client.download_file, (3) on successful download atomically rename/move the
temp file to destination, and (4) only update downloaded/skipped/failed counters
after the rename; ensure exceptions remove temp files and log the key on
failure.

36-39: ⚠️ Potential issue | 🟡 Minor

Remove inline formatting markers from docstrings.

Several docstrings still use backticks around examples, command names, or return tuples. Convert these to plain reStructuredText field-style descriptions. As per coding guidelines, **/*.py: Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks.

Also applies to: 51-51, 68-68, 80-80, 138-138, 170-170, 194-194, 270-270

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/dc_fetch.py` around lines 36 - 39, The docstring for the
date/datetime parser currently contains inline backticks around examples and
should be converted to plain reStructuredText: remove backticks and rewrite
examples as normal text or field-style descriptions (e.g., "Accepts ISO-8601
date (YYYY-MM-DD) and datetime (YYYY-MM-DDTHH:MM:SS with optional timezone).")
in the docstring for the parsing function (the function whose docstring begins
"Parse a date/datetime string to UTC-aware datetime."); apply the same change to
the other listed docstrings (those at the same file positions noted in the
review) so no backticks or inline formatting markers remain and the text follows
reStructuredText/plain style.
src/odemis/util/datacollector.py (3)

297-388: ⚠️ Potential issue | 🟠 Major

Sanitize event names and payload keys before using them as filenames.

event_name is only truncated, and payload keys are used directly for sidecar paths. Values containing /, .., or path separators can escape the queue/temp directory or make serialization fail.

Sanitize file-name components before constructing ZIP and sidecars
+import re
+
+_SAFE_NAME_RE = re.compile(r"[^A-Za-z0-9._-]+")
+
+
+def _safe_name(value: object, default: str, max_len: int = 64) -> str:
+    """Return a filesystem-safe name component."""
+    cleaned = _SAFE_NAME_RE.sub("_", str(value)).strip("._-")
+    if not cleaned or cleaned in {".", ".."}:
+        cleaned = default
+    return cleaned[:max_len]
+
@@
-    safe_event = item.event_name[:64] if item.event_name else "event"
+    safe_event = _safe_name(item.event_name, default="event")
@@
         for key, value in item.payload.items():
+            safe_key = _safe_name(key, default="payload")
             if value is None or isinstance(value, (str, int, float, bool)):
                 payload_meta[key] = value
 
             elif isinstance(value, numpy.ndarray):
                 if item.image_format.upper() == "HDF5":
-                    arc_name = f"{key}.h5"
+                    arc_name = f"{safe_key}.h5"
@@
                 else:
-                    arc_name = f"{key}.ome.tiff"
+                    arc_name = f"{safe_key}.ome.tiff"
@@
             elif isinstance(value, (dict, list)):
-                arc_name = f"extra_{key}.json"
+                arc_name = f"extra_{safe_key}.json"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 297 - 388, The _serialize
function uses raw event_name and payload keys to build filenames (zip_name,
arc_name, extra_files entries) which allows path traversal or invalid filenames;
sanitize these components by normalizing and whitelisting characters (e.g.
replace or remove path separators, dots, control chars), reject or replace
dangerous sequences like ".." and leading slashes, and enforce a max length;
ensure arc_name is always a simple basename (no directories) before creating
abs_path in tmp_dir and before adding entries to payload_meta/extra_files; apply
the same sanitisation to zip_name (derived from safe_event) so produced
tmp_zip/final_zip and ZIP arc names cannot escape tmp_dir or queue_dir and never
create absolute paths.

480-576: ⚠️ Potential issue | 🟠 Major

Do not let pending uploads starve in-memory queue consumption.

When pending ZIPs exist or backoff is active, _run() continues before queue.get(). New record() calls then accumulate in the unbounded queue during outages.

Keep the queue bounded and consume it even when pending ZIPs exist
-        self._queue: queue.Queue = queue.Queue()
+        self._queue: queue.Queue = queue.Queue(maxsize=1000)
@@
         self._ensure_thread()
-        self._queue.put_nowait(item)
+        try:
+            self._queue.put_nowait(item)
+        except queue.Full:
+            logging.warning("DataCollector queue is full; dropping event '%s'", item.event_name)
@@
         while True:
             try:
-                if self._process_pending_zips(self._queue_dir):
-                    continue
+                self._process_pending_zips(self._queue_dir)
             except Exception:
                 logging.exception("DataCollector error while processing pending uploads")
                 self._schedule_retry()
-                continue
             try:
                 item = self._queue.get(timeout=1.0)
             except queue.Empty:
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 480 - 576, The _run loop
currently prioritizes pending ZIP uploads and backoff, allowing the in-memory
queue to grow unbounded during outages; change _run so it always tries to
consume any queued _WorkItem before (or while) sleeping for pending uploads:
attempt a non-blocking or timed get from self._queue (use Queue.get_nowait() or
get(timeout=...)) inside the loop and call _process_work_item(item) for each
retrieved item, falling back to calling _process_pending_zips(self._queue_dir)
only when the queue is empty; ensure the lock/daemon behavior is unchanged and
preserve existing retry/backoff logic in _process_pending_zips and
_schedule_retry/_reset_retry.

144-191: ⚠️ Potential issue | 🔴 Critical

Do not persist unset consent as none.

_write() emits consent = none, but consent later reads the option with getboolean(), which raises for "none" after a restart. Comment out/remove the option for the undecided state, and tolerate legacy "none" when reading.

Persist undecided consent safely
-        else:
-            consent_line = "consent = none"
+        else:
+            consent_line = "# consent = "
@@
     `@property`
     def consent(self) -> Optional[bool]:
         """Return the consent state, or ``None`` if not yet set."""
         try:
-            return self._cp.getboolean("general", "consent")
+            raw_value = self._cp.get("general", "consent").strip().lower()
         except (configparser.NoSectionError, configparser.NoOptionError):
             return None
+        if raw_value in ("", "none"):
+            return None
+        try:
+            return self._cp.getboolean("general", "consent")
+        except ValueError:
+            logging.warning("Invalid data-collection consent value: %r", raw_value)
+            return None
#!/bin/bash
# Verify ConfigParser boolean parsing for the persisted undecided value.
python3 - <<'PY'
import configparser
cp = configparser.ConfigParser(interpolation=None)
cp.read_string("[general]\nconsent = none\n")
try:
    print(cp.getboolean("general", "consent"))
except Exception as exc:
    print(type(exc).__name__, str(exc))
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 144 - 191, The _write method
currently persists undecided consent as the literal "consent = none" which
breaks ConfigParser.getboolean in the consent property; modify _write to NOT
write an explicit "consent = none" (either omit the consent line or write it
commented) when self.consent is None, and keep writing explicit "consent =
true/false" for booleans; update the consent property to tolerate legacy "none"
by catching ValueError (or reading the raw string via _cp.get and returning None
when the value equals "none") so it returns None for undecided state instead of
propagating an exception; adjust references to _write and consent in tests/usage
as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/odemis/util/datacollector.py`:
- Around line 697-699: The current validation calls image_format.upper() which
raises AttributeError for non-strings; add an explicit check that image_format
is an instance of str before calling .upper() and if not raise a ValueError with
the same informative message (referencing image_format and _VALID_IMAGE_FORMATS)
so callers get ValueError not AttributeError; modify the validation around the
image_format check (the code that uses image_format.upper() and
_VALID_IMAGE_FORMATS) to perform the isinstance(image_format, str) check first,
then proceed to compare image_format.upper() against _VALID_IMAGE_FORMATS.

In `@src/odemis/util/dc_fetch.py`:
- Around line 137-150: The iter_s3_objects paginator can spin indefinitely if
response["IsTruncated"] is True but there is no "NextContinuationToken"; update
iter_s3_objects to validate that when response.get("IsTruncated") is truthy you
also have a non-empty NextContinuationToken before assigning token and loop
continuing — if NextContinuationToken is missing or falsy, raise an exception
(or log and break) to fail fast and avoid repeating the same request; reference
the iter_s3_objects function and the token/response.get("NextContinuationToken")
logic when making the change.

---

Duplicate comments:
In `@src/odemis/util/datacollector.py`:
- Around line 297-388: The _serialize function uses raw event_name and payload
keys to build filenames (zip_name, arc_name, extra_files entries) which allows
path traversal or invalid filenames; sanitize these components by normalizing
and whitelisting characters (e.g. replace or remove path separators, dots,
control chars), reject or replace dangerous sequences like ".." and leading
slashes, and enforce a max length; ensure arc_name is always a simple basename
(no directories) before creating abs_path in tmp_dir and before adding entries
to payload_meta/extra_files; apply the same sanitisation to zip_name (derived
from safe_event) so produced tmp_zip/final_zip and ZIP arc names cannot escape
tmp_dir or queue_dir and never create absolute paths.
- Around line 480-576: The _run loop currently prioritizes pending ZIP uploads
and backoff, allowing the in-memory queue to grow unbounded during outages;
change _run so it always tries to consume any queued _WorkItem before (or while)
sleeping for pending uploads: attempt a non-blocking or timed get from
self._queue (use Queue.get_nowait() or get(timeout=...)) inside the loop and
call _process_work_item(item) for each retrieved item, falling back to calling
_process_pending_zips(self._queue_dir) only when the queue is empty; ensure the
lock/daemon behavior is unchanged and preserve existing retry/backoff logic in
_process_pending_zips and _schedule_retry/_reset_retry.
- Around line 144-191: The _write method currently persists undecided consent as
the literal "consent = none" which breaks ConfigParser.getboolean in the consent
property; modify _write to NOT write an explicit "consent = none" (either omit
the consent line or write it commented) when self.consent is None, and keep
writing explicit "consent = true/false" for booleans; update the consent
property to tolerate legacy "none" by catching ValueError (or reading the raw
string via _cp.get and returning None when the value equals "none") so it
returns None for undecided state instead of propagating an exception; adjust
references to _write and consent in tests/usage as needed.

In `@src/odemis/util/dc_fetch.py`:
- Around line 35-47: The parse_since_utc function fails on ISO strings ending
with "Z"/"z" under Python 3.10 because datetime.fromisoformat doesn't accept
"Z"; update parse_since_utc to normalize the input before calling fromisoformat
by replacing a trailing "Z" or "z" with "+00:00" after trimming whitespace (keep
existing logic for 10-char date-only strings), then proceed with fromisoformat
and the existing tzinfo handling/astimezone behavior.
- Around line 249-258: The current download uses Path(key).name which causes
basename collisions and writes directly to the final path; change the logic in
the download loop (where destination, output_dir, key, s3_client.download_file
are referenced) to (1) compute a unique safe local path for each S3 key (e.g.,
recreate the key’s relative path under output_dir or append a short hash to the
basename) and ensure parent directories exist, (2) download to a temporary file
in the same directory (e.g., destination.with_suffix(destination.suffix +
".part" or a tmp filename) using s3_client.download_file, (3) on successful
download atomically rename/move the temp file to destination, and (4) only
update downloaded/skipped/failed counters after the rename; ensure exceptions
remove temp files and log the key on failure.
- Around line 36-39: The docstring for the date/datetime parser currently
contains inline backticks around examples and should be converted to plain
reStructuredText: remove backticks and rewrite examples as normal text or
field-style descriptions (e.g., "Accepts ISO-8601 date (YYYY-MM-DD) and datetime
(YYYY-MM-DDTHH:MM:SS with optional timezone).") in the docstring for the parsing
function (the function whose docstring begins "Parse a date/datetime string to
UTC-aware datetime."); apply the same change to the other listed docstrings
(those at the same file positions noted in the review) so no backticks or inline
formatting markers remain and the text follows reStructuredText/plain style.

In `@src/odemis/util/test/datacollector_test.py`:
- Line 52: Remove the module-level call to logging.basicConfig() (the
logging.basicConfig line) so importing datacollector_test doesn't mutate global
logging; either move that call into an if __name__ == "__main__": block for
local debugging or remove it entirely and rely on per-test logging control
(pytest's caplog fixture or a test-level logger configuration) within the test
functions that need debug output.
- Around line 352-357: The test double `_Backend`'s upload method must match the
production signature used by `_process_pending_zips()` → `_upload()`, so change
the `_Backend.upload` definitions in the tests (the ones around the current
`_Backend` class and the second double at the later block) to accept two
parameters `(local_path, remote_key)` and ignore or use `remote_key` as needed;
keep the existing behavior of recording `local_path.name` and raising a
`ConnectionError` on the first call so retry/order behavior is still exercised.
- Around line 536-542: The test test_record_returns_fast currently asserts a
strict 10 ms wall-clock bound on self._collector.record which is flaky; instead
change the assertion to verify record enqueues work rather than completes
serialization/upload synchronously—e.g., call
self._collector.record("perf_test","1.0",{"image":arr}) and then assert the
collector's work queue increased (inspect _collector._queue.qsize() or a public
enqueue counter) or mock the serializer/uploader on the collector and assert its
serialize/upload method was not called synchronously; alternatively, if you must
keep a timing check, widen the threshold to a much larger value (e.g., 100–200
ms) and document why.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 34ecf67e-7de2-4810-b041-ae18054f6102

📥 Commits

Reviewing files that changed from the base of the PR and between 48eba1b and 06b4736.

📒 Files selected for processing (9)
  • debian/control
  • scripts/odemis-dc-fetch.py
  • src/odemis/gui/cont/menu.py
  • src/odemis/gui/main.py
  • src/odemis/gui/win/consent.py
  • src/odemis/util/datacollector.py
  • src/odemis/util/dc_fetch.py
  • src/odemis/util/test/datacollector_test.py
  • src/odemis/util/test/dc_fetch_test.py
✅ Files skipped from review due to trivial changes (2)
  • debian/control
  • scripts/odemis-dc-fetch.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/odemis/gui/cont/menu.py

Comment thread src/odemis/util/datacollector.py Outdated
Comment thread src/odemis/util/dc_fetch.py Outdated
Provides a thread-safe, non-blocking ``DataCollector.record()`` call that any Odemis module can invoke to capture a labelled data sample. CLI interface is implemented to download the data from the cloud storage.
@K4rishma K4rishma force-pushed the store_annoted_data_p1 branch from 06b4736 to 1f2c680 Compare April 21, 2026 08:38
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (11)
src/odemis/util/datacollector.py (5)

697-699: ⚠️ Potential issue | 🟡 Minor

Validate image_format is a string before calling .upper().

Passing None or another non-string currently raises AttributeError, not the documented ValueError.

Preserve the documented ValueError contract
-        if image_format.upper() not in _VALID_IMAGE_FORMATS:
+        if not isinstance(image_format, str) or image_format.upper() not in _VALID_IMAGE_FORMATS:
             raise ValueError(
                 f"image_format must be one of {_VALID_IMAGE_FORMATS}, got {image_format!r}"
             )
#!/bin/bash
# Verify why the current validation path raises the wrong exception type.
python3 - <<'PY'
try:
    None.upper()
except Exception as exc:
    print(type(exc).__name__, str(exc))
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 697 - 699, The current
validation calls image_format.upper() and can raise AttributeError for
non-string inputs; update the check in the datacollector image format validation
(the block that compares image_format.upper() against _VALID_IMAGE_FORMATS) to
first ensure image_format is an instance of str (e.g., using
isinstance(image_format, str)); if not, raise the same ValueError with the
existing message; then call image_format.upper() and compare to
_VALID_IMAGE_FORMATS as before so the function consistently raises ValueError
for invalid types or values.

153-161: ⚠️ Potential issue | 🔴 Critical

Handle persisted consent = none on config reload.

_write() emits consent = none, but consent later calls getboolean(), which raises ValueError for "none". After restart, config reads can fail instead of returning undecided consent.

Make the getter tolerate undecided consent
     `@property`
     def consent(self) -> Optional[bool]:
         """Return the consent state, or ``None`` if not yet set."""
         try:
-            return self._cp.getboolean("general", "consent")
+            raw = self._cp.get("general", "consent").strip().lower()
         except (configparser.NoSectionError, configparser.NoOptionError):
             return None
+        if raw in ("", "none"):
+            return None
+        try:
+            return self._cp.getboolean("general", "consent")
+        except ValueError:
+            logging.warning("Invalid data-collection consent value in config: %r", raw)
+            return None
#!/bin/bash
# Verify ConfigParser boolean parsing for the persisted "none" value.
python3 - <<'PY'
import configparser
cp = configparser.ConfigParser(interpolation=None)
cp.read_string("[general]\nconsent = none\n")
try:
    cp.getboolean("general", "consent")
except Exception as exc:
    print(type(exc).__name__, str(exc))
PY

Also applies to: 185-190

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 153 - 161, The persisted
config writes "consent = none" which later causes ConfigParser.getboolean() to
raise ValueError on reload; update the consent getter (the property/method that
reads "consent" and any similar readers around the locations referenced, e.g.,
lines handling consent at 185-190) to tolerate an undecided value by reading the
raw string (ConfigParser.get or get with fallback), normalizing it
(strip/lower), and returning None when it equals "none" (or empty) instead of
calling getboolean directly; alternatively catch ValueError from getboolean and
map it to None so undecided consent is handled safely on config reload while
keeping _write() unchanged.

105-127: ⚠️ Potential issue | 🟡 Minor

Bring the new docstrings/type hints in line with the repo rule.

This module still has inline docstring markers, _ensure_section() has no docstring, and _get_client() lacks a return type annotation. As per coding guidelines, **/*.py: Always use type hints for function parameters and return types in Python code; Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks.

Also applies to: 181-183, 445-446, 470-473, 536-537, 676-688

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 105 - 127, Update docstrings
and add missing type hints per repo rules: convert the DataCollectorConfig class
docstring and any other docstrings referenced (lines ~181-183, 445-446, 470-473,
536-537, 676-688) to reStructuredText style without inline formatting markers or
backticks, and add a proper docstring for the helper function _ensure_section()
describing purpose, args, and return in reST. Add full type annotations to all
function signatures that lack them, specifically annotate _get_client() with its
return type and annotate parameters/return types for any functions missing hints
in this module; ensure symbol names match exactly (DataCollectorConfig,
_ensure_section, _get_client) so the changes are applied to the correct
locations.

311-356: ⚠️ Potential issue | 🔴 Critical

Sanitize event names and payload keys before using them as paths.

event_name is only truncated, and payload keys are interpolated directly into sidecar paths. A key like /tmp/x or ../x can escape tmp_dir before the ZIP is built and can also create unsafe archive entries.

Sanitize filename components before file creation
 import queue
+import re
 import shutil
@@
 _DEFAULT_CONSENT_REMIND_DELTA = timedelta(days=30)
 REMINDER_DATE_KEY = "reminder_date"
+_SAFE_FILENAME_RE = re.compile(r"[^A-Za-z0-9._-]+")
+
+
+def _safe_filename_part(value: object, default: str, max_len: int = 64) -> str:
+    """Return a filesystem-safe filename component."""
+    try:
+        text = str(value)
+    except Exception:
+        text = default
+    cleaned = _SAFE_FILENAME_RE.sub("_", text).strip("._-")
+    return (cleaned or default)[:max_len]
@@
-    # Limit event_name length so the filename stays within filesystem limits.
-    safe_event = item.event_name[:64] if item.event_name else "event"
+    safe_event = _safe_filename_part(item.event_name, default="event", max_len=64)
@@
-        for key, value in item.payload.items():
+        for key, value in item.payload.items():
+            safe_key = _safe_filename_part(key, default="payload", max_len=64)
             if value is None or isinstance(value, (str, int, float, bool)):
                 payload_meta[key] = value
 
             elif isinstance(value, numpy.ndarray):
                 if item.image_format.upper() == "HDF5":
-                    arc_name = f"{key}.h5"
+                    arc_name = f"{safe_key}.h5"
@@
-                    arc_name = f"{key}.ome.tiff"
+                    arc_name = f"{safe_key}.ome.tiff"
@@
             elif isinstance(value, (dict, list)):
-                arc_name = f"extra_{key}.json"
+                arc_name = f"extra_{safe_key}.json"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 311 - 356, The code uses raw
event_name and payload keys to build filenames/paths (safe_event, zip_name,
arc_name) which can include path traversal or separators; sanitize event_name
and each payload key before using them as filenames (e.g., remove/replace path
separators, strip leading dots, collapse .., enforce a safe charset and length,
fallback to a safe token) and use the sanitized value when building zip_name and
arc_name (references: safe_event, zip_name, arc_name, tmp_dir, extra_files,
payload_meta). After creating abs_path, ensure abs_path.resolve() is inside
tmp_dir.resolve() (use Path.resolve() and is_relative_to or equivalent) and
reject or remap any keys that would escape tmp_dir; also ensure the arcname
stored in payload_meta is the sanitized basename that will be used as the zip
entry name to prevent unsafe archive entries.

480-496: ⚠️ Potential issue | 🟠 Major

Do not let pending uploads starve queue consumption.

When pending ZIPs exist, _run() continues before queue.get(). During an outage/backoff, new records can accumulate indefinitely in the unbounded in-memory queue.

Drain/limit the in-memory queue while pending uploads retry
 _VALID_IMAGE_FORMATS = ("TIFF", "HDF5")
+_MAX_IN_MEMORY_QUEUE_ITEMS = 1000
 _INITIAL_RETRY_DELAY_SECONDS = 30.0
@@
-        self._queue: queue.Queue = queue.Queue()
+        self._queue: queue.Queue = queue.Queue(maxsize=_MAX_IN_MEMORY_QUEUE_ITEMS)
@@
         """
         self._ensure_thread()
-        self._queue.put_nowait(item)
+        try:
+            self._queue.put_nowait(item)
+        except queue.Full:
+            logging.warning("DataCollector queue full; dropping event '%s'", item.event_name)
@@
     def _run(self) -> None:
         """Main loop: process items until the thread is stopped."""
         while True:
             try:
-                if self._process_pending_zips(self._queue_dir):
-                    continue
+                self._process_pending_zips(self._queue_dir)
             except Exception:
                 logging.exception("DataCollector error while processing pending uploads")
                 self._schedule_retry()
-                continue
             try:
                 item = self._queue.get(timeout=1.0)
             except queue.Empty:
                 continue

Also applies to: 571-584

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 480 - 496, The _run loop
currently skips queue.get() while there are pending uploads and it's backing
off, allowing the unbounded in-memory self._queue to grow; change _run so that
even during backoff it repeatedly calls
self._queue.get(timeout=remaining_backoff) (or otherwise polls the queue) and
processes/drains items up to a small limit rather than sleeping unconditionally,
ensuring new records are consumed while waiting to retry uploads; reference the
_run method, self._queue, self._next_retry_at and the existing enqueue()
behavior and implement a bounded-drain or timed-get loop so pending upload
retries no longer starve queue consumption.
src/odemis/util/dc_fetch.py (4)

137-150: ⚠️ Potential issue | 🟠 Major

Guard against truncated listings without a continuation token.

If an S3-compatible endpoint returns IsTruncated=True without NextContinuationToken, this loop repeats the same request forever. Fail fast instead.

Prevent infinite pagination
         if not response.get("IsTruncated"):
             break
         token = response.get("NextContinuationToken")
+        if not token:
+            raise RuntimeError("S3 listing was truncated without a continuation token")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/dc_fetch.py` around lines 137 - 150, In iter_s3_objects,
guard against infinite loops when S3 returns IsTruncated=True but no
NextContinuationToken: after calling s3_client.list_objects_v2 and before
repeating, check response.get("IsTruncated") and if True and
response.get("NextContinuationToken") is falsy, raise an exception (e.g.,
RuntimeError) with a clear message including bucket and prefix; otherwise assign
token = response["NextContinuationToken"] as currently done. This ensures
iter_s3_objects (token / response) fails fast instead of looping forever.

35-47: ⚠️ Potential issue | 🟠 Major

Normalize trailing Z before fromisoformat() for Python 3.10.

--since 2026-03-01T12:30:00Z is a common UTC ISO-8601 input, but Python 3.10 rejects the trailing Z. Normalize it before parsing.

Python 3.10-compatible parsing
     """
     text = value.strip()
+    if text.endswith(("Z", "z")):
+        text = f"{text[:-1]}+00:00"
     if len(text) == 10:
         parsed = datetime.strptime(text, "%Y-%m-%d")
         return parsed.replace(tzinfo=timezone.utc)
In Python 3.10, does datetime.datetime.fromisoformat support ISO 8601 timestamps ending with Z, such as 2026-03-01T12:30:00Z?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/dc_fetch.py` around lines 35 - 47, The parse_since_utc
function currently calls datetime.fromisoformat(text) which fails on ISO strings
that end with 'Z' (e.g., "2026-03-01T12:30:00Z") on Python 3.10; before calling
datetime.fromisoformat in parse_since_utc, normalize a trailing 'Z' to an
explicit UTC offset (e.g., replace a single trailing 'Z' with '+00:00') so
fromisoformat accepts it, preserving the existing date-only branch and the
subsequent tzinfo handling/astimezone logic.

50-80: ⚠️ Potential issue | 🟡 Minor

Clean up docstrings to match the Python guideline.

Several new docstrings still use inline formatting markers such as backticks. Please convert these to plain reStructuredText-style docstrings without inline markers. As per coding guidelines, **/*.py: Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks.

Also applies to: 137-138, 169-170, 180-181, 188-194, 214-223, 269-270

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/dc_fetch.py` around lines 50 - 80, Update the docstrings in
parse_key_timestamp_utc, parse_key_event_name, build_argument_parser and the
other affected functions (around the mentioned ranges) to follow plain
reStructuredText style: remove backticks and inline formatting (e.g. change
"`<event>-<YYYYMMDDTHHmmss>-<uuid>.zip`" to simply
"<event>-<YYYYMMDDTHHmmss>-<uuid>.zip" or a short sentence), keep them brief
descriptive sentences without type annotations or inline code markup, and ensure
each function/class has a clear one- or two-line description in rST form
consistent with the project's docstring guidelines.

249-258: ⚠️ Potential issue | 🟠 Major

Avoid basename collisions and partial ZIP leftovers.

Path(key).name drops the host prefix, so different S3 keys can overwrite/collide locally. Writing directly to the final path also leaves truncated files after failed downloads, which the next run will treat as existing.

Safer download pattern
 import argparse
 import logging
+import tempfile
 from datetime import datetime, timezone
@@
-            destination = output_dir / Path(key).name
+            safe_parts = [part for part in key.split("/") if part not in ("", ".", "..")]
+            if not safe_parts:
+                failed += 1
+                logging.warning("Skipping unsafe S3 key %s", key)
+                continue
+
+            destination = output_dir.joinpath(*safe_parts)
             if destination.exists():
                 skipped_existing += 1
                 continue
+            destination.parent.mkdir(parents=True, exist_ok=True)
+            tmp_file = tempfile.NamedTemporaryFile(
+                dir=str(destination.parent),
+                prefix=f".{destination.name}.",
+                suffix=".part",
+                delete=False,
+            )
+            tmp_destination = Path(tmp_file.name)
+            tmp_file.close()
             try:
-                s3_client.download_file(bucket, key, str(destination))
+                s3_client.download_file(bucket, key, str(tmp_destination))
+                tmp_destination.replace(destination)
                 downloaded += 1
             except Exception:
+                tmp_destination.unlink(missing_ok=True)
                 failed += 1
                 logging.exception("Failed to download key %s", key)

If you preserve host subdirectories as above, update src/odemis/util/test/dc_fetch_test.py to expect the downloaded file under output_dir / "host" / ....

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/dc_fetch.py` around lines 249 - 258, The current download
uses Path(key).name which flattens S3 keys and can cause collisions; also
downloading directly to the final path leaves truncated files on failure. Change
the destination logic to preserve the S3 key's path under the output_dir (e.g.,
destination = output_dir / Path(key)) and ensure
destination.parent.mkdir(parents=True, exist_ok=True). Download to a temporary
file in the same directory (e.g., destination.with_suffix('.tmp') or a temp file
in destination.parent), call s3_client.download_file into that temp file, then
atomically rename/move to destination only on success; keep the same counters
(downloaded/failed/skipped) and logging (logging.exception) behavior. Finally
update src/odemis/util/test/dc_fetch_test.py to expect the downloaded file under
output_dir / "host" / ... (the preserved subdirectory path).
src/odemis/util/test/datacollector_test.py (2)

52-52: ⚠️ Potential issue | 🟠 Major

Avoid configuring global logging at test import time.

logging.basicConfig() mutates process-wide logging as soon as this test module is imported, which can make unrelated tests order-dependent. Move this under if __name__ == "__main__": if it is only for local diagnostics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/test/datacollector_test.py` at line 52, The top-level call
logging.basicConfig(level=logging.DEBUG) in the datacollector_test module
mutates global logging at import time; remove that top-level call and instead
configure logging only for local/manual runs by placing it under an if __name__
== "__main__": guard, or configure logging per-test using pytest fixtures/caplog
inside the test functions; update/remove the logging.basicConfig invocation so
tests no longer change process-wide logging on import.

536-542: ⚠️ Potential issue | 🟠 Major

This 10 ms wall-clock assertion is likely to flap.

A threshold this tight measures CI runner jitter more than the non-blocking contract. Prefer asserting that record() enqueues/delegates work without serializing/uploading on the caller path, or use a much looser bound.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/test/datacollector_test.py` around lines 536 - 542, The
wall-clock <10ms> assertion in test_record_returns_fast is flaky; instead verify
record() delegates work non-blockingly by asserting it enqueues a task rather
than completes the background work. Modify test_record_returns_fast to call
self._collector.record("perf_test","1.0",{"image": arr}) and then assert that
the collector's enqueue/queue mechanism received the task (for example check a
public/internal queue size or mock/spy the enqueue method on self._collector) or
mock the uploader to assert record returned immediately while background upload
was invoked asynchronously; keep references to self._collector.record and the
test_record_returns_fast name so the change is easy to locate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/odemis/gui/main.py`:
- Around line 436-441: When handling ConsentDialog results, after calling
_data_collector.set_consent(True/False) (and possibly after
_data_collector.postpone_consent()), update the MenuController checkbox state so
the Help → “Share data with Delmic” reflects the new consent immediately; locate
where ConsentDialog.RESULT_OPT_IN / RESULT_OPT_OUT are handled in main.py and
either call an existing MenuController refresh method or add and call a small
MenuController.refresh_share_data_checkbox() (or similar) that reads the data
collector’s consent and sets the checkbox state accordingly.

In `@src/odemis/util/test/dc_fetch_test.py`:
- Around line 175-180: Replace the hardcoded secret_key="secret" in the
S3UploadBackend instantiation inside dc_fetch_test.py with a clearly named test
constant (e.g., DUMMY_S3_SECRET) or add an inline Ruff suppression for S106 on
that expression; specifically update the S3UploadBackend(...) call (the
secret_key parameter) so it either uses a top-level DUMMY_S3_SECRET constant or
appends a trailing noqa: S106 comment to the secret_key literal to prevent Ruff
from treating it as a real credential.

---

Duplicate comments:
In `@src/odemis/util/datacollector.py`:
- Around line 697-699: The current validation calls image_format.upper() and can
raise AttributeError for non-string inputs; update the check in the
datacollector image format validation (the block that compares
image_format.upper() against _VALID_IMAGE_FORMATS) to first ensure image_format
is an instance of str (e.g., using isinstance(image_format, str)); if not, raise
the same ValueError with the existing message; then call image_format.upper()
and compare to _VALID_IMAGE_FORMATS as before so the function consistently
raises ValueError for invalid types or values.
- Around line 153-161: The persisted config writes "consent = none" which later
causes ConfigParser.getboolean() to raise ValueError on reload; update the
consent getter (the property/method that reads "consent" and any similar readers
around the locations referenced, e.g., lines handling consent at 185-190) to
tolerate an undecided value by reading the raw string (ConfigParser.get or get
with fallback), normalizing it (strip/lower), and returning None when it equals
"none" (or empty) instead of calling getboolean directly; alternatively catch
ValueError from getboolean and map it to None so undecided consent is handled
safely on config reload while keeping _write() unchanged.
- Around line 105-127: Update docstrings and add missing type hints per repo
rules: convert the DataCollectorConfig class docstring and any other docstrings
referenced (lines ~181-183, 445-446, 470-473, 536-537, 676-688) to
reStructuredText style without inline formatting markers or backticks, and add a
proper docstring for the helper function _ensure_section() describing purpose,
args, and return in reST. Add full type annotations to all function signatures
that lack them, specifically annotate _get_client() with its return type and
annotate parameters/return types for any functions missing hints in this module;
ensure symbol names match exactly (DataCollectorConfig, _ensure_section,
_get_client) so the changes are applied to the correct locations.
- Around line 311-356: The code uses raw event_name and payload keys to build
filenames/paths (safe_event, zip_name, arc_name) which can include path
traversal or separators; sanitize event_name and each payload key before using
them as filenames (e.g., remove/replace path separators, strip leading dots,
collapse .., enforce a safe charset and length, fallback to a safe token) and
use the sanitized value when building zip_name and arc_name (references:
safe_event, zip_name, arc_name, tmp_dir, extra_files, payload_meta). After
creating abs_path, ensure abs_path.resolve() is inside tmp_dir.resolve() (use
Path.resolve() and is_relative_to or equivalent) and reject or remap any keys
that would escape tmp_dir; also ensure the arcname stored in payload_meta is the
sanitized basename that will be used as the zip entry name to prevent unsafe
archive entries.
- Around line 480-496: The _run loop currently skips queue.get() while there are
pending uploads and it's backing off, allowing the unbounded in-memory
self._queue to grow; change _run so that even during backoff it repeatedly calls
self._queue.get(timeout=remaining_backoff) (or otherwise polls the queue) and
processes/drains items up to a small limit rather than sleeping unconditionally,
ensuring new records are consumed while waiting to retry uploads; reference the
_run method, self._queue, self._next_retry_at and the existing enqueue()
behavior and implement a bounded-drain or timed-get loop so pending upload
retries no longer starve queue consumption.

In `@src/odemis/util/dc_fetch.py`:
- Around line 137-150: In iter_s3_objects, guard against infinite loops when S3
returns IsTruncated=True but no NextContinuationToken: after calling
s3_client.list_objects_v2 and before repeating, check
response.get("IsTruncated") and if True and
response.get("NextContinuationToken") is falsy, raise an exception (e.g.,
RuntimeError) with a clear message including bucket and prefix; otherwise assign
token = response["NextContinuationToken"] as currently done. This ensures
iter_s3_objects (token / response) fails fast instead of looping forever.
- Around line 35-47: The parse_since_utc function currently calls
datetime.fromisoformat(text) which fails on ISO strings that end with 'Z' (e.g.,
"2026-03-01T12:30:00Z") on Python 3.10; before calling datetime.fromisoformat in
parse_since_utc, normalize a trailing 'Z' to an explicit UTC offset (e.g.,
replace a single trailing 'Z' with '+00:00') so fromisoformat accepts it,
preserving the existing date-only branch and the subsequent tzinfo
handling/astimezone logic.
- Around line 50-80: Update the docstrings in parse_key_timestamp_utc,
parse_key_event_name, build_argument_parser and the other affected functions
(around the mentioned ranges) to follow plain reStructuredText style: remove
backticks and inline formatting (e.g. change
"`<event>-<YYYYMMDDTHHmmss>-<uuid>.zip`" to simply
"<event>-<YYYYMMDDTHHmmss>-<uuid>.zip" or a short sentence), keep them brief
descriptive sentences without type annotations or inline code markup, and ensure
each function/class has a clear one- or two-line description in rST form
consistent with the project's docstring guidelines.
- Around line 249-258: The current download uses Path(key).name which flattens
S3 keys and can cause collisions; also downloading directly to the final path
leaves truncated files on failure. Change the destination logic to preserve the
S3 key's path under the output_dir (e.g., destination = output_dir / Path(key))
and ensure destination.parent.mkdir(parents=True, exist_ok=True). Download to a
temporary file in the same directory (e.g., destination.with_suffix('.tmp') or a
temp file in destination.parent), call s3_client.download_file into that temp
file, then atomically rename/move to destination only on success; keep the same
counters (downloaded/failed/skipped) and logging (logging.exception) behavior.
Finally update src/odemis/util/test/dc_fetch_test.py to expect the downloaded
file under output_dir / "host" / ... (the preserved subdirectory path).

In `@src/odemis/util/test/datacollector_test.py`:
- Line 52: The top-level call logging.basicConfig(level=logging.DEBUG) in the
datacollector_test module mutates global logging at import time; remove that
top-level call and instead configure logging only for local/manual runs by
placing it under an if __name__ == "__main__": guard, or configure logging
per-test using pytest fixtures/caplog inside the test functions; update/remove
the logging.basicConfig invocation so tests no longer change process-wide
logging on import.
- Around line 536-542: The wall-clock <10ms> assertion in
test_record_returns_fast is flaky; instead verify record() delegates work
non-blockingly by asserting it enqueues a task rather than completes the
background work. Modify test_record_returns_fast to call
self._collector.record("perf_test","1.0",{"image": arr}) and then assert that
the collector's enqueue/queue mechanism received the task (for example check a
public/internal queue size or mock/spy the enqueue method on self._collector) or
mock the uploader to assert record returned immediately while background upload
was invoked asynchronously; keep references to self._collector.record and the
test_record_returns_fast name so the change is easy to locate.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ea331f3a-e979-4542-a096-c0d33f53d373

📥 Commits

Reviewing files that changed from the base of the PR and between 06b4736 and 1f2c680.

📒 Files selected for processing (9)
  • debian/control
  • scripts/odemis-dc-fetch.py
  • src/odemis/gui/cont/menu.py
  • src/odemis/gui/main.py
  • src/odemis/gui/win/consent.py
  • src/odemis/util/datacollector.py
  • src/odemis/util/dc_fetch.py
  • src/odemis/util/test/datacollector_test.py
  • src/odemis/util/test/dc_fetch_test.py
✅ Files skipped from review due to trivial changes (3)
  • debian/control
  • scripts/odemis-dc-fetch.py
  • src/odemis/gui/win/consent.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/odemis/gui/cont/menu.py

Comment thread src/odemis/gui/main.py
Comment thread src/odemis/util/test/dc_fetch_test.py
@K4rishma K4rishma changed the title [feature] implement store annotated data framework [feature][MSD-318] implement store annotated data framework Apr 22, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
src/odemis/util/test/datacollector_test.py (1)

412-414: 💤 Low value

Use raise ... from exc for proper exception chaining.

When re-raising as SkipTest, chain the original exception to preserve the traceback context.

Suggested fix
         try:
             import boto3
         except ImportError as exc:
-            raise unittest.SkipTest(f"boto3 is required for real S3 integration tests: {exc}")
+            raise unittest.SkipTest(f"boto3 is required for real S3 integration tests: {exc}") from exc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/test/datacollector_test.py` around lines 412 - 414, In the
ImportError except block that currently does "except ImportError as exc: raise
unittest.SkipTest(f\"boto3 is required for real S3 integration tests: {exc}\")",
change the re-raise to use exception chaining by raising the SkipTest from the
caught exception (i.e., "raise unittest.SkipTest(...) from exc") so the original
ImportError traceback is preserved; update the except ImportError as exc handler
around the boto3 import accordingly.
src/odemis/util/datacollector.py (2)

473-483: 💤 Low value

Return type annotation should use a proper type instead of string literal.

The return type "boto3.client" is not a valid type annotation. Use Any or a more specific type from boto3's type stubs.

Suggested fix
-    def _get_client(self) -> "boto3.client":
+    def _get_client(self) -> Any:
         """Return a cached boto3 S3 client."""

Add Any to the existing typing imports at line 46:

-from typing import Optional
+from typing import Any, Optional
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 473 - 483, The return
annotation for _get_client currently uses the invalid string literal
"boto3.client"; update the annotation to a real typing (e.g., Any or a boto3
client stub type) and import it: add Any to the typing imports and change def
_get_client(self) -> "boto3.client": to def _get_client(self) -> Any: (or to the
specific boto3 type if available), referencing the existing _get_client method
and self._client/boto3.client call so the function signature and imports are
consistent.

381-385: 💤 Low value

Address or remove the TODO comment.

The TODO at line 384 indicates missing logging for serialization failures. Either add the logging or remove the comment if the current behavior is intentional.

Would you like me to suggest appropriate logging for this fallback path?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/datacollector.py` around lines 381 - 385, Replace the TODO by
logging the serialization failure and then falling back to "<unserializable>":
inside the try/except around payload_meta[key] = str(value) log the exception
including the key and a short representation of value (e.g. logger.exception or
logger.debug with exc_info=True) so you capture the error details, and if no
module logger exists add a standard logging.getLogger(__name__) at the top; keep
the fallback assignment payload_meta[key] = "<unserializable>" and remove the
TODO comment.
src/odemis/util/dc_fetch.py (1)

52-53: 💤 Low value

Docstrings contain backticks which violate coding guidelines.

The docstrings use backtick formatting (e.g., `<event>-<YYYYMMDDTHHmmss>-<uuid>.zip`) which is not permitted per coding guidelines.

Suggested fix
 def parse_key_timestamp_utc(key: str) -> Optional[datetime]:
-    """Parse `<event>-<YYYYMMDDTHHmmss>-<uuid>.zip` timestamp from key basename."""
+    """Parse event-timestamp-uuid.zip timestamp from key basename."""
...
 def parse_key_event_name(key: str) -> Optional[str]:
-    """Parse event name from `<event>-<YYYYMMDDTHHmmss>-<uuid>.zip` key basename."""
+    """Parse event name from event-timestamp-uuid.zip key basename."""

As per coding guidelines "**/*.py: Include docstrings for all functions and classes, following the reStructuredText style guide, without type information and without using inline formatting markers or backticks".

Also applies to: 68-69

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/dc_fetch.py` around lines 52 - 53, The docstrings for
parse_key_timestamp_utc (and the other function around lines 68-69) use
backticks/inline formatting which violates the reStructuredText guideline;
update these docstrings to plain reST text without backticks or inline markers
and describe the pattern in normal text (e.g., state the filename pattern as
<event>-YYYYMMDDTHHmmss-<uuid>.zip or "event-YYYYMMDDTHHmmss-uuid.zip") so they
follow the project's docstring rules while keeping the same descriptive content.
src/odemis/util/test/dc_fetch_test.py (1)

177-182: ⚡ Quick win

Suppress Ruff S106 warning for dummy test credential.

The static analysis flags secret_key="secret" as a potential hardcoded password. Add a suppression comment since this is clearly a test dummy value.

Suggested fix
         backend = S3UploadBackend(
             access_key="key",
-            secret_key="secret",
+            secret_key="secret",  # noqa: S106 - dummy test credential
             region=S3_REGION,
             bucket="test-bucket",
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/odemis/util/test/dc_fetch_test.py` around lines 177 - 182, The test uses
a dummy hardcoded credential in the S3UploadBackend instantiation
(secret_key="secret") which triggers Ruff S106; add an inline suppression
comment to that argument to silence the warning (e.g., append a "# noqa: S106"
or " # ruff: noqa: S106" to the secret_key="secret" line). Locate the
S3UploadBackend(...) block in the test (the secret_key parameter) and add the
noqa ruff suppression to that line only so the dummy test credential is ignored
by static analysis.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/odemis/util/datacollector.py`:
- Around line 473-483: The return annotation for _get_client currently uses the
invalid string literal "boto3.client"; update the annotation to a real typing
(e.g., Any or a boto3 client stub type) and import it: add Any to the typing
imports and change def _get_client(self) -> "boto3.client": to def
_get_client(self) -> Any: (or to the specific boto3 type if available),
referencing the existing _get_client method and self._client/boto3.client call
so the function signature and imports are consistent.
- Around line 381-385: Replace the TODO by logging the serialization failure and
then falling back to "<unserializable>": inside the try/except around
payload_meta[key] = str(value) log the exception including the key and a short
representation of value (e.g. logger.exception or logger.debug with
exc_info=True) so you capture the error details, and if no module logger exists
add a standard logging.getLogger(__name__) at the top; keep the fallback
assignment payload_meta[key] = "<unserializable>" and remove the TODO comment.

In `@src/odemis/util/dc_fetch.py`:
- Around line 52-53: The docstrings for parse_key_timestamp_utc (and the other
function around lines 68-69) use backticks/inline formatting which violates the
reStructuredText guideline; update these docstrings to plain reST text without
backticks or inline markers and describe the pattern in normal text (e.g., state
the filename pattern as <event>-YYYYMMDDTHHmmss-<uuid>.zip or
"event-YYYYMMDDTHHmmss-uuid.zip") so they follow the project's docstring rules
while keeping the same descriptive content.

In `@src/odemis/util/test/datacollector_test.py`:
- Around line 412-414: In the ImportError except block that currently does
"except ImportError as exc: raise unittest.SkipTest(f\"boto3 is required for
real S3 integration tests: {exc}\")", change the re-raise to use exception
chaining by raising the SkipTest from the caught exception (i.e., "raise
unittest.SkipTest(...) from exc") so the original ImportError traceback is
preserved; update the except ImportError as exc handler around the boto3 import
accordingly.

In `@src/odemis/util/test/dc_fetch_test.py`:
- Around line 177-182: The test uses a dummy hardcoded credential in the
S3UploadBackend instantiation (secret_key="secret") which triggers Ruff S106;
add an inline suppression comment to that argument to silence the warning (e.g.,
append a "# noqa: S106" or " # ruff: noqa: S106" to the secret_key="secret"
line). Locate the S3UploadBackend(...) block in the test (the secret_key
parameter) and add the noqa ruff suppression to that line only so the dummy test
credential is ignored by static analysis.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd235b4d-d275-4adc-ae28-7efa9efc7b43

📥 Commits

Reviewing files that changed from the base of the PR and between 1f2c680 and 2e89fec.

📒 Files selected for processing (7)
  • src/odemis/gui/cont/menu.py
  • src/odemis/gui/main.py
  • src/odemis/gui/win/consent.py
  • src/odemis/util/datacollector.py
  • src/odemis/util/dc_fetch.py
  • src/odemis/util/test/datacollector_test.py
  • src/odemis/util/test/dc_fetch_test.py
✅ Files skipped from review due to trivial changes (1)
  • src/odemis/gui/win/consent.py

@K4rishma K4rishma force-pushed the store_annoted_data_p1 branch from 2e89fec to 84ee1ff Compare May 4, 2026 12:00
@tmoerkerken
Copy link
Copy Markdown
Contributor

I think for such a big PR, a more elaborate description could help me review it a bit better.

class TestRealS3Integration(unittest.TestCase):
"""Real S3 integration tests for Phase 2 upload workflow.

Tests use the credentials from ``_CREDENTIALS_PATH`` and upload to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No md in docstrings. I guess the rule is not picked up nicely @pieleric .

Style

For Python code, always use type hints for function parameters and return types.
Include docstrings for all functions and classes, following the reStructuredText style guide,
but without the type information and without using inline formatting markers or backticks.

Comment thread src/odemis/util/datacollector.py Outdated
Comment thread src/odemis/util/datacollector.py Outdated
Comment thread src/odemis/util/datacollector.py Outdated
Comment thread src/odemis/util/dc_fetch.py Outdated
shutil.rmtree(str(tmp_dir), ignore_errors=True)


def _enforce_queue_limit(queue_dir: Path) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like the 10% should be a parameter

Comment thread src/odemis/util/datacollector.py
Comment thread src/odemis/util/datacollector.py

def _run(self) -> None:
"""Main loop: process items until the thread is stopped."""
while True:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this loop need to run without timeout, or is that built in some of the queue calls?

Comment thread src/odemis/util/datacollector.py Outdated

# Default paths
_CONF_DIR = os.path.join(os.path.expanduser("~"), ".config", "odemis")
_DEFAULT_QUEUE_DIR = Path("/var/log/odemis/dc_queue")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That directory cannot be created by the user. So it won't work if the data collection framework runs from the GUI (which is the most typical case). I'd suggest to use ~/.local/share/odemis/dc_queue

Comment thread debian/control
Comment on lines +78 to +79
# Needed for communicating with AWS S3
python3-boto3,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@K4rishma the code should stay Python 3.8 compatible. You should change the type annotation to be List instead of list and Dict instead of dict.

" odemis-dc-fetch --since 2026-03-01T12:30:00 --event z_stack_acquired --output ./dc_samples"
)
parser = argparse.ArgumentParser(
description="Fetch data-collection ZIP samples from S3.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this script is to be used only by the Delmic "data analysts". They have different access rights compared to the standard users. Standard users get a key (via the Odemis package, common to all users) that only give rights to upload data. They cannot list or read the data.
The data analysts will have a different key (one per analyst) that gives full access to the data, and with which this script will work.

@K4rishma , I'd suggest to add documentation. I think you can add a chapter to the developer's documentation. It should describe the basic model of the data collection access rights, and explain how to use the odemis-dc-fetch tool. In particular, it should explain where the keys are stored, and in which format.

Comment thread src/odemis/util/dc_fetch.py Outdated
Comment on lines +79 to +80
def build_argument_parser() -> argparse.ArgumentParser:
"""Build CLI argument parser for `odemis-dc-fetch`."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the separation between code and script is odd. It does make sense to keep some of the low level functions in a separate file (which makes it easier to test). However, all the argument parsing and main function should be in the odemis-dc-fetch script.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't place it in scripts/ , which is a folder that contains example of how to write python scripts using Odemis. Instead, it should be in util/ .

else:
remind_line = "# reminder_date = "

content = (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Just use the ConfigParser.

S3_REGION = "eu-west-1"

# Path to the S3 credentials key file (JSON with access_key / secret_key).
_CREDENTIALS_PATH = "/usr/share/odemis/datacollector.key"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to provide the same key for all systems, part of the Odemis package. That's a special key, which can only upload data. So it's relatively safe. It's similar to what we already do with the support ticket API (special key which only allows to create a new bug report).

_MAX_RETRY_DELAY_SECONDS = 3600.0
# Default delay before re-prompting for consent after postpone.
_DEFAULT_CONSENT_REMIND_DELTA = timedelta(days=30)
REMINDER_DATE_KEY = "reminder_date"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have a constant for this. That's the same at the other entries of the INI file.



# S3 bucket name — shared production bucket, created once by the dev team.
S3_BUCKET = "delmic-odemis-collect"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a decision I've made. I prefer having a simple code that uses a single platform for now. The code relying on it is just in two files, and very short (see UploadBackend). If we decide to switch platform, we'll update that code. It's simpler than writing a generic code that support multiple platforms right now... while we don't even know how useful the entire "data collection" feature will be.

"failed": failed,
}

def main(argv: Optional[List[str]] = None) -> int:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to the script.

@K4rishma K4rishma force-pushed the store_annoted_data_p1 branch 2 times, most recently from 957af76 to 93e54ac Compare May 28, 2026 08:49
@K4rishma K4rishma force-pushed the store_annoted_data_p1 branch from 93e54ac to 478bf9a Compare May 28, 2026 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants