Skip to content

Write on-disk read-only mirror of virtual datasets#9512

Open
fm3 wants to merge 16 commits into
masterfrom
mirror-virtual
Open

Write on-disk read-only mirror of virtual datasets#9512
fm3 wants to merge 16 commits into
masterfrom
mirror-virtual

Conversation

@fm3
Copy link
Copy Markdown
Member

@fm3 fm3 commented Apr 23, 2026

Steps to test:

  • Enable this in the config (datastore.writeVirtualDatasetsMirror=true)
  • Upload a dataset (creates virtual dataset)
  • mirror should be created.
  • as superuser send request POST /api/datasets/writeAllMirrors, should create mirrors for all virtual datasets
  • readme in mirror dir should adequately describe the situation

TODOs:

  • Make this opt-in
  • trigger this similar to realpath scan, see Scan realpaths for virtual datasets #9518
    • for existing, use ticker? or supseruser-only route?
  • handle lacking write access
  • add readme note that this is read-only

Issues:


  • Added changelog entry (create a $PR_NUMBER.md file in unreleased_changes or use ./tools/create-changelog-entry.py)
  • Removed dev-only changes like prints and application.conf edits
  • Considered common edge cases
  • Needs datastore update after deployment

@fm3 fm3 self-assigned this Apr 23, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@fm3 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 16 minutes and 24 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a01e29ea-fdea-4b08-8602-a5c86782ad43

📥 Commits

Reviewing files that changed from the base of the PR and between 89dac10 and 290f7ca.

📒 Files selected for processing (1)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceMirrorService.scala
📝 Walkthrough

Walkthrough

This PR introduces functionality to create on-disk mirrors of virtual, database-backed datasets. A new mirror-writing service is added to the datastore, integrated with dataset update and upload workflows, and exposed via new controller endpoints. Configuration flags control the feature across both applications.

Changes

Virtual Dataset Mirroring

Layer / File(s) Summary
Configuration & Infrastructure
conf/application.conf, webknossos-datastore/conf/standalone-datastore.conf, webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreConfig.scala
New writeVirtualDatasetsMirror boolean flag added to datastore configuration in both main application and standalone datastore configs.
Core Mirror Service
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceMirrorService.scala
New service creates and maintains on-disk filesystem mirrors for data sources. Writes per-layer subdirectories, symbolic links for magnetic data and attachments, relativized properties JSON, and readme metadata.
Datastore Endpoint & Routing
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala, webknossos-datastore/conf/datastore.latest.routes
DataSourceController gains writeMirror(failOnError: Boolean) action that accepts a sequence of dataset IDs. New POST /datasets/writeMirror route wired to the controller.
Client-side Integration
app/models/dataset/WKRemoteDataStoreClient.scala, app/models/dataset/DatasetService.scala
WKRemoteDataStoreClient adds writeMirror() method to invoke the remote datastore endpoint. DatasetService adds writeMirrorForVirtual() to conditionally trigger mirroring for virtual and usable datasets.
Main Controller Wiring
app/controllers/DatasetController.scala, app/controllers/WKRemoteDataStoreController.scala
DatasetController constructor extended with rpc dependency. Mirror-writing calls integrated into updatePartial(), finishMagUploadToPath(), and finishUploadToPaths() workflows. WKRemoteDataStoreController adds mirror-writing to non-conversion dataset upload path.
Public Endpoints & Routing
conf/webknossos.latest.routes
Two new POST routes added: /datasets/:datasetId/writeMirror and /datasets/writeAllMirrors mapped to DatasetController actions.
Documentation
unreleased_changes/9512.md
Release notes document the new datastore.writeVirtualDatasetsMirror configuration option.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • scalableminds/webknossos#9518: Modifies the same dataset update/upload flows in DatasetController to add post-update operations, creating a potential conflict or sequencing dependency.

Suggested labels

enhancement

Suggested reviewers

  • MichaelBuessemeyer

Poem

🐰 A rabbit hops through disk and tree,
Mirroring datasets wild and free,
With symlinks soft and paths made right,
Virtual worlds now shine so bright!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description check ✅ Passed The description provides clear testing steps, documents completed TODOs, references linked issues, and includes deployment notes related to the changeset.
Linked Issues check ✅ Passed The PR implements the core requirement from #9449: creating on-disk mirrors of virtual datasets with automatic triggering and explicit endpoints for management.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing virtual dataset mirroring: configuration options, controllers, services, routes, and documentation are all in scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately describes the main change: introducing a feature to write on-disk mirrors for virtual datasets, which is the primary objective reflected across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mirror-virtual

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@fm3 fm3 marked this pull request as ready for review May 6, 2026 13:14
@fm3 fm3 changed the title WIP Write on-disk read-only mirror of virtual datasets Write on-disk read-only mirror of virtual datasets May 6, 2026
@fm3 fm3 changed the title Write on-disk read-only mirror of virtual datasets Write on-disk read-only mirror of virtual datasets May 6, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/controllers/DatasetController.scala`:
- Around line 484-485: The call to datasetService.writeMirrorForVirtual inside
the request flow (followed by
analyticsService.track(ChangeDatasetSettingsEvent(...))) must not make the whole
request fail if mirroring errors; change these paths (the writeMirrorForVirtual
invocations used after the primary commit) to run mirror generation
asynchronously or guarded with error handling so failures are log-and-continue.
Specifically, wrap or dispatch
datasetService.writeMirrorForVirtual(updated)(GlobalAccessContext) so any thrown
exceptions are caught or converted to a non-fatal Future recovery and logged
(use the existing logger/analytics for context) instead of propagating, allowing
the primary mutation and analyticsService.track(ChangeDatasetSettingsEvent(...))
to complete; apply the same change to the other occurrences of
writeMirrorForVirtual in this file.

In `@app/models/dataset/DatasetService.scala`:
- Around line 705-711: The helper writeMirrorForVirtual currently calls
client.writeMirror(Seq(dataset._id), failOnError = true) which causes
mirror-write failures to bubble up and abort upload-report flows; change this
call to use failOnError = false so mirror write errors are advisory and don't
fail reportDatasetUpload, and ensure any errors are logged locally inside
writeMirrorForVirtual (do not change the superuser writeAllMirrors flow where
hard failures should remain). Locate writeMirrorForVirtual and the
client.writeMirror invocation to make the adjustment and add a local log on
error instead of propagating the exception.

In
`@webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala`:
- Around line 756-764: writeMirrorForDataset currently collapses
datasetCache.getById(datasetId) failures by calling dataSourceBox.toOption and
Fox.runOptional, so lookup errors are silently ignored; instead surface the Box
as a Fox (do not call toOption) so failures propagate or are counted: replace
the Fox.runOptional(dataSourceBox.toOption) branch with code that converts the
Box into a Fox (e.g. use the library helper to lift a Box into a Fox or
pattern-match on dataSourceBox and turn Failure/Empty into a failing Fox) and
then call dataSourceMirrorService.writeMirror(dataSource, datasetId) ?~> s"Error
writing datasource mirror for $datasetId" so real lookup failures from
datasetCache.getById and errors from writeMirror are not swallowed.

In
`@webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceMirrorService.scala`:
- Around line 64-72: The current implementation in writeMirrorMag (and the
analogous writeMirrorAttachment around lines 98-107) creates symlinks to the
original files which allows writes through the mirror if the caller has
permission; replace the symlink approach with a true read-only mirror: create an
actual copy of the target into layerDir (e.g. use Files.copy from
explicitPath.toLocalPathUnsafe to defaultMagPath), then explicitly set the
copied file's permissions to be read-only for all principals (and where
available set an immutable flag via platform-specific APIs), and update the
returned MagLocator path to point to the copied read-only file; ensure the same
logic is applied in the writeMirrorAttachment routine so mirrored attachments
cannot be modified via the mirror.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cd7adbc8-fc0f-4fbe-8f9d-e7e419e5f645

📥 Commits

Reviewing files that changed from the base of the PR and between 409f236 and 89dac10.

📒 Files selected for processing (12)
  • app/controllers/DatasetController.scala
  • app/controllers/WKRemoteDataStoreController.scala
  • app/models/dataset/DatasetService.scala
  • app/models/dataset/WKRemoteDataStoreClient.scala
  • conf/application.conf
  • conf/webknossos.latest.routes
  • unreleased_changes/9512.md
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreConfig.scala
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DataSourceMirrorService.scala
  • webknossos-datastore/conf/datastore.latest.routes
  • webknossos-datastore/conf/standalone-datastore.conf

Comment thread app/controllers/DatasetController.scala
Comment thread app/models/dataset/DatasetService.scala
@fm3 fm3 requested a review from MichaelBuessemeyer May 6, 2026 14:14
Copy link
Copy Markdown
Contributor

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, thanks for implementing this 🎉
Mainly found messages not present in the conf/messages file.
Testing is open TODO

request.body.attachmentRenamings.getOrElse(Seq.empty)))
updated <- datasetDAO.findOne(datasetId)
_ <- datasetService.scanRealpathsIfVirtual(updated)
_ <- datasetService.writeMirrorForVirtual(updated)(GlobalAccessContext)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say to also include this in the update function below

_ <- Fox.assertTrue(datasetService.isEditableBy(dataset, Some(request.identity))) ?~> "notAllowed" ~> FORBIDDEN
_ <- Fox.fromBool(dataset.isVirtual) ?~> "dataset.writeMirror.onlyForVirtual"
_ <- Fox.fromBool(dataset.isUsable) ?~> "dataset.writeMirror.onlyForUsable"
_ <- datasetService.writeMirrorForVirtual(dataset) ?~> "dataset.writeMirror.failed"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The messages are missing in messages.conf

DELETE /datasets/deletePaths @com.scalableminds.webknossos.datastore.controllers.DataSourceController.deletePaths()
POST /datasets/exploreRemote @com.scalableminds.webknossos.datastore.controllers.DataSourceController.exploreRemoteDataset()
POST /datasets/validatePaths @com.scalableminds.webknossos.datastore.controllers.DataSourceController.validatePaths()
POST /datasets/writeMirror @com.scalableminds.webknossos.datastore.controllers.DataSourceController.writeMirror(failOnError: Boolean)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe name the route and the handler function writeMirrors instead as it expects a list of dataset ids 🤔

_ <- ensureMirrorParent(mirrorDir)
_ <- Fox.runIf(Files.exists(tempMirrorDir)) {
tryo(FileUtils.deleteDirectory(tempMirrorDir.toFile)).toFox
} ?~> "dataset.writeMirror.deleteStaleTempMirrorFailed"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the messages in this file also need to be added to conf/messages

Copy link
Copy Markdown
Contributor

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing session results:

I am honestly confused: Why is this feature needed? Because when I upload a dataset it automatically is added to the binaryData folder of the organization and not in the .virtual folder or so. Only remote datasets seem to not get any on disk representation but the also do not get a mirror (which makes sense). So it this feature "just" for datasets which are composed up of other datasets that are all available on the local file system (are not remote)?

Anyway: Testing went well for these dataset, although dataset which are already in the orga's binaryData folder also get a mirror which is ok I guess, but a bit of a duplicate 🙈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create on-disk mirror of virtual/db-based datasets where possible

2 participants