Fix dataset import via url by knollengewaechs · Pull Request #9516 · scalableminds/webknossos

knollengewaechs · 2026-04-27T07:52:20Z

URL of deployed dev instance (used for testing):

https://___.webknossos.xyz

Steps to test:

Setup a local or remote wk.
Open: http://localhost:9000/import?url=https://s3.janelia.org/lavis-lab/Dye_Bioavailability_images/Converted_OME-NGFF/JF608_Lung_20251024_Slide_23_of_2_Region_002_channels.zarr/ (on local)
Should open the remote dataset
Open the same url multiple times -> should always show the same dataset
Check in dashboard that only one remote dataset was added

TODOs:

frontend

if ds not found, import
fix existing issue where name is not set
open existing ds if already imported
-> tested by manually inserting metadata [{"key": "importURL", "type": "string", "value": }] -> this opens the right DS without importing a new one!
store import url in metadata -> backend? I think it makes sense to do it similarly to the folder, so pass it on to DatasetService.createDataset and store it in the metadata there

Issues:

fixes Import URL is broken #9374

(Please delete unneeded items, merge only when none are left open)

Added changelog entry (create a $PR_NUMBER.md file in unreleased_changes or use ./tools/create-changelog-entry.py)

… requested import url in its metadata

coderabbitai · 2026-04-27T07:52:27Z

📝 Walkthrough

Walkthrough

The PR implements import URL deduplication for remote datasets. It adds backend persistence of import URLs in dataset metadata, provides a lookup endpoint, and updates the frontend to check for existing datasets before importing. This fixes the broken import feature by replacing name-based deduplication with URL-based deduplication.

Changes

Import URL Deduplication for Remote Datasets

Layer / File(s)	Summary
Backend metadata schema and lookup `app/controllers/DatasetController.scala`, `app/models/dataset/Dataset.scala`, `app/models/dataset/DatasetService.scala`	`DatasetUpdateParameters` adds default `None` values for optional fields and a `newEmpty` factory; `DataSourceRegistrationInfo` adds optional `importUrl` field; `DatasetDAO` and `DatasetService` add `findOneByImportURL` methods to query datasets by import URL stored in metadata.
Dataset creation flow updates `app/controllers/DatasetController.scala`, `app/controllers/WKRemoteDataStoreController.scala`, `app/models/dataset/ComposeService.scala`	All `createAndSetUpDataset` call sites now explicitly pass `importURLOpt`: from virtual dataset request body, from explore (as `None`), from uploads (as `None`), and from compose (as `None`).
Backend API endpoint and route `app/controllers/DatasetController.scala`, `conf/webknossos.latest.routes`	New secured action `findByImportURL` queries datasets by import URL within the caller's organization and returns the dataset's public representation or `null`; GET route `/datasets/findByImportURL` maps to this action.
Frontend REST API client `frontend/javascripts/admin/rest_api.ts`	Added `findDatasetByImportUrl` helper to fetch datasets by import URL; updated `storeRemoteDataset` to optionally accept and include `importUrl` in the request payload.
Frontend import existence check and navigation `frontend/javascripts/admin/dataset/dataset_url_import.tsx`	Adds `useFetch` call to check for existing datasets matching the import URL; if found, derives view URL and navigates directly; if pending, shows loading spinner; if not found, proceeds to import UI.
Frontend remote dataset view refactoring `frontend/javascripts/admin/dataset/dataset_add_remote_view.tsx`	Removes name-based duplicate-detection redirect logic and `maybeOpenExistingDataset` helper; switches query param handling to `useEffectOnlyOnce`; updates `storeRemoteDataset` call to pass import URL for backend-level deduplication.
Release notes `unreleased_changes/9516.md`	Documents that remote datasets added via `/import` route are automatically deduplicated and existing datasets are opened when already present in the organization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

scalableminds/webknossos#9557: Modifies the same remote dataset import UI flow in dataset_add_remote_view.tsx, particularly the existing-dataset/opening behavior and navigation logic.
scalableminds/webknossos#9225: Also updates DatasetService.createAndSetUpDataset call sites with additional dataset-creation metadata via named arguments.

Suggested labels

bug

Suggested reviewers

fm3
philippotto

🐰 Import URLs now find their datasets true,
No duplicate dance, just one view to show,
Webknossos buttons on sites—hooray!
Dedup by URL, the metadata way, ✨
Remote datasets home to stay.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title is concise but overly broad. 'Fix dataset import via url' describes the general area but lacks specificity about the core improvement (deduplication via metadata import URL storage).	Consider a more specific title like 'Add import URL deduplication for remote datasets' to better reflect the main technical change.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description provides testing steps, completed frontend TODOs, and references the linked issue `#9374`, clearly relating to the changeset's core objectives.
Linked Issues check	✅ Passed	The pull request successfully addresses all objectives from `#9374`: stores import URLs in metadata, deduplicates by URL instead of name, adds a backend route to check for existing datasets by import URL, and enables reliable 'Annotate with Webknossos' links.
Out of Scope Changes check	✅ Passed	All changes are directly related to the import URL deduplication feature. No unrelated code modifications were detected outside the scope of issue `#9374`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-import-via-url

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

knollengewaechs · 2026-04-27T09:45:51Z

+  def findByImportURL(importURL: String): Action[AnyContent] =
+    sil.SecuredAction.async { implicit request =>
+      for {
+        dataset <- datasetService.findOneByImportURL(importURL, request.identity._organization) ?~> "dataset.notFound" ~> NOT_FOUND


Maybe it makes sense to omit the error here, as it will always be shown when the DS was not yet imported. But that scenario is not really an error. So maybe it would make sense to return null or something.

yeah, good idea. Did that 👍

…ound

…-import-via-url

…e found

…to import route component

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/controllers/DatasetController.scala`:
- Around line 464-467: The match on datasetBox currently conflates all non-Full
cases into a None JSON, hiding Failure details; update the pattern match on
datasetBox to handle Full(dataset) -> datasetService.publicWrites(dataset,
Some(request.identity)), Failure(msg, _, _) -> log the failure (including msg)
and return an error result (e.g., propagate a 5xx/unexpected-error via the same
response wrapper rather than Json.toJson(None)), and Empty (or ParamFailure) ->
treat as not-found/None as appropriate; reference datasetBox,
datasetService.publicWrites and request.identity when making the targeted
changes so you don’t lose real backend/DB failures.

In `@app/models/dataset/Dataset.scala`:
- Around line 443-451: The SQL in findOneByImportURL uses LIMIT 1 without ORDER
BY, making selection arbitrary when multiple rows share the same importURL;
update the query (the q"""...""".as[DatasetsRow] block) to include a
deterministic ORDER BY (for example ORDER BY _id ASC or ORDER BY created_at DESC
depending on desired semantics) before LIMIT 1 so the same row is returned
consistently when duplicates exist.

In `@app/models/dataset/DatasetService.scala`:
- Around line 142-149: The code in the Fox.runOptional(importURLOpt) branch
currently records the raw import URL into metadata via DatasetUpdateParameters
-> metadata when calling datasetDAO.updatePartial; instead, sanitize or
canonicalize the URL (remove credentials, query tokens, signed params) or
replace it with a deterministic hash (e.g., SHA-256 of the full URL) before
storing to avoid leaking sensitive tokens. Update the
Fox.runOptional(importURLOpt) handler to compute and store either a sanitized
URL or its hash (and optionally store a field indicating the canonicalization
method) rather than the raw importURL string in the metadata field used by
DatasetUpdateParameters.

In `@frontend/javascripts/admin/dataset/dataset_url_import.tsx`:
- Around line 45-49: The current render calls navigate(url) directly when
maybeExistingDS?.dataSource is truthy (using getViewDatasetURL and navigate),
causing a side effect during render; move this redirect into a useEffect that
depends on maybeExistingDS (and navigate) so navigation only runs as an effect,
and change the render path to return a temporary fallback (e.g., null or a
loading/redirecting message) while the effect performs the redirect. Ensure you
compute the URL with getViewDatasetURL inside the effect or derive it
beforehand, and remove the direct navigate(url) call from the component body.

In `@unreleased_changes/9516.md`:
- Line 2: Reword the release-note sentence that currently reads "Fastly added
remote dataset via /import route are now automatically deduplicated and the
route opens an existing dataset in case it was already added to the users
organization." to correct grammar and improve clarity; update the line (the
sentence starting with "Fastly added remote dataset via /import route") to
something like: "Fastly remote datasets added via the `/import` route are now
automatically deduplicated, and the route opens an existing dataset if it was
already added to the user’s organization." ensuring proper punctuation,
possessive "user’s", and consistent pluralization.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 727c7128-1d36-41b2-9bcc-f1b792f69c90

📥 Commits

Reviewing files that changed from the base of the PR and between e46d7b7 and 6ea919d.

📒 Files selected for processing (10)

app/controllers/DatasetController.scala
app/controllers/WKRemoteDataStoreController.scala
app/models/dataset/ComposeService.scala
app/models/dataset/Dataset.scala
app/models/dataset/DatasetService.scala
conf/webknossos.latest.routes
frontend/javascripts/admin/dataset/dataset_add_remote_view.tsx
frontend/javascripts/admin/dataset/dataset_url_import.tsx
frontend/javascripts/admin/rest_api.ts
unreleased_changes/9516.md

coderabbitai · 2026-05-14T19:02:36Z

+        js <- datasetBox match {
+          case Full(dataset) => datasetService.publicWrites(dataset, Some(request.identity))
+          case _             => Fox.successful(Json.toJson(None))
+        }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not treat backend failures as “not found” in import lookup.

Line 466 currently returns null for every non-Full case, including real Failures. That masks backend/DB errors and can incorrectly trigger duplicate-import flows instead of surfacing the failure.

Suggested fix

js <- datasetBox match { case Full(dataset) => datasetService.publicWrites(dataset, Some(request.identity)) - case _ => Fox.successful(Json.toJson(None)) + case Empty => Fox.successful(Json.toJson(None)) + case failure: Failure => + Fox.failure("dataset.findByImportURL.failed", failure) }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/controllers/DatasetController.scala` around lines 464 - 467, The match on datasetBox currently conflates all non-Full cases into a None JSON, hiding Failure details; update the pattern match on datasetBox to handle Full(dataset) -> datasetService.publicWrites(dataset, Some(request.identity)), Failure(msg, _, _) -> log the failure (including msg) and return an error result (e.g., propagate a 5xx/unexpected-error via the same response wrapper rather than Json.toJson(None)), and Empty (or ParamFailure) -> treat as not-found/None as appropriate; reference datasetBox, datasetService.publicWrites and request.identity when making the targeted changes so you don’t lose real backend/DB failures.

coderabbitai · 2026-05-14T19:02:36Z

+  def findOneByImportURL(importURL: String, organizationId: String)(implicit ctx: DBAccessContext): Fox[Dataset] =
+    for {
+      accessQuery <- readAccessQuery
+      r <- run(q"""SELECT $columns
+                   FROM $existingCollectionName
+                   WHERE _organization = $organizationId
+                   AND metadata @> jsonb_build_array(jsonb_build_object('type', 'string', 'key', 'importURL', 'value', $importURL))
+                   AND $accessQuery
+                   LIMIT 1""".as[DatasetsRow])


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make import-URL lookup deterministic when duplicates exist.

Line 451 uses LIMIT 1 without ORDER BY. If multiple rows match the same importURL, the selected dataset is arbitrary and may vary between calls.

Suggested fix

r <- run(q"""SELECT $columns FROM $existingCollectionName WHERE _organization = $organizationId AND metadata @> jsonb_build_array(jsonb_build_object('type', 'string', 'key', 'importURL', 'value', $importURL)) AND $accessQuery + ORDER BY created ASC, _id ASC LIMIT 1""".as[DatasetsRow])

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/models/dataset/Dataset.scala` around lines 443 - 451, The SQL in findOneByImportURL uses LIMIT 1 without ORDER BY, making selection arbitrary when multiple rows share the same importURL; update the query (the q"""...""".as[DatasetsRow] block) to include a deterministic ORDER BY (for example ORDER BY _id ASC or ORDER BY created_at DESC depending on desired semantics) before LIMIT 1 so the same row is returned consistently when duplicates exist.

coderabbitai · 2026-05-14T19:02:36Z

+      _ <- Fox.runOptional(importURLOpt) { importURL =>
+        datasetDAO.updatePartial(
+          datasetId,
+          DatasetUpdateParameters(
+            description = None,
+            name = None,
+            metadata = Some(JsArray(Seq(Json.obj("key" -> "importURL", "type" -> "string", "value" -> importURL)))))
+        )(GlobalAccessContext)


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid persisting raw import URLs in metadata.

Line 148 stores the full import URL as metadata. If URLs contain credentials, signed params, or tokens, this creates a sensitive-data leakage path via dataset metadata exposure and retention.

Prefer storing a canonicalized/sanitized value (or hash) for deduplication, not the raw URL.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/models/dataset/DatasetService.scala` around lines 142 - 149, The code in the Fox.runOptional(importURLOpt) branch currently records the raw import URL into metadata via DatasetUpdateParameters -> metadata when calling datasetDAO.updatePartial; instead, sanitize or canonicalize the URL (remove credentials, query tokens, signed params) or replace it with a deterministic hash (e.g., SHA-256 of the full URL) before storing to avoid leaking sensitive tokens. Update the Fox.runOptional(importURLOpt) handler to compute and store either a sanitized URL or its hash (and optionally store a field indicating the canonicalization method) rather than the raw importURL string in the metadata field used by DatasetUpdateParameters.

coderabbitai · 2026-05-14T19:02:36Z

+  if (maybeExistingDS?.dataSource) {
+    const url = getViewDatasetURL(maybeExistingDS);
+    navigate(url);
+    return;
+  }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify no render-path navigate call remains in DatasetURLImport after applying the fix. rg -n 'navigate\(' frontend/javascripts/admin/dataset/dataset_url_import.tsx rg -n 'useEffect\(' frontend/javascripts/admin/dataset/dataset_url_import.tsx

Repository: scalableminds/webknossos

Length of output: 143

🏁 Script executed:

cat -n frontend/javascripts/admin/dataset/dataset_url_import.tsx

Repository: scalableminds/webknossos

Length of output: 2684

Move redirect navigation out of render.

Line 47 triggers navigate(url) during render, which is a side effect in the component's render path. This causes navigate() to run on every render; in React Strict Mode it runs twice and can produce duplicate history entries or warnings. Move this into a useEffect tied to maybeExistingDS and render a temporary fallback.

🔧 Proposed fix

+import { useEffect } from "react"; import DatasetAddRemoteView from "admin/dataset/dataset_add_remote_view"; ... if (datasetUri !== null && !checkedForExistence) { return <BrainSpinner />; } - if (maybeExistingDS?.dataSource) { - const url = getViewDatasetURL(maybeExistingDS); - navigate(url); - return; - } + useEffect(() => { + if (maybeExistingDS?.dataSource) { + navigate(getViewDatasetURL(maybeExistingDS), { replace: true }); + } + }, [maybeExistingDS, navigate]); + + if (maybeExistingDS?.dataSource) { + return <BrainSpinner />; + }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/javascripts/admin/dataset/dataset_url_import.tsx` around lines 45 - 49, The current render calls navigate(url) directly when maybeExistingDS?.dataSource is truthy (using getViewDatasetURL and navigate), causing a side effect during render; move this redirect into a useEffect that depends on maybeExistingDS (and navigate) so navigation only runs as an effect, and change the render path to return a temporary fallback (e.g., null or a loading/redirecting message) while the effect performs the redirect. Ensure you compute the URL with getViewDatasetURL inside the effect or derive it beforehand, and remove the direct navigate(url) call from the component body.

coderabbitai · 2026-05-14T19:02:36Z

@@ -0,0 +1,2 @@
+### Changed
+- Fastly added remote dataset via /import route are now automatically deduplicated and the route opens an existing dataset in case it was already added to the users organization. 


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Improve release-note grammar and clarity.

Line 2 has wording/grammar issues that make the changelog harder to read. Please rephrase for correctness and clarity (e.g., “Fastly remote datasets added via the /import route are now automatically deduplicated, and the route opens an existing dataset if it was already added to the user’s organization.”).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@unreleased_changes/9516.md` at line 2, Reword the release-note sentence that currently reads "Fastly added remote dataset via /import route are now automatically deduplicated and the route opens an existing dataset in case it was already added to the users organization." to correct grammar and improve clarity; update the line (the sentence starting with "Fastly added remote dataset via /import route") to something like: "Fastly remote datasets added via the `/import` route are now automatically deduplicated, and the route opens an existing dataset if it was already added to the user’s organization." ensuring proper punctuation, possessive "user’s", and consistent pluralization.

MichaelBuessemeyer and others added 4 commits April 22, 2026 17:53

add backend route to check for existing dataset which already has the…

0bf229a

… requested import url in its metadata

add comment about locally broken behaviour

909d834

fix dataset import by setting the name correctly

b60e692

WIP: fetch maybe existing dataset by url

37e255a

knollengewaechs assigned MichaelBuessemeyer and knollengewaechs Apr 27, 2026

knollengewaechs added backend frontend labels Apr 27, 2026

knollengewaechs commented Apr 27, 2026

View reviewed changes

knollengewaechs and others added 8 commits April 27, 2026 12:28

fix small naming error again; dont import dataset if import url was f…

495151b

…ound

Merge branch 'master' into fix-import-via-url

3c8b3c5

Merge branch 'master' of github.com:scalableminds/webknossos into fix…

da0ed95

…-import-via-url

return null instead of error when dataset with import url could not b…

92c2074

…e found

add import url to metadata of added remote datasets

93f8d82

move checking whether a dataset with the import url already exists in…

dfcd7ff

…to import route component

add changelog entry

5d3f9df

format backend

6ea919d

MichaelBuessemeyer requested review from fm3 and philippotto and removed request for fm3 May 14, 2026 18:56

MichaelBuessemeyer marked this pull request as ready for review May 14, 2026 18:57

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataset import via url#9516

Fix dataset import via url#9516
knollengewaechs wants to merge 12 commits into
masterfrom
fix-import-via-url

knollengewaechs commented Apr 27, 2026 •

edited by MichaelBuessemeyer

Loading

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

knollengewaechs Apr 27, 2026

Uh oh!

MichaelBuessemeyer May 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,2 @@
		### Changed
		- Fastly added remote dataset via /import route are now automatically deduplicated and the route opens an existing dataset in case it was already added to the users organization.

Conversation

knollengewaechs commented Apr 27, 2026 • edited by MichaelBuessemeyer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

URL of deployed dev instance (used for testing):

Steps to test:

TODOs:

Issues:

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

knollengewaechs Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

MichaelBuessemeyer May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

knollengewaechs commented Apr 27, 2026 •

edited by MichaelBuessemeyer

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading