Skip to content

Fix dataset import via url#9516

Open
knollengewaechs wants to merge 12 commits into
masterfrom
fix-import-via-url
Open

Fix dataset import via url#9516
knollengewaechs wants to merge 12 commits into
masterfrom
fix-import-via-url

Conversation

@knollengewaechs
Copy link
Copy Markdown
Contributor

@knollengewaechs knollengewaechs commented Apr 27, 2026

URL of deployed dev instance (used for testing):

  • https://___.webknossos.xyz

Steps to test:

TODOs:

frontend

  • if ds not found, import
  • fix existing issue where name is not set
  • open existing ds if already imported
    -> tested by manually inserting metadata [{"key": "importURL", "type": "string", "value": }] -> this opens the right DS without importing a new one!
  • store import url in metadata -> backend? I think it makes sense to do it similarly to the folder, so pass it on to DatasetService.createDataset and store it in the metadata there

Issues:


(Please delete unneeded items, merge only when none are left open)

  • Added changelog entry (create a $PR_NUMBER.md file in unreleased_changes or use ./tools/create-changelog-entry.py)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The PR implements import URL deduplication for remote datasets. It adds backend persistence of import URLs in dataset metadata, provides a lookup endpoint, and updates the frontend to check for existing datasets before importing. This fixes the broken import feature by replacing name-based deduplication with URL-based deduplication.

Changes

Import URL Deduplication for Remote Datasets

Layer / File(s) Summary
Backend metadata schema and lookup
app/controllers/DatasetController.scala, app/models/dataset/Dataset.scala, app/models/dataset/DatasetService.scala
DatasetUpdateParameters adds default None values for optional fields and a newEmpty factory; DataSourceRegistrationInfo adds optional importUrl field; DatasetDAO and DatasetService add findOneByImportURL methods to query datasets by import URL stored in metadata.
Dataset creation flow updates
app/controllers/DatasetController.scala, app/controllers/WKRemoteDataStoreController.scala, app/models/dataset/ComposeService.scala
All createAndSetUpDataset call sites now explicitly pass importURLOpt: from virtual dataset request body, from explore (as None), from uploads (as None), and from compose (as None).
Backend API endpoint and route
app/controllers/DatasetController.scala, conf/webknossos.latest.routes
New secured action findByImportURL queries datasets by import URL within the caller's organization and returns the dataset's public representation or null; GET route /datasets/findByImportURL maps to this action.
Frontend REST API client
frontend/javascripts/admin/rest_api.ts
Added findDatasetByImportUrl helper to fetch datasets by import URL; updated storeRemoteDataset to optionally accept and include importUrl in the request payload.
Frontend import existence check and navigation
frontend/javascripts/admin/dataset/dataset_url_import.tsx
Adds useFetch call to check for existing datasets matching the import URL; if found, derives view URL and navigates directly; if pending, shows loading spinner; if not found, proceeds to import UI.
Frontend remote dataset view refactoring
frontend/javascripts/admin/dataset/dataset_add_remote_view.tsx
Removes name-based duplicate-detection redirect logic and maybeOpenExistingDataset helper; switches query param handling to useEffectOnlyOnce; updates storeRemoteDataset call to pass import URL for backend-level deduplication.
Release notes
unreleased_changes/9516.md
Documents that remote datasets added via /import route are automatically deduplicated and existing datasets are opened when already present in the organization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • scalableminds/webknossos#9557: Modifies the same remote dataset import UI flow in dataset_add_remote_view.tsx, particularly the existing-dataset/opening behavior and navigation logic.

  • scalableminds/webknossos#9225: Also updates DatasetService.createAndSetUpDataset call sites with additional dataset-creation metadata via named arguments.

Suggested labels

bug

Suggested reviewers

  • fm3
  • philippotto

🐰 Import URLs now find their datasets true,
No duplicate dance, just one view to show,
Webknossos buttons on sites—hooray!
Dedup by URL, the metadata way,
Remote datasets home to stay.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is concise but overly broad. 'Fix dataset import via url' describes the general area but lacks specificity about the core improvement (deduplication via metadata import URL storage). Consider a more specific title like 'Add import URL deduplication for remote datasets' to better reflect the main technical change.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description provides testing steps, completed frontend TODOs, and references the linked issue #9374, clearly relating to the changeset's core objectives.
Linked Issues check ✅ Passed The pull request successfully addresses all objectives from #9374: stores import URLs in metadata, deduplicates by URL instead of name, adds a backend route to check for existing datasets by import URL, and enables reliable 'Annotate with Webknossos' links.
Out of Scope Changes check ✅ Passed All changes are directly related to the import URL deduplication feature. No unrelated code modifications were detected outside the scope of issue #9374.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-import-via-url

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread app/controllers/DatasetController.scala Outdated
def findByImportURL(importURL: String): Action[AnyContent] =
sil.SecuredAction.async { implicit request =>
for {
dataset <- datasetService.findOneByImportURL(importURL, request.identity._organization) ?~> "dataset.notFound" ~> NOT_FOUND
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to omit the error here, as it will always be shown when the DS was not yet imported. But that scenario is not really an error. So maybe it would make sense to return null or something.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good idea. Did that 👍

@MichaelBuessemeyer MichaelBuessemeyer requested review from fm3 and philippotto and removed request for fm3 May 14, 2026 18:56
@MichaelBuessemeyer MichaelBuessemeyer marked this pull request as ready for review May 14, 2026 18:57
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/controllers/DatasetController.scala`:
- Around line 464-467: The match on datasetBox currently conflates all non-Full
cases into a None JSON, hiding Failure details; update the pattern match on
datasetBox to handle Full(dataset) -> datasetService.publicWrites(dataset,
Some(request.identity)), Failure(msg, _, _) -> log the failure (including msg)
and return an error result (e.g., propagate a 5xx/unexpected-error via the same
response wrapper rather than Json.toJson(None)), and Empty (or ParamFailure) ->
treat as not-found/None as appropriate; reference datasetBox,
datasetService.publicWrites and request.identity when making the targeted
changes so you don’t lose real backend/DB failures.

In `@app/models/dataset/Dataset.scala`:
- Around line 443-451: The SQL in findOneByImportURL uses LIMIT 1 without ORDER
BY, making selection arbitrary when multiple rows share the same importURL;
update the query (the q"""...""".as[DatasetsRow] block) to include a
deterministic ORDER BY (for example ORDER BY _id ASC or ORDER BY created_at DESC
depending on desired semantics) before LIMIT 1 so the same row is returned
consistently when duplicates exist.

In `@app/models/dataset/DatasetService.scala`:
- Around line 142-149: The code in the Fox.runOptional(importURLOpt) branch
currently records the raw import URL into metadata via DatasetUpdateParameters
-> metadata when calling datasetDAO.updatePartial; instead, sanitize or
canonicalize the URL (remove credentials, query tokens, signed params) or
replace it with a deterministic hash (e.g., SHA-256 of the full URL) before
storing to avoid leaking sensitive tokens. Update the
Fox.runOptional(importURLOpt) handler to compute and store either a sanitized
URL or its hash (and optionally store a field indicating the canonicalization
method) rather than the raw importURL string in the metadata field used by
DatasetUpdateParameters.

In `@frontend/javascripts/admin/dataset/dataset_url_import.tsx`:
- Around line 45-49: The current render calls navigate(url) directly when
maybeExistingDS?.dataSource is truthy (using getViewDatasetURL and navigate),
causing a side effect during render; move this redirect into a useEffect that
depends on maybeExistingDS (and navigate) so navigation only runs as an effect,
and change the render path to return a temporary fallback (e.g., null or a
loading/redirecting message) while the effect performs the redirect. Ensure you
compute the URL with getViewDatasetURL inside the effect or derive it
beforehand, and remove the direct navigate(url) call from the component body.

In `@unreleased_changes/9516.md`:
- Line 2: Reword the release-note sentence that currently reads "Fastly added
remote dataset via /import route are now automatically deduplicated and the
route opens an existing dataset in case it was already added to the users
organization." to correct grammar and improve clarity; update the line (the
sentence starting with "Fastly added remote dataset via /import route") to
something like: "Fastly remote datasets added via the `/import` route are now
automatically deduplicated, and the route opens an existing dataset if it was
already added to the user’s organization." ensuring proper punctuation,
possessive "user’s", and consistent pluralization.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 727c7128-1d36-41b2-9bcc-f1b792f69c90

📥 Commits

Reviewing files that changed from the base of the PR and between e46d7b7 and 6ea919d.

📒 Files selected for processing (10)
  • app/controllers/DatasetController.scala
  • app/controllers/WKRemoteDataStoreController.scala
  • app/models/dataset/ComposeService.scala
  • app/models/dataset/Dataset.scala
  • app/models/dataset/DatasetService.scala
  • conf/webknossos.latest.routes
  • frontend/javascripts/admin/dataset/dataset_add_remote_view.tsx
  • frontend/javascripts/admin/dataset/dataset_url_import.tsx
  • frontend/javascripts/admin/rest_api.ts
  • unreleased_changes/9516.md

Comment on lines +464 to +467
js <- datasetBox match {
case Full(dataset) => datasetService.publicWrites(dataset, Some(request.identity))
case _ => Fox.successful(Json.toJson(None))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not treat backend failures as “not found” in import lookup.

Line 466 currently returns null for every non-Full case, including real Failures. That masks backend/DB errors and can incorrectly trigger duplicate-import flows instead of surfacing the failure.

Suggested fix
         js <- datasetBox match {
           case Full(dataset) => datasetService.publicWrites(dataset, Some(request.identity))
-          case _             => Fox.successful(Json.toJson(None))
+          case Empty         => Fox.successful(Json.toJson(None))
+          case failure: Failure =>
+            Fox.failure("dataset.findByImportURL.failed", failure)
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/controllers/DatasetController.scala` around lines 464 - 467, The match on
datasetBox currently conflates all non-Full cases into a None JSON, hiding
Failure details; update the pattern match on datasetBox to handle Full(dataset)
-> datasetService.publicWrites(dataset, Some(request.identity)), Failure(msg, _,
_) -> log the failure (including msg) and return an error result (e.g.,
propagate a 5xx/unexpected-error via the same response wrapper rather than
Json.toJson(None)), and Empty (or ParamFailure) -> treat as not-found/None as
appropriate; reference datasetBox, datasetService.publicWrites and
request.identity when making the targeted changes so you don’t lose real
backend/DB failures.

Comment on lines +443 to +451
def findOneByImportURL(importURL: String, organizationId: String)(implicit ctx: DBAccessContext): Fox[Dataset] =
for {
accessQuery <- readAccessQuery
r <- run(q"""SELECT $columns
FROM $existingCollectionName
WHERE _organization = $organizationId
AND metadata @> jsonb_build_array(jsonb_build_object('type', 'string', 'key', 'importURL', 'value', $importURL))
AND $accessQuery
LIMIT 1""".as[DatasetsRow])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make import-URL lookup deterministic when duplicates exist.

Line 451 uses LIMIT 1 without ORDER BY. If multiple rows match the same importURL, the selected dataset is arbitrary and may vary between calls.

Suggested fix
       r <- run(q"""SELECT $columns
                    FROM $existingCollectionName
                    WHERE _organization = $organizationId
                    AND metadata @> jsonb_build_array(jsonb_build_object('type', 'string', 'key', 'importURL', 'value', $importURL))
                    AND $accessQuery
+                   ORDER BY created ASC, _id ASC
                    LIMIT 1""".as[DatasetsRow])
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/models/dataset/Dataset.scala` around lines 443 - 451, The SQL in
findOneByImportURL uses LIMIT 1 without ORDER BY, making selection arbitrary
when multiple rows share the same importURL; update the query (the
q"""...""".as[DatasetsRow] block) to include a deterministic ORDER BY (for
example ORDER BY _id ASC or ORDER BY created_at DESC depending on desired
semantics) before LIMIT 1 so the same row is returned consistently when
duplicates exist.

Comment on lines +142 to +149
_ <- Fox.runOptional(importURLOpt) { importURL =>
datasetDAO.updatePartial(
datasetId,
DatasetUpdateParameters(
description = None,
name = None,
metadata = Some(JsArray(Seq(Json.obj("key" -> "importURL", "type" -> "string", "value" -> importURL)))))
)(GlobalAccessContext)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid persisting raw import URLs in metadata.

Line 148 stores the full import URL as metadata. If URLs contain credentials, signed params, or tokens, this creates a sensitive-data leakage path via dataset metadata exposure and retention.

Prefer storing a canonicalized/sanitized value (or hash) for deduplication, not the raw URL.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/models/dataset/DatasetService.scala` around lines 142 - 149, The code in
the Fox.runOptional(importURLOpt) branch currently records the raw import URL
into metadata via DatasetUpdateParameters -> metadata when calling
datasetDAO.updatePartial; instead, sanitize or canonicalize the URL (remove
credentials, query tokens, signed params) or replace it with a deterministic
hash (e.g., SHA-256 of the full URL) before storing to avoid leaking sensitive
tokens. Update the Fox.runOptional(importURLOpt) handler to compute and store
either a sanitized URL or its hash (and optionally store a field indicating the
canonicalization method) rather than the raw importURL string in the metadata
field used by DatasetUpdateParameters.

Comment on lines +45 to +49
if (maybeExistingDS?.dataSource) {
const url = getViewDatasetURL(maybeExistingDS);
navigate(url);
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify no render-path navigate call remains in DatasetURLImport after applying the fix.
rg -n 'navigate\(' frontend/javascripts/admin/dataset/dataset_url_import.tsx
rg -n 'useEffect\(' frontend/javascripts/admin/dataset/dataset_url_import.tsx

Repository: scalableminds/webknossos

Length of output: 143


🏁 Script executed:

cat -n frontend/javascripts/admin/dataset/dataset_url_import.tsx

Repository: scalableminds/webknossos

Length of output: 2684


Move redirect navigation out of render.

Line 47 triggers navigate(url) during render, which is a side effect in the component's render path. This causes navigate() to run on every render; in React Strict Mode it runs twice and can produce duplicate history entries or warnings. Move this into a useEffect tied to maybeExistingDS and render a temporary fallback.

🔧 Proposed fix
+import { useEffect } from "react";
 import DatasetAddRemoteView from "admin/dataset/dataset_add_remote_view";
 ...
   if (datasetUri !== null && !checkedForExistence) {
     return <BrainSpinner />;
   }
-  if (maybeExistingDS?.dataSource) {
-    const url = getViewDatasetURL(maybeExistingDS);
-    navigate(url);
-    return;
-  }
+  useEffect(() => {
+    if (maybeExistingDS?.dataSource) {
+      navigate(getViewDatasetURL(maybeExistingDS), { replace: true });
+    }
+  }, [maybeExistingDS, navigate]);
+
+  if (maybeExistingDS?.dataSource) {
+    return <BrainSpinner />;
+  }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/javascripts/admin/dataset/dataset_url_import.tsx` around lines 45 -
49, The current render calls navigate(url) directly when
maybeExistingDS?.dataSource is truthy (using getViewDatasetURL and navigate),
causing a side effect during render; move this redirect into a useEffect that
depends on maybeExistingDS (and navigate) so navigation only runs as an effect,
and change the render path to return a temporary fallback (e.g., null or a
loading/redirecting message) while the effect performs the redirect. Ensure you
compute the URL with getViewDatasetURL inside the effect or derive it
beforehand, and remove the direct navigate(url) call from the component body.

@@ -0,0 +1,2 @@
### Changed
- Fastly added remote dataset via /import route are now automatically deduplicated and the route opens an existing dataset in case it was already added to the users organization.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Improve release-note grammar and clarity.

Line 2 has wording/grammar issues that make the changelog harder to read. Please rephrase for correctness and clarity (e.g., “Fastly remote datasets added via the /import route are now automatically deduplicated, and the route opens an existing dataset if it was already added to the user’s organization.”).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@unreleased_changes/9516.md` at line 2, Reword the release-note sentence that
currently reads "Fastly added remote dataset via /import route are now
automatically deduplicated and the route opens an existing dataset in case it
was already added to the users organization." to correct grammar and improve
clarity; update the line (the sentence starting with "Fastly added remote
dataset via /import route") to something like: "Fastly remote datasets added via
the `/import` route are now automatically deduplicated, and the route opens an
existing dataset if it was already added to the user’s organization." ensuring
proper punctuation, possessive "user’s", and consistent pluralization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import URL is broken

2 participants