Fix dataset import via url#9516
Conversation
📝 WalkthroughWalkthroughThe PR implements import URL deduplication for remote datasets. It adds backend persistence of import URLs in dataset metadata, provides a lookup endpoint, and updates the frontend to check for existing datasets before importing. This fixes the broken import feature by replacing name-based deduplication with URL-based deduplication. ChangesImport URL Deduplication for Remote Datasets
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| def findByImportURL(importURL: String): Action[AnyContent] = | ||
| sil.SecuredAction.async { implicit request => | ||
| for { | ||
| dataset <- datasetService.findOneByImportURL(importURL, request.identity._organization) ?~> "dataset.notFound" ~> NOT_FOUND |
There was a problem hiding this comment.
Maybe it makes sense to omit the error here, as it will always be shown when the DS was not yet imported. But that scenario is not really an error. So maybe it would make sense to return null or something.
There was a problem hiding this comment.
yeah, good idea. Did that 👍
…to import route component
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/controllers/DatasetController.scala`:
- Around line 464-467: The match on datasetBox currently conflates all non-Full
cases into a None JSON, hiding Failure details; update the pattern match on
datasetBox to handle Full(dataset) -> datasetService.publicWrites(dataset,
Some(request.identity)), Failure(msg, _, _) -> log the failure (including msg)
and return an error result (e.g., propagate a 5xx/unexpected-error via the same
response wrapper rather than Json.toJson(None)), and Empty (or ParamFailure) ->
treat as not-found/None as appropriate; reference datasetBox,
datasetService.publicWrites and request.identity when making the targeted
changes so you don’t lose real backend/DB failures.
In `@app/models/dataset/Dataset.scala`:
- Around line 443-451: The SQL in findOneByImportURL uses LIMIT 1 without ORDER
BY, making selection arbitrary when multiple rows share the same importURL;
update the query (the q"""...""".as[DatasetsRow] block) to include a
deterministic ORDER BY (for example ORDER BY _id ASC or ORDER BY created_at DESC
depending on desired semantics) before LIMIT 1 so the same row is returned
consistently when duplicates exist.
In `@app/models/dataset/DatasetService.scala`:
- Around line 142-149: The code in the Fox.runOptional(importURLOpt) branch
currently records the raw import URL into metadata via DatasetUpdateParameters
-> metadata when calling datasetDAO.updatePartial; instead, sanitize or
canonicalize the URL (remove credentials, query tokens, signed params) or
replace it with a deterministic hash (e.g., SHA-256 of the full URL) before
storing to avoid leaking sensitive tokens. Update the
Fox.runOptional(importURLOpt) handler to compute and store either a sanitized
URL or its hash (and optionally store a field indicating the canonicalization
method) rather than the raw importURL string in the metadata field used by
DatasetUpdateParameters.
In `@frontend/javascripts/admin/dataset/dataset_url_import.tsx`:
- Around line 45-49: The current render calls navigate(url) directly when
maybeExistingDS?.dataSource is truthy (using getViewDatasetURL and navigate),
causing a side effect during render; move this redirect into a useEffect that
depends on maybeExistingDS (and navigate) so navigation only runs as an effect,
and change the render path to return a temporary fallback (e.g., null or a
loading/redirecting message) while the effect performs the redirect. Ensure you
compute the URL with getViewDatasetURL inside the effect or derive it
beforehand, and remove the direct navigate(url) call from the component body.
In `@unreleased_changes/9516.md`:
- Line 2: Reword the release-note sentence that currently reads "Fastly added
remote dataset via /import route are now automatically deduplicated and the
route opens an existing dataset in case it was already added to the users
organization." to correct grammar and improve clarity; update the line (the
sentence starting with "Fastly added remote dataset via /import route") to
something like: "Fastly remote datasets added via the `/import` route are now
automatically deduplicated, and the route opens an existing dataset if it was
already added to the user’s organization." ensuring proper punctuation,
possessive "user’s", and consistent pluralization.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 727c7128-1d36-41b2-9bcc-f1b792f69c90
📒 Files selected for processing (10)
app/controllers/DatasetController.scalaapp/controllers/WKRemoteDataStoreController.scalaapp/models/dataset/ComposeService.scalaapp/models/dataset/Dataset.scalaapp/models/dataset/DatasetService.scalaconf/webknossos.latest.routesfrontend/javascripts/admin/dataset/dataset_add_remote_view.tsxfrontend/javascripts/admin/dataset/dataset_url_import.tsxfrontend/javascripts/admin/rest_api.tsunreleased_changes/9516.md
| js <- datasetBox match { | ||
| case Full(dataset) => datasetService.publicWrites(dataset, Some(request.identity)) | ||
| case _ => Fox.successful(Json.toJson(None)) | ||
| } |
There was a problem hiding this comment.
Do not treat backend failures as “not found” in import lookup.
Line 466 currently returns null for every non-Full case, including real Failures. That masks backend/DB errors and can incorrectly trigger duplicate-import flows instead of surfacing the failure.
Suggested fix
js <- datasetBox match {
case Full(dataset) => datasetService.publicWrites(dataset, Some(request.identity))
- case _ => Fox.successful(Json.toJson(None))
+ case Empty => Fox.successful(Json.toJson(None))
+ case failure: Failure =>
+ Fox.failure("dataset.findByImportURL.failed", failure)
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/controllers/DatasetController.scala` around lines 464 - 467, The match on
datasetBox currently conflates all non-Full cases into a None JSON, hiding
Failure details; update the pattern match on datasetBox to handle Full(dataset)
-> datasetService.publicWrites(dataset, Some(request.identity)), Failure(msg, _,
_) -> log the failure (including msg) and return an error result (e.g.,
propagate a 5xx/unexpected-error via the same response wrapper rather than
Json.toJson(None)), and Empty (or ParamFailure) -> treat as not-found/None as
appropriate; reference datasetBox, datasetService.publicWrites and
request.identity when making the targeted changes so you don’t lose real
backend/DB failures.
| def findOneByImportURL(importURL: String, organizationId: String)(implicit ctx: DBAccessContext): Fox[Dataset] = | ||
| for { | ||
| accessQuery <- readAccessQuery | ||
| r <- run(q"""SELECT $columns | ||
| FROM $existingCollectionName | ||
| WHERE _organization = $organizationId | ||
| AND metadata @> jsonb_build_array(jsonb_build_object('type', 'string', 'key', 'importURL', 'value', $importURL)) | ||
| AND $accessQuery | ||
| LIMIT 1""".as[DatasetsRow]) |
There was a problem hiding this comment.
Make import-URL lookup deterministic when duplicates exist.
Line 451 uses LIMIT 1 without ORDER BY. If multiple rows match the same importURL, the selected dataset is arbitrary and may vary between calls.
Suggested fix
r <- run(q"""SELECT $columns
FROM $existingCollectionName
WHERE _organization = $organizationId
AND metadata @> jsonb_build_array(jsonb_build_object('type', 'string', 'key', 'importURL', 'value', $importURL))
AND $accessQuery
+ ORDER BY created ASC, _id ASC
LIMIT 1""".as[DatasetsRow])🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/models/dataset/Dataset.scala` around lines 443 - 451, The SQL in
findOneByImportURL uses LIMIT 1 without ORDER BY, making selection arbitrary
when multiple rows share the same importURL; update the query (the
q"""...""".as[DatasetsRow] block) to include a deterministic ORDER BY (for
example ORDER BY _id ASC or ORDER BY created_at DESC depending on desired
semantics) before LIMIT 1 so the same row is returned consistently when
duplicates exist.
| _ <- Fox.runOptional(importURLOpt) { importURL => | ||
| datasetDAO.updatePartial( | ||
| datasetId, | ||
| DatasetUpdateParameters( | ||
| description = None, | ||
| name = None, | ||
| metadata = Some(JsArray(Seq(Json.obj("key" -> "importURL", "type" -> "string", "value" -> importURL))))) | ||
| )(GlobalAccessContext) |
There was a problem hiding this comment.
Avoid persisting raw import URLs in metadata.
Line 148 stores the full import URL as metadata. If URLs contain credentials, signed params, or tokens, this creates a sensitive-data leakage path via dataset metadata exposure and retention.
Prefer storing a canonicalized/sanitized value (or hash) for deduplication, not the raw URL.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/models/dataset/DatasetService.scala` around lines 142 - 149, The code in
the Fox.runOptional(importURLOpt) branch currently records the raw import URL
into metadata via DatasetUpdateParameters -> metadata when calling
datasetDAO.updatePartial; instead, sanitize or canonicalize the URL (remove
credentials, query tokens, signed params) or replace it with a deterministic
hash (e.g., SHA-256 of the full URL) before storing to avoid leaking sensitive
tokens. Update the Fox.runOptional(importURLOpt) handler to compute and store
either a sanitized URL or its hash (and optionally store a field indicating the
canonicalization method) rather than the raw importURL string in the metadata
field used by DatasetUpdateParameters.
| if (maybeExistingDS?.dataSource) { | ||
| const url = getViewDatasetURL(maybeExistingDS); | ||
| navigate(url); | ||
| return; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify no render-path navigate call remains in DatasetURLImport after applying the fix.
rg -n 'navigate\(' frontend/javascripts/admin/dataset/dataset_url_import.tsx
rg -n 'useEffect\(' frontend/javascripts/admin/dataset/dataset_url_import.tsxRepository: scalableminds/webknossos
Length of output: 143
🏁 Script executed:
cat -n frontend/javascripts/admin/dataset/dataset_url_import.tsxRepository: scalableminds/webknossos
Length of output: 2684
Move redirect navigation out of render.
Line 47 triggers navigate(url) during render, which is a side effect in the component's render path. This causes navigate() to run on every render; in React Strict Mode it runs twice and can produce duplicate history entries or warnings. Move this into a useEffect tied to maybeExistingDS and render a temporary fallback.
🔧 Proposed fix
+import { useEffect } from "react";
import DatasetAddRemoteView from "admin/dataset/dataset_add_remote_view";
...
if (datasetUri !== null && !checkedForExistence) {
return <BrainSpinner />;
}
- if (maybeExistingDS?.dataSource) {
- const url = getViewDatasetURL(maybeExistingDS);
- navigate(url);
- return;
- }
+ useEffect(() => {
+ if (maybeExistingDS?.dataSource) {
+ navigate(getViewDatasetURL(maybeExistingDS), { replace: true });
+ }
+ }, [maybeExistingDS, navigate]);
+
+ if (maybeExistingDS?.dataSource) {
+ return <BrainSpinner />;
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@frontend/javascripts/admin/dataset/dataset_url_import.tsx` around lines 45 -
49, The current render calls navigate(url) directly when
maybeExistingDS?.dataSource is truthy (using getViewDatasetURL and navigate),
causing a side effect during render; move this redirect into a useEffect that
depends on maybeExistingDS (and navigate) so navigation only runs as an effect,
and change the render path to return a temporary fallback (e.g., null or a
loading/redirecting message) while the effect performs the redirect. Ensure you
compute the URL with getViewDatasetURL inside the effect or derive it
beforehand, and remove the direct navigate(url) call from the component body.
| @@ -0,0 +1,2 @@ | |||
| ### Changed | |||
| - Fastly added remote dataset via /import route are now automatically deduplicated and the route opens an existing dataset in case it was already added to the users organization. | |||
There was a problem hiding this comment.
Improve release-note grammar and clarity.
Line 2 has wording/grammar issues that make the changelog harder to read. Please rephrase for correctness and clarity (e.g., “Fastly remote datasets added via the /import route are now automatically deduplicated, and the route opens an existing dataset if it was already added to the user’s organization.”).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@unreleased_changes/9516.md` at line 2, Reword the release-note sentence that
currently reads "Fastly added remote dataset via /import route are now
automatically deduplicated and the route opens an existing dataset in case it
was already added to the users organization." to correct grammar and improve
clarity; update the line (the sentence starting with "Fastly added remote
dataset via /import route") to something like: "Fastly remote datasets added via
the `/import` route are now automatically deduplicated, and the route opens an
existing dataset if it was already added to the user’s organization." ensuring
proper punctuation, possessive "user’s", and consistent pluralization.
URL of deployed dev instance (used for testing):
Steps to test:
TODOs:
frontend
-> tested by manually inserting metadata [{"key": "importURL", "type": "string", "value": }] -> this opens the right DS without importing a new one!
Issues:
(Please delete unneeded items, merge only when none are left open)
$PR_NUMBER.mdfile inunreleased_changesor use./tools/create-changelog-entry.py)