feat(python-notebook-migration): add notebook-migration-service microservice in backend#5258
feat(python-notebook-migration): add notebook-migration-service microservice in backend#5258zyratlo wants to merge 38 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5258 +/- ##
============================================
+ Coverage 54.39% 54.46% +0.06%
- Complexity 2860 2867 +7
============================================
Files 1107 1111 +4
Lines 42767 42937 +170
Branches 4599 4620 +21
============================================
+ Hits 23264 23385 +121
- Misses 18152 18180 +28
- Partials 1351 1372 +21
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 430 | 0.263 | 23,335/33,579/33,579 us | 🔴 +25.4% / ⚪ within ±5% |
| 🔴 | bs=100 sw=10 sl=64 | 952 | 0.581 | 103,743/135,561/135,561 us | 🔴 +5.4% / 🟢 -7.6% |
| ⚪ | bs=1000 sw=10 sl=64 | 1,108 | 0.676 | 905,364/981,928/981,928 us | ⚪ within ±5% / 🟢 -6.9% |
Baseline details
Latest main e270f83 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 430 tuples/sec | 491 tuples/sec | 410.82 tuples/sec | -12.4% | +4.7% |
| bs=10 sw=10 sl=64 | MB/s | 0.263 MB/s | 0.3 MB/s | 0.251 MB/s | -12.3% | +4.9% |
| bs=10 sw=10 sl=64 | p50 | 23,335 us | 18,615 us | 23,785 us | +25.4% | -1.9% |
| bs=10 sw=10 sl=64 | p95 | 33,579 us | 28,239 us | 34,980 us | +18.9% | -4.0% |
| bs=10 sw=10 sl=64 | p99 | 33,579 us | 28,239 us | 34,980 us | +18.9% | -4.0% |
| bs=100 sw=10 sl=64 | throughput | 952 tuples/sec | 982 tuples/sec | 891.94 tuples/sec | -3.1% | +6.7% |
| bs=100 sw=10 sl=64 | MB/s | 0.581 MB/s | 0.599 MB/s | 0.544 MB/s | -3.0% | +6.7% |
| bs=100 sw=10 sl=64 | p50 | 103,743 us | 101,333 us | 112,277 us | +2.4% | -7.6% |
| bs=100 sw=10 sl=64 | p95 | 135,561 us | 128,607 us | 139,802 us | +5.4% | -3.0% |
| bs=100 sw=10 sl=64 | p99 | 135,561 us | 128,607 us | 139,802 us | +5.4% | -3.0% |
| bs=1000 sw=10 sl=64 | throughput | 1,108 tuples/sec | 1,112 tuples/sec | 1,041 tuples/sec | -0.4% | +6.4% |
| bs=1000 sw=10 sl=64 | MB/s | 0.676 MB/s | 0.679 MB/s | 0.635 MB/s | -0.4% | +6.4% |
| bs=1000 sw=10 sl=64 | p50 | 905,364 us | 892,233 us | 972,714 us | +1.5% | -6.9% |
| bs=1000 sw=10 sl=64 | p95 | 981,928 us | 961,858 us | 1,023,057 us | +2.1% | -4.0% |
| bs=1000 sw=10 sl=64 | p99 | 981,928 us | 961,858 us | 1,023,057 us | +2.1% | -4.0% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,464.63,200,128000,430,0.263,23334.92,33579.00,33579.00
1,100,10,64,20,2101.17,2000,1280000,952,0.581,103743.07,135561.41,135561.41
2,1000,10,64,20,18045.39,20000,12800000,1108,0.676,905363.76,981927.65,981927.65|
/request-review @mengw15 |
mengw15
left a comment
There was a problem hiding this comment.
Left two more comments on the service entry point
| override def run( | ||
| configuration: NotebookMigrationServiceConfiguration, | ||
| environment: Environment | ||
| ): Unit = { | ||
| // Serve backend at /api | ||
| environment.jersey.setUrlPattern("/api/*") | ||
|
|
||
| environment.jersey.register(classOf[NotebookMigrationResource]) | ||
|
|
||
| // Route request logs through SLF4J, controlled by TEXERA_SERVICE_LOG_LEVEL | ||
| RequestLoggingFilter.register(environment.getApplicationContext) | ||
| } |
There was a problem hiding this comment.
All 5 existing Dropwizard services in this repo register the same auth stack in run():
| Service | Registers JWT auth at |
|---|---|
AccessControlService |
line 80 (AuthDynamicFeature(classOf[JwtAuthFilter])) |
ConfigService |
line 71 |
FileService |
line 92 |
WorkflowCompilingService |
line 101 |
ComputingUnitManagingService |
line 86 |
Same 5 components every time: AuthDynamicFeature(classOf[JwtAuthFilter]) + RolesAllowedDynamicFeature + AuthValueFactoryProvider.Binder(classOf[SessionUser]) + UnauthorizedExceptionMapper + corresponding @RolesAllowed / @Auth user: SessionUser on the resources.
This service is the only one (1 of 6) without any of that. Every endpoint is fully unauthenticated. POST /store-notebook-and-mapping takes wid from the request body and writes the user-uploaded notebook to that workflow, with no check that the caller owns workflow wid. Same for fetch and set-notebook. Any client with network access to :9098 can read/write any user's workflow's notebook.
The NotebookMigrationResource companion-object comment mentions a "per-user pod" k8s deployment model. But (a) the other 5 services run in k8s too and still do auth at the service layer, (b) it relies on gateway routing being airtight (no internal-cluster client ever talking directly to a pod), (c) even with per-user pods, wid ownership isn't enforced inside the pod either — so a misrouted request could clobber another user's notebook.
Is the intent to add the auth stack in a later PR in the series, or is the design genuinely "gateway is the only auth boundary"? Worth either wiring auth here to match the other 5 services, or documenting the gateway-trusts model in a top-level service comment so the next contributor doesn't assume it's an oversight.
There was a problem hiding this comment.
I added auth and workflow write access checks in 8bed0f9 and new tests that verify their behavior. Please let me know if I should change anything in regards to convention and/or formatting.
| override def run( | ||
| configuration: NotebookMigrationServiceConfiguration, | ||
| environment: Environment | ||
| ): Unit = { | ||
| // Serve backend at /api | ||
| environment.jersey.setUrlPattern("/api/*") | ||
|
|
||
| environment.jersey.register(classOf[NotebookMigrationResource]) | ||
|
|
||
| // Route request logs through SLF4J, controlled by TEXERA_SERVICE_LOG_LEVEL | ||
| RequestLoggingFilter.register(environment.getApplicationContext) | ||
| } |
There was a problem hiding this comment.
All 5 existing Dropwizard services in this repo register HealthCheckResource in their run():
| Service | Line |
|---|---|
AccessControlService |
74 |
ConfigService |
69 |
FileService |
88 |
WorkflowCompilingService |
59 |
ComputingUnitManagingService |
64 |
This service is the only one (1 of 6) without it. Combined with adminConnectors: [] in notebook-migration-service-web-config.yaml (which disables Dropwizard's built-in admin connector), the service has no external liveness/readiness endpoint — the future k8s Deployment manifest (later PR in the series) won't have a probe target without back-fill. Easy to add now (environment.jersey.register(classOf[HealthCheckResource])) to match the convention.
mengw15
left a comment
There was a problem hiding this comment.
Left a comment on input validation
| val notebookName = json.get("notebookName").asText() | ||
| val notebookData = json.get("notebookData") | ||
|
|
||
| // Construct Jupyter API URL | ||
| val apiUrl = s"$jupyterUrl/api/contents/work/$notebookName" |
There was a problem hiding this comment.
notebookName is taken from the request body and interpolated into two unsafe contexts, both without any validation:
-
URL path (line 153):
s"$jupyterUrl/api/contents/work/$notebookName". AnotebookName=../../etc/notebook.ipynbafter HTTP path normalization escapeswork/and hits a different Jupyter contents endpoint. Jupyter's contents API has scope checks but defense in depth would catch this before the request even goes out. -
JSON response injection (line 193 → line 94):
setNotebookwritesnotebookNameinto the process-globaljupyterIframeURL. SubsequentgetJupyterIframeURL()calls then do raw JSON string concat at line 94:"url": "$jupyterIframeURL". IfnotebookNameever contains a"character, the response is malformed JSON — frontendJSON.parsethrows on iframe reload.
The error path already uses errorJson for safe JSON building (fixed in f0cf26172); the success paths here still use raw s"..." interpolation. A simple regex check on notebookName at the top of setNotebook (e.g., reject anything outside [A-Za-z0-9._-]+\.ipynb) closes the URL-path angle. The success-path JSON could either use errorJson-style ObjectMapper construction, or just sanitize the URL before storing it in jupyterIframeURL.
In the per-user-pod deployment model described elsewhere in this file, the real-world impact of the URL-path angle is bounded (a user can only attack their own pod's Jupyter, which they already control via the iframe). But the fix is cheap (a few lines), survives a future shared-deployment change, and is the kind of pattern static-analysis / security audits will flag.
|
also please add more test cases |
Automated Reviewer SuggestionsBased on the
|
Added more test cases, summary of leftover lines: NotebookMigrationResource.scala — 100% line coverage. Every reachable branch is tested, including the Jupyter HTTP paths (via a stub HttpServer on the configured port), the write-access gate, and the malformed-input and unreachable-server failure paths. The only branches Codecov still flags as partial are unreachable by design:
NotebookMigrationService.scala — run() and registerAuthFeatures are tested (mocked Dropwizard Environment, mirroring AccessControlServiceRunSpec). initialize() and main() are left untested: they stand up the real application and open the live JDBC connection, so they aren't unit-testable — consistent with the other Dropwizard services, which test run() only. StorageConfig.scala / NotebookMigrationServiceConfiguration.scala — the flagged lines are config field declarations; their coverage is attributed to other modules' flags and can't be exercised from this service's tests. |
There was a problem hiding this comment.
Pull request overview
Adds a new backend microservice (notebook-migration-service) to mediate between Texera and a JupyterLab stack for the Python Notebook Migration tool, including DB persistence for notebook + workflow mapping and wiring into the build/CI/dev-proxy.
Changes:
- Introduces
notebook-migration-serviceDropwizard app with notebook/Jupyter REST endpoints plus workflow write-access enforcement. - Adds Scala tests covering DB transaction behavior, auth gating, and stubbed Jupyter success/failure paths.
- Wires the service into root sbt aggregation, CI platform matrix, shared
StorageConfig, and the frontend dev proxy.
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
notebook-migration-service/src/main/scala/org/apache/texera/service/NotebookMigrationService.scala |
Dropwizard application entrypoint; registers Jersey pattern, auth stack, resources, and request logging. |
notebook-migration-service/src/main/scala/org/apache/texera/service/NotebookMigrationServiceConfiguration.scala |
Dropwizard configuration class. |
notebook-migration-service/src/main/scala/org/apache/texera/service/resource/NotebookMigrationResource.scala |
Implements Jupyter connectivity + notebook upload, and DB store/fetch endpoints. |
notebook-migration-service/src/main/scala/org/apache/texera/service/resource/WorkflowAccessResource.scala |
Helper for checking workflow WRITE privilege. |
notebook-migration-service/src/main/scala/org/apache/texera/service/resource/HealthCheckResource.scala |
Adds a /healthcheck endpoint. |
notebook-migration-service/src/test/scala/org/apache/texera/service/resource/NotebookMigrationResourceSpec.scala |
Unit tests for DB persistence/rollback, access checks, and Jupyter stub interactions. |
notebook-migration-service/src/test/scala/org/apache/texera/service/NotebookMigrationServiceRunSpec.scala |
Verifies service run registers resources and auth features. |
notebook-migration-service/src/main/resources/notebook-migration-service-web-config.yaml |
Service port/logging config for Dropwizard. |
notebook-migration-service/build.sbt |
Service-specific build and dependencies. |
notebook-migration-service/project/build.properties |
Pins sbt version for the module build. |
notebook-migration-service/LICENSE-binary |
Binary distribution license manifest for dependency licensing checks. |
notebook-migration-service/NOTICE-binary |
Generated NOTICE aggregation for the service distribution. |
common/config/src/main/resources/storage.conf |
Adds storage.jupyter config (URL + token env overrides). |
common/config/src/main/scala/org/apache/texera/common/config/StorageConfig.scala |
Exposes Jupyter URL/token accessors. |
frontend/proxy.config.json |
Dev proxy route for /api/notebook-migration to localhost:9098. |
build.sbt |
Registers NotebookMigrationService as an sbt subproject and aggregates it. |
bin/licensing/check_binary_deps.py |
Includes the new service’s LICENSE-binary in the licensing check list. |
.github/workflows/build.yml |
Adds the new service to the platform CI matrix (dist + tests + licensing). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Response | ||
| .ok( | ||
| s""" | ||
| { | ||
| "success": true, | ||
| "url": "$jupyterIframeURL" | ||
| } | ||
| """ | ||
| ) | ||
| .build() | ||
| } |
There was a problem hiding this comment.
This value is already regex verified to be valid so there wouldn't be a case of invalid JSON. However, to future-proof for possible refactoring of this function it is OK to make this change as it doesn't affect anything now.
There was a problem hiding this comment.
This value is already regex verified to be valid so there wouldn't be a case of invalid JSON. However, to future-proof for possible refactoring of this function it is OK to make this change as it doesn't affect anything now. Added in 319f0a3
| Response | ||
| .ok( | ||
| s""" | ||
| { | ||
| "success": true, | ||
| "url": "$jupyterUrl" | ||
| } | ||
| """ | ||
| ) | ||
| .build() | ||
| } |
There was a problem hiding this comment.
This value is already regex verified to be valid so there wouldn't be a case of invalid JSON. However, to future-proof for possible refactoring of this function it is OK to make this change as it doesn't affect anything now. Added in 319f0a3
| } catch { | ||
| case NonFatal(e) => | ||
| logger.error("Error storing mapping and workflow", e) | ||
| Response | ||
| .status(Response.Status.INTERNAL_SERVER_ERROR) | ||
| .entity(errorJson(e.getMessage)) | ||
| .build() | ||
| } |
| // Jupyter | ||
| val jupyterURL: String = conf.getString("storage.jupyter.url") | ||
| val jupyterToken: String = conf.getString("storage.jupyter.token") |
There was a problem hiding this comment.
I'm going to leave this as-is. StorageConfig's ENV_* constants aren't actually referenced anywhere — the env-var registry that's consumed at runtime is common/config/.../EnvironmentalVariable.scala (used by ComputingUnitManagingResource to pass settings into spawned pods). So adding ENV_JUPYTER_* constants here would only grow an unused, duplicative block rather than improve the config surface.
If/when the Jupyter settings need to propagate to spawned pods, the right place to add them is EnvironmentalVariable — but that wiring doesn't exist yet, so I'd add the constant when it's actually needed rather than speculatively. The env var names themselves are already documented where they're consumed: storage.conf (STORAGE_JUPYTER_URL, JUPYTER_TOKEN).
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ryan Zhang <97552093+zyratlo@users.noreply.github.com>
…migration-service' into migration-tool-backend-notebook-migration-service
What changes were proposed in this PR?
Introduces the microservice that mediates between Texera and the JupyterLab docker stack landed in
migration-tool-jupyter-docker. Adds a new SBT subprojectnotebook-migration-serviceplus shared config and a frontend dev-proxy route.New SBT subproject
notebook-migration-service/:build.sbtandproject/build.properties— module SBT setup; module depends on the existingAuth,Config, andDAOprojectssrc/main/scala/.../NotebookMigrationService.scala— DropwizardApplicationentry point; sets Jersey URL pattern to/api/*, registers the resource class, initializes the shared SQL connection viaSqlServer.initConnection(StorageConfig.jdbcUrl, …), and wires inRequestLoggingFilter.src/main/scala/.../NotebookMigrationServiceConfiguration.scala— DropwizardConfigurationsubclass.src/main/scala/.../resource/NotebookMigrationResource.scala— five REST endpoints under/notebook-migration:GET /get-jupyter-url— health-checks the Jupyter container and returns its base URL.GET /get-jupyter-iframe-url— returns the iframe-ready URL fornotebook.ipynb.POST /set-notebook— receives a notebook JSON, PUTs it into JupyterLab via its/api/contents/work/{name}API.POST /store-notebook-and-mapping— persists a notebook + workflow-notebook mapping into Postgres in a single transaction (writes to thenotebookandworkflow_notebook_mappingtables added bymigration-tool-database-tables).POST /fetch-notebook-and-mapping— returns the most recent notebook + mapping for a given (wid, vid).src/main/resources/logback.xml— logging config.src/main/resources/notebook-migration-service-web-config.yaml— Dropwizard server config (HTTP port9098, DB connection refs).Root build wiring:
build.sbt— declares the newNotebookMigrationServiceSBT subproject and adds it to theTexeraProjectaggregation.Shared config:
common/config/src/main/resources/storage.conf— newjupyter { url = "http://localhost:9100" }block, overridable viaSTORAGE_JUPYTER_URL.common/config/src/main/scala/.../StorageConfig.scala— adds thejupyterURLaccessor.Frontend dev proxy:
frontend/proxy.config.json— routes/api/notebook-migration/*tohttp://localhost:9098.Any related issues, documentation, discussions?
Closes #5257
Parent issue #4301
migration-tool-database-tablesfeat(python-notebook-migration): add database tables for Notebook Migration tool #5055 — the resource imports jOOQ-generatedNotebook/WorkflowNotebookMappingclasses that only exist once the schema PR is merged.migration-tool-jupyter-dockeris whatStorageConfig.jupyterURLpoints to. Without it running, the Jupyter-related endpoints return a 500 with"Cannot connect to Jupyter server". Service still starts and the DB-persistence endpoints work in isolation.How was this PR tested?
Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)