Skip to content

feat(python-notebook-migration): add LLM client for notebook-to-workflow conversion#5260

Open
zyratlo wants to merge 18 commits into
apache:mainfrom
zyratlo:migration-tool-llm-client
Open

feat(python-notebook-migration): add LLM client for notebook-to-workflow conversion#5260
zyratlo wants to merge 18 commits into
apache:mainfrom
zyratlo:migration-tool-llm-client

Conversation

@zyratlo

@zyratlo zyratlo commented May 28, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Introduces the frontend LLM session class that converts a Jupyter notebook into a Texera workflow JSON plus a bidirectional cell to operator mapping, along with the prompt library it uses. Two files under frontend/src/app/workspace/service/notebook-migration/, totalling ~700 lines (~410 of which is prompt text).

migration-llm.ts — defines NotebookMigrationLLM, an @Injectable class wrapping a Vercel AI SDK chat session against the LiteLLM proxy already exposed on main at /api/chat/completion.

  • initialize(modelType, apiKey) — builds an OpenAI-compatible chat client via createOpenAI({ baseURL: AppSettings.getApiEndpoint() }), seeds the message history with Texera documentation as system messages.
  • verifyConnection() — does a 10-token ping call to validate that the API key works against the configured model.
  • convertNotebookToWorkflow(notebook) — extracts code cells (each tagged with a UUID in metadata.uuid), sends WORKFLOW_PROMPT + the notebook to get a JSON of UDF operators / edges, then sends MAPPING_PROMPT to get the cell↔operator mapping. Assembles a complete Texera workflow JSON (PythonUDFV2 operators with stub input/output ports, links derived from the LLM's edge list, default settings) plus a bidirectional operator_to_cell / cell_to_operator mapping. Returns both as a JSON string.
  • close() — clears the message history and the model reference.

migration-prompts.ts — string constants used by migration-llm.ts: TEXERA_OVERVIEW, TUPLE_DOCUMENTATION, TABLE_DOCUMENTATION, OPERATOR_DOCUMENTATION, UDF_INPUT_PORT_DOCUMENTATION, EXAMPLE_OF_GOOD_CONVERSION, VISUALIZER_DOCUMENTATION, EXAMPLE_OF_MULTIPLE_UDF_CONVERSION, WORKFLOW_PROMPT, MAPPING_PROMPT.

Any related issues, documentation, discussions?

Closes #5259
Parent issue #4301

How was this PR tested?

No unit tests were included for these reasons:

  • A large portion of the changes are prompt text, which are not testable, only readable. However the prompt text can be changed to improve the performance of the LLM.
  • Testing would require mocking a significant amount of logic that will be introduced in later PRs, since the logic in migration-llm.ts is parsing a response.

However I am open to writing tests based on review feedback.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

@github-actions github-actions Bot added the frontend Changes related to the frontend GUI label May 28, 2026
@codecov-commenter

codecov-commenter commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 79.56989% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.15%. Comparing base (a5d8602) to head (84e87f1).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...kspace/service/notebook-migration/migration-llm.ts 76.82% 14 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5260      +/-   ##
============================================
+ Coverage     54.08%   54.15%   +0.07%     
  Complexity     2809     2809              
============================================
  Files          1101     1106       +5     
  Lines         42597    42777     +180     
  Branches       4584     4599      +15     
============================================
+ Hits          23039    23168     +129     
- Misses        18219    18265      +46     
- Partials       1339     1344       +5     
Flag Coverage Δ *Carryforward flag
access-control-service 70.44% <ø> (ø) Carriedforward from 4dc3b1b
agent-service 34.36% <ø> (ø) Carriedforward from 4dc3b1b
amber 55.56% <ø> (ø) Carriedforward from 4dc3b1b
computing-unit-managing-service 1.65% <ø> (ø) Carriedforward from 4dc3b1b
config-service 57.35% <ø> (ø) Carriedforward from 4dc3b1b
file-service 57.06% <ø> (ø) Carriedforward from 4dc3b1b
frontend 48.41% <79.56%> (+0.25%) ⬆️
pyamber 90.20% <ø> (ø) Carriedforward from 4dc3b1b
python 90.76% <ø> (ø) Carriedforward from 4dc3b1b
workflow-compiling-service 58.69% <ø> (ø) Carriedforward from 4dc3b1b

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Yicong-Huang Yicong-Huang changed the title feat(python-notebook-migration, frontend): add LLM client for notebook-to-workflow conversion feat(python-notebook-migration): add LLM client for notebook-to-workflow conversion May 28, 2026
@zyratlo

zyratlo commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/request-review @mengw15

@github-actions github-actions Bot requested a review from mengw15 June 16, 2026 23:54

@mengw15 mengw15 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

Comment thread frontend/src/app/workspace/service/notebook-migration/migration-llm.ts Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • No candidates found from git blame history.

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label Jun 22, 2026
@zyratlo zyratlo requested a review from mengw15 June 22, 2026 16:28

@mengw15 mengw15 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of cleanups on migration-prompts.ts

print("Minimum values:\n", data.min())
print("\nMaximum values:\n", data.max())
print("\nMean values:\n", data.mean())
# END CELL 4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small things while we're in here:

1. The 4 // https://github.com/Texera/texera/... reference URLs (lines 22, 95, 122, 153) are stale on two axes:

  • Org rename: Texera/texeraapache/texera. The plain wiki link at line 22 still redirects, but the three pinned-commit blob/... URLs (lines 95, 122, 153) likely 404 because those commits don't exist in the apache fork's git history.
  • Path drift: the URLs say core/amber/src/main/python/core/models/X.py, but the actual path on apache/texera is amber/src/main/python/core/models/X.py (no core/ prefix). Verified via git ls-tree.

Suggest swapping to the current apache/texera-rooted paths and either dropping the commit pin (link to main/HEAD) or refreshing it.

2. Line 257: # END CELL 4 (with a space) — every other one of the 18 cell markers in EXAMPLE_OF_MULTIPLE_UDF_CONVERSION is the no-space form # END CELLN. This is the canonical pattern the LLM should learn from (and that custom.js parses on the iframe side too). A stray space in the example can teach the LLM that the marker format is loose, producing inconsistent cell IDs when it generates the workflow + mapping. One-character fix: drop the space.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4dc3b1b

@mengw15 mengw15 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more comments on migration-llm.ts

Comment on lines +194 to +209
public async convertNotebookToWorkflow(notebook: Notebook): Promise<string> {
this.assertEnabled();
if (!this.initialized) {
throw new Error("LLM session not initialized");
}

const codeCells = notebook.cells.filter(cell => cell.cell_type === "code");
const notebookString = codeCells
.map(cell => {
const uuid = String(cell.metadata.uuid);
return `# START ${uuid}\n${cell.source}\n# END ${uuid}`;
})
.join("\n\n");

const workflow = await this.sendPrompt(`${WORKFLOW_PROMPT}\n${notebookString}`);
const mapping = await this.sendPrompt(MAPPING_PROMPT);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.messages accumulates user/assistant pairs across every sendPrompt call. A single convertNotebookToWorkflow adds 4 messages (workflow prompt + response + mapping prompt + response) on top of the 8 doc system messages from initialize. If a consumer calls convertNotebookToWorkflow twice without re-initialize-ing or calling close() between (e.g., user uploads notebook A, then B, in the same session), the second call sends the entire first conversation back to the LLM — tokens balloon, context window can blow, and the prior workflow/mapping output can pollute the second response.

The class is built for a one-shot lifecycle (initializeverifyConnectionconvertNotebookToWorkflowclose) but doesn't enforce it. Three options:

  1. Reset at the top: at the start of convertNotebookToWorkflow, reset this.messages to just the documentation prelude (same shape as what initialize builds). This makes each conversion stateless w.r.t. prior conversions.
  2. Throw on re-use: track whether this session has already converted once; on second call, throw an error directing callers to close() + initialize() first.
  3. Document the one-shot lifecycle clearly in the class doc-comment so the next consumer (and the spec) doesn't trip on it.

Option 1 is probably most forgiving for callers; option 3 is the cheapest. The current spec uses a fresh makeLLM() per test so it doesn't catch the multi-call case, which would be a useful test to add either way.

@zyratlo zyratlo Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented Option 1 in 84e87f1 and added test cases to verify previous conversation history is discarded.

Comment on lines +226 to +233
Object.entries(udfLLMResponse.code).forEach(([udfId, udfCode], i) => {
let udfOutputColumns: { attributeName: string; attributeType: string }[] = [];
if (udfLLMResponse.outputs && udfLLMResponse.outputs[udfId]) {
udfOutputColumns = udfLLMResponse.outputs[udfId].map((attr: string) => ({
attributeName: attr,
attributeType: "binary",
}));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up to C2 (binary attributeType): thanks for explaining that binary makes sense for UDF-to-UDF pickled-object data flow. That covers internal flow, but I'd want to also flag the user-facing implication:

  • Result panel viewing: the final UDF in any LLM-generated chain emits binary columns, so Texera's result panel renders them as opaque blobs (<binary data> / base64) rather than the dataframe / values the user expects to see after running their migrated notebook. The user has to manually edit every output column's attributeType before they can view results.
  • Non-UDF downstream operators: if the user (post-generation) adds a Visualize / Filter / Aggregate / SQL operator after the chain, those operators can't interpret binary columns either.

For an MVP this is a reasonable simplification (LLM prompt stays simple), but the friction shows up every time the user runs a generated workflow. Two cheap mitigations short of "make the LLM infer types" (which I agree is bigger work):

  1. Terminal-UDF default: treat the operator(s) with no outgoing edge as the "result-facing" one and default its outputs to string (or some non-binary type) instead of binary — at least the final result renders.
  2. Documented limitation in the class doc-comment + a frontend toast/modal after generation noting "Output columns default to binary; edit per-operator to view typed results."

Either way is fine; even just (2) would close the "user confused why results don't render" UX gap. Up to you whether to scope this here or treat it as a follow-up issue.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a frontend notebook-migration LLM client that calls Texera’s LiteLLM proxy to convert Jupyter notebooks into Texera workflow JSON, alongside a prompt library and unit tests.

Changes:

  • Introduces NotebookMigrationLLM Angular service to run a 2-step LLM flow (workflow JSON + cell↔operator mapping) and assemble a Texera workflow payload.
  • Adds a prompt-constant library used to seed the LLM session with Texera/UDF documentation and conversion instructions.
  • Adds Vitest unit tests for JSON parsing and deterministic workflow/mapping construction, and wires in @ai-sdk/openai.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
frontend/package.json Adds @ai-sdk/openai dependency used by the new migration client.
frontend/yarn.lock Locks the added @ai-sdk/openai dependency.
frontend/src/app/workspace/service/notebook-migration/migration-prompts.ts Adds the prompt/documentation constants for notebook-to-workflow conversion.
frontend/src/app/workspace/service/notebook-migration/migration-llm.ts Implements the notebook migration LLM session, prompting, parsing, and workflow/mapping assembly.
frontend/src/app/workspace/service/notebook-migration/migration-llm.spec.ts Adds unit tests for parsing and conversion behavior with mocked LLM transport.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +20 to +27
import { Injectable } from "@angular/core";
import { GuiConfigService } from "../../../common/service/gui-config.service";
import { createOpenAI } from "@ai-sdk/openai";
import { generateText, type ModelMessage } from "ai";
import { AppSettings } from "../../../common/app-setting";
import { v4 as uuidv4 } from "uuid";
import { WorkflowUtilService } from "../workflow-graph/util/workflow-util.service";
import { OperatorPredicate } from "../../types/workflow-common.interface";
Comment on lines +115 to +123
public initialize(modelType: string = "gpt-5-mini", apiKey: string = "dummy"): void {
this.assertEnabled();
this.model = createOpenAI({
baseURL: new URL(`${AppSettings.getApiEndpoint()}`, document.baseURI).toString(),
// apiKey is required by the library for creating the OpenAI compatible client;
// For security reason, we store the apiKey at the backend, thus the value is dummy here.
apiKey: apiKey,
}).chat(modelType);

Comment on lines +56 to +59
settings: {
dataTransferBatchSize: number;
};
}
Comment on lines +214 to +222
const workflowJSON: WorkflowJSON = {
operators: [],
operatorPositions: {},
links: [],
commentBoxes: [],
settings: {
dataTransferBatchSize: 400,
},
};
Comment on lines +41 to +45
interface Cell {
cell_type: string;
metadata: { [key: string]: any };
source: string;
}
Comment on lines +98 to +110
private parseJsonResponse(raw: string, context: string): any {
// Trim first, then strip optional markdown code fences (```json ... ``` or ``` ... ```)
const cleaned = raw
.trim()
.replace(/^```[a-zA-Z]*\s*/, "")
.replace(/\s*```$/, "")
.trim();
try {
return JSON.parse(cleaned);
} catch (err) {
throw new Error(`Failed to parse LLM ${context} response as JSON: ${(err as Error).message}`);
}
}
Comment on lines +62 to +64
const llm = new NotebookMigrationLLM(stubConfig, stubUtil);
llm.initialize();
return llm;
Comment on lines +157 to +169
it("produces a link with an undefined endpoint when an edge references an unknown UDF id", async () => {
const notebook: Notebook = { cells: [codeCell("CELL1", "a")] };
mockResponses(
JSON.stringify({ code: { UDF1: "c1" }, edges: [["UDF1", "UDFX"]], outputs: {} }),
JSON.stringify({ UDF1: ["CELL1"] })
);

const { workflowJSON } = JSON.parse(await makeLLM().convertNotebookToWorkflow(notebook));

// Documents current behavior: udfMappingToUUID["UDFX"] is undefined.
expect(workflowJSON.links[0].source.operatorID).toBe("PythonUDFV2-0");
expect(workflowJSON.links[0].target.operatorID).toBeUndefined();
});
Comment on lines +183 to +199
it("emits the 'undefined' cell marker in the prompt when a code cell lacks metadata.uuid", async () => {
const notebook: Notebook = { cells: [codeCell(undefined, "print(1)")] };
mockResponses(
JSON.stringify({ code: { UDF1: "c1" }, edges: [], outputs: {} }),
JSON.stringify({ UDF1: ["CELL1"] })
);

await makeLLM().convertNotebookToWorkflow(notebook);

// The notebook string (embedded in the workflow prompt) is sent to generateText.
// messages is a shared, mutated array, so search every message content rather than
// assuming a fixed index.
const allPromptContent = mockGenerateText.mock.calls
.flatMap(call => call[0].messages.map((m: any) => m.content))
.join("\n");
expect(allPromptContent).toContain("# START undefined");
});
Comment on lines +226 to +230
it("does not strip a fence preceded by prose (documents current limitation)", () => {
expect(() => parse('Here is the JSON: ```json\n{"a":1}\n```\nThanks!')).toThrow(
"Failed to parse LLM workflow response as JSON"
);
});

@mengw15 mengw15 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a follow-up note on Copilot's apiKey/JWT comment.

Comment on lines +115 to +122
public initialize(modelType: string = "gpt-5-mini", apiKey: string = "dummy"): void {
this.assertEnabled();
this.model = createOpenAI({
baseURL: new URL(`${AppSettings.getApiEndpoint()}`, document.baseURI).toString(),
// apiKey is required by the library for creating the OpenAI compatible client;
// For security reason, we store the apiKey at the backend, thus the value is dummy here.
apiKey: apiKey,
}).chat(modelType);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strengthening Copilot's apiKey/JWT note on this initialize() block: this is now a hard blocker rather than a forward-looking suggestion, because of #5421.

Before #5421, AccessControlResource.scala:251 (the /chat/* LiteLLM proxy in access-control-service) was @PermitAll. The JwtAuthFilter skipped it, and the resource body explicitly stripped the incoming Authorization header and substituted the LiteLLM master key — so apiKey: "dummy" worked because the value was discarded server-side. That's where the "for security reason, we store the apiKey at the backend, thus the value is dummy here" comment in this file came from.

After #5421 (fix(auth): require REGULAR/ADMIN role on LiteLLM proxy endpoints, merged 2026-06-07), the same resource is @RolesAllowed(Array("REGULAR", "ADMIN")). The JwtAuthFilter registered at AUTHENTICATION priority in AccessControlService.scala:79 now runs on every /api/chat/* request, parses Authorization: Bearer <…> as a JWT, and rejects malformed tokens with 401 before the request reaches the resource body.

Net effect once this PR rebases on current main:

  • convertNotebookToWorkflow → 401
  • verifyConnection → 401
  • The whole migration feature is non-functional until the apiKey wiring changes.

Vercel AI SDK puts the apiKey string into the Authorization header verbatim as a Bearer token, which is the exact shape JwtAuthFilter parses, so passing the current user's JWT here should be a small wire-up (e.g. pulling the token from UserService at the call site and threading it into initialize()). The resource body still substitutes the master key before forwarding to LiteLLM, so the JWT is used only for the Texera-side auth check.

The "// apiKey is required by the library… the value is dummy here" code comment should be updated alongside, otherwise it will mislead the next reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file frontend Changes related to the frontend GUI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Notebook Migration] Add LLM client for notebook-to-workflow conversion

4 participants