Skip to content

Add Gemma 4 RTX 4090 backend helpers#152

Open
adybag14-cyber wants to merge 4 commits into
Luce-Org:mainfrom
adybag14-cyber:main
Open

Add Gemma 4 RTX 4090 backend helpers#152
adybag14-cyber wants to merge 4 commits into
Luce-Org:mainfrom
adybag14-cyber:main

Conversation

@adybag14-cyber
Copy link
Copy Markdown

image image

@adybag14-cyber
Copy link
Copy Markdown
Author

RTX 4090 Gemma 4 running at 60 tk/s 31b it abliterated

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 4 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/Start-LuceboxGemma4090.ps1">

<violation number="1" location="scripts/Start-LuceboxGemma4090.ps1:6">
P2: Hard-coding the default repo path to one workstation makes this helper fail by default on other machines.</violation>
</file>

<file name="scripts/verify_gemma4_4090.py">

<violation number="1" location="scripts/verify_gemma4_4090.py:117">
P2: `--runs` is unchecked, so 0/negative values can make aggregation crash on an empty result set.</violation>
</file>

<file name="scripts/lucebox-gemma4-4090.sh">

<violation number="1" location="scripts/lucebox-gemma4-4090.sh:62">
P2: Readiness timeout can be bypassed because the health probe has no curl timeout, so a single stalled request can block `wait_ready()` indefinitely.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

[string] $Command = 'Start',

[string] $Distro = '',
[string] $RepoPath = '/mnt/c/Users/adyba/src/lucebox-hub',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hard-coding the default repo path to one workstation makes this helper fail by default on other machines.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/Start-LuceboxGemma4090.ps1, line 6:

<comment>Hard-coding the default repo path to one workstation makes this helper fail by default on other machines.</comment>

<file context>
@@ -0,0 +1,66 @@
+    [string] $Command = 'Start',
+
+    [string] $Distro = '',
+    [string] $RepoPath = '/mnt/c/Users/adyba/src/lucebox-hub',
+    [int] $WaitSeconds = 300
+)
</file context>

parser = argparse.ArgumentParser()
parser.add_argument("--base-url", default="http://127.0.0.1:18191")
parser.add_argument("--threshold", type=float, default=60.0)
parser.add_argument("--runs", type=int, default=3)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: --runs is unchecked, so 0/negative values can make aggregation crash on an empty result set.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/verify_gemma4_4090.py, line 117:

<comment>`--runs` is unchecked, so 0/negative values can make aggregation crash on an empty result set.</comment>

<file context>
@@ -0,0 +1,162 @@
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-url", default="http://127.0.0.1:18191")
+    parser.add_argument("--threshold", type=float, default=60.0)
+    parser.add_argument("--runs", type=int, default=3)
+    parser.add_argument("--n-predict", type=int, default=256)
+    parser.add_argument("--wait", type=float, default=300.0)
</file context>

}

health() {
curl -fsS "$(url)/health"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Readiness timeout can be bypassed because the health probe has no curl timeout, so a single stalled request can block wait_ready() indefinitely.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/lucebox-gemma4-4090.sh, line 62:

<comment>Readiness timeout can be bypassed because the health probe has no curl timeout, so a single stalled request can block `wait_ready()` indefinitely.</comment>

<file context>
@@ -0,0 +1,194 @@
+}
+
+health() {
+    curl -fsS "$(url)/health"
+}
+
</file context>

@davide221
Copy link
Copy Markdown
Contributor

davide221 commented May 11, 2026

@adybag14-cyber thanks for your contribution! @dusterbloom can you take a look at this

@dusterbloom
Copy link
Copy Markdown
Contributor

dusterbloom commented May 11, 2026

Thanks for the contribution and the time you put into the scripts.

We can't take this one as-is. A few specific reasons:

The Gemma 4 path in this repo is dflash/, not libllama. The PR description says "DFlash runtime in this repository is still the Qwen/Laguna research path; Gemma 4 uses libllama" — that isn't accurate. dflash/src/gemma4_target_graph.cpp, dflash/src/gemma4_mtp_graph.cpp, dflash/src/gemma4_dflash_graph.cpp, and dflash/src/gemma4_target_loader.cpp are the active Gemma 4 implementation. They power our current Gemma 4 26B-A4B benches up to 1M context with MTP γ=2 — see .sisyphus/notes/gemma4-baseline/mtp-gamma/. The scripts in this PR set up a parallel path through llama-server and bypass that work rather than building on it.

The submodule pin is intentional and non-negotiable. dflash/deps/llama.cpp is pinned to our feature/tq3-kv-cache-clean branch because it carries TQ3_0 KV quantization, the graph-level FWHT contract, sparse FA, and the chunked attention path that the rest of the codebase depends on. CI builds against that pin; benchmarks reference it. We can't replace it with an arbitrary third-party llama.cpp build per script.

The scripts hardcode paths from a workstation we don't have access to/mnt/c/Users/adyba/... and /home/tdamre/src/llama.cpp-mtp-pr22673/build-mtp-cuda124-speed-faall/bin/llama-server. We can't ship those, and we can't CI them. The same goes for the --spec-draft-n-max 4 claim presented as "the measured stable MTP window for this 31B target on the RTX 4090" — no log is cited, and our own γ-sweep on this target (γ ∈ {1, 2, 4, 8} at 4K/16K/64K, RTX 3090) shows γ=2 as the winner past short contexts; γ=4 regresses. Different GPU, different KV layout, possibly different result — but it would need a log to ground the claim.

We're moving toward a declarative config layout — a configs/backends/ registry for alternate llama.cpp builds and configs/profiles/ for per-machine deployment configs with measurement provenance — which is the shape we'd take a contribution like this in. Work-in-progress at #155 — same shape we'd take a re-submission in. If you're still interested once that lands, a re-submission as a small profile + backend descriptor would be the natural way back in.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="scripts/probe_gemma4_context.py">

<violation number="1" location="scripts/probe_gemma4_context.py:155">
P2: Threshold validation can falsely pass when every run lacks a numeric `predicted_per_second` metric.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

"cache_type_v": args.cache_type_v,
"threshold": args.threshold,
"all_ok": all(r["ok"] for r in results),
"all_ge_threshold": (all(rate >= args.threshold for rate in rates) if args.threshold > 0 else None),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Threshold validation can falsely pass when every run lacks a numeric predicted_per_second metric.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At scripts/probe_gemma4_context.py, line 155:

<comment>Threshold validation can falsely pass when every run lacks a numeric `predicted_per_second` metric.</comment>

<file context>
@@ -0,0 +1,176 @@
+        "cache_type_v": args.cache_type_v,
+        "threshold": args.threshold,
+        "all_ok": all(r["ok"] for r in results),
+        "all_ge_threshold": (all(rate >= args.threshold for rate in rates) if args.threshold > 0 else None),
+        "min_predicted_per_second": min(rates) if rates else None,
+        "avg_predicted_per_second": (sum(rates) / len(rates) if rates else None),
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants