Skip to content

Dv/cpu image bringup report#20

Open
DenisValeev wants to merge 7 commits into
PrismML-Eng:mainfrom
DenisValeev:dv/cpu-image-bringup-report
Open

Dv/cpu image bringup report#20
DenisValeev wants to merge 7 commits into
PrismML-Eng:mainfrom
DenisValeev:dv/cpu-image-bringup-report

Conversation

@DenisValeev
Copy link
Copy Markdown

Summary

Adds a focused CPU image generation bring-up report for bonsai-image.

The report centers the strongest demonstrated result: the unpacked transformer CPU path can produce coherent 128x128 outputs end to end, including:

  • plain ostrich
  • a coherent ostrich silhouette
  • a large centered red circle
  • a clean 4-quadrant multi-color layout

Why this is useful

This makes the current CPU status easier to understand and reproduce.

It documents that:

  • the unpacked CPU path can converge on globally coherent images
  • both geometric and object-level prompts can work on CPU
  • 128x128 is a practical validation target for CPU image generation

Guidance captured in the report

  • use 128x128+ for structure/composition validation
  • judge the active path primarily by final outputs
  • use the unpacked transformer CPU path as a reference-capable CPU configuration

Notes

  • this is a bring-up/status report, not a claim that every CPU failure mode is solved
  • no host-specific, secret, or private-environment details are included
IMG_0416

@khosravipasha
Copy link
Copy Markdown
Contributor

oh nice, how fast was it?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an experimental CPU image-generation script plus accompanying bring-up/report docs to document and reproduce coherent 128x128 CPU outputs (notably on the unpacked transformer path) for bonsai-image.

Changes:

  • Added a standalone scripts/generate_cpu_experimental.py script to run Flux2 CPU diffusion with logging and optional step image dumps.
  • Added two markdown reports capturing validated CPU bring-up results and suggested practical guidance.
  • Documented an example command shape intended to reproduce the 128x128, 4-step CPU results.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
scripts/generate_cpu_experimental.py New experimental CPU generation entrypoint (prompt encode → diffusion → VAE decode) with model loading helpers and detailed logging.
docs/upstream_issue_cpu_bringup.md Focused bring-up/status note describing the strongest validated CPU results and reproduction shape.
docs/upstream_cpu_image_report_draft.md Longer-form draft report consolidating the same CPU bring-up evidence and guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +21 to +24
REPO_ROOT = Path(__file__).resolve().parents[1]
sys.path.insert(0, str((REPO_ROOT / "vendor" / "image-studio").resolve()))

from backend_gpu.diffusion_klein import _mflux_empirical_mu # noqa: E402
Comment on lines +32 to +37
def mem_gib() -> float:
with open("/proc/self/status") as fh:
for line in fh:
if line.startswith("VmRSS:"):
return int(line.split()[1]) / 1024 / 1024
return 0.0
Comment thread scripts/generate_cpu_experimental.py Outdated
parser.add_argument("--output", required=True)
parser.add_argument("--height", type=int, default=256)
parser.add_argument("--width", type=int, default=256)
parser.add_argument("--steps", type=int, default=1)
Comment on lines +354 to +356
if args.height % 32 != 0 or args.width % 32 != 0:
raise SystemExit("height and width must be multiples of 32")

Comment on lines +48 to +54
python scripts/generate_cpu_experimental.py \
--prompt 'ostrich' \
--height 128 \
--width 128 \
--steps 4 \
--seed 7 \
--transformer-dir models/bonsai-image-4B-ternary-unpacked/transformer
Comment on lines +71 to +77
python scripts/generate_cpu_experimental.py \
--prompt 'ostrich' \
--height 128 \
--width 128 \
--steps 4 \
--seed 7 \
--transformer-dir models/bonsai-image-4B-ternary-unpacked/transformer
@DenisValeev
Copy link
Copy Markdown
Author

oh nice, how fast was it?

Per codex: fastest passing config here was 96x96, 2-step, fp32, 4 threads: 11.4s warm render, about 49.6s total including 38.2s setup. 128x128 at 2 steps was 13.2s warm / 54.7s total, and 128x128 at 4 steps was 33.0s warm / 80.2s total. Old 128x128 4-step baseline here was about 20m42s, so the search materially improved it.

This is with 4 neoverse arm vcpus and 24 gigs of ram on a free tier oracle cloud server. Will push latest changes to this branch.

@DenisValeev
Copy link
Copy Markdown
Author

image

@DenisValeev
Copy link
Copy Markdown
Author

image

Comment thread docs/pr_title.txt Outdated
@@ -0,0 +1 @@
scripts: add warm CPU server benchmark helpers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can delete this file probably

Comment thread docs/pr_body.md Outdated
@@ -0,0 +1,38 @@
## Summary
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to commit this file?
should the in the PR itself

@khosravipasha
Copy link
Copy Markdown
Contributor

Nice that's faster than I imagined for CPU only, is this running in fp16 or using the 1-bit or ternary packing? Kinda saw mix of both when skimming through the code.
Happy to merge it after some clean up and making it blend better with rest of demo, e.g. have a script that starts a server, etc.

@DenisValeev
Copy link
Copy Markdown
Author

DenisValeev commented Jun 7, 2026 via email

@DenisValeev
Copy link
Copy Markdown
Author

Pushed a cleanup pass to this branch.

What changed:

  • deleted docs/pr_title.txt
  • deleted docs/pr_body.md
  • fixed the Copilot nits in scripts/generate_cpu_experimental.py
    • only prepend vendor/image-studio if it exists
    • make RSS logging fail-safe when /proc/self/status is unavailable
    • default --steps to 4 to match the validated/report path
    • reject non-positive --steps early
  • fixed the example commands in both docs to include required --output
  • added scripts/start_cpu_image_server.sh so the warm CPU image path has a demo-shaped server entrypoint instead of only a raw Python module

On the runtime question: the path here is not fp16 on this ARM CPU box.

  • model family: bonsai-image-4B-ternary
  • transformer path for the stronger CPU runs: unpacked transformer sibling, not GemLite dense reconstruction
  • main inference dtype on this host: float32
  • text encoder: quantized on disk, then dequantized for prompt encoding on CPU

Heads-up on the warm timings: the faster numbers from the later server work are for cached prompt requests on a resident server. A brand-new uncached prompt still pays a large cold prompt-encode cost first on this CPU path, so warm cached latency and first-hit latency are very different. The warm-server helper now makes that split explicit.

If you want, I can do one more follow-up pass after this and fold the CPU server start path more tightly into the existing demo flow.

@DenisValeev
Copy link
Copy Markdown
Author

For anything substantial it's pretty slow.

But as a fire and forget with a follow-up to telegram it may be of interest.


image 1024x1024 4-step total 1928.8s

seed: 3995096562
prewarm: 669.0s
restart_wait: 195.0s
render: 1031.8s
max_seq: 128

So about 2000 seconds end to end.

image

@DenisValeev
Copy link
Copy Markdown
Author

For the above example run this:

python scripts/generate_cpu_experimental.py --prompt "Macro close-up of an iridescent peacock feather eye, emerald green and teal strands radiating outward, deep blue-black oval center, shimmering metallic texture, shallow depth of field, soft dark blurred background, glossy natural fibers, cinematic macro photography, high detail." --output outputs/cpu-peacock-feather.png --height 1024 --width 1024 --steps 4 --max-seq 128 --transformer-dir models/bonsai-image-4B-ternary-unpacked/transformer

@DenisValeev
Copy link
Copy Markdown
Author

A lone samurai warrior in ornate black lacquer armor standing in mist, katana held low at his side, crimson silk cords and weathered metal plates, rain droplets glistening on the armor, stern shadowed face under a kabuto helmet, dramatic rim lighting, foggy bamboo forest background, shallow depth of field, cinematic composition, high detail, realistic historical texture, moody atmosphere, ultra-detailed photography style.

size: 1024x1024
steps: 4
seed: 2265673322
prewarm: 255.0s
restart_wait: 93.0s
render: 1031.6s
total: 1411.6s

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants