[good parallel computer] Minor updates

raphlinus · raphlinus · commit d63210f679fd · 2025-03-22T08:27:51.000-07:00
Small updates and clarifications based on feedback.
diff --git a/_posts/2024-01-29-good-parallel-computer.md b/_posts/2024-01-29-good-parallel-computer.md
@@ -16,7 +16,7 @@ Last April, I gave a [colloquium][UCSC Colloquium] (video) at the UCSC CSE progr
 
 ## Memory efficiency of sophisticated GPU programs
 
-I’ve been working on Vello, an advanced 2D vector graphics renderer, for many years. The CPU uploads a scene description in a simplified binary SVG-like format, then the compute shaders take care of the rest, producing a 2D rendered image at the end. The compute shaders [parse][stack monoid] tree structures, do advanced computational geometry for [stroke expansion], and sorting-like algorithms for binning. Internally, it's essentially a simple compiler, producing a separate optimized byte-code like program for each 16x16 pixel tile, then interpreting those programs. What it cannot do, a problem I am increasingly frustrated by, is run in bounded memory. Each stage produces intermediate data structures, and the number and size of these structures depends on the input in an unpredictable way. For example, changing a single transform in the encoded scene can result in profoundly different rendering plans.
+I’ve been working on [Vello], an advanced 2D vector graphics renderer, for many years. The CPU uploads a scene description in a simplified binary SVG-like format, then the compute shaders take care of the rest, producing a 2D rendered image at the end. The compute shaders [parse][stack monoid] tree structures, do advanced computational geometry for [stroke expansion], and sorting-like algorithms for binning. Internally, it's essentially a simple compiler, producing a separate optimized byte-code like program for each 16x16 pixel tile, then interpreting those programs. What it cannot do, a problem I am increasingly frustrated by, is run in bounded memory. Each stage produces intermediate data structures, and the number and size of these structures depends on the input in an unpredictable way. For example, changing a single transform in the encoded scene can result in profoundly different rendering plans.
 
 The problem is that the buffers for the intermediate results need to be allocated (under CPU control) before kicking off the pipeline. There are a number of imperfect potential solutions. We could estimate memory requirements on the CPU before starting a render, but that's expensive and may not be precise, resulting either in failure or waste. We could try a render, detect failure, and retry if buffers were exceeded, but doing readback from GPU to CPU is a big performance problem, and creates a significant architectural burden on other engines we'd interface with.
 
@@ -41,7 +41,7 @@ Perhaps more than anything else, the CM spurred tremendous research into paralle
 
 ### Cell
 
-Another important pioneering parallel computer was Cell, which shipped as part of the PlayStation 3 in 2006. That device shipped in fairly good volume (about 87.4 million units), and had fascinating applications including [high performance computing][PlayStation 3 cluster], but was a dead end; the Playstation 4 switched to a fairly vanilla Radeon GPU.
+Another important pioneering parallel computer was Cell, which shipped as part of the PlayStation 3 in 2006. That device shipped in fairly good volume (about 87.4 million units), and had fascinating applications including [high performance computing][PlayStation 3 cluster], but was a dead end; the Playstation 4 switched to a fairly vanilla rendering pipeline based on a Radeon GPU.
 
 Probably one of the biggest challenges in the Cell was the programming model. In the version shipped on the PS3, there were 8 parallel cores, each with 256kB of static RAM, and each with 128 bit wide vector SIMD. The programmer had to manually copy data into local SRAM, where a kernel would then do some computation. There was little or no support for high level programming; thus people wanting to target this platform had to painstakingly architect and implement parallel algorithms.
 
@@ -53,7 +53,7 @@ The Cell had approximately 200 GFLOPS of total throughput, which was impressive
 
 Perhaps the most poignant road not taken in the history of GPU design is Larrabee. The [2008 SIGGRAPH paper][Larrabee paper] makes a compelling case, but ultimately the project failed. It's hard to say why exactly, but I think it's possible it was just poor execution on Intel's part, and with more persistence and a couple of iterations to improve the shortcomings in the original version, it might well have succeeded. At heart, Larrabee is a standard x86 computer with wide (512 bit) SIMD units and just a bit of special hardware to optimize graphics tasks. Most graphics functions are implemented in software. If it had succeeded, it would very easily fulfill my wishes; work creation and queuing is done in software and can be entirely dynamic at a fine level of granularity.
 
-Bits of Larrabee live on. The upcoming AVX10 instruction set is an evolution of Larrabee's AVX-512, and supports 32 lanes of f16 operations. In fact, Tom Forsyth, one of its creators, argues that [Larrabee did not indeed fail][Why didn't Larrabee fail?] but that its legacy is a success. Another valuable facet of legacy is ISPC, and Matt Pharr's blog on [The story of ispc] sheds light on the Larrabee project.
+Bits of Larrabee live on. The upcoming AVX10 instruction set is an evolution of Larrabee's AVX-512, and supports 32 lanes of f16 operations. In fact, Tom Forsyth, one of its creators, argues that [Larrabee did not indeed fail][Why didn't Larrabee fail?] but that its legacy is a success. It did ship in modest volumes as Xeon Phi. Another valuable facet of legacy is ISPC, and Matt Pharr's blog on [The story of ispc] sheds light on the Larrabee project.
 
 Likely one of the problems of Larrabee was power consumption, which has emerged as one of the limiting factors in parallel computer performance. The fully coherent (total store order) memory hierarchy, while making software easier, also added to the cost of the system, and since then we've gained a lot of knowledge in how to write performant software in weaker memory models.
 
@@ -187,3 +187,4 @@ Progress on a good parallel computer would help my own little sliver of work, tr
 [update from Tellusim]: https://tellusim.com/compute-raster/
 [Connection Machine 1 (1985) at KIT / Informatics / TECO]: https://www.flickr.com/photos/teco_kit/24095266110/
 [stack monoid]: https://arxiv.org/abs/2205.11659
+[Vello]: https://github.com/linebender/vello