diff --git a/book/i18n/ko/src/SUMMARY.md b/book/i18n/ko/src/SUMMARY.md index dd70147b..ba409758 100644 --- a/book/i18n/ko/src/SUMMARY.md +++ b/book/i18n/ko/src/SUMMARY.md @@ -12,23 +12,17 @@ - [Puzzle 1: Map](./puzzle_01/puzzle_01.md) - [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./puzzle_01/raw.md) - - [๐Ÿ’ก ๋ฏธ๋ฆฌ๋ณด๊ธฐ: LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ํ˜„๋Œ€์  ๋ฐฉ์‹](./puzzle_01/layout_tensor_preview.md) + - [๐Ÿ’ก ๋ฏธ๋ฆฌ๋ณด๊ธฐ: TileTensor๋ฅผ ํ™œ์šฉํ•œ ํ˜„๋Œ€์  ๋ฐฉ์‹](./puzzle_01/tile_tensor_preview.md) - [Puzzle 2: Zip](./puzzle_02/puzzle_02.md) - [Puzzle 3: ๊ฐ€๋“œ](./puzzle_03/puzzle_03.md) - [Puzzle 4: 2D Map](./puzzle_04/puzzle_04.md) - [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./puzzle_04/raw.md) - - [๐Ÿ“š LayoutTensor ์•Œ์•„๋ณด๊ธฐ](./puzzle_04/introduction_layout_tensor.md) - - [๐Ÿš€ ํ˜„๋Œ€์  2D ์—ฐ์‚ฐ](./puzzle_04/layout_tensor.md) + - [๐Ÿ“š TileTensor ์•Œ์•„๋ณด๊ธฐ](./puzzle_04/introduction_tile_tensor.md) + - [๐Ÿš€ ํ˜„๋Œ€์  2D ์—ฐ์‚ฐ](./puzzle_04/tile_tensor.md) - [Puzzle 5: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ](./puzzle_05/puzzle_05.md) - - [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./puzzle_05/raw.md) - - [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./puzzle_05/layout_tensor.md) - [Puzzle 6: ๋ธ”๋ก](./puzzle_06/puzzle_06.md) - [Puzzle 7: 2D ๋ธ”๋ก](./puzzle_07/puzzle_07.md) - - [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./puzzle_07/raw.md) - - [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./puzzle_07/layout_tensor.md) - [Puzzle 8: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ](./puzzle_08/puzzle_08.md) - - [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./puzzle_08/raw.md) - - [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./puzzle_08/layout_tensor.md) # Part II: ๐Ÿž GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น… @@ -44,11 +38,7 @@ # Part III: ๐Ÿงฎ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜ - [Puzzle 11: ํ’€๋ง](./puzzle_11/puzzle_11.md) - - [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./puzzle_11/raw.md) - - [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./puzzle_11/layout_tensor.md) - [Puzzle 12: ๋‚ด์ ](./puzzle_12/puzzle_12.md) - - [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./puzzle_12/raw.md) - - [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./puzzle_12/layout_tensor.md) - [Puzzle 13: 1D ํ•ฉ์„ฑ๊ณฑ](./puzzle_13/puzzle_13.md) - [๐Ÿ”ฐ ๊ธฐ๋ณธ ๋ฒ„์ „](./puzzle_13/simple.md) - [โญ ๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „](./puzzle_13/block_boundary.md) diff --git a/book/i18n/ko/src/introduction.md b/book/i18n/ko/src/introduction.md index 600ccb44..b1616edc 100644 --- a/book/i18n/ko/src/introduction.md +++ b/book/i18n/ko/src/introduction.md @@ -157,7 +157,7 @@ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ์ž์ฒด๋ณด๋‹ค ๋ฐ์ดํ„ฐ๋ฅผ ์˜ฎ๊ธฐ๋Š” ๋น„์šฉ - ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ๊ณผ ๋ธ”๋ก ๊ตฌ์„ฑ ๋ฐฐ์šฐ๊ธฐ - ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ๊ฐ€๋“œ ์ดํ•ดํ•˜๊ธฐ -- ์›์‹œ ํฌ์ธํ„ฐ์™€ LayoutTensor ์ถ”์ƒํ™” ๋ชจ๋‘ ๋‹ค๋ค„๋ณด๊ธฐ +- ์›์‹œ ํฌ์ธํ„ฐ์™€ TileTensor ์ถ”์ƒํ™” ๋ชจ๋‘ ๋‹ค๋ค„๋ณด๊ธฐ - ์Šค๋ ˆ๋“œ ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ดˆ ์ตํžˆ๊ธฐ **Part II: GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น… (ํผ์ฆ 9-10) โœ…** @@ -227,7 +227,7 @@ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ์ž์ฒด๋ณด๋‹ค ๋ฐ์ดํ„ฐ๋ฅผ ์˜ฎ๊ธฐ๋Š” ๋น„์šฉ - AI ์›Œํฌ๋กœ๋“œ๋ฅผ ์œ„ํ•œ ํ…์„œ ์ฝ”์–ด ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฐฐ์šฐ๊ธฐ - ํ˜„๋Œ€ GPU์˜ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฐฐ์šฐ๊ธฐ -์ด ์ฑ…์€ ๊ธฐ์กด ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ, ๋จผ์ € ์ €์ˆ˜์ค€ ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ž‘์œผ๋กœ ์ดํ•ด๋ฅผ ์Œ“์€ ๋’ค ์ ์ง„์ ์œผ๋กœ Mojo์˜ LayoutTensor ์ถ”์ƒํ™”๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด์™€ ํ˜„๋Œ€์  ํ…์„œ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์˜ ์‹ค์šฉ์  ์ง€์‹์„ ๋ชจ๋‘ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +์ด ์ฑ…์€ ๊ธฐ์กด ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ, ๋จผ์ € ์ €์ˆ˜์ค€ ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ž‘์œผ๋กœ ์ดํ•ด๋ฅผ ์Œ“์€ ๋’ค ์ ์ง„์ ์œผ๋กœ Mojo์˜ TileTensor ์ถ”์ƒํ™”๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด์™€ ํ˜„๋Œ€์  ํ…์„œ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์˜ ์‹ค์šฉ์  ์ง€์‹์„ ๋ชจ๋‘ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ## ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? diff --git a/book/i18n/ko/src/puzzle_01/puzzle_01.md b/book/i18n/ko/src/puzzle_01/puzzle_01.md index 1498b7f9..dbbe968b 100644 --- a/book/i18n/ko/src/puzzle_01/puzzle_01.md +++ b/book/i18n/ko/src/puzzle_01/puzzle_01.md @@ -32,8 +32,8 @@ ์ง์ ‘ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋‹ค๋ฃจ๋ฉฐ GPU์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋ฅผ ์ตํž™๋‹ˆ๋‹ค. -### [๐Ÿ’ก ๋ฏธ๋ฆฌ๋ณด๊ธฐ: LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ํ˜„๋Œ€์  ๋ฐฉ์‹](./layout_tensor_preview.md) +### [๐Ÿ’ก ๋ฏธ๋ฆฌ๋ณด๊ธฐ: TileTensor๋ฅผ ํ™œ์šฉํ•œ ํ˜„๋Œ€์  ๋ฐฉ์‹](./tile_tensor_preview.md) -LayoutTensor๊ฐ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์–ด๋–ป๊ฒŒ ๋‹จ์ˆœํ™”ํ•˜๋Š”์ง€ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ๋” ์•ˆ์ „ํ•˜๊ณ  ๊น”๋”ํ•œ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +TileTensor๊ฐ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์–ด๋–ป๊ฒŒ ๋‹จ์ˆœํ™”ํ•˜๋Š”์ง€ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ๋” ์•ˆ์ „ํ•˜๊ณ  ๊น”๋”ํ•œ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๐Ÿ’ก **ํŒ**: ๋‘ ๋ฐฉ์‹์„ ๋ชจ๋‘ ์ตํžˆ๋ฉด ํ˜„๋Œ€์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ๋” ๊นŠ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. diff --git a/book/i18n/ko/src/puzzle_01/layout_tensor_preview.md b/book/i18n/ko/src/puzzle_01/tile_tensor_preview.md similarity index 75% rename from book/i18n/ko/src/puzzle_01/layout_tensor_preview.md rename to book/i18n/ko/src/puzzle_01/tile_tensor_preview.md index 903cd802..87635036 100644 --- a/book/i18n/ko/src/puzzle_01/layout_tensor_preview.md +++ b/book/i18n/ko/src/puzzle_01/tile_tensor_preview.md @@ -1,6 +1,6 @@ -## ์™œ LayoutTensor๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ• ๊นŒ์š”? +## ์™œ TileTensor๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ• ๊นŒ์š”? ์•„๋ž˜ ๊ธฐ์กด ๊ตฌํ˜„์„ ๋ณด๋ฉด ๋ช‡ ๊ฐ€์ง€ ์ž ์žฌ์ ์ธ ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: @@ -32,9 +32,9 @@ idx = (batch * HEIGHT + row) * WIDTH + col idx = (batch * padded_height + row) * padded_width + col ``` -### LayoutTensor ๋ฏธ๋ฆฌ๋ณด๊ธฐ +### TileTensor ๋ฏธ๋ฆฌ๋ณด๊ธฐ -[LayoutTensor](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ฅผ ํ›จ์”ฌ ๊น”๋”ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: +[TileTensor](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ฅผ ํ›จ์”ฌ ๊น”๋”ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: ```mojo # ๋ฏธ๋ฆฌ๋ณด๊ธฐ - ์ง€๊ธˆ์€ ์ด ๋ฌธ๋ฒ•์„ ๋ชฐ๋ผ๋„ ๊ดœ์ฐฎ์Šต๋‹ˆ๋‹ค! @@ -42,7 +42,7 @@ output[i, j] = a[i, j] + 10.0 # 2D ์ธ๋ฑ์‹ฑ output[b, i, j] = a[b, i, j] + 10.0 # 3D ์ธ๋ฑ์‹ฑ ``` -Puzzle 4์—์„œ LayoutTensor๋ฅผ ์ž์„ธํžˆ ๋ฐฐ์šธ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ๊ทธ๋•Œ ์ด ๊ฐœ๋…๋“ค์ด ํ•„์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ง€๊ธˆ์€ ๋‹ค์Œ ๋‚ด์šฉ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•˜์„ธ์š”: +Puzzle 4์—์„œ TileTensor๋ฅผ ์ž์„ธํžˆ ๋ฐฐ์šธ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ๊ทธ๋•Œ ์ด ๊ฐœ๋…๋“ค์ด ํ•„์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ง€๊ธˆ์€ ๋‹ค์Œ ๋‚ด์šฉ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•˜์„ธ์š”: - ๊ธฐ๋ณธ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ - ๊ฐ„๋‹จํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด diff --git a/book/i18n/ko/src/puzzle_02/puzzle_02.md b/book/i18n/ko/src/puzzle_02/puzzle_02.md index 3381ff6f..480ba4c2 100644 --- a/book/i18n/ko/src/puzzle_02/puzzle_02.md +++ b/book/i18n/ko/src/puzzle_02/puzzle_02.md @@ -132,4 +132,4 @@ expected: HostBuffer([0.0, 2.0, 4.0, 6.0]) - ํ•œ ๋ฐฐ์—ด์„ ๋‹ค๋ฅธ ๋ฐฐ์—ด์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•ด์•ผ ํ•œ๋‹ค๋ฉด? - ์—ฌ๋Ÿฌ ๋ฐฐ์—ด์—์„œ ๋ณ‘ํ•ฉ(coalesced) ์ ‘๊ทผ์„ ์–ด๋–ป๊ฒŒ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์„๊นŒ? -์ด๋Ÿฌํ•œ ์งˆ๋ฌธ๋“ค์€ Puzzle 4์˜ [LayoutTensor ์•Œ์•„๋ณด๊ธฐ](../puzzle_04/introduction_layout_tensor.md)์—์„œ ๋‹ค๋ฃน๋‹ˆ๋‹ค. +์ด๋Ÿฌํ•œ ์งˆ๋ฌธ๋“ค์€ Puzzle 4์˜ [TileTensor ์•Œ์•„๋ณด๊ธฐ](../puzzle_04/introduction_tile_tensor.md)์—์„œ ๋‹ค๋ฃน๋‹ˆ๋‹ค. diff --git a/book/i18n/ko/src/puzzle_03/puzzle_03.md b/book/i18n/ko/src/puzzle_03/puzzle_03.md index 79f70fd9..901181ec 100644 --- a/book/i18n/ko/src/puzzle_03/puzzle_03.md +++ b/book/i18n/ko/src/puzzle_03/puzzle_03.md @@ -160,4 +160,4 @@ if i < height and j < width and k < depth and i >= padding and j >= padding: ... ``` -์ด๋Ÿฐ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ํŒจํ„ด์€ Puzzle 4์˜ [LayoutTensor ์•Œ์•„๋ณด๊ธฐ](../puzzle_04/introduction_layout_tensor.md)์—์„œ ๋ฐฐ์šฐ๋ฉด ํ›จ์”ฌ ๊น”๋”ํ•ด์ง‘๋‹ˆ๋‹ค. LayoutTensor๋Š” ํ˜•ํƒœ ๊ด€๋ฆฌ ๊ธฐ๋Šฅ์„ ๊ธฐ๋ณธ์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. +์ด๋Ÿฐ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ํŒจํ„ด์€ Puzzle 4์˜ [TileTensor ์•Œ์•„๋ณด๊ธฐ](../puzzle_04/introduction_tile_tensor.md)์—์„œ ๋ฐฐ์šฐ๋ฉด ํ›จ์”ฌ ๊น”๋”ํ•ด์ง‘๋‹ˆ๋‹ค. TileTensor๋Š” ํ˜•ํƒœ ๊ด€๋ฆฌ ๊ธฐ๋Šฅ์„ ๊ธฐ๋ณธ์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. diff --git a/book/i18n/ko/src/puzzle_04/introduction_layout_tensor.md b/book/i18n/ko/src/puzzle_04/introduction_tile_tensor.md similarity index 65% rename from book/i18n/ko/src/puzzle_04/introduction_layout_tensor.md rename to book/i18n/ko/src/puzzle_04/introduction_tile_tensor.md index 35098d83..076a239b 100644 --- a/book/i18n/ko/src/puzzle_04/introduction_layout_tensor.md +++ b/book/i18n/ko/src/puzzle_04/introduction_tile_tensor.md @@ -1,11 +1,11 @@ -# LayoutTensor ์•Œ์•„๋ณด๊ธฐ +# TileTensor ์•Œ์•„๋ณด๊ธฐ ํผ์ฆ ํ’€์ด๋ฅผ ์ž ์‹œ ๋ฉˆ์ถ”๊ณ , GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋” ์ฆ๊ฒ๊ฒŒ ๋งŒ๋“ค์–ด์ค„ ๊ฐ•๋ ฅํ•œ ์ถ”์ƒํ™”๋ฅผ ๋ฏธ๋ฆฌ ์‚ดํŽด๋ด…์‹œ๋‹ค: -๐Ÿฅ ... ๋ฐ”๋กœ **[LayoutTensor](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/)** ์ž…๋‹ˆ๋‹ค. +๐Ÿฅ ... ๋ฐ”๋กœ **[TileTensor](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/)** ์ž…๋‹ˆ๋‹ค. -> ๐Ÿ’ก _LayoutTensor๊ฐ€ ์–ด๋–ค ์ผ์„ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ง›๋ณด๊ธฐ๋กœ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ์ง€๊ธˆ ๋ชจ๋“  ๊ฑธ ์ดํ•ดํ•  ํ•„์š”๋Š” ์—†์–ด์š” - ํผ์ฆ์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๊ฐ ๊ธฐ๋Šฅ์„ ์ž์„ธํžˆ ์•Œ์•„๋ณผ ๊ฒ๋‹ˆ๋‹ค_. +> ๐Ÿ’ก _TileTensor๊ฐ€ ์–ด๋–ค ์ผ์„ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ง›๋ณด๊ธฐ๋กœ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ์ง€๊ธˆ ๋ชจ๋“  ๊ฑธ ์ดํ•ดํ•  ํ•„์š”๋Š” ์—†์–ด์š” - ํผ์ฆ์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๊ฐ ๊ธฐ๋Šฅ์„ ์ž์„ธํžˆ ์•Œ์•„๋ณผ ๊ฒ๋‹ˆ๋‹ค_. ## ๋ฌธ์ œ: ์ ์  ๋ณต์žกํ•ด์ง€๋Š” ์ฝ”๋“œ @@ -32,9 +32,9 @@ if row < height and col < width: output[idx] = a[idx] + 10.0 ``` -## ํ•ด๊ฒฐ์ฑ…: LayoutTensor ๋ฏธ๋ฆฌ๋ณด๊ธฐ +## ํ•ด๊ฒฐ์ฑ…: TileTensor ๋ฏธ๋ฆฌ๋ณด๊ธฐ -LayoutTensor๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค์„ ๊น”๋”ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•ด์ค๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ๋ฐฐ์šธ ๋‚ด์šฉ์„ ์‚ด์ง ์—ฟ๋ณด๋ฉด: +TileTensor๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค์„ ๊น”๋”ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•ด์ค๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ๋ฐฐ์šธ ๋‚ด์šฉ์„ ์‚ด์ง ์—ฟ๋ณด๋ฉด: 1. **์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ**: ์ˆ˜๋™ ์˜คํ”„์…‹ ๊ณ„์‚ฐ ๋Œ€์‹  `tensor[i, j]` ์‚ฌ์šฉ 2. **์œ ์—ฐํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ**: ํ–‰ ์šฐ์„ , ์—ด ์šฐ์„ , ํƒ€์ผ ๊ตฌ์„ฑ ์ง€์› @@ -42,34 +42,36 @@ LayoutTensor๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค์„ ๊น”๋”ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•ด์ค๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ## ์•ž์œผ๋กœ ๋ฐฐ์šธ ๋‚ด์šฉ ๋ง›๋ณด๊ธฐ -LayoutTensor๊ฐ€ ํ•  ์ˆ˜ ์žˆ๋Š” ์ผ์„ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ๋กœ ์‚ดํŽด๋ด…์‹œ๋‹ค. ์ง€๊ธˆ ๋ชจ๋“  ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ดํ•ดํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค - ์•ž์œผ๋กœ ๋‚˜์˜ฌ ํผ์ฆ์—์„œ ๊ฐ ๊ธฐ๋Šฅ์„ ๊ผผ๊ผผํžˆ ๋‹ค๋ฃฐ ๊ฑฐ์˜ˆ์š”. +TileTensor๊ฐ€ ํ•  ์ˆ˜ ์žˆ๋Š” ์ผ์„ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ๋กœ ์‚ดํŽด๋ด…์‹œ๋‹ค. ์ง€๊ธˆ ๋ชจ๋“  ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ดํ•ดํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค - ์•ž์œผ๋กœ ๋‚˜์˜ฌ ํผ์ฆ์—์„œ ๊ฐ ๊ธฐ๋Šฅ์„ ๊ผผ๊ผผํžˆ ๋‹ค๋ฃฐ ๊ฑฐ์˜ˆ์š”. ### ๊ธฐ๋ณธ ์‚ฌ์šฉ ์˜ˆ์‹œ ```mojo -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major # ๋ ˆ์ด์•„์›ƒ ์ •์˜ comptime HEIGHT = 2 comptime WIDTH = 3 -comptime layout = Layout.row_major(HEIGHT, WIDTH) +comptime layout = row_major[HEIGHT, WIDTH]() +comptime LayoutType = type_of(layout) # ํ…์„œ ์ƒ์„ฑ -tensor = LayoutTensor[dtype, layout](buffer.unsafe_ptr()) +tensor = TileTensor(buffer, layout) # ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์š”์†Œ ์ ‘๊ทผ tensor[0, 0] = 1.0 # ์ฒซ ๋ฒˆ์งธ ์š”์†Œ tensor[1, 2] = 2.0 # ๋งˆ์ง€๋ง‰ ์š”์†Œ ``` -`Layout`๊ณผ `LayoutTensor`์— ๋Œ€ํ•ด ๋” ์•Œ์•„๋ณด๋ ค๋ฉด [Mojo ๋งค๋‰ด์–ผ](https://docs.modular.com/mojo/manual/)์˜ ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”: +`Layout`๊ณผ `TileTensor`์— ๋Œ€ํ•ด ๋” ์•Œ์•„๋ณด๋ ค๋ฉด [Mojo ๋งค๋‰ด์–ผ](https://docs.modular.com/mojo/manual/)์˜ ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”: - [Introduction to layouts](https://docs.modular.com/mojo/manual/layout/layouts) -- [Using LayoutTensor](https://docs.modular.com/mojo/manual/layout/tensors) +- [Using TileTensor](https://docs.modular.com/mojo/manual/layout/tensors) ## ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ -LayoutTensor์˜ ๊ธฐ๋ณธ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋กœ ๋ชจ๋“  ๊ฒƒ์„ ์ •๋ฆฌํ•ด๋ด…์‹œ๋‹ค: +TileTensor์˜ ๊ธฐ๋ณธ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋กœ ๋ชจ๋“  ๊ฒƒ์„ ์ •๋ฆฌํ•ด๋ด…์‹œ๋‹ค: ```mojo {{#include ../../../../src/puzzle_04/intro.mojo}} @@ -87,28 +89,28 @@ LayoutTensor์˜ ๊ธฐ๋ณธ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋กœ ๋ชจ๋“  ๊ฒƒ์„ ์ •๋ฆฌ
```bash -pixi run layout_tensor_intro +pixi run tile_tensor_intro ```
```bash -pixi run -e amd layout_tensor_intro +pixi run -e amd tile_tensor_intro ```
```bash -pixi run -e apple layout_tensor_intro +pixi run -e apple tile_tensor_intro ```
```bash -uv run poe layout_tensor_intro +uv run poe tile_tensor_intro ```
@@ -130,7 +132,7 @@ After: 3. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค 4. ๋ณ€๊ฒฝ ์‚ฌํ•ญ์ด ์ถœ๋ ฅ์— ๋ฐ˜์˜๋ฉ๋‹ˆ๋‹ค -์ด ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋Š” LayoutTensor์˜ ํ•ต์‹ฌ ์žฅ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: +์ด ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋Š” TileTensor์˜ ํ•ต์‹ฌ ์žฅ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: - ํ…์„œ ์ƒ์„ฑ๊ณผ ์ ‘๊ทผ์„ ์œ„ํ•œ ๊น”๋”ํ•œ ๋ฌธ๋ฒ• - ์ž๋™ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ @@ -143,6 +145,6 @@ After: - ๋ณต์žกํ•œ ํƒ€์ผ๋ง ์ „๋žต - ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ์—ฐ์‚ฐ -LayoutTensor์™€ ํ•จ๊ป˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—ฌ์ •์„ ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋๋‚˜์š”? ํผ์ฆ๋กœ ๋“ค์–ด๊ฐ€๋ด…์‹œ๋‹ค! +TileTensor์™€ ํ•จ๊ป˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—ฌ์ •์„ ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋๋‚˜์š”? ํผ์ฆ๋กœ ๋“ค์–ด๊ฐ€๋ด…์‹œ๋‹ค! ๐Ÿ’ก **ํŒ**: ์ง„ํ–‰ํ•˜๋ฉด์„œ ์ด ์˜ˆ์ œ๋ฅผ ๊ธฐ์–ตํ•ด๋‘์„ธ์š” - ์ด ๊ธฐ๋ณธ ๊ฐœ๋…์„ ๋ฐ”ํƒ•์œผ๋กœ ์ ์  ๋” ์ •๊ตํ•œ GPU ํ”„๋กœ๊ทธ๋žจ์„ ๋งŒ๋“ค์–ด๊ฐˆ ๊ฒ๋‹ˆ๋‹ค. diff --git a/book/i18n/ko/src/puzzle_04/puzzle_04.md b/book/i18n/ko/src/puzzle_04/puzzle_04.md index 522fe51e..d97ce9b4 100644 --- a/book/i18n/ko/src/puzzle_04/puzzle_04.md +++ b/book/i18n/ko/src/puzzle_04/puzzle_04.md @@ -51,12 +51,12 @@ ์ˆ˜๋™์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ด€๋ฆฌํ•˜๋ฉด์„œ 2D ์ธ๋ฑ์‹ฑ์ด ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. -### [๐Ÿ“š LayoutTensor ์•Œ์•„๋ณด๊ธฐ](./introduction_layout_tensor.md) +### [๐Ÿ“š TileTensor ์•Œ์•„๋ณด๊ธฐ](./introduction_tile_tensor.md) GPU์—์„œ ๋‹ค์ฐจ์› ๋ฐฐ์—ด ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„ํŽธํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฐ•๋ ฅํ•œ ์ถ”์ƒํ™”๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. -### [๐Ÿš€ ํ˜„๋Œ€์  2D ์—ฐ์‚ฐ](./layout_tensor.md) +### [๐Ÿš€ ํ˜„๋Œ€์  2D ์—ฐ์‚ฐ](./tile_tensor.md) -์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ๊ณผ ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ๊ฐ–์ถ˜ LayoutTensor๋ฅผ ์ง์ ‘ ์จ๋ด…๋‹ˆ๋‹ค. +์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ๊ณผ ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ๊ฐ–์ถ˜ TileTensor๋ฅผ ์ง์ ‘ ์จ๋ด…๋‹ˆ๋‹ค. -๐Ÿ’ก **์ฐธ๊ณ **: ์ด ํผ์ฆ๋ถ€ํ„ฐ๋Š” ๋” ๊น”๋”ํ•˜๊ณ  ์•ˆ์ „ํ•œ GPU ์ฝ”๋“œ๋ฅผ ์œ„ํ•ด LayoutTensor๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. +๐Ÿ’ก **์ฐธ๊ณ **: ์ด ํผ์ฆ๋ถ€ํ„ฐ๋Š” ๋” ๊น”๋”ํ•˜๊ณ  ์•ˆ์ „ํ•œ GPU ์ฝ”๋“œ๋ฅผ ์œ„ํ•ด TileTensor๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. diff --git a/book/i18n/ko/src/puzzle_04/layout_tensor.md b/book/i18n/ko/src/puzzle_04/tile_tensor.md similarity index 65% rename from book/i18n/ko/src/puzzle_04/layout_tensor.md rename to book/i18n/ko/src/puzzle_04/tile_tensor.md index 5412e089..75499e48 100644 --- a/book/i18n/ko/src/puzzle_04/layout_tensor.md +++ b/book/i18n/ko/src/puzzle_04/tile_tensor.md @@ -1,10 +1,10 @@ -# LayoutTensor ๋ฒ„์ „ +# TileTensor ๋ฒ„์ „ ## ๊ฐœ์š” -2D _LayoutTensor_ `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D _LayoutTensor_ `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. +2D _TileTensor_ `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D _TileTensor_ `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. **์ฐธ๊ณ **: _์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค_. @@ -12,13 +12,13 @@ ์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: -- 2D ๋ฐฐ์—ด ์ ‘๊ทผ์— `LayoutTensor` ์‚ฌ์šฉํ•˜๊ธฐ +- 2D ๋ฐฐ์—ด ์ ‘๊ทผ์— `TileTensor` ์‚ฌ์šฉํ•˜๊ธฐ - `tensor[i, j]`๋กœ ์ง์ ‘ 2D ์ธ๋ฑ์‹ฑํ•˜๊ธฐ -- `LayoutTensor`์—์„œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ +- `TileTensor`์—์„œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ -ํ•ต์‹ฌ์€ `LayoutTensor`๊ฐ€ ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•˜์—ฌ ๋‚ด๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์ถ”์ƒํ™”ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. +ํ•ต์‹ฌ์€ `TileTensor`๊ฐ€ ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•˜์—ฌ ๋‚ด๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์ถ”์ƒํ™”ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. -- **2D ์ ‘๊ทผ**: `LayoutTensor`๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด \\((i,j)\\) ์ธ๋ฑ์‹ฑ +- **2D ์ ‘๊ทผ**: `TileTensor`๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด \\((i,j)\\) ์ธ๋ฑ์‹ฑ - **๋ฉ”๋ชจ๋ฆฌ ์ถ”์ƒํ™”**: ์ˆ˜๋™ ํ–‰ ์šฐ์„  ๊ณ„์‚ฐ ๋ถˆํ•„์š” - **๊ฐ€๋“œ ์กฐ๊ฑด**: ๋‘ ์ฐจ์› ๋ชจ๋‘ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํ•„์š” - **์Šค๋ ˆ๋“œ ๋ฒ”์œ„**: ์Šค๋ ˆ๋“œ \\((3 \times 3)\\)๊ฐ€ ํ…์„œ ์›์†Œ \\((2 \times 2)\\)๋ณด๋‹ค ๋งŽ์Œ @@ -26,10 +26,10 @@ ## ์™„์„ฑํ•  ์ฝ”๋“œ ```mojo -{{#include ../../../../../problems/p04/p04_layout_tensor.mojo:add_10_2d_layout_tensor}} +{{#include ../../../../../problems/p04/p04_tile_tensor.mojo:add_10_2d_tile_tensor}} ``` -์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p04/p04_layout_tensor.mojo +์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p04/p04_tile_tensor.mojo
ํŒ @@ -57,28 +57,28 @@
```bash -pixi run p04_layout_tensor +pixi run p04_tile_tensor ```
```bash -pixi run -e amd p04_layout_tensor +pixi run -e amd p04_tile_tensor ```
```bash -pixi run -e apple p04_layout_tensor +pixi run -e apple p04_tile_tensor ```
```bash -uv run poe p04_layout_tensor +uv run poe p04_tile_tensor ```
@@ -97,7 +97,7 @@ expected: HostBuffer([10.0, 11.0, 12.0, 13.0]) ```mojo -{{#include ../../../../../solutions/p04/p04_layout_tensor.mojo:add_10_2d_layout_tensor_solution}} +{{#include ../../../../../solutions/p04/p04_tile_tensor.mojo:add_10_2d_tile_tensor_solution}} ```
@@ -106,7 +106,7 @@ expected: HostBuffer([10.0, 11.0, 12.0, 13.0]) - `row = thread_idx.y`, `col = thread_idx.x`๋กœ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ด - `if row < size and col < size`๋กœ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ ๋ฐฉ์ง€ -- `LayoutTensor`์˜ 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: `output[row, col] = a[row, col] + 10.0` +- `TileTensor`์˜ 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: `output[row, col] = a[row, col] + 10.0`
diff --git a/book/i18n/ko/src/puzzle_05/layout_tensor.md b/book/i18n/ko/src/puzzle_05/layout_tensor.md deleted file mode 100644 index 319ef21c..00000000 --- a/book/i18n/ko/src/puzzle_05/layout_tensor.md +++ /dev/null @@ -1,130 +0,0 @@ - - -# LayoutTensor ๋ฒ„์ „ - -## ๊ฐœ์š” - -1D LayoutTensor `a`์™€ `b`๋ฅผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋”ํ•ด 2D LayoutTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. - -**์ฐธ๊ณ **: _์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - -- ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์— `LayoutTensor` ์‚ฌ์šฉํ•˜๊ธฐ -- ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ ๋‹ค๋ฃจ๊ธฐ -- `LayoutTensor`๋กœ 2D ์ธ๋ฑ์‹ฑ ์ฒ˜๋ฆฌํ•˜๊ธฐ - -ํ•ต์‹ฌ์€ `LayoutTensor`๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ \\((1, n)\\)์™€ \\((n, 1)\\)์„ \\((n,n)\\)์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. - -- **ํ…์„œ ํฌ๊ธฐ**: ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋Š” \\((1, n)\\)๊ณผ \\((n, 1)\\) -- **๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ**: ๋‘ ์ฐจ์›์„ ๊ฒฐํ•ฉํ•ด \\((n,n)\\) ์ถœ๋ ฅ ์ƒ์„ฑ -- **๊ฐ€๋“œ ์กฐ๊ฑด**: ์ถœ๋ ฅ ํฌ๊ธฐ์— ๋Œ€ํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š” -- **์Šค๋ ˆ๋“œ ๋ฒ”์œ„**: ํ…์„œ ์›์†Œ \\((2 \times 2)\\)๋ณด๋‹ค ์Šค๋ ˆ๋“œ \\((3 \times 3)\\)๊ฐ€ ๋งŽ์Œ - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p05/p05_layout_tensor.mojo:broadcast_add_layout_tensor}} -``` - -์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p05/p05_layout_tensor.mojo - -
-ํŒ - -
- -1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: `row = thread_idx.y`, `col = thread_idx.x` -2. ๊ฐ€๋“œ ์ถ”๊ฐ€: `if row < size and col < size` -3. ๊ฐ€๋“œ ๋‚ด๋ถ€: LayoutTensor๋กœ `a`์™€ `b` ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ• ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š” - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p05_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p05_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p05_layout_tensor -``` - -
-
- -```bash -uv run poe p05_layout_tensor -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([1.0, 2.0, 11.0, 12.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p05/p05_layout_tensor.mojo:broadcast_add_layout_tensor_solution}} -``` - -
- -LayoutTensor ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์™€ GPU ์Šค๋ ˆ๋“œ ๋งคํ•‘์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค: - -1. **์Šค๋ ˆ๋“œ์—์„œ ํ–‰๋ ฌ๋กœ ๋งคํ•‘** - - - `thread_idx.y`๋กœ ํ–‰, `thread_idx.x`๋กœ ์—ด์— ์ ‘๊ทผ - - ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ์ด ์ถœ๋ ฅ ํ–‰๋ ฌ ๊ตฌ์กฐ์™€ ์ผ์น˜ - - ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ(3ร—3 ๊ทธ๋ฆฌ๋“œ)๋Š” ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ์ฒ˜๋ฆฌ - -2. **๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ž‘๋™ ๋ฐฉ์‹** - - ์ž…๋ ฅ `a`์˜ ํฌ๊ธฐ๋Š” `(1,n)`: `a[0,col]`์ด ํ–‰์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ - - ์ž…๋ ฅ `b`์˜ ํฌ๊ธฐ๋Š” `(n,1)`: `b[row,0]`์ด ์—ด์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ - - ์ถœ๋ ฅ์˜ ํฌ๊ธฐ๋Š” `(n,n)`: ๊ฐ ์›์†Œ๋Š” ํ•ด๋‹น ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’๋“ค์˜ ํ•ฉ - - ```txt - [ a0 a1 ] + [ b0 ] = [ a0+b0 a1+b0 ] - [ b1 ] [ a0+b1 a1+b1 ] - ``` - -3. **๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ** - - ๊ฐ€๋“œ ์กฐ๊ฑด `row < size and col < size`๋กœ ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ ๋ฐฉ์ง€ - - ํ–‰๋ ฌ ๋ฒ”์œ„์™€ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ - - ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋•๋ถ„์— `a`์™€ `b`์— ๋Œ€ํ•œ ๋ณ„๋„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š” - -์ด ํŒจํ„ด์€ ์ดํ›„ ํผ์ฆ์—์„œ ๋‹ค๋ฃฐ ๋” ๋ณต์žกํ•œ ํ…์„œ ์—ฐ์‚ฐ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. -
-
diff --git a/book/i18n/ko/src/puzzle_05/puzzle_05.md b/book/i18n/ko/src/puzzle_05/puzzle_05.md index 766ebef3..f572c2b8 100644 --- a/book/i18n/ko/src/puzzle_05/puzzle_05.md +++ b/book/i18n/ko/src/puzzle_05/puzzle_05.md @@ -4,9 +4,7 @@ ## ๊ฐœ์š” -๋ฒกํ„ฐ `a`์™€ `b`๋ฅผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ(broadcast)๋กœ ๋”ํ•ด 2D ํ–‰๋ ฌ `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. - -๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ **๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ**๋ž€ ์š”์†Œ๋ณ„ ์—ฐ์‚ฐ์„ ํ•  ๋•Œ ์ €์ฐจ์› ๋ฐฐ์—ด์„ ๊ณ ์ฐจ์› ๋ฐฐ์—ด์˜ ํ˜•์ƒ์— ๋งž๊ฒŒ ์ž๋™์œผ๋กœ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ๋ฉ”๋ชจ๋ฆฌ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์ œํ•˜์ง€ ์•Š๊ณ , ์ถ”๊ฐ€ ์ฐจ์›์— ๊ฑธ์ณ ๊ฐ’์„ ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ฐ˜๋ณตํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 2D ํ–‰๋ ฌ์˜ ๊ฐ ํ–‰(๋˜๋Š” ์—ด)์— 1D ๋ฒกํ„ฐ๋ฅผ ๋”ํ•  ๋•Œ ๋ฒกํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ณต์‚ฌํ•˜์ง€ ์•Š์•„๋„ ๊ฐ™์€ ์š”์†Œ๊ฐ€ ์ž๋™์œผ๋กœ ๋ฐ˜๋ณต ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. +1D TileTensor `a`์™€ `b`๋ฅผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋”ํ•ด 2D TileTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. **์ฐธ๊ณ **: _์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค._ @@ -15,19 +13,121 @@ ## ํ•ต์‹ฌ ๊ฐœ๋… -- ๋ฒกํ„ฐ๋ฅผ ํ–‰๋ ฌ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•˜๊ธฐ -- 2D ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ -- ์„œ๋กœ ๋‹ค๋ฅธ ์ฐจ์› ๊ฐ„ ์—ฐ์‚ฐ -- ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ํŒจํ„ด +์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: + +- ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์— `TileTensor` ์‚ฌ์šฉํ•˜๊ธฐ +- ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ ๋‹ค๋ฃจ๊ธฐ +- `TileTensor`๋กœ 2D ์ธ๋ฑ์‹ฑ ์ฒ˜๋ฆฌํ•˜๊ธฐ + +ํ•ต์‹ฌ์€ `TileTensor`๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ \\((1, n)\\)์™€ \\((n, 1)\\)์„ \\((n,n)\\)์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. + +- **ํ…์„œ ํฌ๊ธฐ**: ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋Š” \\((1, n)\\)๊ณผ \\((n, 1)\\) +- **๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ**: ๋‘ ์ฐจ์›์„ ๊ฒฐํ•ฉํ•ด \\((n,n)\\) ์ถœ๋ ฅ ์ƒ์„ฑ +- **๊ฐ€๋“œ ์กฐ๊ฑด**: ์ถœ๋ ฅ ํฌ๊ธฐ์— ๋Œ€ํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š” +- **์Šค๋ ˆ๋“œ ๋ฒ”์œ„**: ํ…์„œ ์›์†Œ \\((2 \times 2)\\)๋ณด๋‹ค ์Šค๋ ˆ๋“œ \\((3 \times 3)\\)๊ฐ€ ๋งŽ์Œ + +## ์™„์„ฑํ•  ์ฝ”๋“œ + +```mojo +{{#include ../../../../../problems/p05/p05.mojo:broadcast_add}} +``` + +์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p05/p05.mojo + +
+ํŒ + +
+ +1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: `row = thread_idx.y`, `col = thread_idx.x` +2. ๊ฐ€๋“œ ์ถ”๊ฐ€: `if row < size and col < size` +3. ๊ฐ€๋“œ ๋‚ด๋ถ€: TileTensor๋กœ `a`์™€ `b` ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ• ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š” + +
+
+ +## ์ฝ”๋“œ ์‹คํ–‰ + +์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: + +
+
+ + + + +
+
+ +```bash +pixi run p05 +``` + +
+
+ +```bash +pixi run -e amd p05 +``` + +
+
+ +```bash +pixi run -e apple p05 +``` + +
+
+ +```bash +uv run poe p05 +``` + +
+
+ +ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค: + +```txt +out: HostBuffer([0.0, 0.0, 0.0, 0.0]) +expected: HostBuffer([1.0, 2.0, 11.0, 12.0]) +``` + +## ์†”๋ฃจ์…˜ + +
+ + +```mojo +{{#include ../../../../../solutions/p05/p05.mojo:broadcast_add_solution}} +``` + +
+ +TileTensor ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์™€ GPU ์Šค๋ ˆ๋“œ ๋งคํ•‘์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค: -## ๊ตฌํ˜„ ๋ฐฉ์‹ +1. **์Šค๋ ˆ๋“œ์—์„œ ํ–‰๋ ฌ๋กœ ๋งคํ•‘** -### [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./raw.md) + - `thread_idx.y`๋กœ ํ–‰, `thread_idx.x`๋กœ ์—ด์— ์ ‘๊ทผ + - ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ์ด ์ถœ๋ ฅ ํ–‰๋ ฌ ๊ตฌ์กฐ์™€ ์ผ์น˜ + - ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ(3ร—3 ๊ทธ๋ฆฌ๋“œ)๋Š” ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ์ฒ˜๋ฆฌ -์ˆ˜๋™ ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. +2. **๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ž‘๋™ ๋ฐฉ์‹** + - ์ž…๋ ฅ `a`์˜ ํฌ๊ธฐ๋Š” `(1,n)`: `a[0,col]`์ด ํ–‰์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ + - ์ž…๋ ฅ `b`์˜ ํฌ๊ธฐ๋Š” `(n,1)`: `b[row,0]`์ด ์—ด์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ + - ์ถœ๋ ฅ์˜ ํฌ๊ธฐ๋Š” `(n,n)`: ๊ฐ ์›์†Œ๋Š” ํ•ด๋‹น ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’๋“ค์˜ ํ•ฉ -### [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./layout_tensor.md) + ```txt + [ a0 a1 ] + [ b0 ] = [ a0+b0 a1+b0 ] + [ b1 ] [ a0+b1 a1+b1 ] + ``` -์„œ๋กœ ๋‹ค๋ฅธ ์ฐจ์› ๊ฐ„ ์—ฐ์‚ฐ์„ LayoutTensor๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. +3. **๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ** + - ๊ฐ€๋“œ ์กฐ๊ฑด `row < size and col < size`๋กœ ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ ๋ฐฉ์ง€ + - ํ–‰๋ ฌ ๋ฒ”์œ„์™€ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ + - ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋•๋ถ„์— `a`์™€ `b`์— ๋Œ€ํ•œ ๋ณ„๋„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š” -๐Ÿ’ก **์ฐธ๊ณ **: ์ˆ˜๋™ ์ธ๋ฑ์‹ฑ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ LayoutTensor๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”. +์ด ํŒจํ„ด์€ ์ดํ›„ ํผ์ฆ์—์„œ ๋‹ค๋ฃฐ ๋” ๋ณต์žกํ•œ ํ…์„œ ์—ฐ์‚ฐ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. +
+
diff --git a/book/i18n/ko/src/puzzle_05/raw.md b/book/i18n/ko/src/puzzle_05/raw.md deleted file mode 100644 index a6576518..00000000 --- a/book/i18n/ko/src/puzzle_05/raw.md +++ /dev/null @@ -1,127 +0,0 @@ - - -## ๊ฐœ์š” - -๋ฒกํ„ฐ `a`์™€ `b`๋ฅผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋”ํ•ด 2D ํ–‰๋ ฌ `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. - -**์ฐธ๊ณ **: _์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - -- 1D ๋ฒกํ„ฐ๋ฅผ ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ฐจ์› ๋ฐฉํ–ฅ์œผ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•˜๊ธฐ -- 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ ์ˆ˜ํ–‰ํ•˜๊ธฐ -- ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์—์„œ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌํ•˜๊ธฐ - -ํ•ต์‹ฌ์€ ๋‘ 1D ๋ฒกํ„ฐ์˜ ์›์†Œ๋“ค์„ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ 2D ์ถœ๋ ฅ ํ–‰๋ ฌ์— ๋งคํ•‘ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๊ณ , ์Šค๋ ˆ๋“œ ๊ฒฝ๊ณ„๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. - -- **๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ**: `a`์˜ ๊ฐ ์›์†Œ๊ฐ€ `b`์˜ ๊ฐ ์›์†Œ์™€ ๊ฒฐํ•ฉ -- **์Šค๋ ˆ๋“œ ๋งคํ•‘**: \\(2 \times 2\\) ์ถœ๋ ฅ์— \\((3 \times 3)\\) ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ ์‚ฌ์šฉ -- **๋ฒกํ„ฐ ์ ‘๊ทผ**: `a`์™€ `b`๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ ‘๊ทผ ํŒจํ„ด ์‚ฌ์šฉ -- **๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ**: ํ–‰๋ ฌ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ์Šค๋ ˆ๋“œ๋ฅผ ๊ฐ€๋“œ๋กœ ์ฒ˜๋ฆฌ - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p05/p05.mojo:broadcast_add}} -``` - -์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p05/p05.mojo - -
-ํŒ - -
- -1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: `row = thread_idx.y`, `col = thread_idx.x` -2. ๊ฐ€๋“œ ์ถ”๊ฐ€: `if row < size and col < size` -3. ๊ฐ€๋“œ ๋‚ด๋ถ€: `a`์™€ `b` ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ• ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š” - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p05 -``` - -
-
- -```bash -pixi run -e amd p05 -``` - -
-
- -```bash -pixi run -e apple p05 -``` - -
-
- -```bash -uv run poe p05 -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([1.0, 2.0, 11.0, 12.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p05/p05.mojo:broadcast_add_solution}} -``` - -
- -LayoutTensor ์ถ”์ƒํ™” ์—†์ด GPU ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์˜ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค: - -1. **์Šค๋ ˆ๋“œ์—์„œ ํ–‰๋ ฌ๋กœ ๋งคํ•‘** - - `thread_idx.y`๋กœ ํ–‰, `thread_idx.x`๋กœ ์—ด์— ์ ‘๊ทผ - - 2D ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๋ฅผ ์ถœ๋ ฅ ํ–‰๋ ฌ ์›์†Œ์— ์ง์ ‘ ๋งคํ•‘ - - 3ร—3 ๊ทธ๋ฆฌ๋“œ์˜ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ๋ฅผ 2ร—2 ์ถœ๋ ฅ์— ๋งž๊ฒŒ ์ฒ˜๋ฆฌ - -2. **๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ž‘๋™ ๋ฐฉ์‹** - - ๋ฒกํ„ฐ `a`๋Š” ์ˆ˜ํ‰ ๋ฐฉํ–ฅ์œผ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: ๊ฐ ํ–‰์—์„œ ๋™์ผํ•œ `a[col]` ์‚ฌ์šฉ - - ๋ฒกํ„ฐ `b`๋Š” ์ˆ˜์ง ๋ฐฉํ–ฅ์œผ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: ๊ฐ ์—ด์—์„œ ๋™์ผํ•œ `b[row]` ์‚ฌ์šฉ - - ๋‘ ๋ฒกํ„ฐ๋ฅผ ๋”ํ•ด ์ถœ๋ ฅ ์ƒ์„ฑ - - ```txt - [ a0 a1 ] + [ b0 ] = [ a0+b0 a1+b0 ] - [ b1 ] [ a0+b1 a1+b1 ] - ``` - -3. **๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ** - - ๋‹จ์ผ ๊ฐ€๋“œ ์กฐ๊ฑด `row < size and col < size`๋กœ ๋‘ ์ฐจ์› ๋ชจ๋‘ ์ฒ˜๋ฆฌ - - ์ž…๋ ฅ ๋ฒกํ„ฐ์™€ ์ถœ๋ ฅ ํ–‰๋ ฌ์˜ ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ ๋ฐฉ์ง€ - - 3ร—3 ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๊ฐ€ 2ร—2 ๋ฐ์ดํ„ฐ๋ณด๋‹ค ํฌ๋ฏ€๋กœ ๋ฐ˜๋“œ์‹œ ํ•„์š” - -LayoutTensor ๋ฒ„์ „๊ณผ ๋น„๊ตํ•ด์„œ ๋™์ผํ•œ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ถ”์ƒํ™”๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์„ ์–ผ๋งˆ๋‚˜ ๋‹จ์ˆœํ•˜๊ฒŒ ๋งŒ๋“œ๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”. -
-
diff --git a/book/i18n/ko/src/puzzle_06/puzzle_06.md b/book/i18n/ko/src/puzzle_06/puzzle_06.md index 22c1e080..08909897 100644 --- a/book/i18n/ko/src/puzzle_06/puzzle_06.md +++ b/book/i18n/ko/src/puzzle_06/puzzle_06.md @@ -29,7 +29,7 @@ ์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p06/p06.mojo -> ์ฐธ๊ณ : ์ด ํผ์ฆ์˜ `LayoutTensor` ๋ฒ„์ „์€ ๊ฑฐ์˜ ๋™์ผํ•˜๋ฏ€๋กœ ๋…์ž์—๊ฒŒ ๋งก๊น๋‹ˆ๋‹ค. +> ์ฐธ๊ณ : ์ด ํผ์ฆ์˜ `TileTensor` ๋ฒ„์ „์€ ๊ฑฐ์˜ ๋™์ผํ•˜๋ฏ€๋กœ ๋…์ž์—๊ฒŒ ๋งก๊น๋‹ˆ๋‹ค.
ํŒ diff --git a/book/i18n/ko/src/puzzle_07/layout_tensor.md b/book/i18n/ko/src/puzzle_07/layout_tensor.md deleted file mode 100644 index 1c983fd8..00000000 --- a/book/i18n/ko/src/puzzle_07/layout_tensor.md +++ /dev/null @@ -1,160 +0,0 @@ - - -# LayoutTensor ๋ฒ„์ „ - -## ๊ฐœ์š” - -2D LayoutTensor `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D LayoutTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. - -**์ฐธ๊ณ :** _๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ `a`์˜ ํ–‰๊ณผ ์—ด ํฌ๊ธฐ๋ณด๋‹ค ๋ชจ๋‘ ์ž‘์Šต๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - -- ์—ฌ๋Ÿฌ ๋ธ”๋ก๊ณผ ํ•จ๊ป˜ `LayoutTensor` ์‚ฌ์šฉํ•˜๊ธฐ -- 2D ๋ธ”๋ก ๊ตฌ์„ฑ์œผ๋กœ ํฐ ํ–‰๋ ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ -- ๋ธ”๋ก ์ธ๋ฑ์‹ฑ๊ณผ `LayoutTensor` ์ ‘๊ทผ ๊ฒฐํ•ฉํ•˜๊ธฐ - -ํ•ต์‹ฌ์€ `LayoutTensor`๊ฐ€ 2D ์ธ๋ฑ์‹ฑ์„ ๋‹จ์ˆœํ™”ํ•ด ์ฃผ์ง€๋งŒ, ํฐ ํ–‰๋ ฌ์—์„œ๋Š” ์—ฌ์ „ํžˆ ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. - -## ๊ตฌ์„ฑ - -- **ํ–‰๋ ฌ ํฌ๊ธฐ**: \\(5 \times 5\\) ์›์†Œ -- **๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ**: `LayoutTensor`๊ฐ€ ํ–‰ ์šฐ์„  ๊ตฌ์„ฑ ๊ด€๋ฆฌ -- **๋ธ”๋ก ์กฐ์œจ**: ์—ฌ๋Ÿฌ ๋ธ”๋ก์œผ๋กœ ์ „์ฒด ํ–‰๋ ฌ ์ปค๋ฒ„ -- **2D ์ธ๋ฑ์‹ฑ**: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ํ•จ๊ป˜ ์ž์—ฐ์Šค๋Ÿฌ์šด \\((i,j)\\) ์ ‘๊ทผ -- **์ด ์Šค๋ ˆ๋“œ ์ˆ˜**: \\(25\\)๊ฐœ ์›์†Œ์— ๋Œ€ํ•ด \\(36\\)๊ฐœ -- **์Šค๋ ˆ๋“œ ๋งคํ•‘**: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ–‰๋ ฌ ์›์†Œ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌ - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p07/p07_layout_tensor.mojo:add_10_blocks_2d_layout_tensor}} -``` - -์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p07/p07_layout_tensor.mojo - -
-ํŒ - -
- -1. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ: `row = block_dim.y * block_idx.y + thread_idx.y`, `col = block_dim.x * block_idx.x + thread_idx.x` -2. ๊ฐ€๋“œ ์ถ”๊ฐ€: `if row < size and col < size` -3. ๊ฐ€๋“œ ๋‚ด๋ถ€: 2D LayoutTensor์— 10์„ ๋”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š” - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p07_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p07_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p07_layout_tensor -``` - -
-
- -```bash -uv run poe p07_layout_tensor -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0]) -expected: HostBuffer([10.0, 11.0, 12.0, ... , 34.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p07/p07_layout_tensor.mojo:add_10_blocks_2d_layout_tensor_solution}} -``` - -
- -LayoutTensor๊ฐ€ 2D ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค: - -1. **2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ** - - ์ „์—ญ ํ–‰(row): `block_dim.y * block_idx.y + thread_idx.y` - - ์ „์—ญ ์—ด(col): `block_dim.x * block_idx.x + thread_idx.x` - - ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๋ฅผ ํ…์„œ ์›์†Œ์— ๋งคํ•‘: - - ```txt - 3ร—3 ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ 5ร—5 ํ…์„œ: - - Block (0,0) Block (1,0) - [(0,0) (0,1) (0,2)] [(0,3) (0,4) * ] - [(1,0) (1,1) (1,2)] [(1,3) (1,4) * ] - [(2,0) (2,1) (2,2)] [(2,3) (2,4) * ] - - Block (0,1) Block (1,1) - [(3,0) (3,1) (3,2)] [(3,3) (3,4) * ] - [(4,0) (4,1) (4,2)] [(4,3) (4,4) * ] - [ * * * ] [ * * * ] - ``` - - (* = ์Šค๋ ˆ๋“œ๋Š” ์กด์žฌํ•˜์ง€๋งŒ ํ…์„œ ๊ฒฝ๊ณ„ ๋ฐ–) - -2. **LayoutTensor์˜ ์žฅ์ ** - - ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ: ์ˆ˜๋™ ์˜คํ”„์…‹ ๊ณ„์‚ฐ ๋Œ€์‹  `tensor[row, col]` ์‚ฌ์šฉ - - ์ž๋™ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™” - - ์ ‘๊ทผ ํŒจํ„ด ์˜ˆ์‹œ: - - ```txt - ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ: LayoutTensor: - row * size + col tensor[row, col] - (2,1) -> 11 (2,1) -> ๊ฐ™์€ ์›์†Œ - ``` - -3. **๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ** - - ๊ฐ€๋“œ `row < size and col < size`๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ์ƒํ™ฉ: - - ๋ถ€๋ถ„ ๋ธ”๋ก์—์„œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ์Šค๋ ˆ๋“œ - - ํ…์„œ ๊ฒฝ๊ณ„์˜ ์—ฃ์ง€ ์ผ€์ด์Šค - - ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์€ LayoutTensor๊ฐ€ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ - - 25๊ฐœ ์›์†Œ๋ฅผ 36๊ฐœ ์Šค๋ ˆ๋“œ๋กœ ์ฒ˜๋ฆฌ (3ร—3 ๋ธ”๋ก์˜ 2ร—2 ๊ทธ๋ฆฌ๋“œ) - -4. **๋ธ”๋ก ์กฐ์œจ** - - ๊ฐ 3ร—3 ๋ธ”๋ก์ด 5ร—5 ํ…์„œ์˜ ์ผ๋ถ€๋ถ„์„ ๋‹ด๋‹น - - LayoutTensor๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ถ€๋ถ„: - - ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™” - - ํšจ์œจ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด - - ๋ธ”๋ก ๊ฒฝ๊ณ„ ๊ฐ„ ์กฐ์œจ - - ์บ์‹œ ์นœํ™”์  ๋ฐ์ดํ„ฐ ์ ‘๊ทผ - -์ด ํŒจํ„ด์€ LayoutTensor๊ฐ€ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์Šค๋ ˆ๋“œ ์กฐ์œจ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ 2D ๋ธ”๋ก ์ฒ˜๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. -
-
diff --git a/book/i18n/ko/src/puzzle_07/puzzle_07.md b/book/i18n/ko/src/puzzle_07/puzzle_07.md index 882da4d3..3cb2b593 100644 --- a/book/i18n/ko/src/puzzle_07/puzzle_07.md +++ b/book/i18n/ko/src/puzzle_07/puzzle_07.md @@ -4,7 +4,7 @@ ## ๊ฐœ์š” -ํ–‰๋ ฌ `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. +2D TileTensor `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D TileTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. **์ฐธ๊ณ :** _๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ `a`์˜ ํ–‰๊ณผ ์—ด ํฌ๊ธฐ๋ณด๋‹ค ๋ชจ๋‘ ์ž‘์Šต๋‹ˆ๋‹ค._ @@ -13,10 +13,13 @@ ## ํ•ต์‹ฌ ๊ฐœ๋… -- ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ -- ๊ทธ๋ฆฌ๋“œ์™€ ๋ธ”๋ก์˜ ์กฐ์œจ -- ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์นœ ์ธ๋ฑ์‹ฑ -- ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด +์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: + +- ์—ฌ๋Ÿฌ ๋ธ”๋ก๊ณผ ํ•จ๊ป˜ `TileTensor` ์‚ฌ์šฉํ•˜๊ธฐ +- 2D ๋ธ”๋ก ๊ตฌ์„ฑ์œผ๋กœ ํฐ ํ–‰๋ ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ +- ๋ธ”๋ก ์ธ๋ฑ์‹ฑ๊ณผ `TileTensor` ์ ‘๊ทผ ๊ฒฐํ•ฉํ•˜๊ธฐ + +ํ•ต์‹ฌ์€ `TileTensor`๊ฐ€ 2D ์ธ๋ฑ์‹ฑ์„ ๋‹จ์ˆœํ™”ํ•ด ์ฃผ์ง€๋งŒ, ํฐ ํ–‰๋ ฌ์—์„œ๋Š” ์—ฌ์ „ํžˆ ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. > ๐Ÿ”‘ **2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ ๋ฐฉ์‹** > @@ -47,14 +50,143 @@ > - ๋ธ”๋ก ๊ฐ„ ๊ฒน์นจ ์—†์Œ > - ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด -## ๊ตฌํ˜„ ๋ฐฉ์‹ +## ๊ตฌ์„ฑ + +- **ํ–‰๋ ฌ ํฌ๊ธฐ**: \\(5 \times 5\\) ์›์†Œ +- **๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ**: `TileTensor`๊ฐ€ ํ–‰ ์šฐ์„  ๊ตฌ์„ฑ ๊ด€๋ฆฌ +- **๋ธ”๋ก ์กฐ์œจ**: ์—ฌ๋Ÿฌ ๋ธ”๋ก์œผ๋กœ ์ „์ฒด ํ–‰๋ ฌ ์ปค๋ฒ„ +- **2D ์ธ๋ฑ์‹ฑ**: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ํ•จ๊ป˜ ์ž์—ฐ์Šค๋Ÿฌ์šด \\((i,j)\\) ์ ‘๊ทผ +- **์ด ์Šค๋ ˆ๋“œ ์ˆ˜**: \\(25\\)๊ฐœ ์›์†Œ์— ๋Œ€ํ•ด \\(36\\)๊ฐœ +- **์Šค๋ ˆ๋“œ ๋งคํ•‘**: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ–‰๋ ฌ ์›์†Œ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌ + +## ์™„์„ฑํ•  ์ฝ”๋“œ + +```mojo +{{#include ../../../../../problems/p07/p07.mojo:add_10_blocks_2d}} +``` + +์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p07/p07.mojo + +
+ํŒ + +
+ +1. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ: `row = block_dim.y * block_idx.y + thread_idx.y`, `col = block_dim.x * block_idx.x + thread_idx.x` +2. ๊ฐ€๋“œ ์ถ”๊ฐ€: `if row < size and col < size` +3. ๊ฐ€๋“œ ๋‚ด๋ถ€: 2D TileTensor์— 10์„ ๋”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š” + +
+
+ +## ์ฝ”๋“œ ์‹คํ–‰ + +์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: + +
+
+ + + + +
+
+ +```bash +pixi run p07 +``` + +
+
+ +```bash +pixi run -e amd p07 +``` + +
+
+ +```bash +pixi run -e apple p07 +``` + +
+
+ +```bash +uv run poe p07 +``` + +
+
+ +ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค: + +```txt +out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0]) +expected: HostBuffer([10.0, 11.0, 12.0, ... , 34.0]) +``` + +## ์†”๋ฃจ์…˜ + +
+ + +```mojo +{{#include ../../../../../solutions/p07/p07.mojo:add_10_blocks_2d_solution}} +``` + +
+ +TileTensor๊ฐ€ 2D ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค: + +1. **2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ** + - ์ „์—ญ ํ–‰(row): `block_dim.y * block_idx.y + thread_idx.y` + - ์ „์—ญ ์—ด(col): `block_dim.x * block_idx.x + thread_idx.x` + - ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๋ฅผ ํ…์„œ ์›์†Œ์— ๋งคํ•‘: + + ```txt + 3ร—3 ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ 5ร—5 ํ…์„œ: + + Block (0,0) Block (1,0) + [(0,0) (0,1) (0,2)] [(0,3) (0,4) * ] + [(1,0) (1,1) (1,2)] [(1,3) (1,4) * ] + [(2,0) (2,1) (2,2)] [(2,3) (2,4) * ] + + Block (0,1) Block (1,1) + [(3,0) (3,1) (3,2)] [(3,3) (3,4) * ] + [(4,0) (4,1) (4,2)] [(4,3) (4,4) * ] + [ * * * ] [ * * * ] + ``` + + (* = ์Šค๋ ˆ๋“œ๋Š” ์กด์žฌํ•˜์ง€๋งŒ ํ…์„œ ๊ฒฝ๊ณ„ ๋ฐ–) -### [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./raw.md) +2. **TileTensor์˜ ์žฅ์ ** + - ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ: ์ˆ˜๋™ ์˜คํ”„์…‹ ๊ณ„์‚ฐ ๋Œ€์‹  `tensor[row, col]` ์‚ฌ์šฉ + - ์ž๋™ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™” + - ์ ‘๊ทผ ํŒจํ„ด ์˜ˆ์‹œ: -์ˆ˜๋™ ์ธ๋ฑ์‹ฑ์œผ๋กœ ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์นœ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. + ```txt + ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ: TileTensor: + row * size + col tensor[row, col] + (2,1) -> 11 (2,1) -> ๊ฐ™์€ ์›์†Œ + ``` -### [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./layout_tensor.md) +3. **๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ** + - ๊ฐ€๋“œ `row < size and col < size`๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ์ƒํ™ฉ: + - ๋ถ€๋ถ„ ๋ธ”๋ก์—์„œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ์Šค๋ ˆ๋“œ + - ํ…์„œ ๊ฒฝ๊ณ„์˜ ์—ฃ์ง€ ์ผ€์ด์Šค + - ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์€ TileTensor๊ฐ€ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ + - 25๊ฐœ ์›์†Œ๋ฅผ 36๊ฐœ ์Šค๋ ˆ๋“œ๋กœ ์ฒ˜๋ฆฌ (3ร—3 ๋ธ”๋ก์˜ 2ร—2 ๊ทธ๋ฆฌ๋“œ) -LayoutTensor ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•ด ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. +4. **๋ธ”๋ก ์กฐ์œจ** + - ๊ฐ 3ร—3 ๋ธ”๋ก์ด 5ร—5 ํ…์„œ์˜ ์ผ๋ถ€๋ถ„์„ ๋‹ด๋‹น + - TileTensor๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ถ€๋ถ„: + - ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™” + - ํšจ์œจ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด + - ๋ธ”๋ก ๊ฒฝ๊ณ„ ๊ฐ„ ์กฐ์œจ + - ์บ์‹œ ์นœํ™”์  ๋ฐ์ดํ„ฐ ์ ‘๊ทผ -๐Ÿ’ก **์ฐธ๊ณ **: LayoutTensor๊ฐ€ ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์–ผ๋งˆ๋‚˜ ๋‹จ์ˆœํ™”ํ•˜๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”. +์ด ํŒจํ„ด์€ TileTensor๊ฐ€ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์Šค๋ ˆ๋“œ ์กฐ์œจ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ 2D ๋ธ”๋ก ์ฒ˜๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. +
+
diff --git a/book/i18n/ko/src/puzzle_07/raw.md b/book/i18n/ko/src/puzzle_07/raw.md deleted file mode 100644 index 33ba22fc..00000000 --- a/book/i18n/ko/src/puzzle_07/raw.md +++ /dev/null @@ -1,157 +0,0 @@ - - -## ๊ฐœ์š” - -ํ–‰๋ ฌ `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. - -**์ฐธ๊ณ :** _๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ `a`์˜ ํ–‰๊ณผ ์—ด ํฌ๊ธฐ๋ณด๋‹ค ๋ชจ๋‘ ์ž‘์Šต๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - -- 2D ๋ธ”๋ก๊ณผ ์Šค๋ ˆ๋“œ ๋ฐฐ์น˜ ๋‹ค๋ฃจ๊ธฐ -- ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ํ–‰๋ ฌ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํ•˜๊ธฐ -- 2D ์ธ๋ฑ์Šค์™€ ์„ ํ˜• ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๊ฐ„ ๋ณ€ํ™˜ํ•˜๊ธฐ - -ํ•ต์‹ฌ์€ ํ•˜๋‚˜์˜ ๋ธ”๋ก๋ณด๋‹ค ํฐ 2D ํ–‰๋ ฌ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ์—ฌ๋Ÿฌ ๋ธ”๋ก์˜ ์Šค๋ ˆ๋“œ๋“ค์ด ์–ด๋–ป๊ฒŒ ํ•จ๊ป˜ ์ž‘๋™ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. - -## ๊ตฌ์„ฑ - -- **ํ–‰๋ ฌ ํฌ๊ธฐ**: \\(5 \times 5\\) ์›์†Œ -- **2D ๋ธ”๋ก**: ๊ฐ ๋ธ”๋ก์ด \\(3 \times 3\\) ์˜์—ญ ์ฒ˜๋ฆฌ -- **๊ทธ๋ฆฌ๋“œ ๋ ˆ์ด์•„์›ƒ**: \\(2 \times 2\\) ๊ทธ๋ฆฌ๋“œ์— ๋ธ”๋ก ๋ฐฐ์น˜ -- **์ด ์Šค๋ ˆ๋“œ ์ˆ˜**: \\(25\\)๊ฐœ ์›์†Œ์— ๋Œ€ํ•ด \\(36\\)๊ฐœ -- **๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด**: 2D ๋ฐ์ดํ„ฐ๋ฅผ ํ–‰ ์šฐ์„ ์œผ๋กœ ์ €์žฅ -- **์ปค๋ฒ„๋ฆฌ์ง€**: ๋ชจ๋“  ํ–‰๋ ฌ ์›์†Œ๊ฐ€ ๋น ์ง์—†์ด ์ฒ˜๋ฆฌ๋˜๋„๋ก ๋ณด์žฅ - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p07/p07.mojo:add_10_blocks_2d}} -``` - -์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p07/p07.mojo - -
-ํŒ - -
- -1. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ: `row = block_dim.y * block_idx.y + thread_idx.y`, `col = block_dim.x * block_idx.x + thread_idx.x` -2. ๊ฐ€๋“œ ์ถ”๊ฐ€: `if row < size and col < size` -3. ๊ฐ€๋“œ ๋‚ด๋ถ€: ํ–‰ ์šฐ์„  ๋ฐฉ์‹์œผ๋กœ 10์„ ๋”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”! - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p07 -``` - -
-
- -```bash -pixi run -e amd p07 -``` - -
-
- -```bash -pixi run -e apple p07 -``` - -
-
- -```bash -uv run poe p07 -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0]) -expected: HostBuffer([10.0, 11.0, 12.0, ... , 34.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p07/p07.mojo:add_10_blocks_2d_solution}} -``` - -
- -์›์‹œ ๋ฉ”๋ชจ๋ฆฌ๋กœ 2D ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ๋ฅผ ๊ตฌํ˜„ํ•  ๋•Œ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค: - -1. **2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ** - - ์ „์—ญ ํ–‰(row): `block_dim.y * block_idx.y + thread_idx.y` - - ์ „์—ญ ์—ด(col): `block_dim.x * block_idx.x + thread_idx.x` - - ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๋ฅผ ํ–‰๋ ฌ ์›์†Œ์— ๋งคํ•‘: - - ```txt - 3ร—3 ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ 5ร—5 ํ–‰๋ ฌ: - - Block (0,0) Block (1,0) - [(0,0) (0,1) (0,2)] [(0,3) (0,4) * ] - [(1,0) (1,1) (1,2)] [(1,3) (1,4) * ] - [(2,0) (2,1) (2,2)] [(2,3) (2,4) * ] - - Block (0,1) Block (1,1) - [(3,0) (3,1) (3,2)] [(3,3) (3,4) * ] - [(4,0) (4,1) (4,2)] [(4,3) (4,4) * ] - [ * * * ] [ * * * ] - ``` - - (* = ์Šค๋ ˆ๋“œ๋Š” ์กด์žฌํ•˜์ง€๋งŒ ํ–‰๋ ฌ ๊ฒฝ๊ณ„ ๋ฐ–) - -2. **๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ** - - ํ–‰ ์šฐ์„  ์„ ํ˜• ๋ฉ”๋ชจ๋ฆฌ: `index = row * size + col` - - 5ร—5 ํ–‰๋ ฌ ์˜ˆ์‹œ: - - ```txt - 2D ์ธ๋ฑ์Šค: ์„ ํ˜• ๋ฉ”๋ชจ๋ฆฌ: - (2,1) -> 11 [00 01 02 03 04] - [05 06 07 08 09] - [10 11 12 13 14] - [15 16 17 18 19] - [20 21 22 23 24] - ``` - -3. **๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ** - - ๊ฐ€๋“œ `row < size and col < size`๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒฝ์šฐ: - - ๋ถ€๋ถ„ ๋ธ”๋ก์—์„œ ๋‚จ๋Š” ์Šค๋ ˆ๋“œ - - ํ–‰๋ ฌ ๊ฒฝ๊ณ„์˜ ์—ฃ์ง€ ์ผ€์ด์Šค - - 3ร—3 ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ 2ร—2 ๊ทธ๋ฆฌ๋“œ = 25๊ฐœ ์›์†Œ์— 36๊ฐœ ์Šค๋ ˆ๋“œ - -4. **๋ธ”๋ก ์กฐ์œจ** - - ๊ฐ 3ร—3 ๋ธ”๋ก์ด 5ร—5 ํ–‰๋ ฌ์˜ ์ผ๋ถ€๋ถ„์„ ๋‹ด๋‹น - - 2ร—2 ๋ธ”๋ก ๊ทธ๋ฆฌ๋“œ๋กœ ์ „์ฒด๋ฅผ ๋น ์ง์—†์ด ์ปค๋ฒ„ - - ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ๊ฒน์น˜๋Š” ์Šค๋ ˆ๋“œ ์ฒ˜๋ฆฌ - - ๋ธ”๋ก๋“ค์ด ํ•จ๊ป˜ ํšจ์œจ์ ์œผ๋กœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ - -์ด ํŒจํ„ด์€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ 2D ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ์Šค๋ ˆ๋“œ ์กฐ์œจ์„ ์–ด๋–ป๊ฒŒ ์œ ์ง€ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. -
-
diff --git a/book/i18n/ko/src/puzzle_08/layout_tensor.md b/book/i18n/ko/src/puzzle_08/layout_tensor.md deleted file mode 100644 index c05cdb4a..00000000 --- a/book/i18n/ko/src/puzzle_08/layout_tensor.md +++ /dev/null @@ -1,193 +0,0 @@ - - -## ๊ฐœ์š” - -1D LayoutTensor `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 1D LayoutTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. - -**์ฐธ๊ณ :** _๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ `a`์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - -- address_space๋ฅผ ํ™œ์šฉํ•œ LayoutTensor์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋Šฅ -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์˜ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” -- LayoutTensor๋กœ ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌํ•˜๊ธฐ - -ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ์˜ ์„ฑ๋Šฅ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. - -## ๊ตฌ์„ฑ - -- ๋ฐฐ์—ด ํฌ๊ธฐ: `SIZE = 8` ์›์†Œ -- ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: `TPB = 4` -- ๋ธ”๋ก ์ˆ˜: 2 -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น `TPB`๊ฐœ ์›์†Œ - -## ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹๊ณผ์˜ ์ฃผ์š” ์ฐจ์ด์  - -1. **๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น**: [stack_allocation](https://docs.modular.com/mojo/std/memory/memory/stack_allocation/) ๋Œ€์‹  address_space๋ฅผ ์‚ฌ์šฉํ•œ [LayoutTensor](https://docs.modular.com/mojo/stdlib/layout/layout_tensor/LayoutTensor) ์‚ฌ์šฉ - - ```mojo - # ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹ - shared = stack_allocation[TPB, Scalar[dtype]]() - - # LayoutTensor ๋ฐฉ์‹ - shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - ``` - -2. **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ**: ๋™์ผํ•œ ๋ฌธ๋ฒ• - - ```mojo - # ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹ - shared[local_i] = a[global_i] - - # LayoutTensor ๋ฐฉ์‹ - shared[local_i] = a[global_i] - ``` - -3. **์•ˆ์ „ ๊ธฐ๋Šฅ**: - - - ํƒ€์ž… ์•ˆ์ „์„ฑ - - ๋ ˆ์ด์•„์›ƒ ๊ด€๋ฆฌ - - ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ ์ฒ˜๋ฆฌ - -> **์ฐธ๊ณ **: LayoutTensor๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์‹œ `barrier()`๋ฅผ ํ†ตํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๋Š” ์—ฌ์ „ํžˆ ์ง์ ‘ ๊ด€๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. - -**ํ•™์Šต ์ฐธ๊ณ **: ์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์—๋งŒ ์ ‘๊ทผํ•˜๋ฏ€๋กœ `barrier()`๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ ํ•„์š”ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p08/p08_layout_tensor.mojo:add_10_shared_layout_tensor}} -``` - -์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p08/p08_layout_tensor.mojo - -
-ํŒ - -
- -1. address_space ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ LayoutTensor ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ -2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: `shared[local_i] = a[global_i]` -3. `barrier()`๋กœ ๋™๊ธฐํ™” (ํ•™์Šต์šฉ - ์—ฌ๊ธฐ์„œ๋Š” ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Œ) -4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค๋กœ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ -5. ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ฐ€๋“œ - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p08_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p08_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p08_layout_tensor -``` - -
-
- -```bash -uv run poe p08_layout_tensor -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p08/p08_layout_tensor.mojo:add_10_shared_layout_tensor_solution}} -``` - -
- -LayoutTensor๊ฐ€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค: - -1. **LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ** - - ์ „์—ญ ํ…์„œ: `a`์™€ `output` (๋А๋ฆผ, ๋ชจ๋“  ๋ธ”๋ก์—์„œ ๋ณด์ž„) - - ๊ณต์œ  ํ…์„œ: `shared` (๋น ๋ฆ„, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋กœ์ปฌ) - - ๋ธ”๋ก๋‹น 4๊ฐœ ์Šค๋ ˆ๋“œ๋กœ 8๊ฐœ ์›์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ: - - ```txt - ์ „์—ญ ํ…์„œ a: [1 1 1 1 | 1 1 1 1] # ์ž…๋ ฅ: ๋ชจ๋‘ 1 - - Block (0): Block (1): - shared[0..3] shared[0..3] - [1 1 1 1] [1 1 1 1] - ``` - -2. **์Šค๋ ˆ๋“œ ์กฐ์œจ** - - ๋กœ๋“œ ๋‹จ๊ณ„ (์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ): - - ```txt - Thread 0: shared[0] = a[0]=1 Thread 2: shared[2] = a[2]=1 - Thread 1: shared[1] = a[1]=1 Thread 3: shared[3] = a[3]=1 - barrier() โ†“ โ†“ โ†“ โ†“ # ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ - ``` - - - ์ฒ˜๋ฆฌ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ํ…์„œ ๊ฐ’์— 10์„ ๋”ํ•จ - - ๊ฒฐ๊ณผ: `output[global_i] = shared[local_i] + 10 = 11` - - **์ฐธ๊ณ **: ์ด ๊ฒฝ์šฐ์—๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜(`shared[local_i]`)์—๋งŒ ์“ฐ๊ณ  ์ฝ์œผ๋ฏ€๋กœ `barrier()`๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜๋Š” ์ƒํ™ฉ์—์„œ ํ•„์ˆ˜์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. - -3. **LayoutTensor์˜ ์žฅ์ ** - - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: - - ```txt - # address_space๋ฅผ ์‚ฌ์šฉํ•œ ๊น”๋”ํ•œ LayoutTensor API - shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - ``` - - - ์ „์—ญ๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‘ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ: - - ```txt - Block 0 ์ถœ๋ ฅ: [11 11 11 11] - Block 1 ์ถœ๋ ฅ: [11 11 11 11] - ``` - - - ๋‚ด์žฅ๋œ ๋ ˆ์ด์•„์›ƒ ๊ด€๋ฆฌ์™€ ํƒ€์ž… ์•ˆ์ „์„ฑ - -4. **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด** - - ๋กœ๋“œ: ์ „์—ญ ํ…์„œ โ†’ ๊ณต์œ  ํ…์„œ (์ตœ์ ํ™”๋จ) - - ๋™๊ธฐํ™”: ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „๊ณผ ๋™์ผํ•œ `barrier()` ํ•„์š” - - ์ฒ˜๋ฆฌ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10 ๋”ํ•˜๊ธฐ - - ์ €์žฅ: ๊ฒฐ๊ณผ(11)๋ฅผ ์ „์—ญ ํ…์„œ์— ์“ฐ๊ธฐ - -์ด ํŒจํ„ด์€ LayoutTensor๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ ์ด์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋” ํŽธ๋ฆฌํ•œ API์™€ ๋‚ด์žฅ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. -
-
diff --git a/book/i18n/ko/src/puzzle_08/puzzle_08.md b/book/i18n/ko/src/puzzle_08/puzzle_08.md index caccb88c..0137735d 100644 --- a/book/i18n/ko/src/puzzle_08/puzzle_08.md +++ b/book/i18n/ko/src/puzzle_08/puzzle_08.md @@ -4,21 +4,167 @@ ## ๊ฐœ์š” -๋ฒกํ„ฐ `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด ๋ฒกํ„ฐ `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. +1D TileTensor `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 1D TileTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. **์ฐธ๊ณ :** _๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ `a`์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค._ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ๊ฐํ™” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ๊ฐํ™” -## ๊ตฌํ˜„ ๋ฐฉ์‹ +## ํ•ต์‹ฌ ๊ฐœ๋… -### [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./raw.md) +์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: -๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์™€ ๋™๊ธฐํ™”๋ฅผ ์ˆ˜๋™์œผ๋กœ ๊ด€๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. +- address_space๋ฅผ ํ™œ์šฉํ•œ TileTensor์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋Šฅ +- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์˜ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” +- TileTensor๋กœ ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌํ•˜๊ธฐ -### [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./layout_tensor.md) +ํ•ต์‹ฌ์€ TileTensor๊ฐ€ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ์˜ ์„ฑ๋Šฅ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. -LayoutTensor์— ๋‚ด์žฅ๋œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. +## ๊ตฌ์„ฑ -๐Ÿ’ก **์ฐธ๊ณ **: LayoutTensor๊ฐ€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๊ฒฝํ—˜ํ•ด ๋ณด์„ธ์š”. +- ๋ฐฐ์—ด ํฌ๊ธฐ: `SIZE = 8` ์›์†Œ +- ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: `TPB = 4` +- ๋ธ”๋ก ์ˆ˜: 2 +- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น `TPB`๊ฐœ ์›์†Œ + +> **๊ฒฝ๊ณ **: ๊ฐ ๋ธ”๋ก์—๋Š” ํ•ด๋‹น ๋ธ”๋ก์˜ ์Šค๋ ˆ๋“œ๋“ค์ด ์ฝ๊ณ  ์“ธ ์ˆ˜ ์žˆ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์–‘์ด _์ƒ์ˆ˜_๋กœ ๊ณ ์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ’์€ ํŒŒ์ด์ฌ ๋ฆฌํ„ฐ๋Ÿด ์ƒ์ˆ˜์—ฌ์•ผ ํ•˜๋ฉฐ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์“ด ํ›„์—๋Š” [barrier](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/)๋ฅผ ํ˜ธ์ถœํ•ด ์Šค๋ ˆ๋“œ๋“ค์ด ๊ต์ฐจํ•˜์ง€ ์•Š๋„๋ก ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. + +**ํ•™์Šต ์ฐธ๊ณ **: ์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์—๋งŒ ์ ‘๊ทผํ•˜๋ฏ€๋กœ `barrier()`๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ ํ•„์š”ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. + +## ์™„์„ฑํ•  ์ฝ”๋“œ + +```mojo +{{#include ../../../../../problems/p08/p08.mojo:add_10_shared}} +``` + +์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p08/p08.mojo + +
+ํŒ + +
+ +1. address_space ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ TileTensor ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ +2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: `shared[local_i] = a[global_i]` +3. `barrier()`๋กœ ๋™๊ธฐํ™” (ํ•™์Šต์šฉ - ์—ฌ๊ธฐ์„œ๋Š” ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Œ) +4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค๋กœ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ +5. ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ฐ€๋“œ + +
+
+ +## ์ฝ”๋“œ ์‹คํ–‰ + +์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: + +
+
+ + + + +
+
+ +```bash +pixi run p08 +``` + +
+
+ +```bash +pixi run -e amd p08 +``` + +
+
+ +```bash +pixi run -e apple p08 +``` + +
+
+ +```bash +uv run poe p08 +``` + +
+
+ +ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค: + +```txt +out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) +expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0]) +``` + +## ์†”๋ฃจ์…˜ + +
+ + +```mojo +{{#include ../../../../../solutions/p08/p08.mojo:add_10_shared_solution}} +``` + +
+ +TileTensor๊ฐ€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค: + +1. **TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ** + - ์ „์—ญ ํ…์„œ: `a`์™€ `output` (๋А๋ฆผ, ๋ชจ๋“  ๋ธ”๋ก์—์„œ ๋ณด์ž„) + - ๊ณต์œ  ํ…์„œ: `shared` (๋น ๋ฆ„, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋กœ์ปฌ) + - ๋ธ”๋ก๋‹น 4๊ฐœ ์Šค๋ ˆ๋“œ๋กœ 8๊ฐœ ์›์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ: + + ```txt + ์ „์—ญ ํ…์„œ a: [1 1 1 1 | 1 1 1 1] # ์ž…๋ ฅ: ๋ชจ๋‘ 1 + + Block (0): Block (1): + shared[0..3] shared[0..3] + [1 1 1 1] [1 1 1 1] + ``` + +2. **์Šค๋ ˆ๋“œ ์กฐ์œจ** + - ๋กœ๋“œ ๋‹จ๊ณ„ (์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ): + + ```txt + Thread 0: shared[0] = a[0]=1 Thread 2: shared[2] = a[2]=1 + Thread 1: shared[1] = a[1]=1 Thread 3: shared[3] = a[3]=1 + barrier() โ†“ โ†“ โ†“ โ†“ # ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ + ``` + + - ์ฒ˜๋ฆฌ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ํ…์„œ ๊ฐ’์— 10์„ ๋”ํ•จ + - ๊ฒฐ๊ณผ: `output[global_i] = shared[local_i] + 10 = 11` + + **์ฐธ๊ณ **: ์ด ๊ฒฝ์šฐ์—๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜(`shared[local_i]`)์—๋งŒ ์“ฐ๊ณ  ์ฝ์œผ๋ฏ€๋กœ `barrier()`๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜๋Š” ์ƒํ™ฉ์—์„œ ํ•„์ˆ˜์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. + +3. **TileTensor์˜ ์žฅ์ ** + - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: + + ```txt + # address_space๋ฅผ ์‚ฌ์šฉํ•œ ๊น”๋”ํ•œ TileTensor API + shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]()) + ``` + + - ์ „์—ญ๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‘ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ: + + ```txt + Block 0 ์ถœ๋ ฅ: [11 11 11 11] + Block 1 ์ถœ๋ ฅ: [11 11 11 11] + ``` + + - ๋‚ด์žฅ๋œ ๋ ˆ์ด์•„์›ƒ ๊ด€๋ฆฌ์™€ ํƒ€์ž… ์•ˆ์ „์„ฑ + +4. **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด** + - ๋กœ๋“œ: ์ „์—ญ ํ…์„œ โ†’ ๊ณต์œ  ํ…์„œ (์ตœ์ ํ™”๋จ) + - ๋™๊ธฐํ™”: ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „๊ณผ ๋™์ผํ•œ `barrier()` ํ•„์š” + - ์ฒ˜๋ฆฌ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10 ๋”ํ•˜๊ธฐ + - ์ €์žฅ: ๊ฒฐ๊ณผ(11)๋ฅผ ์ „์—ญ ํ…์„œ์— ์“ฐ๊ธฐ + +์ด ํŒจํ„ด์€ TileTensor๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ ์ด์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋” ํŽธ๋ฆฌํ•œ API์™€ ๋‚ด์žฅ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. +
+
diff --git a/book/i18n/ko/src/puzzle_08/raw.md b/book/i18n/ko/src/puzzle_08/raw.md deleted file mode 100644 index 18cf77f1..00000000 --- a/book/i18n/ko/src/puzzle_08/raw.md +++ /dev/null @@ -1,168 +0,0 @@ - - -## ๊ฐœ์š” - -๋ฒกํ„ฐ `a`์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”. - -**์ฐธ๊ณ :** _๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ `a`์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - -- ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋‚ด์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉํ•˜๊ธฐ -- ๋ฐฐ๋ฆฌ์–ด(barrier)๋กœ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”ํ•˜๊ธฐ -- ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ์ €์žฅ์†Œ ๊ด€๋ฆฌํ•˜๊ธฐ - -ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” ๋น ๋ฅธ ๋กœ์ปฌ ์ €์žฅ์†Œ๋ผ๋Š” ์ , ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์Šค๋ ˆ๋“œ ๊ฐ„ ์กฐ์œจ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. - -## ๊ตฌ์„ฑ - -- ๋ฐฐ์—ด ํฌ๊ธฐ: `SIZE = 8` ์›์†Œ -- ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: `TPB = 4` -- ๋ธ”๋ก ์ˆ˜: 2 -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น `TPB`๊ฐœ ์›์†Œ - -์ฐธ๊ณ : - -- **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ**: ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๋น ๋ฅธ ์ €์žฅ์†Œ -- **์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”**: `barrier()`๋ฅผ ์‚ฌ์šฉํ•œ ์กฐ์œจ -- **๋ฉ”๋ชจ๋ฆฌ ๋ฒ”์œ„**: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋ธ”๋ก ๋‚ด์—์„œ๋งŒ ๋ณด์ž„ -- **์ ‘๊ทผ ํŒจํ„ด**: ๋กœ์ปฌ ์ธ๋ฑ์Šค vs ์ „์—ญ ์ธ๋ฑ์Šค - -> **์ฃผ์˜**: ๊ฐ ๋ธ”๋ก์ด ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ๋Š” _์ƒ์ˆ˜_ ๋กœ ์ •ํ•ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์€ ๋ณ€์ˆ˜๊ฐ€ ์•„๋‹Œ ๋ฆฌํ„ฐ๋Ÿด Python ์ƒ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์“ด ํ›„์—๋Š” [barrier](https://docs.modular.com/mojo/stdlib/gpu/sync/barrier/)๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ์•ž์„œ๊ฐ€์ง€ ์•Š๋„๋ก ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. - -**ํ•™์Šต ์ฐธ๊ณ **: ์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์—๋งŒ ์ ‘๊ทผํ•˜๋ฏ€๋กœ `barrier()`๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ ํ•„์š”ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p08/p08.mojo:add_10_shared}} -``` - -์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p08/p08.mojo - -
-ํŒ - -
- -1. `barrier()`๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ (ํ•™์Šต์šฉ - ์—ฌ๊ธฐ์„œ๋Š” ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Œ) -2. `local_i`๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: `shared[local_i]` -3. `global_i`๋กœ ์ถœ๋ ฅ: `output[global_i]` -4. ๊ฐ€๋“œ ์ถ”๊ฐ€: `if global_i < size` - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p08 -``` - -
-
- -```bash -pixi run -e amd p08 -``` - -
-
- -```bash -pixi run -e apple p08 -``` - -
-
- -```bash -uv run poe p08 -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p08/p08.mojo:add_10_shared_solution}} -``` - -
- -GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค: - -1. **๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ** - - ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: `a`์™€ `output` ๋ฐฐ์—ด (๋А๋ฆผ, ๋ชจ๋“  ๋ธ”๋ก์—์„œ ๋ณด์ž„) - - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: `shared` ๋ฐฐ์—ด (๋น ๋ฆ„, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋กœ์ปฌ) - - ๋ธ”๋ก๋‹น 4๊ฐœ ์Šค๋ ˆ๋“œ๋กœ 8๊ฐœ ์›์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ: - - ```txt - ์ „์—ญ ๋ฐฐ์—ด a: [1 1 1 1 | 1 1 1 1] # ์ž…๋ ฅ: ๋ชจ๋‘ 1 - - Block (0): Block (1): - shared[0..3] shared[0..3] - [1 1 1 1] [1 1 1 1] - ``` - -2. **์Šค๋ ˆ๋“œ ์กฐ์œจ** - - ๋กœ๋“œ ๋‹จ๊ณ„: - - ```txt - Thread 0: shared[0] = a[0]=1 Thread 2: shared[2] = a[2]=1 - Thread 1: shared[1] = a[1]=1 Thread 3: shared[3] = a[3]=1 - barrier() โ†“ โ†“ โ†“ โ†“ # ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ - ``` - - - ์ฒ˜๋ฆฌ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10์„ ๋”ํ•จ - - ๊ฒฐ๊ณผ: `output[i] = shared[local_i] + 10 = 11` - - **์ฐธ๊ณ **: ์ด ๊ฒฝ์šฐ์—๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜(`shared[local_i]`)์—๋งŒ ์“ฐ๊ณ  ์ฝ์œผ๋ฏ€๋กœ `barrier()`๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜๋Š” ์ƒํ™ฉ์—์„œ ํ•„์ˆ˜์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. - -3. **์ธ๋ฑ์Šค ๋งคํ•‘** - - ์ „์—ญ ์ธ๋ฑ์Šค: `block_dim.x * block_idx.x + thread_idx.x` - - ```txt - Block 0 ์ถœ๋ ฅ: [11 11 11 11] - Block 1 ์ถœ๋ ฅ: [11 11 11 11] - ``` - - - ๋กœ์ปฌ ์ธ๋ฑ์Šค: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— `thread_idx.x` ์‚ฌ์šฉ - - ```txt - ๋‘ ๋ธ”๋ก ๋ชจ๋‘ ์ฒ˜๋ฆฌ: 1 + 10 = 11 - ``` - -4. **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด** - - ๋กœ๋“œ: ์ „์—ญ โ†’ ๊ณต์œ  (๋ณ‘ํ•ฉ ์ฝ๊ธฐ๋กœ 1 ๊ฐ’๋“ค ๋กœ๋“œ) - - ๋™๊ธฐํ™”: `barrier()`๋กœ ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋ณด์žฅ - - ์ฒ˜๋ฆฌ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10 ๋”ํ•˜๊ธฐ - - ์ €์žฅ: ๊ฒฐ๊ณผ(11)๋ฅผ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ์“ฐ๊ธฐ - -์ด ํŒจํ„ด์€ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ ์กฐ์œจ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. -
-
diff --git a/book/i18n/ko/src/puzzle_09/puzzle_09.md b/book/i18n/ko/src/puzzle_09/puzzle_09.md index 40ade8b5..7a8610e3 100644 --- a/book/i18n/ko/src/puzzle_09/puzzle_09.md +++ b/book/i18n/ko/src/puzzle_09/puzzle_09.md @@ -80,7 +80,7 @@ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ˜„์‹ค์ด ๋ฐ”๋กœ ์ด๊ฒƒ์ž…๋‹ˆ๋‹ค. **์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ **๋กœ์ง ๋ฒ„๊ทธ ์กฐ์‚ฌ** - ๊ฒฐ๊ณผ๊ฐ€ ํ‹€๋ฆฐ ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น… -- LayoutTensor ๊ธฐ๋ฐ˜์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์กฐ์‚ฌ +- TileTensor ๊ธฐ๋ฐ˜์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์กฐ์‚ฌ - ์ตœ์ ํ™”๋กœ ๋ณ€์ˆ˜๊ฐ€ ์‚ฌ๋ผ์กŒ์„ ๋•Œ ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„ํ•˜๊ธฐ - ๋ฐ˜๋ณต๋ฌธ ๊ฒฝ๊ณ„์™€ ๋ฐ˜๋ณต ํšŸ์ˆ˜ ๋ถ„์„ํ•˜๊ธฐ - ํ‹€๋ฆฐ ๊ฒฐ๊ณผ์—์„œ ํŒจํ„ด ์ฐพ์•„๋‚ด๊ธฐ diff --git a/book/i18n/ko/src/puzzle_09/second_case.md b/book/i18n/ko/src/puzzle_09/second_case.md index 75dcf48e..595a8537 100644 --- a/book/i18n/ko/src/puzzle_09/second_case.md +++ b/book/i18n/ko/src/puzzle_09/second_case.md @@ -11,7 +11,7 @@ - **[์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€](./first_case.md)**: ๋ช…ํ™•ํ•œ ํฌ๋ž˜์‹œ ์‹ ํ˜ธ(`CUDA_ERROR_ILLEGAL_ADDRESS`)๊ฐ€ ์กฐ์‚ฌ๋ฅผ ์•ˆ๋‚ดํ•จ - **๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€**: ํฌ๋ž˜์‹œ๋„ ์—†๊ณ  ์—๋Ÿฌ ๋ฉ”์‹œ์ง€๋„ ์—†์Œ - ํƒ์ •์ฒ˜๋Ÿผ ํŒŒํ—ค์ณ์•ผ ํ•˜๋Š” ๋ฏธ๋ฌ˜ํ•˜๊ฒŒ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋งŒ ์žˆ์Œ -์ด๋ฒˆ ์ค‘๊ธ‰ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” `LayoutTensor` ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” **์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜**๋ฅผ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœ๊ทธ๋žจ์€ ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰๋˜์ง€๋งŒ ์ž˜๋ชป๋œ ์ถœ๋ ฅ์„ ๋‚ด๋Š”๋ฐ, ์‹ค์ œ ๊ฐœ๋ฐœ์—์„œ ํ›จ์”ฌ ํ”ํ•˜๋ฉด์„œ๋„ ๊นŒ๋‹ค๋กœ์šด ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค์ž…๋‹ˆ๋‹ค. +์ด๋ฒˆ ์ค‘๊ธ‰ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” `TileTensor` ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” **์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜**๋ฅผ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœ๊ทธ๋žจ์€ ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰๋˜์ง€๋งŒ ์ž˜๋ชป๋œ ์ถœ๋ ฅ์„ ๋‚ด๋Š”๋ฐ, ์‹ค์ œ ๊ฐœ๋ฐœ์—์„œ ํ›จ์”ฌ ํ”ํ•˜๋ฉด์„œ๋„ ๊นŒ๋‹ค๋กœ์šด ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค์ž…๋‹ˆ๋‹ค. **์‚ฌ์ „ ์ค€๋น„**: [Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ](./essentials.md)๊ณผ [ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€](./first_case.md)๋ฅผ ๋จผ์ € ์™„๋ฃŒํ•ด์„œ CUDA-GDB ์›Œํฌํ”Œ๋กœ์šฐ์™€ ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•์„ ์ตํ˜€๋‘์„ธ์š”. ์•„๋ž˜ ๋ช…๋ น์„ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”: @@ -23,7 +23,7 @@ pixi run -e nvidia setup-cuda-gdb ์ด๋ฒˆ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: -- **LayoutTensor ๋””๋ฒ„๊น…**: ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด ์กฐ์‚ฌํ•˜๊ธฐ +- **TileTensor ๋””๋ฒ„๊น…**: ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด ์กฐ์‚ฌํ•˜๊ธฐ - **๋กœ์ง ๋ฒ„๊ทธ ํƒ์ง€**: ํฌ๋ž˜์‹œํ•˜์ง€ ์•Š๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์ฐพ๊ธฐ - **๋ฐ˜๋ณต๋ฌธ ๊ฒฝ๊ณ„ ๋ถ„์„**: ๋ฐ˜๋ณต ํšŸ์ˆ˜ ๋ฌธ์ œ ์ดํ•ดํ•˜๊ธฐ - **๊ฒฐ๊ณผ ํŒจํ„ด ๋ถ„์„**: ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋กœ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๊ธฐ @@ -158,14 +158,14 @@ Each position should sum its neighbors: [left + center + right] CUDA thread hit application kernel entry function breakpoint, p09_process_sliding_window_... <<<(1,1,1),(4,1,1)>>> (output=..., input=...) at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:30 -30 input: LayoutTensor[mut=False, dtype, vector_layout], +30 input: TileTensor[mut=False, dtype, vector_layout], ``` #### Step 4: ๋ฉ”์ธ ๋กœ์ง์œผ๋กœ ์ด๋™ ```bash (cuda-gdb) n -29 output: LayoutTensor[mut=True, dtype, vector_layout], +29 output: TileTensor[mut=True, dtype, vector_layout], (cuda-gdb) n 32 thread_id = thread_idx.x (cuda-gdb) n @@ -193,7 +193,7 @@ Cannot access memory at address 0x0 Attempt to take address of value not located in memory. ``` -**โŒ ๋ฌธ์ œ**: LayoutTensor ์ง์ ‘ ์ธ๋ฑ์‹ฑ์ด ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. +**โŒ ๋ฌธ์ œ**: TileTensor ์ง์ ‘ ์ธ๋ฑ์‹ฑ์ด ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ```bash (cuda-gdb) p a.ptr[0] @@ -202,7 +202,7 @@ $2 = {0} $3 = {{0}, {1}, {2}, {3}} ``` -**๐ŸŽฏ ๋ŒํŒŒ๊ตฌ**: `a.ptr[0]@4`๋กœ ์ „์ฒด ์ž…๋ ฅ ๋ฐฐ์—ด์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ด๊ฒƒ์ด LayoutTensor ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. +**๐ŸŽฏ ๋ŒํŒŒ๊ตฌ**: `a.ptr[0]@4`๋กœ ์ „์ฒด ์ž…๋ ฅ ๋ฐฐ์—ด์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ด๊ฒƒ์ด TileTensor ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ### 3๋‹จ๊ณ„: ํ•ต์‹ฌ ๋ฐ˜๋ณต๋ฌธ ์กฐ์‚ฌ @@ -356,9 +356,9 @@ for offset in range(ITER): # โ† 2๋ฒˆ๋งŒ ๋ฐ˜๋ณต: [0, 1] - **ํ˜ธ์ŠคํŠธ ์ถœ๋ ฅ ํŒจํ„ด**์ด ์ค‘์š”ํ•œ ๋””๋ฒ„๊น… ๋‹จ์„œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค - **์†Œ์Šค ์ฝ”๋“œ ์ถ”๋ก **์ด ์ œํ•œ๋œ ๋””๋ฒ„๊ฑฐ ๊ธฐ๋Šฅ์„ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค -**LayoutTensor ๋””๋ฒ„๊น…**: +**TileTensor ๋””๋ฒ„๊น…**: -- LayoutTensor ์ถ”์ƒํ™”๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๊ทผ๋ณธ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฒ„๊ทธ๋Š” ๊ทธ๋Œ€๋กœ ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค +- TileTensor ์ถ”์ƒํ™”๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๊ทผ๋ณธ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฒ„๊ทธ๋Š” ๊ทธ๋Œ€๋กœ ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค - ํ…์„œ ๋‚ด์šฉ์„ ๊ฒ€์‚ฌํ•˜๋ ค ํ•˜๊ธฐ๋ณด๋‹ค ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋กœ์ง์— ์ง‘์ค‘ํ•˜์„ธ์š” - ์ฒด๊ณ„์ ์ธ ์ถ”๋ก ์œผ๋กœ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ๊ณผ ์‹ค์ œ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์„ ์ถ”์ ํ•˜์„ธ์š” diff --git a/book/i18n/ko/src/puzzle_09/third_case.md b/book/i18n/ko/src/puzzle_09/third_case.md index 1c12a643..d6b8b02e 100644 --- a/book/i18n/ko/src/puzzle_09/third_case.md +++ b/book/i18n/ko/src/puzzle_09/third_case.md @@ -12,7 +12,7 @@ - **[๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€](./second_case.md)**: ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์ถœ๋ ฅ โ†’ ํŒจํ„ด ๋ถ„์„ โ†’ ๋กœ์ง ๋ฒ„๊ทธ ๋ฐœ๊ฒฌ - **์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€**: ํ”„๋กœ๊ทธ๋žจ ๋ฌดํ•œ ์ •์ง€ โ†’ ์Šค๋ ˆ๋“œ ์ƒํƒœ ์กฐ์‚ฌ โ†’ ์กฐ์œจ ๋ฒ„๊ทธ ๋ฐœ๊ฒฌ -์ด ๊ณ ๊ธ‰ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, LayoutTensor ์—ฐ์‚ฐ, ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”๊ฐ€ ์–ฝํžŒ **์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ**๋ฅผ ์กฐ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค - ์ด์ „ ์‚ฌ๋ก€๋“ค์—์„œ ์ตํžŒ ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ๊ธฐ์ˆ ์„ ์ด๋™์›ํ•ฉ๋‹ˆ๋‹ค. +์ด ๊ณ ๊ธ‰ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, TileTensor ์—ฐ์‚ฐ, ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”๊ฐ€ ์–ฝํžŒ **์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ**๋ฅผ ์กฐ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค - ์ด์ „ ์‚ฌ๋ก€๋“ค์—์„œ ์ตํžŒ ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ๊ธฐ์ˆ ์„ ์ด๋™์›ํ•ฉ๋‹ˆ๋‹ค. **์‚ฌ์ „ ์ค€๋น„**: [Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ](./essentials.md), [ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€](./first_case.md), [ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€](./second_case.md)๋ฅผ ๋จผ์ € ์™„๋ฃŒํ•ด์„œ CUDA-GDB ์›Œํฌํ”Œ๋กœ์šฐ, ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ์˜ ํ•œ๊ณ„, ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ•์„ ์ดํ•ดํ•˜์„ธ์š”. ์•„๋ž˜ ์„ค์ • ๋ช…๋ น์„ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”: @@ -25,7 +25,7 @@ pixi run -e nvidia setup-cuda-gdb ์ด๋ฒˆ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - **๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ํƒ์ง€**: ์Šค๋ ˆ๋“œ๋“ค์ด ๋™๊ธฐํ™” ์ง€์ ์—์„œ ์˜์›ํžˆ ๊ธฐ๋‹ค๋ฆฌ๊ฒŒ ๋˜๋Š” ์ƒํ™ฉ ์‹๋ณ„ํ•˜๊ธฐ -- **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ**: LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ +- **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ**: TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ - **์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ๋ถ„์„**: ์ผ๋ถ€ ์Šค๋ ˆ๋“œ๊ฐ€ ๋‹ค๋ฅธ ์ฝ”๋“œ ๊ฒฝ๋กœ๋ฅผ ํƒˆ ๋•Œ ๋””๋ฒ„๊น…ํ•˜๊ธฐ - **์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น…**: CUDA-GDB๋กœ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ์‹คํŒจ ๋ถ„์„ํ•˜๊ธฐ @@ -155,7 +155,7 @@ Waiting for GPU computation to complete... CUDA thread hit application kernel entry function breakpoint, p09_collaborative_filter_Orig6A6AcB6A6A_1882ca334fc2d34b2b9c4fa338df6c07<<<(1,1,1),(4,1,1)>>> ( output=..., a=...) at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:56 -56 a: LayoutTensor[mut=False, dtype, vector_layout], +56 a: TileTensor[mut=False, dtype, vector_layout], ``` **๐Ÿ” ์ฃผ์š” ๊ด€์ฐฐ**: @@ -169,7 +169,7 @@ CUDA thread hit application kernel entry function breakpoint, p09_collaborative_ ```bash (cuda-gdb) n -55 output: LayoutTensor[mut=True, dtype, vector_layout], +55 output: TileTensor[mut=True, dtype, vector_layout], (cuda-gdb) n 58 thread_id = thread_idx.x (cuda-gdb) n @@ -311,13 +311,13 @@ if thread_id < SIZE - 1: # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ง„์ž…ํ•˜์ง€ ์•Š์Œ ```mojo def collaborative_filter( - output: LayoutTensor[mut=True, dtype, vector_layout], - a: LayoutTensor[mut=False, dtype, vector_layout], + output: TileTensor[mut=True, dtype, vector_layout], + a: TileTensor[mut=False, dtype, vector_layout], ): thread_id = thread_idx.x - shared_workspace = LayoutTensor[ + shared_workspace = TileTensor[ dtype, - Layout.row_major(SIZE-1), + row_major[SIZE-1](), MutAnyOrigin, address_space = AddressSpace.SHARED, ].stack_allocation() @@ -364,7 +364,7 @@ def collaborative_filter( - **๋ฐฐ๋ฆฌ์–ด ๊ทœ์น™**: ๋ธ”๋ก์˜ **๋ชจ๋“ ** ์Šค๋ ˆ๋“œ๊ฐ€ **๊ฐ™์€** ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•ด์•ผ ํ•จ - **์กฐ๊ฑด๋ถ€ ์‹คํ–‰์˜ ํ•จ์ •**: ์–ด๋–ค if๋ฌธ์ด๋“  ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Œ - **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ**: ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜์— ์ฃผ์˜ ํ•„์š” -- **LayoutTensor๊ฐ€ ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ๋ง‰์•„์ฃผ์ง€ ์•Š์Œ**: ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋ผ๋„ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™”๋Š” ์—ฌ์ „ํžˆ ํ•„์š” +- **TileTensor๊ฐ€ ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ๋ง‰์•„์ฃผ์ง€ ์•Š์Œ**: ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋ผ๋„ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™”๋Š” ์—ฌ์ „ํžˆ ํ•„์š” **๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ**: ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ๋Š” GPU ๋ฒ„๊ทธ ์ค‘ ๋””๋ฒ„๊น…ํ•˜๊ธฐ ๊ฐ€์žฅ ์–ด๋ ค์šด ์œ ํ˜•์— ์†ํ•ฉ๋‹ˆ๋‹ค: diff --git a/book/i18n/ko/src/puzzle_10/memcheck.md b/book/i18n/ko/src/puzzle_10/memcheck.md index c708f244..2127623a 100644 --- a/book/i18n/ko/src/puzzle_10/memcheck.md +++ b/book/i18n/ko/src/puzzle_10/memcheck.md @@ -8,13 +8,13 @@ **ํ•ต์‹ฌ ํ†ต์ฐฐ**: GPU ํ”„๋กœ๊ทธ๋žจ์€ ๋ถˆ๋ฒ•์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ๋„ ๋™์‹œ์— "์˜ฌ๋ฐ”๋ฅธ" ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. -**์„ ํ–‰ ํ•™์Šต**: [Puzzle 4 LayoutTensor](../puzzle_04/introduction_layout_tensor.md)์™€ ๊ธฐ๋ณธ์ ์ธ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋…์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. +**์„ ํ–‰ ํ•™์Šต**: [Puzzle 4 TileTensor](../puzzle_04/introduction_tile_tensor.md)์™€ ๊ธฐ๋ณธ์ ์ธ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋…์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ## ์กฐ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ์˜ ๋ฐœ๊ฒฌ ### ํ…Œ์ŠคํŠธ๋Š” ํ†ต๊ณผํ–ˆ์ง€๋งŒ, ์ฝ”๋“œ๊ฐ€ ์ •๋ง ์˜ฌ๋ฐ”๋ฅธ ๊ฑธ๊นŒ? -์–ผํ• ๋ฌดํ•ดํ•ด ๋ณด์ด๊ณ  ์™„๋ฒฝํ•˜๊ฒŒ ๋™์ž‘ํ•˜๋Š” ๋“ฏํ•œ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ด…์‹œ๋‹ค (๊ฐ€๋“œ๊ฐ€ ์—†๋Š” [Puzzle 04](../puzzle_04/layout_tensor.md)์ž…๋‹ˆ๋‹ค): +์–ผํ• ๋ฌดํ•ดํ•ด ๋ณด์ด๊ณ  ์™„๋ฒฝํ•˜๊ฒŒ ๋™์ž‘ํ•˜๋Š” ๋“ฏํ•œ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ด…์‹œ๋‹ค (๊ฐ€๋“œ๊ฐ€ ์—†๋Š” [Puzzle 04](../puzzle_04/tile_tensor.md)์ž…๋‹ˆ๋‹ค): ```mojo {{#include ../../../../../problems/p10/p10.mojo:add_10_2d_no_guard}} @@ -163,10 +163,10 @@ Running memory bug example (bounds checking issue)... ### ํ•ด๊ฒฐ์ฑ… -[Puzzle 04](../puzzle_04/layout_tensor.md)์—์„œ ๋ณธ ๊ฒƒ์ฒ˜๋Ÿผ, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค: +[Puzzle 04](../puzzle_04/tile_tensor.md)์—์„œ ๋ณธ ๊ฒƒ์ฒ˜๋Ÿผ, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค: ```mojo -{{#include ../../../../../solutions/p04/p04_layout_tensor.mojo:add_10_2d_layout_tensor_solution}} +{{#include ../../../../../solutions/p04/p04_tile_tensor.mojo:add_10_2d_tile_tensor_solution}} ``` ํ•ด๊ฒฐ์ฑ…์€ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค: **๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๊ธฐ ์ „์— ํ•ญ์ƒ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ์ดํ„ฐ ์ฐจ์›์— ๋Œ€ํ•ด ๊ฒ€์ฆ**ํ•˜์„ธ์š”. diff --git a/book/i18n/ko/src/puzzle_10/puzzle_10.md b/book/i18n/ko/src/puzzle_10/puzzle_10.md index dd72314c..273c2bf6 100644 --- a/book/i18n/ko/src/puzzle_10/puzzle_10.md +++ b/book/i18n/ko/src/puzzle_10/puzzle_10.md @@ -125,7 +125,7 @@ GPU ๊ฒ€์‚ฌ๋ฅผ ํ•˜๋ ค๋ฉด **๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ ํƒ์ •**์ด ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. - Puzzle 1-8์—์„œ ๋‹ค๋ฃฌ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋… (๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ, ์Šค๋ ˆ๋“œ ์กฐ์œจ, ๋ฐฐ๋ฆฌ์–ด) - **[ํ˜ธํ™˜ NVIDIA GPU ํ•˜๋“œ์›จ์–ด](https://docs.modular.com/max/faq#gpu-requirements)** - `compute-sanitizer` ์ ‘๊ทผ์„ ์œ„ํ•œ `pixi` ํŒจํ‚ค์ง€ ๋งค๋‹ˆ์ € ํ™˜๊ฒฝ ์„ค์ • -- **์„ ํ–‰ ํผ์ฆ**: [Puzzle 4](../puzzle_04/introduction_layout_tensor.md)์™€ [Puzzle 8](../puzzle_08/layout_tensor.md) ์ˆ™์ง€ ๊ถŒ์žฅ +- **์„ ํ–‰ ํผ์ฆ**: [Puzzle 4](../puzzle_04/introduction_tile_tensor.md)์™€ [Puzzle 8](../puzzle_08/tile_tensor.md) ์ˆ™์ง€ ๊ถŒ์žฅ **๋ชฉํ‘œ**: diff --git a/book/i18n/ko/src/puzzle_11/layout_tensor.md b/book/i18n/ko/src/puzzle_11/layout_tensor.md deleted file mode 100644 index 188752cb..00000000 --- a/book/i18n/ko/src/puzzle_11/layout_tensor.md +++ /dev/null @@ -1,186 +0,0 @@ - - -## ๊ฐœ์š” - -1D LayoutTensor `a`์—์„œ ๊ฐ ์œ„์น˜์˜ ์ง์ „ 3๊ฐœ ๊ฐ’์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. - -**์ฐธ๊ณ :** _๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - -- LayoutTensor๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ -- [Puzzle 8](../puzzle_08/layout_tensor.md)์—์„œ ๋‹ค๋ฃฌ LayoutTensor ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌํ•˜๊ธฐ -- ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด -- ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ - -ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ํšจ์œจ์ ์ธ ์œˆ๋„์šฐ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ์€ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. - -## ๊ตฌ์„ฑ - -- ๋ฐฐ์—ด ํฌ๊ธฐ: `SIZE = 8` -- ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: `TPB = 8` -- ์œˆ๋„์šฐ ํฌ๊ธฐ: 3 -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: `TPB`๊ฐœ - -์ฐธ๊ณ : - -- **LayoutTensor ํ• ๋‹น**: `LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` ์‚ฌ์šฉ -- **์œˆ๋„์šฐ ์ ‘๊ทผ**: 3๊ฐœ์งœ๋ฆฌ ์œˆ๋„์šฐ์— ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ -- **๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ**: ์ฒ˜์Œ ๋‘ ์œ„์น˜๋Š” ํŠน์ˆ˜ ์ผ€์ด์Šค -- **๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด**: ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ 1ํšŒ - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p11/p11_layout_tensor.mojo:pooling_layout_tensor}} -``` - -์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p11/p11_layout_tensor.mojo - -
-ํŒ - -
- -1. LayoutTensor์™€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ -2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: `shared[local_i] = a[global_i]` -3. ์ฒ˜์Œ ๋‘ ์œ„์น˜๋ฅผ ํŠน์ˆ˜ ์ผ€์ด์Šค๋กœ ์ฒ˜๋ฆฌ -4. ์œˆ๋„์šฐ ์—ฐ์‚ฐ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ -5. ๊ฒฝ๊ณ„ ์ดˆ๊ณผ ์ ‘๊ทผ์— ๊ฐ€๋“œ ์ถ”๊ฐ€ - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p11_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p11_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p11_layout_tensor -``` - -
-
- -```bash -uv run poe p11_layout_tensor -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p11/p11_layout_tensor.mojo:pooling_layout_tensor_solution}} -``` - -
- -LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ํ•ฉ๊ณ„ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ์ฃผ์š” ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: - -1. **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •** - - LayoutTensor๊ฐ€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ๋ฅผ ์ƒ์„ฑ: - - ```txt - shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - ``` - - - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์”ฉ ๋กœ๋“œ: - - ```txt - Input array: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] - Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] - ``` - - - `barrier()`๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋กœ๋“œ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ - -2. **๊ฒฝ๊ณ„ ์ผ€์ด์Šค** - - ์œ„์น˜ 0: ํ•˜๋‚˜๋งŒ - - ```txt - output[0] = shared[0] = 0.0 - ``` - - - ์œ„์น˜ 1: ์ฒ˜์Œ ๋‘ ๊ฐ’์˜ ํ•ฉ - - ```txt - output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0 - ``` - -3. **๋ฉ”์ธ ์œˆ๋„์šฐ ์—ฐ์‚ฐ** - - ์œ„์น˜ 2 ์ดํ›„: - - ```txt - Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0 - Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0 - Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0 - ... - ``` - - - LayoutTensor์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ: - - ```txt - # 3๊ฐœ์งœ๋ฆฌ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ - window_sum = shared[i-2] + shared[i-1] + shared[i] - ``` - -4. **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด** - - ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๊ณต์œ  ํ…์„œ๋กœ ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ - - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ - - LayoutTensor์˜ ์žฅ์ : - - ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ - - ์ž์—ฐ์Šค๋Ÿฌ์šด ์œˆ๋„์šฐ ์ธ๋ฑ์‹ฑ - - ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ - - ์ „ ๊ณผ์ •์— ๊ฑธ์นœ ํƒ€์ž… ์•ˆ์ „์„ฑ - -๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ๊ณผ LayoutTensor์˜ ์•ˆ์ „์„ฑ ๋ฐ ํŽธ์˜์„ฑ์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค: - -- ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์†Œํ™” -- ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ฐ„์†Œํ™” -- ๊น”๋”ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ -- ๋ณ‘ํ•ฉ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€ - -์ตœ์ข… ์ถœ๋ ฅ์€ ๋ˆ„์  ์œˆ๋„์šฐ ํ•ฉ๊ณ„์ž…๋‹ˆ๋‹ค: - -```txt -[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0] -``` - -
-
diff --git a/book/i18n/ko/src/puzzle_11/puzzle_11.md b/book/i18n/ko/src/puzzle_11/puzzle_11.md index 10904fe5..80e8540e 100644 --- a/book/i18n/ko/src/puzzle_11/puzzle_11.md +++ b/book/i18n/ko/src/puzzle_11/puzzle_11.md @@ -4,21 +4,188 @@ ## ๊ฐœ์š” -๋ฒกํ„ฐ `a`์—์„œ ๊ฐ ์œ„์น˜์˜ ์ง์ „ 3๊ฐœ ๊ฐ’์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฒกํ„ฐ `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. +1D TileTensor `a`์—์„œ ๊ฐ ์œ„์น˜์˜ ์ง์ „ 3๊ฐœ ๊ฐ’์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D TileTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. **์ฐธ๊ณ :** _๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค._ Pooling ์‹œ๊ฐํ™” Pooling ์‹œ๊ฐํ™” -## ๊ตฌํ˜„ ๋ฐฉ์‹ +## ํ•ต์‹ฌ ๊ฐœ๋… -### [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./raw.md) +์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: -์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ์„ ์ˆ˜๋™ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์™€ ๋™๊ธฐํ™”๋กœ ์ง์ ‘ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. +- TileTensor๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ +- [Puzzle 8](../puzzle_08/puzzle_08.md)์—์„œ ๋‹ค๋ฃฌ TileTensor ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌํ•˜๊ธฐ +- ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด +- ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ -### [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./layout_tensor.md) +ํ•ต์‹ฌ์€ TileTensor๊ฐ€ ํšจ์œจ์ ์ธ ์œˆ๋„์šฐ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ์€ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. -LayoutTensor์˜ ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•ด ํšจ์œจ์ ์ธ ์œˆ๋„์šฐ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. +## ๊ตฌ์„ฑ -๐Ÿ’ก **์ฐธ๊ณ **: LayoutTensor๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ์ด ์–ผ๋งˆ๋‚˜ ๊ฐ„๊ฒฐํ•ด์ง€๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”. ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๋„ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. +- ๋ฐฐ์—ด ํฌ๊ธฐ: `SIZE = 8` +- ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: `TPB = 8` +- ์œˆ๋„์šฐ ํฌ๊ธฐ: 3 +- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: `TPB`๊ฐœ + +์ฐธ๊ณ : + +- **TileTensor ํ• ๋‹น**: `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]())` ์‚ฌ์šฉ +- **์œˆ๋„์šฐ ์ ‘๊ทผ**: 3๊ฐœ์งœ๋ฆฌ ์œˆ๋„์šฐ์— ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ +- **๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ**: ์ฒ˜์Œ ๋‘ ์œ„์น˜๋Š” ํŠน์ˆ˜ ์ผ€์ด์Šค +- **๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด**: ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ 1ํšŒ + +## ์™„์„ฑํ•  ์ฝ”๋“œ + +```mojo +{{#include ../../../../../problems/p11/p11.mojo:pooling}} +``` + +์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p11/p11.mojo + +
+ํŒ + +
+ +1. TileTensor์™€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ +2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: `shared[local_i] = a[global_i]` +3. ์ฒ˜์Œ ๋‘ ์œ„์น˜๋ฅผ ํŠน์ˆ˜ ์ผ€์ด์Šค๋กœ ์ฒ˜๋ฆฌ +4. ์œˆ๋„์šฐ ์—ฐ์‚ฐ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ +5. ๊ฒฝ๊ณ„ ์ดˆ๊ณผ ์ ‘๊ทผ์— ๊ฐ€๋“œ ์ถ”๊ฐ€ + +
+
+ +## ์ฝ”๋“œ ์‹คํ–‰ + +์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: + +
+
+ + + + +
+
+ +```bash +pixi run p11 +``` + +
+
+ +```bash +pixi run -e amd p11 +``` + +
+
+ +```bash +pixi run -e apple p11 +``` + +
+
+ +```bash +uv run poe p11 +``` + +
+
+ +ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: + +```txt +out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) +expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]) +``` + +## ์†”๋ฃจ์…˜ + +
+ + +```mojo +{{#include ../../../../../solutions/p11/p11.mojo:pooling_solution}} +``` + +
+ +TileTensor๋ฅผ ํ™œ์šฉํ•œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ํ•ฉ๊ณ„ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ์ฃผ์š” ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: + +1. **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •** + - TileTensor๊ฐ€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ๋ฅผ ์ƒ์„ฑ: + + ```txt + shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]()) + ``` + + - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์”ฉ ๋กœ๋“œ: + + ```txt + Input array: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] + Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] + ``` + + - `barrier()`๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋กœ๋“œ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ + +2. **๊ฒฝ๊ณ„ ์ผ€์ด์Šค** + - ์œ„์น˜ 0: ํ•˜๋‚˜๋งŒ + + ```txt + output[0] = shared[0] = 0.0 + ``` + + - ์œ„์น˜ 1: ์ฒ˜์Œ ๋‘ ๊ฐ’์˜ ํ•ฉ + + ```txt + output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0 + ``` + +3. **๋ฉ”์ธ ์œˆ๋„์šฐ ์—ฐ์‚ฐ** + - ์œ„์น˜ 2 ์ดํ›„: + + ```txt + Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0 + Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0 + Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0 + ... + ``` + + - TileTensor์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ: + + ```txt + # 3๊ฐœ์งœ๋ฆฌ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ + window_sum = shared[i-2] + shared[i-1] + shared[i] + ``` + +4. **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด** + - ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๊ณต์œ  ํ…์„œ๋กœ ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ + - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ + - TileTensor์˜ ์žฅ์ : + - ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ + - ์ž์—ฐ์Šค๋Ÿฌ์šด ์œˆ๋„์šฐ ์ธ๋ฑ์‹ฑ + - ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ + - ์ „ ๊ณผ์ •์— ๊ฑธ์นœ ํƒ€์ž… ์•ˆ์ „์„ฑ + +๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ๊ณผ TileTensor์˜ ์•ˆ์ „์„ฑ ๋ฐ ํŽธ์˜์„ฑ์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค: + +- ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์†Œํ™” +- ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ฐ„์†Œํ™” +- ๊น”๋”ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ +- ๋ณ‘ํ•ฉ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€ + +์ตœ์ข… ์ถœ๋ ฅ์€ ๋ˆ„์  ์œˆ๋„์šฐ ํ•ฉ๊ณ„์ž…๋‹ˆ๋‹ค: + +```txt +[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0] +``` + +
+
diff --git a/book/i18n/ko/src/puzzle_11/raw.md b/book/i18n/ko/src/puzzle_11/raw.md deleted file mode 100644 index 1a100ae5..00000000 --- a/book/i18n/ko/src/puzzle_11/raw.md +++ /dev/null @@ -1,176 +0,0 @@ - - -## ๊ฐœ์š” - -๋ฒกํ„ฐ `a`์—์„œ ๊ฐ ์œ„์น˜์˜ ์ง์ „ 3๊ฐœ ๊ฐ’์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฒกํ„ฐ `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. - -**์ฐธ๊ณ :** _๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ -- ํ’€๋ง์˜ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ -- ์ด์›ƒ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์„ ์œ„ํ•œ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ - -ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด ์œˆ๋„์šฐ ๋‚ด ๊ฐ’๋“ค์— ํšจ์œจ์ ์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‹œํ€€์Šค ์•ž๋ถ€๋ถ„์€ ํŠน๋ณ„ํžˆ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. - -## ๊ตฌ์„ฑ - -- ๋ฐฐ์—ด ํฌ๊ธฐ: `SIZE = 8` -- ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: `TPB = 8` -- ์œˆ๋„์šฐ ํฌ๊ธฐ: 3 -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: `TPB`๊ฐœ - -์ฐธ๊ณ : - -- **์œˆ๋„์šฐ ์ ‘๊ทผ**: ๊ฐ ์ถœ๋ ฅ์€ ์ด์ „ ์ตœ๋Œ€ 3๊ฐœ ๊ฐ’์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค -- **๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ**: ์ฒ˜์Œ ๋‘ ์œ„์น˜๋Š” ํŠน๋ณ„ํ•œ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค -- **๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด**: ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ 1ํšŒ -- **์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”**: ์œˆ๋„์šฐ ์—ฐ์‚ฐ ์ „์— ์กฐ์œจ ํ•„์š” - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p11/p11.mojo:pooling}} -``` - -์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p11/p11.mojo - -
-ํŒ - -
- -1. ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ  `barrier()` ํ˜ธ์ถœ -2. ํŠน์ˆ˜ ์ผ€์ด์Šค: `output[0] = shared[0]`, `output[1] = shared[0] + shared[1]` -3. ์ผ๋ฐ˜ ์ผ€์ด์Šค: `if 1 < global_i < size` -4. ์„ธ ๊ฐ’์˜ ํ•ฉ: `shared[local_i - 2] + shared[local_i - 1] + shared[local_i]` - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p11 -``` - -
-
- -```bash -pixi run -e amd p11 -``` - -
-
- -```bash -pixi run -e apple p11 -``` - -
-
- -```bash -uv run poe p11 -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p11/p11.mojo:pooling_solution}} -``` - -
- -๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ํ•ฉ๊ณ„ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ์ฃผ์š” ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: - -1. **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •** - - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— `TPB`๊ฐœ ํ• ๋‹น: - - ```txt - Input array: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] - Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] - ``` - - - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ํ•˜๋‚˜์”ฉ ๋กœ๋“œ - - `barrier()`๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋กœ๋“œ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ - -2. **๊ฒฝ๊ณ„ ์ผ€์ด์Šค** - - ์œ„์น˜ 0: ํ•˜๋‚˜๋งŒ - - ```txt - output[0] = shared[0] = 0.0 - ``` - - - ์œ„์น˜ 1: ์ฒ˜์Œ ๋‘ ๊ฐ’์˜ ํ•ฉ - - ```txt - output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0 - ``` - -3. **๋ฉ”์ธ ์œˆ๋„์šฐ ์—ฐ์‚ฐ** - - ์œ„์น˜ 2 ์ดํ›„: - - ```txt - Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0 - Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0 - Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0 - ... - ``` - - - ๋กœ์ปฌ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•œ ์œˆ๋„์šฐ ๊ณ„์‚ฐ: - - ```txt - # 3๊ฐœ์งœ๋ฆฌ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ - window_sum = shared[i-2] + shared[i-1] + shared[i] - ``` - -4. **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด** - - ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ - - ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ - - ์ด์›ƒ ์ ‘๊ทผ์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ - - ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€ - -์ด ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ํฌ์ธํŠธ: - -- ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์†Œํ™” -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋น ๋ฅธ ์ด์›ƒ ์กฐํšŒ -- ๊น”๋”ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ -- ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ - -์ตœ์ข… ์ถœ๋ ฅ์€ ๋ˆ„์  ์œˆ๋„์šฐ ํ•ฉ๊ณ„๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: - -```txt -[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0] -``` - -
-
diff --git a/book/i18n/ko/src/puzzle_12/layout_tensor.md b/book/i18n/ko/src/puzzle_12/layout_tensor.md deleted file mode 100644 index 986f60e7..00000000 --- a/book/i18n/ko/src/puzzle_12/layout_tensor.md +++ /dev/null @@ -1,230 +0,0 @@ - - -## ๊ฐœ์š” - -1D LayoutTensor `a`์™€ 1D LayoutTensor `b`์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor `output`(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. - -**์ฐธ๊ณ :** _๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ๋ธ”๋ก๋‹น ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: - -- [Puzzle 8](../puzzle_08/layout_tensor.md), [Puzzle 11](../puzzle_11/layout_tensor.md)์—์„œ ์ด์–ด์ง€๋Š” LayoutTensor ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ -- `address_space`๋ฅผ ํ™œ์šฉํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ -- ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ํ˜‘๋ ฅํ•ด ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๊ฐ€๋Š” ๊ณผ์ • -- ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํ…์„œ ์—ฐ์‚ฐ - -ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋ฉด์„œ๋„, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์‚ด๋ฆฌ๋Š” ๋ฐฉ์‹์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. - -## ๊ตฌ์„ฑ - -- ๋ฒกํ„ฐ ํฌ๊ธฐ: `SIZE = 8` -- ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: `TPB = 8` -- ๋ธ”๋ก ์ˆ˜: 1 -- ์ถœ๋ ฅ ํฌ๊ธฐ: 1 -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: `TPB`๊ฐœ - -์ฐธ๊ณ : - -- **LayoutTensor ํ• ๋‹น**: `LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` ์‚ฌ์šฉ -- **์š”์†Œ ์ ‘๊ทผ**: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ -- **๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ**: ์ž…๋ ฅ์šฉ๊ณผ ์ถœ๋ ฅ์šฉ ๋ ˆ์ด์•„์›ƒ์„ ๋”ฐ๋กœ ๊ตฌ์„ฑ -- **์Šค๋ ˆ๋“œ ์กฐ์œจ**: ๋™์ผํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์œผ๋กœ `barrier()` ์‚ฌ์šฉ - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p12/p12_layout_tensor.mojo:dot_product_layout_tensor}} -``` - -์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p12/p12_layout_tensor.mojo - -
-ํŒ - -
- -1. LayoutTensor์™€ `address_space`๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ -2. `shared[local_i]`์— `a[global_i] * b[global_i]`๋ฅผ ์ €์žฅ -3. `barrier()`์™€ ํ•จ๊ป˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด ์ ์šฉ -4. ์Šค๋ ˆ๋“œ 0์ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ `output[0]`์— ๊ธฐ๋ก - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p12_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p12_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p12_layout_tensor -``` - -
-
- -```bash -uv run poe p12_layout_tensor -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0]) -expected: HostBuffer([140.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p12/p12_layout_tensor.mojo:dot_product_layout_tensor_solution}} -``` - -
- -LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์œผ๋กœ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค: - -### 1๋‹จ๊ณ„: ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ - -๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ง๊ด€์ ์ธ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๊ณฑ์…ˆ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค: - -```mojo -shared[local_i] = a[global_i] * b[global_i] -``` - -### 2๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ - -๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜์ž…๋‹ˆ๋‹ค: - -```txt -์ดˆ๊ธฐ๊ฐ’: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7] - = [0 1 4 9 16 25 36 49] - -Step 1: [0+16 1+25 4+36 9+49 16 25 36 49] - = [16 26 40 58 16 25 36 49] - -Step 2: [16+40 26+58 40 58 16 25 36 49] - = [56 84 40 58 16 25 36 49] - -Step 3: [56+84 84 40 58 16 25 36 49] - = [140 84 40 58 16 25 36 49] -``` - -### ๊ตฌํ˜„์˜ ํ•ต์‹ฌ ํŠน์ง• - -1. **๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ**: - - `address_space` ํŒŒ๋ผ๋ฏธํ„ฐ ํ•˜๋‚˜๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ํ• ๋‹น - - ํƒ€์ž… ์•ˆ์ „ํ•œ ์—ฐ์‚ฐ์ด ๋ณด์žฅ๋˜๊ณ  - - ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋ฉฐ - - ์ธ๋ฑ์‹ฑ๋„ ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ - -2. **์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”**: - - ์ดˆ๊ธฐ ๊ณฑ์…ˆ์ด ๋๋‚˜๋ฉด `barrier()` - - ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด๋งˆ๋‹ค `barrier()` - - ์Šค๋ ˆ๋“œ ๊ฐ„ ์•ˆ์ „ํ•œ ์กฐ์œจ ๋ณด์žฅ - -3. **๋ฆฌ๋•์…˜ ๋กœ์ง**: - - ```mojo - stride = TPB // 2 - while stride > 0: - if local_i < stride: - shared[local_i] += shared[local_i + stride] - barrier() - stride //= 2 - ``` - -4. **์„ฑ๋Šฅ์ƒ ์ด์ **: - - \\(O(\log n)\\) ์‹œ๊ฐ„ ๋ณต์žก๋„ - - ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ - - ์ตœ์†Œํ•œ์˜ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ - - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ - -LayoutTensor ๋ฒ„์ „์€ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ, ์—ฌ๊ธฐ์— ๋”ํ•ด: - -- ํƒ€์ž… ์•ˆ์ „์„ฑ์ด ํ•œ์ธต ๊ฐ•ํ™”๋˜๊ณ  -- ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๊ฐ€ ๋” ๊น”๋”ํ•ด์ง€๋ฉฐ -- ๋ ˆ์ด์•„์›ƒ์„ ์ž๋™์œผ๋กœ ์ธ์‹ํ•˜๊ณ  -- ์ธ๋ฑ์‹ฑ ๋ฌธ๋ฒ•๋„ ์ž์—ฐ์Šค๋Ÿฌ์›Œ์ง‘๋‹ˆ๋‹ค - -### ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์˜ ์ค‘์š”์„ฑ - -๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด์˜ `barrier()`๋Š” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค: - -`barrier()`๊ฐ€ ์—†์œผ๋ฉด ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค: - -```text -์ดˆ๊ธฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: [0 1 4 9 16 25 36 49] - -Step 1 (stride = 4): -Thread 0 ์ฝ๊ธฐ: shared[0] = 0, shared[4] = 16 -Thread 1 ์ฝ๊ธฐ: shared[1] = 1, shared[5] = 25 -Thread 2 ์ฝ๊ธฐ: shared[2] = 4, shared[6] = 36 -Thread 3 ์ฝ๊ธฐ: shared[3] = 9, shared[7] = 49 - -barrier ์—†์ด: -- Thread 0 ์“ฐ๊ธฐ: shared[0] = 0 + 16 = 16 -- Thread 1์ด Thread 0๋ณด๋‹ค ๋จผ์ € ๋‹ค์Œ ๋‹จ๊ณ„(stride = 2)๋กœ ๋„˜์–ด๊ฐ€์„œ - 16์ด ์•„๋‹Œ ์ด์ „ ๊ฐ’ shared[0] = 0์„ ์ฝ์–ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค! -``` - -`barrier()`๊ฐ€ ์žˆ์œผ๋ฉด: - -```text -Step 1 (stride = 4): -๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ ๊ธฐ๋ก: -[16 26 40 58 16 25 36 49] -barrier()๊ฐ€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์—๊ฒŒ ์ด ๊ฐ’๋“ค์ด ๋ณด์ด๋„๋ก ๋ณด์žฅ - -Step 2 (stride = 2): -์ด์ œ ์—…๋ฐ์ดํŠธ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ: -Thread 0: shared[0] = 16 + 40 = 56 -Thread 1: shared[1] = 26 + 58 = 84 -``` - -`barrier()`๋Š” ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค: - -1. ํ˜„์žฌ ๋‹จ๊ณ„์˜ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋๋‚œ ๋’ค์—์•ผ ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐ -2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์‹  ๊ฐ’์„ ๋ณผ ์ˆ˜ ์žˆ์Œ -3. ์–ด๋–ค ์Šค๋ ˆ๋“œ๋„ ์•ž์„œ ๋‚˜๊ฐ€์ง€ ์•Š์Œ -4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ์ผ๊ด€๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€ - -์ด๋Ÿฐ ๋™๊ธฐํ™” ์ง€์ ์ด ์—†์œผ๋ฉด: - -- ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  -- ์Šค๋ ˆ๋“œ๊ฐ€ ์ด๋ฏธ ์ง€๋‚œ ๊ฐ’์„ ์ฝ๊ฒŒ ๋˜๋ฉฐ -- ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๊ณ  -- ์ตœ์ข… ํ•ฉ๊ณ„๊ฐ€ ํ‹€์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค - -
-
diff --git a/book/i18n/ko/src/puzzle_12/puzzle_12.md b/book/i18n/ko/src/puzzle_12/puzzle_12.md index 7420e049..ccfdb365 100644 --- a/book/i18n/ko/src/puzzle_12/puzzle_12.md +++ b/book/i18n/ko/src/puzzle_12/puzzle_12.md @@ -4,7 +4,7 @@ ## ๊ฐœ์š” -๋ฒกํ„ฐ `a`์™€ ๋ฒกํ„ฐ `b`์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ `output`(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ๋‚ด์ ์€ ํฌ๊ธฐ๊ฐ€ ๊ฐ™์€ ๋‘ ๋ฒกํ„ฐ์—์„œ ๋Œ€์‘ํ•˜๋Š” ์›์†Œ๋ผ๋ฆฌ ๊ณฑํ•œ ๋’ค, ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋‘ ๋”ํ•ด ํ•˜๋‚˜์˜ ์ˆซ์ž(์Šค์นผ๋ผ)๋ฅผ ๊ตฌํ•˜๋Š” ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. +1D TileTensor `a`์™€ 1D TileTensor `b`์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D TileTensor `output`(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ๋‚ด์ ์€ ํฌ๊ธฐ๊ฐ€ ๊ฐ™์€ ๋‘ ๋ฒกํ„ฐ์—์„œ ๋Œ€์‘ํ•˜๋Š” ์›์†Œ๋ผ๋ฆฌ ๊ณฑํ•œ ๋’ค, ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋‘ ๋”ํ•ด ํ•˜๋‚˜์˜ ์ˆซ์ž(์Šค์นผ๋ผ)๋ฅผ ๊ตฌํ•˜๋Š” ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋‘ ๋ฒกํ„ฐ๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์„ ๋•Œ: @@ -19,14 +19,225 @@ ๋‚ด์  ์‹œ๊ฐํ™” ๋‚ด์  ์‹œ๊ฐํ™” -## ๊ตฌํ˜„ ๋ฐฉ์‹ +## ํ•ต์‹ฌ ๊ฐœ๋… -### [๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹](./raw.md) +์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ: -์ˆ˜๋™ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์™€ ๋™๊ธฐํ™”๋กœ ๋ฆฌ๋•์…˜์„ ๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. +- [Puzzle 8](../puzzle_08/puzzle_08.md), [Puzzle 11](../puzzle_11/puzzle_11.md)์—์„œ ์ด์–ด์ง€๋Š” TileTensor ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ +- `address_space`๋ฅผ ํ™œ์šฉํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ +- ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ํ˜‘๋ ฅํ•ด ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๊ฐ€๋Š” ๊ณผ์ • +- ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํ…์„œ ์—ฐ์‚ฐ -### [๐Ÿ“ LayoutTensor ๋ฒ„์ „](./layout_tensor.md) +ํ•ต์‹ฌ์€ TileTensor๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋ฉด์„œ๋„, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์‚ด๋ฆฌ๋Š” ๋ฐฉ์‹์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. -LayoutTensor๋ฅผ ํ™œ์šฉํ•ด ๋ฆฌ๋•์…˜๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๋” ๊ฐ„๊ฒฐํ•˜๊ฒŒ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. +## ๊ตฌ์„ฑ -๐Ÿ’ก **์ฐธ๊ณ **: LayoutTensor๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์–ผ๋งˆ๋‚˜ ๊น”๋”ํ•ด์ง€๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”. ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ์ž…๋‹ˆ๋‹ค. +- ๋ฒกํ„ฐ ํฌ๊ธฐ: `SIZE = 8` +- ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: `TPB = 8` +- ๋ธ”๋ก ์ˆ˜: 1 +- ์ถœ๋ ฅ ํฌ๊ธฐ: 1 +- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: `TPB`๊ฐœ + +์ฐธ๊ณ : + +- **TileTensor ํ• ๋‹น**: `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]())` ์‚ฌ์šฉ +- **์š”์†Œ ์ ‘๊ทผ**: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ +- **๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ**: ์ž…๋ ฅ์šฉ๊ณผ ์ถœ๋ ฅ์šฉ ๋ ˆ์ด์•„์›ƒ์„ ๋”ฐ๋กœ ๊ตฌ์„ฑ +- **์Šค๋ ˆ๋“œ ์กฐ์œจ**: ๋™์ผํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์œผ๋กœ `barrier()` ์‚ฌ์šฉ + +## ์™„์„ฑํ•  ์ฝ”๋“œ + +```mojo +{{#include ../../../../../problems/p12/p12.mojo:dot_product}} +``` + +์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p12/p12.mojo + +
+ํŒ + +
+ +1. TileTensor์™€ `address_space`๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ +2. `shared[local_i]`์— `a[global_i] * b[global_i]`๋ฅผ ์ €์žฅ +3. `barrier()`์™€ ํ•จ๊ป˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด ์ ์šฉ +4. ์Šค๋ ˆ๋“œ 0์ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ `output[0]`์— ๊ธฐ๋ก + +
+
+ +## ์ฝ”๋“œ ์‹คํ–‰ + +์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: + +
+
+ + + + +
+
+ +```bash +pixi run p12 +``` + +
+
+ +```bash +pixi run -e amd p12 +``` + +
+
+ +```bash +pixi run -e apple p12 +``` + +
+
+ +```bash +uv run poe p12 +``` + +
+
+ +ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: + +```txt +out: HostBuffer([0.0]) +expected: HostBuffer([140.0]) +``` + +## ์†”๋ฃจ์…˜ + +
+ + +```mojo +{{#include ../../../../../solutions/p12/p12.mojo:dot_product_solution}} +``` + +
+ +TileTensor๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์œผ๋กœ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค: + +### 1๋‹จ๊ณ„: ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ + +๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ง๊ด€์ ์ธ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๊ณฑ์…ˆ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค: + +```mojo +shared[local_i] = a[global_i] * b[global_i] +``` + +### 2๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ + +๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜์ž…๋‹ˆ๋‹ค: + +```txt +์ดˆ๊ธฐ๊ฐ’: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7] + = [0 1 4 9 16 25 36 49] + +Step 1: [0+16 1+25 4+36 9+49 16 25 36 49] + = [16 26 40 58 16 25 36 49] + +Step 2: [16+40 26+58 40 58 16 25 36 49] + = [56 84 40 58 16 25 36 49] + +Step 3: [56+84 84 40 58 16 25 36 49] + = [140 84 40 58 16 25 36 49] +``` + +### ๊ตฌํ˜„์˜ ํ•ต์‹ฌ ํŠน์ง• + +1. **๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ**: + - `address_space` ํŒŒ๋ผ๋ฏธํ„ฐ ํ•˜๋‚˜๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ํ• ๋‹น + - ํƒ€์ž… ์•ˆ์ „ํ•œ ์—ฐ์‚ฐ์ด ๋ณด์žฅ๋˜๊ณ  + - ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋ฉฐ + - ์ธ๋ฑ์‹ฑ๋„ ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ + +2. **์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”**: + - ์ดˆ๊ธฐ ๊ณฑ์…ˆ์ด ๋๋‚˜๋ฉด `barrier()` + - ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด๋งˆ๋‹ค `barrier()` + - ์Šค๋ ˆ๋“œ ๊ฐ„ ์•ˆ์ „ํ•œ ์กฐ์œจ ๋ณด์žฅ + +3. **๋ฆฌ๋•์…˜ ๋กœ์ง**: + + ```mojo + stride = TPB // 2 + while stride > 0: + if local_i < stride: + shared[local_i] += shared[local_i + stride] + barrier() + stride //= 2 + ``` + +4. **์„ฑ๋Šฅ์ƒ ์ด์ **: + - \\(O(\log n)\\) ์‹œ๊ฐ„ ๋ณต์žก๋„ + - ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ + - ์ตœ์†Œํ•œ์˜ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ + - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ + +TileTensor ๋ฒ„์ „์€ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ, ์—ฌ๊ธฐ์— ๋”ํ•ด: + +- ํƒ€์ž… ์•ˆ์ „์„ฑ์ด ํ•œ์ธต ๊ฐ•ํ™”๋˜๊ณ  +- ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๊ฐ€ ๋” ๊น”๋”ํ•ด์ง€๋ฉฐ +- ๋ ˆ์ด์•„์›ƒ์„ ์ž๋™์œผ๋กœ ์ธ์‹ํ•˜๊ณ  +- ์ธ๋ฑ์‹ฑ ๋ฌธ๋ฒ•๋„ ์ž์—ฐ์Šค๋Ÿฌ์›Œ์ง‘๋‹ˆ๋‹ค + +### ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์˜ ์ค‘์š”์„ฑ + +๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด์˜ `barrier()`๋Š” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค: + +`barrier()`๊ฐ€ ์—†์œผ๋ฉด ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค: + +```text +์ดˆ๊ธฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: [0 1 4 9 16 25 36 49] + +Step 1 (stride = 4): +Thread 0 ์ฝ๊ธฐ: shared[0] = 0, shared[4] = 16 +Thread 1 ์ฝ๊ธฐ: shared[1] = 1, shared[5] = 25 +Thread 2 ์ฝ๊ธฐ: shared[2] = 4, shared[6] = 36 +Thread 3 ์ฝ๊ธฐ: shared[3] = 9, shared[7] = 49 + +barrier ์—†์ด: +- Thread 0 ์“ฐ๊ธฐ: shared[0] = 0 + 16 = 16 +- Thread 1์ด Thread 0๋ณด๋‹ค ๋จผ์ € ๋‹ค์Œ ๋‹จ๊ณ„(stride = 2)๋กœ ๋„˜์–ด๊ฐ€์„œ + 16์ด ์•„๋‹Œ ์ด์ „ ๊ฐ’ shared[0] = 0์„ ์ฝ์–ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค! +``` + +`barrier()`๊ฐ€ ์žˆ์œผ๋ฉด: + +```text +Step 1 (stride = 4): +๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ ๊ธฐ๋ก: +[16 26 40 58 16 25 36 49] +barrier()๊ฐ€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์—๊ฒŒ ์ด ๊ฐ’๋“ค์ด ๋ณด์ด๋„๋ก ๋ณด์žฅ + +Step 2 (stride = 2): +์ด์ œ ์—…๋ฐ์ดํŠธ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ: +Thread 0: shared[0] = 16 + 40 = 56 +Thread 1: shared[1] = 26 + 58 = 84 +``` + +`barrier()`๋Š” ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค: + +1. ํ˜„์žฌ ๋‹จ๊ณ„์˜ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋๋‚œ ๋’ค์—์•ผ ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐ +2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์‹  ๊ฐ’์„ ๋ณผ ์ˆ˜ ์žˆ์Œ +3. ์–ด๋–ค ์Šค๋ ˆ๋“œ๋„ ์•ž์„œ ๋‚˜๊ฐ€์ง€ ์•Š์Œ +4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ์ผ๊ด€๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€ + +์ด๋Ÿฐ ๋™๊ธฐํ™” ์ง€์ ์ด ์—†์œผ๋ฉด: + +- ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  +- ์Šค๋ ˆ๋“œ๊ฐ€ ์ด๋ฏธ ์ง€๋‚œ ๊ฐ’์„ ์ฝ๊ฒŒ ๋˜๋ฉฐ +- ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๊ณ  +- ์ตœ์ข… ํ•ฉ๊ณ„๊ฐ€ ํ‹€์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค + +
+
diff --git a/book/i18n/ko/src/puzzle_12/raw.md b/book/i18n/ko/src/puzzle_12/raw.md deleted file mode 100644 index ecf9a719..00000000 --- a/book/i18n/ko/src/puzzle_12/raw.md +++ /dev/null @@ -1,229 +0,0 @@ - - -## ๊ฐœ์š” - -๋ฒกํ„ฐ `a`์™€ ๋ฒกํ„ฐ `b`์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ `output`(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. - -**์ฐธ๊ณ :** _๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ๋ธ”๋ก๋‹น ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค._ - -## ํ•ต์‹ฌ ๊ฐœ๋… - -์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ: - -- ์—ฌ๋Ÿฌ ๊ฐ’์„ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋Š” ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜(parallel reduction) ๊ตฌํ˜„ํ•˜๊ธฐ -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ค‘๊ฐ„ ๊ฒฐ๊ณผ ์ €์žฅํ•˜๊ธฐ -- ์Šค๋ ˆ๋“œ๋ผ๋ฆฌ ํ˜‘๋ ฅํ•˜์—ฌ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ ๋งŒ๋“ค๊ธฐ - -ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์™€ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•ด, ํฉ์–ด์ ธ ์žˆ๋Š” ๊ฐ’๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋กœ ๋ชจ์•„๊ฐ€๋Š” ๊ณผ์ •์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. - -## ๊ตฌ์„ฑ - -- ๋ฒกํ„ฐ ํฌ๊ธฐ: `SIZE = 8` -- ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: `TPB = 8` -- ๋ธ”๋ก ์ˆ˜: 1 -- ์ถœ๋ ฅ ํฌ๊ธฐ: 1 -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: `TPB`๊ฐœ - -์ฐธ๊ณ : - -- **์š”์†Œ ์ ‘๊ทผ**: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ `a`์™€ `b`์—์„œ ๋Œ€์‘ํ•˜๋Š” ์š”์†Œ๋ฅผ ์ฝ์Œ -- **๋ถ€๋ถ„ ๊ฒฐ๊ณผ**: ์ค‘๊ฐ„ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ €์žฅ -- **์Šค๋ ˆ๋“œ ์กฐ์œจ**: ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์น˜๊ธฐ ์ „์— ๋™๊ธฐํ™” -- **์ตœ์ข… ๋ฆฌ๋•์…˜**: ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์Šค์นผ๋ผ ์ถœ๋ ฅ์œผ๋กœ ๋ณ€ํ™˜ - -_์ฐธ๊ณ : ์ด ๋ฌธ์ œ์—์„œ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ ํšŸ์ˆ˜๋ฅผ ์‹ ๊ฒฝ ์“ธ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ ๋ฌธ์ œ๋Š” ๋‚˜์ค‘์— ๋‹ค๋ฃจ๊ฒ ์Šต๋‹ˆ๋‹ค._ - -## ์™„์„ฑํ•  ์ฝ”๋“œ - -```mojo -{{#include ../../../../../problems/p12/p12.mojo:dot_product}} -``` - -์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p12/p12.mojo - -
-ํŒ - -
- -1. `shared[local_i]`์— `a[global_i] * b[global_i]`๋ฅผ ์ €์žฅ -2. `barrier()`๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๋™๊ธฐํ™” -3. ์Šค๋ ˆ๋“œ 0์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ชจ๋“  ๊ณฑ์„ ํ•ฉ์‚ฐ -4. ์ตœ์ข… ํ•ฉ๊ณ„๋ฅผ `output[0]`์— ๊ธฐ๋ก - -
-
- -## ์ฝ”๋“œ ์‹คํ–‰ - -์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”: - -
-
- - - - -
-
- -```bash -pixi run p12 -``` - -
-
- -```bash -pixi run -e amd p12 -``` - -
-
- -```bash -pixi run -e apple p12 -``` - -
-
- -```bash -uv run poe p12 -``` - -
-
- -ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: - -```txt -out: HostBuffer([0.0]) -expected: HostBuffer([140.0]) -``` - -## ์†”๋ฃจ์…˜ - -
- - -```mojo -{{#include ../../../../../solutions/p12/p12.mojo:dot_product_solution}} -``` - -
- -๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค: - -### 1๋‹จ๊ณ„: ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ - -๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณฑ์…ˆ ํ•˜๋‚˜๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค: - -```txt -Thread i: shared[i] = a[i] * b[i] -``` - -### 2๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ - -ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๋ฅผ ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๋Š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค: - -```txt -์ดˆ๊ธฐ๊ฐ’: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7] - = [0 1 4 9 16 25 36 49] - -Step 1: [0+16 1+25 4+36 9+49 16 25 36 49] - = [16 26 40 58 16 25 36 49] - -Step 2: [16+40 26+58 40 58 16 25 36 49] - = [56 84 40 58 16 25 36 49] - -Step 3: [56+84 84 40 58 16 25 36 49] - = [140 84 40 58 16 25 36 49] -``` - -### ๊ตฌํ˜„์˜ ํ•ต์‹ฌ ํŠน์ง• - -1. **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด**: - - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ •ํ™•ํžˆ ๋‘ ๊ฐ’์„ ๋กœ๋“œ (`a[i]`, `b[i]`) - - ์ค‘๊ฐ„ ๊ฒฐ๊ณผ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ - - ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— 1ํšŒ ๊ธฐ๋ก - -2. **์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”**: - - ์ดˆ๊ธฐ ๊ณฑ์…ˆ ํ›„ `barrier()` - - ๊ฐ ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ํ›„ `barrier()` - - ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ๊ฐ„ ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€ - -3. **๋ฆฌ๋•์…˜ ๋กœ์ง**: - - ```mojo - stride = TPB // 2 - while stride > 0: - if local_i < stride: - shared[local_i] += shared[local_i + stride] - barrier() - stride //= 2 - ``` - - - ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค stride๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ - - ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๋งŒ ๋ง์…ˆ ์ˆ˜ํ–‰ - - ์ž‘์—… ํšจ์œจ์„ฑ ์œ ์ง€ - -4. **์„ฑ๋Šฅ ๊ณ ๋ ค ์‚ฌํ•ญ**: - - \\(n\\)๊ฐœ ์š”์†Œ์— ๋Œ€ํ•ด \\(\log_2(n)\\) ๋‹จ๊ณ„ - - ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด - - ์ตœ์†Œํ•œ์˜ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ - - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ - -์ด ๊ตฌํ˜„์€ ์ˆœ์ฐจ ์‹คํ–‰์˜ \\(O(n)\\)์— ๋น„ํ•ด \\(O(\log n)\\) ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์œ„๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. - -### ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์˜ ์ค‘์š”์„ฑ - -๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด์˜ `barrier()`๋Š” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค: - -`barrier()`๊ฐ€ ์—†์œผ๋ฉด ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค: - -```text -์ดˆ๊ธฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: [0 1 4 9 16 25 36 49] - -Step 1 (stride = 4): -Thread 0 ์ฝ๊ธฐ: shared[0] = 0, shared[4] = 16 -Thread 1 ์ฝ๊ธฐ: shared[1] = 1, shared[5] = 25 -Thread 2 ์ฝ๊ธฐ: shared[2] = 4, shared[6] = 36 -Thread 3 ์ฝ๊ธฐ: shared[3] = 9, shared[7] = 49 - -barrier ์—†์ด: -- Thread 0 ์“ฐ๊ธฐ: shared[0] = 0 + 16 = 16 -- Thread 1์ด Thread 0๋ณด๋‹ค ๋จผ์ € ๋‹ค์Œ ๋‹จ๊ณ„(stride = 2)๋กœ ๋„˜์–ด๊ฐ€์„œ - 16์ด ์•„๋‹Œ ์ด์ „ ๊ฐ’ shared[0] = 0์„ ์ฝ์–ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค! -``` - -`barrier()`๊ฐ€ ์žˆ์œผ๋ฉด: - -```text -Step 1 (stride = 4): -๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ ๊ธฐ๋ก: -[16 26 40 58 16 25 36 49] -barrier()๊ฐ€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์—๊ฒŒ ์ด ๊ฐ’๋“ค์ด ๋ณด์ด๋„๋ก ๋ณด์žฅ - -Step 2 (stride = 2): -์ด์ œ ์—…๋ฐ์ดํŠธ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ: -Thread 0: shared[0] = 16 + 40 = 56 -Thread 1: shared[1] = 26 + 58 = 84 -``` - -`barrier()`๋Š” ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค: - -1. ํ˜„์žฌ ๋‹จ๊ณ„์˜ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋๋‚œ ๋’ค์—์•ผ ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐ -2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์‹  ๊ฐ’์„ ๋ณผ ์ˆ˜ ์žˆ์Œ -3. ์–ด๋–ค ์Šค๋ ˆ๋“œ๋„ ์•ž์„œ ๋‚˜๊ฐ€์ง€ ์•Š์Œ -4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ์ผ๊ด€๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€ - -์ด๋Ÿฐ ๋™๊ธฐํ™” ์ง€์ ์ด ์—†์œผ๋ฉด: - -- ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  -- ์Šค๋ ˆ๋“œ๊ฐ€ ์ด๋ฏธ ์ง€๋‚œ ๊ฐ’์„ ์ฝ๊ฒŒ ๋˜๋ฉฐ -- ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๊ณ  -- ์ตœ์ข… ํ•ฉ๊ณ„๊ฐ€ ํ‹€์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค - -
-
diff --git a/book/i18n/ko/src/puzzle_13/block_boundary.md b/book/i18n/ko/src/puzzle_13/block_boundary.md index 6ab3d2db..4ed215ce 100644 --- a/book/i18n/ko/src/puzzle_13/block_boundary.md +++ b/book/i18n/ko/src/puzzle_13/block_boundary.md @@ -2,7 +2,7 @@ # ๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „ -1D LayoutTensor `a`์™€ 1D LayoutTensor `b`์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. +1D TileTensor `a`์™€ 1D TileTensor `b`์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D TileTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. **์ฐธ๊ณ :** _์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค._ @@ -34,7 +34,7 @@
-1. `LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น +1. `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]())`์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น 2. ๋ฉ”์ธ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: `shared_a[local_i] = a[global_i]` 3. ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: `if local_i < CONV_2 - 1`์ผ ๋•Œ ๋‹ค์Œ ๋ธ”๋ก์˜ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ 4. ์ปค๋„ ๋กœ๋“œ: `shared_b[local_i] = b[local_i]` @@ -127,8 +127,8 @@ Block 1 ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: [8 9 10 11 12 13 14 0|0 0 0] // ๋‘ ๋ฒˆ์งธ ๋ธ”๋ก. ```mojo # ํ•ฉ์„ฑ๊ณฑ ์œˆ๋„์šฐ์— ํ•„์š”ํ•œ ํŒจ๋”ฉ์„ ๋จผ์ € ๊ณ ๋ ค - shared_a = LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - shared_b = LayoutTensor[dtype, Layout.row_major(CONV_2), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() + shared_a = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]()) + shared_b = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[CONV_2]()) ``` ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ธ”๋ก ๋ฐ์ดํ„ฐ์™€ ๊ฒน์นจ ์˜์—ญ์„ ๋ชจ๋‘ ๋‹ด๊ธฐ์— ์ถฉ๋ถ„ํ•œ ๊ณต๊ฐ„์ด ํ™•๋ณด๋ฉ๋‹ˆ๋‹ค. @@ -242,7 +242,7 @@ Thread 4๋ถ€ํ„ฐ๋Š” `global_i + j < SIZE_2`๊ฐ€ `False`๊ฐ€ ๋˜์–ด ํ•ด๋‹น ๋ฐ˜๋ณต์„ - ์ ์ ˆํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ - ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ํ†ตํ•œ ๋†’์€ ์„ฑ๋Šฅ -- LayoutTensor ์ถ”์ƒํ™”๋ฅผ ํ™œ์šฉํ•œ ๊น”๋”ํ•œ ์ฝ”๋“œ ๊ตฌ์กฐ +- TileTensor ์ถ”์ƒํ™”๋ฅผ ํ™œ์šฉํ•œ ๊น”๋”ํ•œ ์ฝ”๋“œ ๊ตฌ์กฐ - ์ตœ์†Œํ•œ์˜ ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ - ์ˆ˜ํ•™์ ์œผ๋กœ ๊ฑด์ „ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ diff --git a/book/i18n/ko/src/puzzle_13/puzzle_13.md b/book/i18n/ko/src/puzzle_13/puzzle_13.md index 9f74646c..34f50da0 100644 --- a/book/i18n/ko/src/puzzle_13/puzzle_13.md +++ b/book/i18n/ko/src/puzzle_13/puzzle_13.md @@ -2,14 +2,14 @@ # Puzzle 13: 1D ํ•ฉ์„ฑ๊ณฑ -> ## LayoutTensor๋กœ ์ „ํ™˜ํ•˜๊ธฐ +> ## TileTensor๋กœ ์ „ํ™˜ํ•˜๊ธฐ > > ์ง€๊ธˆ๊นŒ์ง€ GPU ํผ์ฆ ์—ฌ์ •์—์„œ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์— ๋Œ€ํ•œ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ•จ๊ป˜ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค: > > 1. [UnsafePointer](https://docs.modular.com/mojo/std/memory/unsafe_pointer/UnsafePointer/)๋ฅผ ์‚ฌ์šฉํ•œ ํฌ์ธํ„ฐ ์ง์ ‘ ์กฐ์ž‘ ๋ฐฉ์‹์˜ raw ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ -> 2. ๊ฐ•๋ ฅํ•œ `address_space` ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๋Š”, ๋ณด๋‹ค ๊ตฌ์กฐํ™”๋œ [LayoutTensor](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/) +> 2. ๊ฐ•๋ ฅํ•œ `address_space` ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๋Š”, ๋ณด๋‹ค ๊ตฌ์กฐํ™”๋œ [TileTensor](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/) > -> ์ด ํผ์ฆ๋ถ€ํ„ฐ๋Š” `LayoutTensor`๋กœ ์™„์ „ํžˆ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ถ”์ƒํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค: +> ์ด ํผ์ฆ๋ถ€ํ„ฐ๋Š” `TileTensor`๋กœ ์™„์ „ํžˆ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ถ”์ƒํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค: > > - ํƒ€์ž… ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด > - ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ์˜ ๋ช…ํ™•ํ•œ ํ‘œํ˜„ @@ -24,7 +24,7 @@ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ์™€ ์ด๋ฏธ์ง€ ๋ถ„์„์—์„œ ํ•ฉ์„ฑ๊ณฑ(convolution)์€ ๋‘ ์‹œํ€€์Šค๋ฅผ ๊ฒฐํ•ฉํ•ด ์ƒˆ๋กœ์šด ์‹œํ€€์Šค๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ํ•ต์‹ฌ ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ์ด ํผ์ฆ์—์„œ๋Š” ์ž…๋ ฅ ๋ฐฐ์—ด ์œ„๋กœ ์ปค๋„์„ ์Šฌ๋ผ์ด๋”ฉํ•˜๋ฉด์„œ ๊ฐ ์ถœ๋ ฅ ์›์†Œ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ์„ GPU์—์„œ ๊ตฌํ˜„ํ•ด ๋ด…๋‹ˆ๋‹ค. -`LayoutTensor` ์ถ”์ƒํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ `a`์™€ ๋ฒกํ„ฐ `b`์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. +`TileTensor` ์ถ”์ƒํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ `a`์™€ ๋ฒกํ„ฐ `b`์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. **์ฐธ๊ณ :** _์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค._ @@ -47,9 +47,9 @@ for i in range(SIZE): ์ด ํผ์ฆ์€ ๋‹จ๊ณ„์ ์œผ๋กœ ์ดํ•ด๋ฅผ ์Œ“์•„๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ๋‘ ํŒŒํŠธ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค: - [๐Ÿ”ฐ ๊ธฐ๋ณธ ๋ฒ„์ „](./simple.md) - ์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์„ธ์š”. ๋‹จ์ผ ๋ธ”๋ก์—์„œ LayoutTensor์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ํ•ฉ์„ฑ๊ณฑ ๊ตฌํ˜„์˜ ๊ธฐ์ดˆ๋ฅผ ์ตํž™๋‹ˆ๋‹ค. + ์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์„ธ์š”. ๋‹จ์ผ ๋ธ”๋ก์—์„œ TileTensor์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ํ•ฉ์„ฑ๊ณฑ ๊ตฌํ˜„์˜ ๊ธฐ์ดˆ๋ฅผ ์ตํž™๋‹ˆ๋‹ค. - [โญ ๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „](./block_boundary.md) - ์ด์–ด์„œ ๋ธ”๋ก ๊ฒฝ๊ณ„๋ฅผ ๋„˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•ด์•ผ ํ•˜๋Š” ๋” ๊นŒ๋‹ค๋กœ์šด ๊ฒฝ์šฐ์— ๋„์ „ํ•ฉ๋‹ˆ๋‹ค. LayoutTensor์˜ ๊ธฐ๋Šฅ์„ ๋ณธ๊ฒฉ์ ์œผ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. + ์ด์–ด์„œ ๋ธ”๋ก ๊ฒฝ๊ณ„๋ฅผ ๋„˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•ด์•ผ ํ•˜๋Š” ๋” ๊นŒ๋‹ค๋กœ์šด ๊ฒฝ์šฐ์— ๋„์ „ํ•ฉ๋‹ˆ๋‹ค. TileTensor์˜ ๊ธฐ๋Šฅ์„ ๋ณธ๊ฒฉ์ ์œผ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ฒ„์ „์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ ์ธก๋ฉด์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋„์ „ ๊ณผ์ œ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋ฒ„์ „์—์„œ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์˜ ์›๋ฆฌ๋ฅผ ์ตํžŒ ๋‹ค์Œ, ๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „์—์„œ๋Š” ์‹ค์ œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งˆ์ฃผ์น˜๋Š” ๋ณต์žกํ•œ ์ƒํ™ฉ์„ ๋‹ค๋ฃจ๋Š” ๋Šฅ๋ ฅ์„ ์‹œํ—˜ํ•ด ๋ด…๋‹ˆ๋‹ค. diff --git a/book/i18n/ko/src/puzzle_13/simple.md b/book/i18n/ko/src/puzzle_13/simple.md index 0ef11d8d..d7b9303c 100644 --- a/book/i18n/ko/src/puzzle_13/simple.md +++ b/book/i18n/ko/src/puzzle_13/simple.md @@ -2,7 +2,7 @@ # ๋‹จ์ผ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ ๋ฒ„์ „ -1D LayoutTensor `a`์™€ 1D LayoutTensor `b`์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. +1D TileTensor `a`์™€ 1D TileTensor `b`์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D TileTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. **์ฐธ๊ณ :** _์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค._ @@ -43,7 +43,7 @@
-1. `LayoutTensor[dtype, Layout.row_major(SIZE), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น +1. `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[SIZE]())`์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น 2. ์ž…๋ ฅ์„ `shared_a[local_i]`์—, ์ปค๋„์„ `shared_b[local_i]`์— ๋กœ๋“œ 3. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ํ›„ `barrier()` ํ˜ธ์ถœ 4. ๊ฒฝ๊ณ„ ์•ˆ์—์„œ ๊ณฑ์„ ํ•ฉ์‚ฐ: `if local_i + j < SIZE` @@ -179,7 +179,7 @@ expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0]) - `var`์™€ `output.element_type`์œผ๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์ถ”๋ก  - `@parameter` ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋กœ ํ•ฉ์„ฑ๊ณฑ ๋ฃจํ”„๋ฅผ ์ปดํŒŒ์ผ ํƒ€์ž„์— ์ „๊ฐœ - ์—„๊ฒฉํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ ํ™•๋ณด - - LayoutTensor์˜ ํƒ€์ž… ์‹œ์Šคํ…œ์œผ๋กœ ์ฝ”๋“œ ์•ˆ์ „์„ฑ ํ–ฅ์ƒ + - TileTensor์˜ ํƒ€์ž… ์‹œ์Šคํ…œ์œผ๋กœ ์ฝ”๋“œ ์•ˆ์ „์„ฑ ํ–ฅ์ƒ 3. **๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ**: - ์ž…๋ ฅ ๋ฐฐ์—ด๊ณผ ์ปค๋„ ๋ชจ๋‘ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ diff --git a/book/i18n/ko/src/puzzle_14/complete.md b/book/i18n/ko/src/puzzle_14/complete.md index 773a0eee..64136f82 100644 --- a/book/i18n/ko/src/puzzle_14/complete.md +++ b/book/i18n/ko/src/puzzle_14/complete.md @@ -2,7 +2,7 @@ # ์™„์„ฑ ๋ฒ„์ „ -1D LayoutTensor `a`์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D LayoutTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. +1D TileTensor `a`์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D TileTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. **์ฐธ๊ณ :** _`a`์˜ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ๋ ค๋ฉด ์—ฌ๋Ÿฌ ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค._ diff --git a/book/i18n/ko/src/puzzle_14/puzzle_14.md b/book/i18n/ko/src/puzzle_14/puzzle_14.md index 23c2b1a3..cf13e437 100644 --- a/book/i18n/ko/src/puzzle_14/puzzle_14.md +++ b/book/i18n/ko/src/puzzle_14/puzzle_14.md @@ -6,7 +6,7 @@ ๋ˆ„์  ํ•ฉ(prefix sum, _scan_ ์ด๋ผ๊ณ ๋„ ํ•ฉ๋‹ˆ๋‹ค)์€ ์‹œํ€€์Šค์˜ ๊ฐ’์„ ์ฐจ๋ก€๋กœ ๋”ํ•ด ๋‚˜๊ฐ€๋Š” ๊ธฐ๋ณธ์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ •๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ถ€ํ„ฐ ๊ณผํ•™ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊นŒ์ง€ ์ˆ˜๋งŽ์€ ๋ณ‘๋ ฌ ์‘์šฉ์˜ ํ•ต์‹ฌ์— ์ž๋ฆฌํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ˆซ์ž ์‹œํ€€์Šค๋ฅผ ๋ˆ„์  ํ•ฉ๊ณ„๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ธฐ๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ, GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“ค๋ ค๋ฉด ๊ธฐ๋ฐœํ•œ ๋ณ‘๋ ฌ์  ์‚ฌ๊ณ ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค! -1D LayoutTensor `a`์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D LayoutTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. +1D TileTensor `a`์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D TileTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. **์ฐธ๊ณ :** _`a`์˜ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, ๊ฐ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋งŒ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค._ diff --git a/book/i18n/ko/src/puzzle_14/simple.md b/book/i18n/ko/src/puzzle_14/simple.md index 8dce3df9..9d67eba9 100644 --- a/book/i18n/ko/src/puzzle_14/simple.md +++ b/book/i18n/ko/src/puzzle_14/simple.md @@ -2,7 +2,7 @@ # ๊ธฐ๋ณธ ๋ฒ„์ „ -1D LayoutTensor `a`์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D LayoutTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. +1D TileTensor `a`์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D TileTensor `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. **์ฐธ๊ณ :** _`a`์˜ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, ๊ฐ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋งŒ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค._ @@ -15,11 +15,11 @@ ์ฐธ๊ณ : -- **๋ฐ์ดํ„ฐ ๋กœ๋”ฉ**: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ LayoutTensor ์ ‘๊ทผ์„ ํ†ตํ•ด ์›์†Œ ํ•˜๋‚˜๋ฅผ ๋กœ๋“œ -- **๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด**: address_space๋ฅผ ์ง€์ •ํ•œ LayoutTensor๋กœ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ +- **๋ฐ์ดํ„ฐ ๋กœ๋”ฉ**: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ TileTensor ์ ‘๊ทผ์„ ํ†ตํ•ด ์›์†Œ ํ•˜๋‚˜๋ฅผ ๋กœ๋“œ +- **๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด**: address_space๋ฅผ ์ง€์ •ํ•œ TileTensor๋กœ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ - **์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”**: ์—ฐ์‚ฐ ๋‹จ๊ณ„ ๊ฐ„ ์กฐ์œจ - **์ ‘๊ทผ ํŒจํ„ด**: ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ -- **ํƒ€์ž… ์•ˆ์ „์„ฑ**: LayoutTensor์˜ ํƒ€์ž… ์‹œ์Šคํ…œ ํ™œ์šฉ +- **ํƒ€์ž… ์•ˆ์ „์„ฑ**: TileTensor์˜ ํƒ€์ž… ์‹œ์Šคํ…œ ํ™œ์šฉ ## ์™„์„ฑํ•  ์ฝ”๋“œ diff --git a/book/i18n/ko/src/puzzle_15/puzzle_15.md b/book/i18n/ko/src/puzzle_15/puzzle_15.md index 309b002d..4420577f 100644 --- a/book/i18n/ko/src/puzzle_15/puzzle_15.md +++ b/book/i18n/ko/src/puzzle_15/puzzle_15.md @@ -4,7 +4,7 @@ ## ๊ฐœ์š” -2D ํ–‰๋ ฌ `a`์˜ ๊ฐ ํ–‰์— ๋Œ€ํ•ด ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ LayoutTensor๋ฅผ ์‚ฌ์šฉํ•ด `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. +2D ํ–‰๋ ฌ `a`์˜ ๊ฐ ํ–‰์— ๋Œ€ํ•ด ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ TileTensor๋ฅผ ์‚ฌ์šฉํ•ด `output`์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ์ถ• ํ•ฉ๊ณ„ ์‹œ๊ฐํ™” ์ถ• ํ•ฉ๊ณ„ ์‹œ๊ฐํ™” @@ -13,12 +13,12 @@ ์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ: -- LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ํ–‰๋ ฌ ์ฐจ์› ๋ฐฉํ–ฅ์˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ +- TileTensor๋ฅผ ํ™œ์šฉํ•œ ํ–‰๋ ฌ ์ฐจ์› ๋ฐฉํ–ฅ์˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ - ๋ธ”๋ก ์ขŒํ‘œ๋ฅผ ์ด์šฉํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํ•  - ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด - ๋‹ค์ฐจ์› ํ…์„œ ๋ ˆ์ด์•„์›ƒ ๋‹ค๋ฃจ๊ธฐ -ํ•ต์‹ฌ์€ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ํ–‰๋ ฌ์˜ ํ–‰์— ๋งคํ•‘ํ•˜๊ณ , LayoutTensor์˜ ์ฐจ์›๋ณ„ ์ธ๋ฑ์‹ฑ์„ ํ™œ์šฉํ•˜๋ฉด์„œ ๊ฐ ๋ธ”๋ก ๋‚ด์—์„œ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. +ํ•ต์‹ฌ์€ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ํ–‰๋ ฌ์˜ ํ–‰์— ๋งคํ•‘ํ•˜๊ณ , TileTensor์˜ ์ฐจ์›๋ณ„ ์ธ๋ฑ์‹ฑ์„ ํ™œ์šฉํ•˜๋ฉด์„œ ๊ฐ ๋ธ”๋ก ๋‚ด์—์„œ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ## ๊ตฌ์„ฑ @@ -26,8 +26,8 @@ - ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \\(\\text{TPB} = 8\\) - ๊ทธ๋ฆฌ๋“œ ํฌ๊ธฐ: \\(1 \\times \\text{BATCH}\\) - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น \\(\\text{TPB}\\)๊ฐœ ์›์†Œ -- ์ž…๋ ฅ ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(BATCH, SIZE)` -- ์ถœ๋ ฅ ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(BATCH, 1)` +- ์ž…๋ ฅ ๋ ˆ์ด์•„์›ƒ: `row_major[BATCH, SIZE]()` +- ์ถœ๋ ฅ ๋ ˆ์ด์•„์›ƒ: `row_major[BATCH, 1]()` ํ–‰๋ ฌ ์‹œ๊ฐํ™”: @@ -118,12 +118,12 @@ expected: HostBuffer([15.0, 51.0, 87.0, 123.0])
-LayoutTensor๋ฅผ ํ™œ์šฉํ•ด 2D ํ–‰๋ ฌ์˜ ํ–‰ ๋ฐฉํ–ฅ ํ•ฉ๊ณ„๋ฅผ ๋ณ‘๋ ฌ๋กœ ๊ตฌํ•˜๋Š” ๋ฆฌ๋•์…˜ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค: +TileTensor๋ฅผ ํ™œ์šฉํ•ด 2D ํ–‰๋ ฌ์˜ ํ–‰ ๋ฐฉํ–ฅ ํ•ฉ๊ณ„๋ฅผ ๋ณ‘๋ ฌ๋กœ ๊ตฌํ•˜๋Š” ๋ฆฌ๋•์…˜ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค: ### ํ–‰๋ ฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ๋ธ”๋ก ๋งคํ•‘ ```txt -Input Matrix (4ร—6) with LayoutTensor: Block Assignment: +Input Matrix (4ร—6) with TileTensor: Block Assignment: [[ a[0,0] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] ] โ†’ Block(0,0) [ a[1,0] a[1,1] a[1,2] a[1,3] a[1,4] a[1,5] ] โ†’ Block(0,1) [ a[2,0] a[2,1] a[2,2] a[2,3] a[2,4] a[2,5] ] โ†’ Block(0,2) @@ -158,9 +158,9 @@ Input Matrix (4ร—6) with LayoutTensor: Block Assignment: - ๊ฐ ๋ธ”๋ก์ด ํ•˜๋‚˜์˜ ํ–‰ ์ „์ฒด๋ฅผ ์ฒ˜๋ฆฌ 2. **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด**: - - ์ž…๋ ฅ์— LayoutTensor 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: `a[batch, local_i]` + - ์ž…๋ ฅ์— TileTensor 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: `a[batch, local_i]` - ํšจ์œจ์ ์ธ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ - - ์ถœ๋ ฅ์— LayoutTensor 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: `output[batch, 0]` + - ์ถœ๋ ฅ์— TileTensor 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: `output[batch, 0]` 3. **๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๋กœ์ง**: @@ -198,7 +198,7 @@ Input Matrix (4ร—6) with LayoutTensor: Block Assignment: ### ์„ฑ๋Šฅ ์ตœ์ ํ™” 1. **๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ**: - - LayoutTensor๋ฅผ ํ†ตํ•œ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ + - TileTensor๋ฅผ ํ†ตํ•œ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ - ๋น ๋ฅธ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ - ํ–‰ ๊ฒฐ๊ณผ๋‹น ํ•œ ๋ฒˆ์˜ ์“ฐ๊ธฐ diff --git "a/book/i18n/ko/src/puzzle_16/na\303\257ve.md" "b/book/i18n/ko/src/puzzle_16/na\303\257ve.md" index 57be8804..6b16a048 100644 --- "a/book/i18n/ko/src/puzzle_16/na\303\257ve.md" +++ "b/book/i18n/ko/src/puzzle_16/na\303\257ve.md" @@ -26,9 +26,9 @@ ๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ: -- ์ž…๋ ฅ A: `Layout.row_major(SIZE, SIZE)` -- ์ž…๋ ฅ B: `Layout.row_major(SIZE, SIZE)` -- ์ถœ๋ ฅ: `Layout.row_major(SIZE, SIZE)` +- ์ž…๋ ฅ A: `row_major[SIZE, SIZE]()` +- ์ž…๋ ฅ B: `row_major[SIZE, SIZE]()` +- ์ถœ๋ ฅ: `row_major[SIZE, SIZE]()` ## ์™„์„ฑํ•  ์ฝ”๋“œ @@ -110,7 +110,7 @@ expected: HostBuffer([4.0, 6.0, 12.0, 22.0])
-LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ๊ธฐ๋ณธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค: +TileTensor๋ฅผ ํ™œ์šฉํ•œ ๊ธฐ๋ณธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค: ### ํ–‰๋ ฌ ๋ ˆ์ด์•„์›ƒ (2ร—2 ์˜ˆ์‹œ) diff --git a/book/i18n/ko/src/puzzle_16/shared_memory.md b/book/i18n/ko/src/puzzle_16/shared_memory.md index ae76bb3f..4fd595b6 100644 --- a/book/i18n/ko/src/puzzle_16/shared_memory.md +++ b/book/i18n/ko/src/puzzle_16/shared_memory.md @@ -10,13 +10,13 @@ ์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ: -- LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ๋ธ”๋ก ๋กœ์ปฌ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ +- TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ๋ธ”๋ก ๋กœ์ปฌ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ - ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ํŒจํ„ด - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™” - 2D ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•œ ํ˜‘๋ ฅ์  ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ -- ํ–‰๋ ฌ ์—ฐ์‚ฐ์— LayoutTensor๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๊ธฐ +- ํ–‰๋ ฌ ์—ฐ์‚ฐ์— TileTensor๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๊ธฐ -ํ•ต์‹ฌ์€ LayoutTensor๋ฅผ ํ†ตํ•ด ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋น„์šฉ์ด ํฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. +ํ•ต์‹ฌ์€ TileTensor๋ฅผ ํ†ตํ•ด ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋น„์šฉ์ด ํฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ## ๊ตฌ์„ฑ @@ -26,15 +26,15 @@ ๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ: -- ์ž…๋ ฅ A: `Layout.row_major(SIZE, SIZE)` -- ์ž…๋ ฅ B: `Layout.row_major(SIZE, SIZE)` -- ์ถœ๋ ฅ: `Layout.row_major(SIZE, SIZE)` -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: `TPB ร— TPB` ํฌ๊ธฐ์˜ LayoutTensor 2๊ฐœ +- ์ž…๋ ฅ A: `row_major[SIZE, SIZE]()` +- ์ž…๋ ฅ B: `row_major[SIZE, SIZE]()` +- ์ถœ๋ ฅ: `row_major[SIZE, SIZE]()` +- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: `TPB ร— TPB` ํฌ๊ธฐ์˜ TileTensor 2๊ฐœ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ: ```txt -Global Memory (LayoutTensor): Shared Memory (LayoutTensor): +Global Memory (TileTensor): Shared Memory (TileTensor): A[i,j]: Direct access a_shared[local_row, local_col] B[i,j]: Direct access b_shared[local_row, local_col] ``` @@ -119,7 +119,7 @@ expected: HostBuffer([4.0, 6.0, 12.0, 22.0])
-LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ตฌํ˜„์€ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค: +TileTensor๋ฅผ ํ™œ์šฉํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ตฌํ˜„์€ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค: ### ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ @@ -140,9 +140,9 @@ Matrix B: b_shared: (๋น„์Šทํ•œ ๋ ˆ์ด์•„์›ƒ) 1. **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •**: ```mojo - # address_space๋ฅผ ์ง€์ •ํ•œ LayoutTensor๋กœ 2D ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ…์„œ ์ƒ์„ฑ - a_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - b_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() + # address_space๋ฅผ ์ง€์ •ํ•œ TileTensor๋กœ 2D ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ…์„œ ์ƒ์„ฑ + a_shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB, TPB]()) + b_shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB, TPB]()) ``` 2. **์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ**: @@ -160,7 +160,7 @@ Matrix B: b_shared: (๋น„์Šทํ•œ ๋ ˆ์ด์•„์›ƒ) 3. **๋ฐ์ดํ„ฐ ๋กœ๋”ฉ**: ```mojo - # LayoutTensor ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ + # TileTensor ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ if row < size and col < size: a_shared[local_row, local_col] = a[row, col] b_shared[local_row, local_col] = b[row, col] @@ -219,13 +219,13 @@ Matrix B: b_shared: (๋น„์Šทํ•œ ๋ ˆ์ด์•„์›ƒ) ### ์ฃผ์š” ์–ธ์–ด ๊ธฐ๋Šฅ -1. **LayoutTensor์˜ ์žฅ์ **: +1. **TileTensor์˜ ์žฅ์ **: - ์ง์ ‘ 2D ์ธ๋ฑ์‹ฑ์œผ๋กœ ์ฝ”๋“œ ๋‹จ์ˆœํ™” - `element_type`์„ ํ†ตํ•œ ํƒ€์ž… ์•ˆ์ „์„ฑ - ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ 2. **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น**: - - address_space๋ฅผ ์ง€์ •ํ•œ LayoutTensor๋กœ ๊ตฌ์กฐํ™”๋œ ํ• ๋‹น + - address_space๋ฅผ ์ง€์ •ํ•œ TileTensor๋กœ ๊ตฌ์กฐํ™”๋œ ํ• ๋‹น - ์ž…๋ ฅ ํ…์„œ์™€ ๋™์ผํ•œ ํ–‰ ์šฐ์„  ๋ ˆ์ด์•„์›ƒ - ํšจ์œจ์  ์ ‘๊ทผ์„ ์œ„ํ•œ ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ @@ -255,7 +255,7 @@ Matrix B: b_shared: (๋น„์Šทํ•œ ๋ ˆ์ด์•„์›ƒ) - ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšŸ์ˆ˜ ๊ฐ์†Œ - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ -- LayoutTensor์˜ ํšจ์œจ์ ์ธ 2D ์ธ๋ฑ์‹ฑ ํ™œ์šฉ +- TileTensor์˜ ํšจ์œจ์ ์ธ 2D ์ธ๋ฑ์‹ฑ ํ™œ์šฉ - ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ์œ ์ง€
diff --git a/book/i18n/ko/src/puzzle_16/tiled.md b/book/i18n/ko/src/puzzle_16/tiled.md index ec297bfc..88158781 100644 --- a/book/i18n/ko/src/puzzle_16/tiled.md +++ b/book/i18n/ko/src/puzzle_16/tiled.md @@ -4,28 +4,28 @@ ## ๊ฐœ์š” -LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์œผ๋กœ ์ •๋ฐฉ ํ–‰๋ ฌ \\(A\\) ์™€ \\(B\\) ๋ฅผ ๊ณฑํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ํฐ ํ–‰๋ ฌ์„ ์ž‘์€ ์กฐ๊ฐ(ํƒ€์ผ)์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. +TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์œผ๋กœ ์ •๋ฐฉ ํ–‰๋ ฌ \\(A\\) ์™€ \\(B\\) ๋ฅผ ๊ณฑํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ํฐ ํ–‰๋ ฌ์„ ์ž‘์€ ์กฐ๊ฐ(ํƒ€์ผ)์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ## ํ•ต์‹ฌ ๊ฐœ๋… -- LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ํ–‰๋ ฌ ํƒ€์ผ๋ง์œผ๋กœ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ +- TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ํ–‰๋ ฌ ํƒ€์ผ๋ง์œผ๋กœ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ - ์ ์ ˆํ•œ ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉํ•œ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์œจ - TensorBuilder๋ฅผ ํ†ตํ•œ ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ -- LayoutTensor ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•œ ํƒ€์ผ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ +- TileTensor ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•œ ํƒ€์ผ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ## ๊ตฌ์„ฑ - ํ–‰๋ ฌ ํฌ๊ธฐ: \\(\\text{SIZE\_TILED} = 9\\) - ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \\(\\text{TPB} \times \\text{TPB} = 3 \times 3\\) - ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \\(3 \times 3\\) ๋ธ”๋ก -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น \\(\\text{TPB} \times \\text{TPB}\\) LayoutTensor 2๊ฐœ +- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น \\(\\text{TPB} \times \\text{TPB}\\) TileTensor 2๊ฐœ ๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ: -- ์ž…๋ ฅ A: `Layout.row_major(SIZE_TILED, SIZE_TILED)` -- ์ž…๋ ฅ B: `Layout.row_major(SIZE_TILED, SIZE_TILED)` -- ์ถœ๋ ฅ: `Layout.row_major(SIZE_TILED, SIZE_TILED)` -- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TensorBuilder๋ฅผ ์‚ฌ์šฉํ•œ `TPB ร— TPB` LayoutTensor 2๊ฐœ +- ์ž…๋ ฅ A: `row_major[SIZE_TILED, SIZE_TILED]()` +- ์ž…๋ ฅ B: `row_major[SIZE_TILED, SIZE_TILED]()` +- ์ถœ๋ ฅ: `row_major[SIZE_TILED, SIZE_TILED]()` +- ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TensorBuilder๋ฅผ ์‚ฌ์šฉํ•œ `TPB ร— TPB` TileTensor 2๊ฐœ ## ํƒ€์ผ๋ง ์ „๋žต @@ -37,7 +37,7 @@ Grid Layout (3ร—3): Thread Layout per Block (3ร—3): [B10][B11][B12] [T10 T11 T12] [B20][B21][B22] [T20 T21 T22] -๊ฐ ๋ธ”๋ก์€ LayoutTensor ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ํƒ€์ผ์„ ์ฒ˜๋ฆฌ +๊ฐ ๋ธ”๋ก์€ TileTensor ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ํƒ€์ผ์„ ์ฒ˜๋ฆฌ ``` ### ํƒ€์ผ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„ @@ -318,7 +318,7 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41 ์ด ๊ตฌํ˜„์€ ๋‹ค์Œ์„ ํ†ตํ•ด ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค: -- LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ +- TileTensor๋ฅผ ํ™œ์šฉํ•œ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ - ์ตœ์ ์˜ ํƒ€์ผ๋ง ์ „๋žต - ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” - ์„ธ์‹ฌํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ @@ -326,7 +326,7 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41
-## ์†”๋ฃจ์…˜: ๊ด€์šฉ์  LayoutTensor ํƒ€์ผ๋ง +## ์†”๋ฃจ์…˜: ๊ด€์šฉ์  TileTensor ํƒ€์ผ๋ง
@@ -337,14 +337,14 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41
-๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ Mojo์˜ LayoutTensor API์™€ ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•˜์—ฌ ๊น”๋”ํ•œ ๊ตฌํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. +๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ Mojo์˜ TileTensor API์™€ ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•˜์—ฌ ๊น”๋”ํ•œ ๊ตฌํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. **ํ•ต์‹ฌ ํฌ์ธํŠธ: ์ด ๊ตฌํ˜„์€ ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ์ค€ A ร— B ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.** **์ด ๊ตฌํ˜„์ด ํ•˜๋Š” ๊ฒƒ:** - **ํ–‰๋ ฌ ์—ฐ์‚ฐ**: ํ‘œ์ค€ \\(A \times B\\) ๊ณฑ์…ˆ (\\(A \times B^T\\) ๊ฐ€ ์•„๋‹˜) -- **๋กœ๋”ฉ ํŒจํ„ด**: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ `Layout.row_major(1, TPB)`๋กœ ๋ณ‘ํ•ฉ ์ ‘๊ทผ +- **๋กœ๋”ฉ ํŒจํ„ด**: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ `row_major[1, TPB]()`๋กœ ๋ณ‘ํ•ฉ ์ ‘๊ทผ - **์—ฐ์‚ฐ**: `acc += a_shared[local_row, k] * b_shared[k, local_col]` - **๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ**: ๋กœ๋”ฉ ์‹œ ์ „์น˜ ์—†์Œ - ๋‘ ํ–‰๋ ฌ์„ ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ ๋กœ๋“œ @@ -356,7 +356,7 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41 \\((9 \times 9)\\) ํ–‰๋ ฌ ํฌ๊ธฐ์—์„œ๋Š” ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์ด ์ด๋ฃจ์–ด์ ธ ๋ชจ๋“  ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ๋ถˆํ•„์š”ํ•ฉ๋‹ˆ๋‹ค: -1. **LayoutTensor ํƒ€์ผ API** +1. **TileTensor ํƒ€์ผ API** ```mojo out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x) @@ -364,7 +364,7 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41 b_tile = b.tile[TPB, TPB](idx, block_idx.x) ``` - ์ˆ˜๋™ ์ขŒํ‘œ ๊ณ„์‚ฐ ์—†์ด "(block_idx.y, block_idx.x) ์œ„์น˜์˜ ํƒ€์ผ์„ ๊ฐ€์ ธ์˜จ๋‹ค"๋ฅผ ์ง์ ‘ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ [๋ฌธ์„œ](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/#tile)๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”. + ์ˆ˜๋™ ์ขŒํ‘œ ๊ณ„์‚ฐ ์—†์ด "(block_idx.y, block_idx.x) ์œ„์น˜์˜ ํƒ€์ผ์„ ๊ฐ€์ ธ์˜จ๋‹ค"๋ฅผ ์ง์ ‘ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ [๋ฌธ์„œ](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/#tile)๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”. 2. **๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ** @@ -395,14 +395,14 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41 3. **์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ ˆ์ด์•„์›ƒ** ```mojo - comptime load_a_layout = Layout.row_major(1, TPB) # ๋ณ‘ํ•ฉ ๋กœ๋”ฉ - comptime load_b_layout = Layout.row_major(1, TPB) # ๋ณ‘ํ•ฉ ๋กœ๋”ฉ + comptime load_a_layout = row_major[1, TPB]() # ๋ณ‘ํ•ฉ ๋กœ๋”ฉ + comptime load_b_layout = row_major[1, TPB]() # ๋ณ‘ํ•ฉ ๋กœ๋”ฉ # ์ฐธ๊ณ : ํ‘œ์ค€ A ร— B ๊ณฑ์…ˆ์—์„œ ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ๊ฐ™์€ ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉ ``` **ํ˜„์žฌ ๊ตฌํ˜„์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ถ„์„:** - ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ์„ ์œ„ํ•ด `Layout.row_major(1, TPB)`๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: + ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ์„ ์œ„ํ•ด `row_major[1, TPB]()`๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: - `load_a_layout`: ์Šค๋ ˆ๋“œ๋“ค์ด ํ˜‘๋ ฅํ•˜์—ฌ ํ–‰๋ ฌ A ํ–‰์˜ ์—ฐ์† ์›์†Œ๋ฅผ ๋กœ๋“œ - `load_b_layout`: ์Šค๋ ˆ๋“œ๋“ค์ด ํ˜‘๋ ฅํ•˜์—ฌ ํ–‰๋ ฌ B ํ–‰์˜ ์—ฐ์† ์›์†Œ๋ฅผ ๋กœ๋“œ - **ํ•ต์‹ฌ**: ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ์€ ๋ณต์‚ฌ ์‹œ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ ๋ฐฉ์‹์„ ๊ฒฐ์ •ํ•˜๋ฉฐ, ์ตœ์ข… ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ๊ณผ๋Š” ๋ณ„๊ฐœ์ž…๋‹ˆ๋‹ค @@ -424,11 +424,11 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41 - Matrix A ํƒ€์ผ: ์Šค๋ ˆ๋“œ๋“ค์ด A[block_row, k], A[block_row, k+1], A[block_row, k+2]... ๋กœ๋“œ (์—ฐ์†) - Matrix B ํƒ€์ผ: ์Šค๋ ˆ๋“œ๋“ค์ด B[k, block_col], B[k, block_col+1], B[k, block_col+2]... ๋กœ๋“œ (์—ฐ์†) - Layout.row_major(1, TPB)๋กœ ๋‘ ํŒจํ„ด ๋ชจ๋‘ ๋ณ‘ํ•ฉ + row_major[1, TPB]()๋กœ ๋‘ ํŒจํ„ด ๋ชจ๋‘ ๋ณ‘ํ•ฉ ``` **์„ธ ๊ฐ€์ง€ ๋ณ„๊ฐœ์˜ ๋ฉ”๋ชจ๋ฆฌ ๊ณ ๋ ค์‚ฌํ•ญ:** - 1. **์ „์—ญโ†’๊ณต์œ  ๋ณ‘ํ•ฉ**: `Layout.row_major(1, TPB)`๋กœ ๋ณ‘ํ•ฉ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ณด์žฅ + 1. **์ „์—ญโ†’๊ณต์œ  ๋ณ‘ํ•ฉ**: `row_major[1, TPB]()`๋กœ ๋ณ‘ํ•ฉ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ณด์žฅ 2. **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ**: `a_shared[local_row, k] * b_shared[k, local_col]`๋กœ ๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ 3. **ํ–‰๋ ฌ ์—ฐ์‚ฐ**: ์—ฐ์‚ฐ ํŒจํ„ด์ด A ร— B๋ฅผ ๊ฒฐ์ • (A ร— B^T๊ฐ€ ์•„๋‹˜) @@ -467,7 +467,7 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41 | ๊ธฐ๋Šฅ | ์ˆ˜๋™ Tiling | ๊ด€์šฉ์  Tiling | |---------|--------------|------------------| -| ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ | ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์žˆ๋Š” ์ง์ ‘ ์ธ๋ฑ์‹ฑ | LayoutTensor ํƒ€์ผ API | +| ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ | ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์žˆ๋Š” ์ง์ ‘ ์ธ๋ฑ์‹ฑ | TileTensor ํƒ€์ผ API | | ํƒ€์ผ ๋กœ๋”ฉ | ์›์†Œ๋ณ„ ๋ช…์‹œ์  ๋ณต์‚ฌ | ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์˜ ๋ฒŒํฌ ์ „์†ก | | ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ | ์ˆ˜๋™ ์ดˆ๊ธฐํ™” (๋ฐฉ์–ด์ ) | ๋ณต์‚ฌ ํ•จ์ˆ˜๊ฐ€ ๊ด€๋ฆฌ | | ์ฝ”๋“œ ๋ณต์žก๋„ | ๋ช…์‹œ์  ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋‹ค์†Œ ์žฅํ™ฉ | ๊ณ ์ˆ˜์ค€ API๋กœ ๋” ๊ฐ„๊ฒฐ | @@ -483,7 +483,7 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41 **ํ˜„์žฌ ๊ตฌํ˜„ ์š”์•ฝ:** -- ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ `Layout.row_major(1, TPB)` ์‚ฌ์šฉ +- ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ `row_major[1, TPB]()` ์‚ฌ์šฉ - ํ‘œ์ค€ A ร— B ๊ณฑ์…ˆ ์ˆ˜ํ–‰ - ๋ณต์‚ฌ ์ค‘ ๋ฐ์ดํ„ฐ ์ „์น˜ ์—†์Œ @@ -494,8 +494,8 @@ expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 41 ```mojo # ์˜ˆ์‹œ: A ร— B๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์ „ ์ „์น˜๋œ ํ–‰๋ ฌ B^T๋ฅผ ๋กœ๋“œ # (ํ˜„์žฌ ๊ตฌํ˜„์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ํ•˜์ง€ ์•Š์Œ) -comptime load_b_layout = Layout.row_major(TPB, 1) # B^T๋ฅผ ๋ณ‘ํ•ฉ ์ ‘๊ทผ์œผ๋กœ ๋กœ๋“œ -comptime store_b_layout = Layout.row_major(1, TPB) # ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— B๋กœ ์ €์žฅ +comptime load_b_layout = row_major[TPB, 1]() # B^T๋ฅผ ๋ณ‘ํ•ฉ ์ ‘๊ทผ์œผ๋กœ ๋กœ๋“œ +comptime store_b_layout = row_major[1, TPB]() # ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— B๋กœ ์ €์žฅ copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile) ``` @@ -508,7 +508,7 @@ copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store **ํ•ต์‹ฌ ๊ตฌ๋ถ„:** -- **ํ˜„์žฌ ๊ตฌํ˜„**: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ํ‘œ์ค€ \\(A \times B\\) ๊ณฑ์…ˆ์— `Layout.row_major(1, TPB)` ์‚ฌ์šฉ +- **ํ˜„์žฌ ๊ตฌํ˜„**: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ํ‘œ์ค€ \\(A \times B\\) ๊ณฑ์…ˆ์— `row_major[1, TPB]()` ์‚ฌ์šฉ - **์ „์น˜ ๋กœ๋”ฉ ์˜ˆ์‹œ**: ์ด๋ฏธ ์ „์น˜๋œ ๋ฐ์ดํ„ฐ๋‚˜ ๋‹ค๋ฅธ ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ๋‹ค๋ฅธ ๋ ˆ์ด์•„์›ƒ ์‚ฌ์šฉ ์ด๊ฒƒ์€ Mojo์˜ ์ฒ ํ•™์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์— ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ํ•„์š”ํ•  ๋•Œ ์ €์ˆ˜์ค€ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. @@ -520,13 +520,13 @@ copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store **๊ด€์šฉ์  ํƒ€์ผ๋ง ๊ตฌํ˜„์ด ์‹ค์ œ๋กœ ํ•˜๋Š” ๊ฒƒ:** 1. **ํ–‰๋ ฌ ์—ฐ์‚ฐ**: ํ‘œ์ค€ A ร— B ๊ณฑ์…ˆ -2. **๋ฉ”๋ชจ๋ฆฌ ๋กœ๋”ฉ**: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ `Layout.row_major(1, TPB)`๋กœ ๋ณ‘ํ•ฉ ์ ‘๊ทผ +2. **๋ฉ”๋ชจ๋ฆฌ ๋กœ๋”ฉ**: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ `row_major[1, TPB]()`๋กœ ๋ณ‘ํ•ฉ ์ ‘๊ทผ 3. **์—ฐ์‚ฐ ํŒจํ„ด**: `acc += a_shared[local_row, k] * b_shared[k, local_col]` 4. **๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ**: ๋กœ๋”ฉ ์‹œ ์ „์น˜ ์—†์Œ **์ด๊ฒƒ์ด ์ตœ์ ์ธ ์ด์œ :** -- **๋ณ‘ํ•ฉ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ**: `Layout.row_major(1, TPB)`๋กœ ํšจ์œจ์ ์ธ ๋กœ๋”ฉ ๋ณด์žฅ +- **๋ณ‘ํ•ฉ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ**: `row_major[1, TPB]()`๋กœ ํšจ์œจ์ ์ธ ๋กœ๋”ฉ ๋ณด์žฅ - **๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ**: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์ถฉ๋Œ์„ ๋ฐฉ์ง€ - **ํ‘œ์ค€ ์•Œ๊ณ ๋ฆฌ์ฆ˜**: ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ํ–‰๋ ฌ ๊ณฑ์…ˆ ํŒจํ„ด์„ ๊ตฌํ˜„ diff --git a/book/i18n/ko/src/puzzle_17/puzzle_17.md b/book/i18n/ko/src/puzzle_17/puzzle_17.md index 6095c119..d3d6b0cd 100644 --- a/book/i18n/ko/src/puzzle_17/puzzle_17.md +++ b/book/i18n/ko/src/puzzle_17/puzzle_17.md @@ -143,7 +143,7 @@ Verification passed: Custom kernel results match NumPy calculation 3. **์ปค์Šคํ…€ op ๋“ฑ๋ก**: - `@compiler.register("conv1d")` ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๊ฐ€ ์—ฐ์‚ฐ์„ MAX ๊ทธ๋ž˜ํ”„์— ๋…ธ์ถœ. [@compiler.register](https://docs.modular.com/mojo/manual/decorators/compiler-register/) ์ฐธ๊ณ  - `execute` ๋ฉ”์„œ๋“œ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ธํ„ฐํŽ˜์ด์Šค(์ž…๋ ฅ, ์ถœ๋ ฅ, ์ปจํ…์ŠคํŠธ) ์ •์˜ - - ์ž…์ถœ๋ ฅ ํ…์„œ๊ฐ€ ์ปค๋„์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก LayoutTensor๋กœ ๋ณ€ํ™˜ + - ์ž…์ถœ๋ ฅ ํ…์„œ๊ฐ€ ์ปค๋„์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก TileTensor๋กœ ๋ณ€ํ™˜ - Device context๊ฐ€ GPU ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น๊ณผ ์ปค๋„ ์‹คํ–‰ ๊ด€๋ฆฌ 4. **์ปค๋„ ์‹คํ–‰**: @@ -182,7 +182,7 @@ Verification passed: Custom kernel results match NumPy calculation kernel_tensor = kernel.to_layout_tensor() ``` - - MAX ๊ทธ๋ž˜ํ”„ ํ…์„œ๋ฅผ Mojo LayoutTensor๋กœ ๋ณ€ํ™˜ + - MAX ๊ทธ๋ž˜ํ”„ ํ…์„œ๋ฅผ Mojo TileTensor๋กœ ๋ณ€ํ™˜ - ์ปค๋„์ด ํ…์„œ๋ฅผ ์ง์ ‘ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์คŒ - ์ปดํŒŒ์ผ ํƒ€์ž„ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๋ ˆ์ด์•„์›ƒ ์ถ”์ถœ diff --git a/book/i18n/ko/src/puzzle_18/puzzle_18.md b/book/i18n/ko/src/puzzle_18/puzzle_18.md index 3e5d71f3..4b9fa15b 100644 --- a/book/i18n/ko/src/puzzle_18/puzzle_18.md +++ b/book/i18n/ko/src/puzzle_18/puzzle_18.md @@ -44,8 +44,8 @@ GPU ๊ตฌํ˜„์—์„œ๋Š” ์ตœ๋Œ“๊ฐ’ ์ฐพ๊ธฐ์™€ ์ง€์ˆ˜ ํ•ฉ ๊ณ„์‚ฐ ๋ชจ๋‘์— ๋ณ‘๋ ฌ ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์„ค์ •: -- ์ž…๋ ฅ ํ…์„œ: `Layout.row_major(SIZE)` -- ์ถœ๋ ฅ ํ…์„œ: `Layout.row_major(SIZE)` +- ์ž…๋ ฅ ํ…์„œ: `row_major[SIZE]()` +- ์ถœ๋ ฅ ํ…์„œ: `row_major[SIZE]()` - ์ปค์Šคํ…€ op ํŒŒ๋ผ๋ฏธํ„ฐ: `{"input_size": input_tensor.shape[0]}` ์ด ํผ์ฆ์˜ ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: @@ -259,8 +259,8 @@ def softmax_gpu_kernel[ input_size: Int, dtype: DType = DType.float32, ]( - output: LayoutTensor[mut=True, dtype, layout], - input: LayoutTensor[mut=False, dtype, layout], + output: TileTensor[mut=True, dtype, layout], + input: TileTensor[mut=False, dtype, layout], ) ``` @@ -275,8 +275,8 @@ def softmax_gpu_kernel[ #### ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ```mojo -shared_max = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() -shared_sum = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() +shared_max = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[BLOCK_DIM_X]()) +shared_sum = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[BLOCK_DIM_X]()) ``` ์ปค๋„์€ ๋‘ ๊ฐœ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„ํผ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค: diff --git a/book/i18n/ko/src/puzzle_19/puzzle_19.md b/book/i18n/ko/src/puzzle_19/puzzle_19.md index 9957a39c..1673b32e 100644 --- a/book/i18n/ko/src/puzzle_19/puzzle_19.md +++ b/book/i18n/ko/src/puzzle_19/puzzle_19.md @@ -83,10 +83,10 @@ GPU ๊ตฌํ˜„์€ **์ด์ „ ํผ์ฆ์—์„œ ์ตœ์ ํ™”๋œ ์ปค๋„๋“ค์„ ์žฌ์‚ฌ์šฉํ•˜๊ณ  ๋ ˆ์ด์•„์›ƒ ์„ค์ •: -- ์ฟผ๋ฆฌ ํ…์„œ: `Layout.row_major(d)` -- ํ‚ค ํ…์„œ: `Layout.row_major(seq_len, d)` -- ๊ฐ’ ํ…์„œ: `Layout.row_major(seq_len, d)` -- ์ถœ๋ ฅ ํ…์„œ: `Layout.row_major(d)` +- ์ฟผ๋ฆฌ ํ…์„œ: `row_major[d]()` +- ํ‚ค ํ…์„œ: `row_major[seq_len, d]()` +- ๊ฐ’ ํ…์„œ: `row_major[seq_len, d]()` +- ์ถœ๋ ฅ ํ…์„œ: `row_major[d]()` - ์ปค์Šคํ…€ op ํŒŒ๋ผ๋ฏธํ„ฐ: `{"seq_len": seq_len, "d": d, "dtype": dtype}` ์ด ํผ์ฆ์˜ ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: @@ -123,7 +123,7 @@ GPU ๊ตฌํ˜„์€ **์ด์ „ ํผ์ฆ์—์„œ ์ตœ์ ํ™”๋œ ์ปค๋„๋“ค์„ ์žฌ์‚ฌ์šฉํ•˜๊ณ  **์ „์น˜ ์ปค๋„ ๊ตฌํ˜„ ๊ฐ€์ด๋“œ:** -1. **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •**: `LayoutTensor[dtype, Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`์„ ์‚ฌ์šฉํ•˜์—ฌ `TRANSPOSE_BLOCK_DIM_XY` ร— `TRANSPOSE_BLOCK_DIM_XY` ํฌ๊ธฐ์˜ ์ •์‚ฌ๊ฐํ˜• ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์Šค๋ ˆ๋“œ ๊ฐ„ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ตํ™˜์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. +1. **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •**: `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY]())`์„ ์‚ฌ์šฉํ•˜์—ฌ `TRANSPOSE_BLOCK_DIM_XY` ร— `TRANSPOSE_BLOCK_DIM_XY` ํฌ๊ธฐ์˜ ์ •์‚ฌ๊ฐํ˜• ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์Šค๋ ˆ๋“œ ๊ฐ„ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ตํ™˜์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. 2. **์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ**: ์Šค๋ ˆ๋“œ๋ฅผ ํ–‰๋ ฌ ์š”์†Œ์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค: - `local_row = thread_idx.y`, `local_col = thread_idx.x` (๋ธ”๋ก ๋‚ด ์œ„์น˜) diff --git a/book/i18n/ko/src/puzzle_23/elementwise.md b/book/i18n/ko/src/puzzle_23/elementwise.md index 14c88253..ec223236 100644 --- a/book/i18n/ko/src/puzzle_23/elementwise.md +++ b/book/i18n/ko/src/puzzle_23/elementwise.md @@ -12,7 +12,7 @@ - `elementwise`๋ฅผ ํ™œ์šฉํ•œ **ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ** - GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ **์ž๋™ SIMD ๋ฒกํ„ฐํ™”** -- ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•œ **LayoutTensor ์—ฐ์‚ฐ** +- ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•œ **TileTensor ์—ฐ์‚ฐ** - **GPU ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ** vs SIMD ์—ฐ์‚ฐ - ์ค‘์ฒฉ ํ•จ์ˆ˜์—์„œ์˜ **์บก์ฒ˜ ์˜๋ฏธ๋ก ** @@ -26,7 +26,7 @@ - ๋ฒกํ„ฐ ํฌ๊ธฐ: `SIZE = 1024` - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` - SIMD ํญ: ํƒ€๊ฒŸ ์˜์กด์  (GPU ์•„ํ‚คํ…์ฒ˜์™€ ๋ฐ์ดํ„ฐ ํƒ€์ž…์— ๋”ฐ๋ผ ๊ฒฐ์ •) -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D ํ–‰ ์šฐ์„ ) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D ํ–‰ ์šฐ์„ ) ## ์™„์„ฑํ•  ์ฝ”๋“œ diff --git a/book/i18n/ko/src/puzzle_23/puzzle_23.md b/book/i18n/ko/src/puzzle_23/puzzle_23.md index e2764486..4a57fb24 100644 --- a/book/i18n/ko/src/puzzle_23/puzzle_23.md +++ b/book/i18n/ko/src/puzzle_23/puzzle_23.md @@ -73,7 +73,7 @@ vectorized: 13.38ms โ† ์ž๋™ ์ตœ์ ํ™” ์˜ค๋ฒ„ํ—ค๋“œ - **๊ธฐ๋ณธ GPU ๊ฐœ๋…**: ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ, ์Šค๋ ˆ๋“œ ์‹คํ–‰, SIMD ์—ฐ์‚ฐ - **Mojo ๊ธฐ์ดˆ**: ํŒŒ๋ผ๋ฏธํ„ฐ ํ•จ์ˆ˜, ์ปดํŒŒ์ผ ํƒ€์ž„ ํŠน์ˆ˜ํ™”, ์บก์ฒ˜ ์˜๋ฏธ๋ก  -- **LayoutTensor ์—ฐ์‚ฐ**: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘ +- **TileTensor ์—ฐ์‚ฐ**: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘ - **GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ**: ๋ฒ„ํผ ํ• ๋‹น, ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™” ## ํ•™์Šต ๊ฒฝ๋กœ @@ -88,7 +88,7 @@ vectorized: 13.38ms โ† ์ž๋™ ์ตœ์ ํ™” ์˜ค๋ฒ„ํ—ค๋“œ - `elementwise`๋ฅผ ํ™œ์šฉํ•œ ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ - GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ ์ž๋™ SIMD ๋ฒกํ„ฐํ™” -- ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•œ LayoutTensor ์—ฐ์‚ฐ +- ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•œ TileTensor ์—ฐ์‚ฐ - ์ค‘์ฒฉ ํ•จ์ˆ˜์—์„œ์˜ ์บก์ฒ˜ ์˜๋ฏธ๋ก  **ํ•ต์‹ฌ ํŒจํ„ด:** diff --git a/book/i18n/ko/src/puzzle_23/tile.md b/book/i18n/ko/src/puzzle_23/tile.md index 3f828604..ad7cc70d 100644 --- a/book/i18n/ko/src/puzzle_23/tile.md +++ b/book/i18n/ko/src/puzzle_23/tile.md @@ -33,7 +33,7 @@ Mojo์˜ ํƒ€์ผ๋ง ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ๋ฒกํ„ฐ ๋ง์…ˆ ์—ฐ์‚ฐ์„ ๊ตฌ - ํƒ€์ผ ํฌ๊ธฐ: `TILE_SIZE = 32` - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` - SIMD ํญ: GPU ์˜์กด์  (ํƒ€์ผ ๋‚ด ์—ฐ์‚ฐ์šฉ) -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D ํ–‰ ์šฐ์„ ) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D ํ–‰ ์šฐ์„ ) ## ์™„์„ฑํ•  ์ฝ”๋“œ @@ -60,7 +60,7 @@ num_tiles = (size + tile_size - 1) // tile_size # ์˜ฌ๋ฆผ ๋‚˜๋ˆ—์…ˆ ### 2. **ํƒ€์ผ ์ถ”์ถœ ํŒจํ„ด** -[LayoutTensor `.tile` ๋ฌธ์„œ](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/#tile)๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”. +[TileTensor `.tile` ๋ฌธ์„œ](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/#tile)๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”. ```mojo tile_id = indices[0] # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌํ•  ํƒ€์ผ ํ•˜๋‚˜๋ฅผ ๋ฐ›์Œ diff --git a/book/i18n/ko/src/puzzle_23/vectorize.md b/book/i18n/ko/src/puzzle_23/vectorize.md index f0d6d19c..f80957dc 100644 --- a/book/i18n/ko/src/puzzle_23/vectorize.md +++ b/book/i18n/ko/src/puzzle_23/vectorize.md @@ -34,7 +34,7 @@ - ํƒ€์ผ ํฌ๊ธฐ: `TILE_SIZE = 32` - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` - SIMD ํญ: GPU ์˜์กด์  -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D ํ–‰ ์šฐ์„ ) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D ํ–‰ ์šฐ์„ ) ## 1. ์ˆ˜๋™ ๋ฒกํ„ฐํ™” ๋ฐฉ์‹ diff --git a/book/i18n/ko/src/puzzle_24/puzzle_24.md b/book/i18n/ko/src/puzzle_24/puzzle_24.md index d68652ec..2ae13f47 100644 --- a/book/i18n/ko/src/puzzle_24/puzzle_24.md +++ b/book/i18n/ko/src/puzzle_24/puzzle_24.md @@ -51,9 +51,9 @@ GPU ๋ธ”๋ก (์˜ˆ: 256 ์Šค๋ ˆ๋“œ) ```mojo # 1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฆฌ๋•์…˜ # ์•ž์„œ ์‚ดํŽด๋ณธ ๋ณต์žกํ•œ ํŒจํ„ด (p12.mojo): -shared = LayoutTensor[ +shared = TileTensor[ dtype, - Layout.row_major(WARP_SIZE), + row_major[WARP_SIZE](), MutAnyOrigin, address_space = AddressSpace.SHARED, ].stack_allocation() @@ -95,7 +95,7 @@ total = sum(partial_product) # ๋‚ด๋ถ€์ ์œผ๋กœ ๋ฐฐ๋ฆฌ์–ด๋„, ๊ฒฝ์Ÿ ์ƒํƒœ๋„ - **Part VI ํ•จ์ˆ˜ํ˜• ํŒจํ„ด**: elementwise, tiled, vectorize ์ ‘๊ทผ ๋ฐฉ์‹ - **GPU ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ**: ๋ธ”๋ก, ์›Œํ”„, ์Šค๋ ˆ๋“œ์— ๋Œ€ํ•œ ์ดํ•ด -- **LayoutTensor ์—ฐ์‚ฐ**: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘ +- **TileTensor ์—ฐ์‚ฐ**: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘ - **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋…**: ๋ฐฐ๋ฆฌ์–ด์™€ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ์™œ ๋ณต์žกํ•œ์ง€ ## ํ•™์Šต ๊ฒฝ๋กœ diff --git a/book/i18n/ko/src/puzzle_24/warp_sum.md b/book/i18n/ko/src/puzzle_24/warp_sum.md index b0e5cac1..b635fa95 100644 --- a/book/i18n/ko/src/puzzle_24/warp_sum.md +++ b/book/i18n/ko/src/puzzle_24/warp_sum.md @@ -27,7 +27,7 @@ - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` - ๋ธ”๋ก ๊ตฌ์„ฑ: `(WARP_SIZE, 1)` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ - ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: `(1, 1)` ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜ -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D ํ–‰ ์šฐ์„ ) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D ํ–‰ ์šฐ์„ ) ## ๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ณต์žก์„ฑ (Puzzle 12์—์„œ) diff --git a/book/i18n/ko/src/puzzle_25/puzzle_25.md b/book/i18n/ko/src/puzzle_25/puzzle_25.md index c5c4f397..b57819da 100644 --- a/book/i18n/ko/src/puzzle_25/puzzle_25.md +++ b/book/i18n/ko/src/puzzle_25/puzzle_25.md @@ -50,9 +50,9 @@ GPU ์›Œํ”„ (32 ์Šค๋ ˆ๋“œ, SIMT ๋ก์Šคํ… ์‹คํ–‰) ```mojo # ๋ณต์žกํ•œ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด (๊ธฐ์กด ๋ฐฉ์‹): -shared = LayoutTensor[ +shared = TileTensor[ dtype, - Layout.row_major(WARP_SIZE), + row_major[WARP_SIZE](), MutAnyOrigin, address_space = AddressSpace.SHARED, ].stack_allocation() @@ -91,7 +91,7 @@ else: - **Part VII ์›Œํ”„ ๊ธฐ์ดˆ**: SIMT ์‹คํ–‰๊ณผ ๊ธฐ๋ณธ ์›Œํ”„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ดํ•ด ([Puzzle 24: ์›Œํ”„ ๊ธฐ์ดˆ](../puzzle_24/puzzle_24.md) ์ฐธ๊ณ ) - **GPU ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ**: ๋ธ”๋ก, ์›Œํ”„, ๋ ˆ์ธ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ -- **LayoutTensor ์—ฐ์‚ฐ**: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘ +- **TileTensor ์—ฐ์‚ฐ**: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘ - **๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ**: ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฐ€์žฅ์ž๋ฆฌ ์ผ€์ด์Šค ๊ด€๋ฆฌ ## ํ•™์Šต ๊ฒฝ๋กœ diff --git a/book/i18n/ko/src/puzzle_25/warp_broadcast.md b/book/i18n/ko/src/puzzle_25/warp_broadcast.md index 17386452..49672ad2 100644 --- a/book/i18n/ko/src/puzzle_25/warp_broadcast.md +++ b/book/i18n/ko/src/puzzle_25/warp_broadcast.md @@ -82,7 +82,7 @@ result = use_collective_value(collective_value) - ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: `(1, 1)` ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜ - ๋ธ”๋ก ๊ตฌ์„ฑ: `(WARP_SIZE, 1)` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D row-major) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D row-major) ### ์™„์„ฑํ•  ์ฝ”๋“œ diff --git a/book/i18n/ko/src/puzzle_25/warp_shuffle_down.md b/book/i18n/ko/src/puzzle_25/warp_shuffle_down.md index b9f0e906..108f73ee 100644 --- a/book/i18n/ko/src/puzzle_25/warp_shuffle_down.md +++ b/book/i18n/ko/src/puzzle_25/warp_shuffle_down.md @@ -31,7 +31,7 @@ - ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: `(1, 1)` ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜ - ๋ธ”๋ก ๊ตฌ์„ฑ: `(WARP_SIZE, 1)` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D row-major) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D row-major) ### shuffle_down ๊ฐœ๋… diff --git a/book/i18n/ko/src/puzzle_26/puzzle_26.md b/book/i18n/ko/src/puzzle_26/puzzle_26.md index 0f79af92..e44d00e7 100644 --- a/book/i18n/ko/src/puzzle_26/puzzle_26.md +++ b/book/i18n/ko/src/puzzle_26/puzzle_26.md @@ -50,9 +50,9 @@ Offset 1: Lane 0 โ†” Lane 1, Lane 2 โ†” Lane 3, ..., Lane 30 โ†” Lane 31 ```mojo # ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ (๊ธฐ์กด ๋ฐฉ์‹ - Puzzle 14 ์ฐธ๊ณ ): -shared = LayoutTensor[ +shared = TileTensor[ dtype, - Layout.row_major(WARP_SIZE), + row_major[WARP_SIZE](), MutAnyOrigin, address_space = AddressSpace.SHARED, ].stack_allocation() diff --git a/book/i18n/ko/src/puzzle_26/warp_prefix_sum.md b/book/i18n/ko/src/puzzle_26/warp_prefix_sum.md index 0eb81776..961b50a4 100644 --- a/book/i18n/ko/src/puzzle_26/warp_prefix_sum.md +++ b/book/i18n/ko/src/puzzle_26/warp_prefix_sum.md @@ -28,7 +28,7 @@ - ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: `(1, 1)` ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜ - ๋ธ”๋ก ๊ตฌ์„ฑ: `(WARP_SIZE, 1)` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D row-major) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D row-major) ### prefix_sum์˜ ์ด์  diff --git a/book/i18n/ko/src/puzzle_26/warp_shuffle_xor.md b/book/i18n/ko/src/puzzle_26/warp_shuffle_xor.md index 7aa21e6c..638f8966 100644 --- a/book/i18n/ko/src/puzzle_26/warp_shuffle_xor.md +++ b/book/i18n/ko/src/puzzle_26/warp_shuffle_xor.md @@ -31,7 +31,7 @@ - ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: `(1, 1)` ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜ - ๋ธ”๋ก ๊ตฌ์„ฑ: `(WARP_SIZE, 1)` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D row-major) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D row-major) ### shuffle_xor ๊ฐœ๋… diff --git a/book/i18n/ko/src/puzzle_27/block_broadcast.md b/book/i18n/ko/src/puzzle_27/block_broadcast.md index 8f0fbffe..31474e02 100644 --- a/book/i18n/ko/src/puzzle_27/block_broadcast.md +++ b/book/i18n/ko/src/puzzle_27/block_broadcast.md @@ -27,7 +27,7 @@ - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` - ๋ธ”๋ก ๊ตฌ์„ฑ: `(128, 1)` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (`TPB = 128`) - ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: `(1, 1)` ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜ -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ๋ชจ๋‘ 1D row-major) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ๋ชจ๋‘ 1D row-major) - ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: 1-8 ๋ฐ˜๋ณต ๊ฐ’, ํ‰๊ท  = 4.5 - ์˜ˆ์ƒ ์ถœ๋ ฅ: ํ‰๊ท ์ด 1.0์ธ ์ •๊ทœํ™”๋œ ๋ฒกํ„ฐ @@ -102,7 +102,7 @@ output[global_i] = my_value / mean ### 2. **๋ฐ์ดํ„ฐ ๋กœ๋”ฉ๊ณผ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ (์ต์ˆ™ํ•œ ํŒจํ„ด)** -๊ธฐ์กด LayoutTensor ํŒจํ„ด์œผ๋กœ ์š”์†Œ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค: +๊ธฐ์กด TileTensor ํŒจํ„ด์œผ๋กœ ์š”์†Œ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค: ```mojo var my_value: Scalar[dtype] = 0.0 @@ -257,7 +257,7 @@ Output mean: 1.0 global_i = block_dim.x * block_idx.x + thread_idx.x // ์ž…๋ ฅ ๋ฐฐ์—ด ์œ„์น˜์— ๋งคํ•‘ local_i = thread_idx.x // ๋ธ”๋ก ๋‚ด ์œ„์น˜ (0-127) -LayoutTensor ํŒจํ„ด์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ์š”์†Œ ๋กœ๋”ฉ: +TileTensor ํŒจํ„ด์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ์š”์†Œ ๋กœ๋”ฉ: ์Šค๋ ˆ๋“œ 0: my_value = input_data[0][0] = 1.0 // ์ฒซ ๋ฒˆ์งธ ์ˆœํ™˜ ๊ฐ’ ์Šค๋ ˆ๋“œ 1: my_value = input_data[1][0] = 2.0 // ๋‘ ๋ฒˆ์งธ ์ˆœํ™˜ ๊ฐ’ ์Šค๋ ˆ๋“œ 7: my_value = input_data[7][0] = 8.0 // ๋งˆ์ง€๋ง‰ ์ˆœํ™˜ ๊ฐ’ @@ -375,10 +375,10 @@ block.broadcast() ์‹คํ–‰ ํ›„: ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ฆ๋ช… ๊ฐ€๋Šฅํ•˜๊ฒŒ ์˜ฌ๋ฐ”๋ฅธ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ``` -### **[Puzzle 12](../puzzle_12/layout_tensor.md) (๊ธฐ์ดˆ ํŒจํ„ด)๊ณผ์˜ ์—ฐ๊ฒฐ:** +### **[Puzzle 12](../puzzle_12/tile_tensor.md) (๊ธฐ์ดˆ ํŒจํ„ด)๊ณผ์˜ ์—ฐ๊ฒฐ:** - **์Šค๋ ˆ๋“œ ์กฐ์œจ์˜ ์ง„ํ™”**: ๋™์ผํ•œ `global_i`, `local_i` ํŒจํ„ด์ด์ง€๋งŒ ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ ์‚ฌ์šฉ -- **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด**: ๋™์ผํ•œ LayoutTensor SIMD ์ถ”์ถœ `[0]`์ด์ง€๋งŒ ์ตœ์ ํ™”๋œ ์›Œํฌํ”Œ๋กœ์šฐ +- **๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด**: ๋™์ผํ•œ TileTensor SIMD ์ถ”์ถœ `[0]`์ด์ง€๋งŒ ์ตœ์ ํ™”๋œ ์›Œํฌํ”Œ๋กœ์šฐ - **๋ณต์žก์„ฑ ์ œ๊ฑฐ**: 20์ค„ ์ด์ƒ์˜ ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด๋ฅผ 2๊ฐœ์˜ ๋ธ”๋ก ์—ฐ์‚ฐ์œผ๋กœ ๋Œ€์ฒด - **๊ต์œก์  ์ง„ํ–‰**: ์ˆ˜๋™ โ†’ ์ž๋™, ๋ณต์žก โ†’ ๋‹จ์ˆœ, ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ โ†’ ์‹ ๋ขฐ์„ฑ @@ -457,7 +457,7 @@ block.broadcast() ์‹คํ–‰ ํ›„: **์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์ง„ํ–‰:** -1. **์ˆ˜๋™ ์กฐ์œจ** ([Puzzle 12](../puzzle_12/layout_tensor.md)): ๋ณ‘๋ ฌ ๊ธฐ์ดˆ ์ดํ•ด +1. **์ˆ˜๋™ ์กฐ์œจ** ([Puzzle 12](../puzzle_12/tile_tensor.md)): ๋ณ‘๋ ฌ ๊ธฐ์ดˆ ์ดํ•ด 2. **์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ** ([Puzzle 24](../puzzle_24/warp_sum.md)): ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ํŒจํ„ด ํ•™์Šต 3. **๋ธ”๋ก ๋ฆฌ๋•์…˜** ([`block.sum()`](./block_sum.md)): ์ „์ฒดโ†’ํ•˜๋‚˜ ํ†ต์‹  ํ•™์Šต 4. **๋ธ”๋ก ์Šค์บ”** ([`block.prefix_sum()`](./block_prefix_sum.md)): ์ „์ฒดโ†’๊ฐ๊ฐ ํ†ต์‹  ํ•™์Šต diff --git a/book/i18n/ko/src/puzzle_27/block_prefix_sum.md b/book/i18n/ko/src/puzzle_27/block_prefix_sum.md index b3b8c469..cbfcaceb 100644 --- a/book/i18n/ko/src/puzzle_27/block_prefix_sum.md +++ b/book/i18n/ko/src/puzzle_27/block_prefix_sum.md @@ -28,7 +28,7 @@ - ๋ธ”๋ก ๊ตฌ์„ฑ: `(128, 1)` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (`TPB = 128`) - ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: `(1, 1)` ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜ - ๊ตฌ๊ฐ„ ์ˆ˜: `NUM_BINS = 8` (๋ฒ”์œ„ [0.0, 0.125), [0.125, 0.25) ๋“ฑ) -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D row-major) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D row-major) - ๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜: `128 / WARP_SIZE` (GPU์— ๋”ฐ๋ผ 2๊ฐœ ๋˜๋Š” 4๊ฐœ) ## ๋„์ „ ๊ณผ์ œ: ๋ณ‘๋ ฌ ๊ตฌ๊ฐ„ ์ถ”์ถœ @@ -134,7 +134,7 @@ if belongs_to_target == 1: bin_output[Int(offset[0])] = my_value # ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด SIMD๋ฅผ Int๋กœ ๋ณ€ํ™˜ ``` -์ด๊ฒƒ์€ [Puzzle 12](../puzzle_12/layout_tensor.md)์˜ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํŒจํ„ด๊ณผ ๋™์ผํ•˜์ง€๋งŒ, ์กฐ๊ฑด์ด "๋Œ€์ƒ ๊ตฌ๊ฐ„์— ์†ํ•˜๋Š”์ง€"๋กœ ๋ฐ”๋€Œ์—ˆ์Šต๋‹ˆ๋‹ค. +์ด๊ฒƒ์€ [Puzzle 12](../puzzle_12/tile_tensor.md)์˜ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํŒจํ„ด๊ณผ ๋™์ผํ•˜์ง€๋งŒ, ์กฐ๊ฑด์ด "๋Œ€์ƒ ๊ตฌ๊ฐ„์— ์†ํ•˜๋Š”์ง€"๋กœ ๋ฐ”๋€Œ์—ˆ์Šต๋‹ˆ๋‹ค. ### 6. **์ตœ์ข… ๊ฐœ์ˆ˜ ๊ณ„์‚ฐ** @@ -152,7 +152,7 @@ if local_i == tpb - 1: # ๋ธ”๋ก์˜ ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ ์ด์ „ ํผ์ฆ์˜ ํŒจํ„ด์„ ๊ธฐ์–ตํ•˜์„ธ์š”: -- `LayoutTensor` ์ธ๋ฑ์‹ฑ์€ SIMD๋ฅผ ๋ฐ˜ํ™˜: `input_data[i][0]` +- `TileTensor` ์ธ๋ฑ์‹ฑ์€ SIMD๋ฅผ ๋ฐ˜ํ™˜: `input_data[i][0]` - `block.prefix_sum()`์€ SIMD๋ฅผ ๋ฐ˜ํ™˜: `offset[0]`์œผ๋กœ ์ถ”์ถœ - ๋ฐฐ์—ด ์ธ๋ฑ์‹ฑ์€ `Int`๊ฐ€ ํ•„์š”: `bin_output[...]`์— `Int(offset[0])` @@ -254,14 +254,14 @@ Bin 7 extracted elements: ## **๋‹จ๊ณ„๋ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:** -### **1๋‹จ๊ณ„: ์š”์†Œ ์ฒ˜๋ฆฌ ([Puzzle 12](../puzzle_12/layout_tensor.md) ๋‚ด์ ๊ณผ ์œ ์‚ฌ)** +### **1๋‹จ๊ณ„: ์š”์†Œ ์ฒ˜๋ฆฌ ([Puzzle 12](../puzzle_12/tile_tensor.md) ๋‚ด์ ๊ณผ ์œ ์‚ฌ)** ``` ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ (์ต์ˆ™ํ•œ ํŒจํ„ด): global_i = block_dim.x * block_idx.x + thread_idx.x // ์ „์—ญ ์š”์†Œ ์ธ๋ฑ์Šค local_i = thread_idx.x // ๋กœ์ปฌ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค -์š”์†Œ ๋กœ๋”ฉ (LayoutTensor ํŒจํ„ด๊ณผ ๋™์ผ): +์š”์†Œ ๋กœ๋”ฉ (TileTensor ํŒจํ„ด๊ณผ ๋™์ผ): ์Šค๋ ˆ๋“œ 0: my_value = input_data[0][0] = 0.00 ์Šค๋ ˆ๋“œ 1: my_value = input_data[1][0] = 0.01 ์Šค๋ ˆ๋“œ 13: my_value = input_data[13][0] = 0.13 @@ -330,11 +330,11 @@ belongs_to_target=1์ธ ์Šค๋ ˆ๋“œ๋งŒ ๊ธฐ๋ก: ## **์ด ๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :** -### **[Puzzle 12](../puzzle_12/layout_tensor.md) (๊ธฐ์กด ๋‚ด์ )๊ณผ์˜ ์—ฐ๊ฒฐ:** +### **[Puzzle 12](../puzzle_12/tile_tensor.md) (๊ธฐ์กด ๋‚ด์ )๊ณผ์˜ ์—ฐ๊ฒฐ:** - **๋™์ผํ•œ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ**: `global_i`์™€ `local_i` ํŒจํ„ด - **๋™์ผํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ**: `if global_i < size` ๊ฒ€์ฆ -- **๋™์ผํ•œ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ**: `[0]`์„ ์‚ฌ์šฉํ•œ LayoutTensor SIMD ์ถ”์ถœ +- **๋™์ผํ•œ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ**: `[0]`์„ ์‚ฌ์šฉํ•œ TileTensor SIMD ์ถ”์ถœ ### **[`block.sum()`](./block_sum.md) (์ด ํผ์ฆ์˜ ์•ž๋ถ€๋ถ„)๊ณผ์˜ ์—ฐ๊ฒฐ:** diff --git a/book/i18n/ko/src/puzzle_27/block_sum.md b/book/i18n/ko/src/puzzle_27/block_sum.md index d1da7eaf..da3c945d 100644 --- a/book/i18n/ko/src/puzzle_27/block_sum.md +++ b/book/i18n/ko/src/puzzle_27/block_sum.md @@ -27,12 +27,12 @@ - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` - ๋ธ”๋ก ๊ตฌ์„ฑ: `(128, 1)` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (`TPB = 128`) - ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: `(1, 1)` ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜ -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(SIZE)` (1D row-major) +- ๋ ˆ์ด์•„์›ƒ: `row_major[SIZE]()` (1D row-major) - ๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜: `128 / WARP_SIZE` (NVIDIA์—์„œ 4๊ฐœ, AMD์—์„œ 2๊ฐœ ๋˜๋Š” 4๊ฐœ) ## ๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ณต์žก์„ฑ (Puzzle 12์—์„œ) -[Puzzle 12](../puzzle_12/layout_tensor.md)์˜ ๋ณต์žกํ•œ ๋ฐฉ์‹์„ ๋– ์˜ฌ๋ ค ๋ด…์‹œ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด, ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค: +[Puzzle 12](../puzzle_12/tile_tensor.md)์˜ ๋ณต์žกํ•œ ๋ฐฉ์‹์„ ๋– ์˜ฌ๋ ค ๋ด…์‹œ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด, ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค: ```mojo {{#include ../../../../../solutions/p27/p27.mojo:traditional_dot_product_solution}} @@ -173,9 +173,9 @@ Just like warp.sum() but for the entire block ๊ฐ ์Šค๋ ˆ๋“œ๋Š” ๋ฒกํ„ฐ `a`์™€ `b`์—์„œ ํ•˜๋‚˜์˜ ์š”์†Œ ์Œ์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋“ค์„ ์Šค๋ ˆ๋“œ ๊ฐ„์— ํ•ฉ์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” "๋ถ€๋ถ„ ๊ฒฐ๊ณผ"๋กœ ํ•ฉ์น˜๋Š” ์—ฐ์‚ฐ์€ ๋ฌด์—‡์ผ๊นŒ์š”? -### 3. **LayoutTensor ์ธ๋ฑ์‹ฑ ํŒจํ„ด** +### 3. **TileTensor ์ธ๋ฑ์‹ฑ ํŒจํ„ด** -`LayoutTensor` ์š”์†Œ์— ์ ‘๊ทผํ•  ๋•Œ, ์ธ๋ฑ์‹ฑ์ด SIMD ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค๋Š” ์ ์„ ๊ธฐ์–ตํ•˜์„ธ์š”. ์‚ฐ์ˆ  ์—ฐ์‚ฐ์„ ์œ„ํ•ด ์Šค์นผ๋ผ ๊ฐ’์„ ์ถ”์ถœํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. +`TileTensor` ์š”์†Œ์— ์ ‘๊ทผํ•  ๋•Œ, ์ธ๋ฑ์‹ฑ์ด SIMD ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค๋Š” ์ ์„ ๊ธฐ์–ตํ•˜์„ธ์š”. ์‚ฐ์ˆ  ์—ฐ์‚ฐ์„ ์œ„ํ•ด ์Šค์นผ๋ผ ๊ฐ’์„ ์ถ”์ถœํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ### 4. **[block.sum()](https://docs.modular.com/mojo/std/gpu/primitives/block/sum) API ๊ฐœ๋…** diff --git a/book/i18n/ko/src/puzzle_28/puzzle_28.md b/book/i18n/ko/src/puzzle_28/puzzle_28.md index 2be71f31..89021720 100644 --- a/book/i18n/ko/src/puzzle_28/puzzle_28.md +++ b/book/i18n/ko/src/puzzle_28/puzzle_28.md @@ -164,7 +164,7 @@ wait_and_compute() # โ† ๋‚˜๋จธ์ง€ ~400 ์‚ฌ์ดํด๋งŒ ๋Œ€๊ธฐ ํ›„ ์—ฐ์‚ฐ - ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ๊ทธ๋ฆฌ๋“œ๋‹น `(VECTOR_SIZE // CONV_TILE_SIZE, 1)` ๋ธ”๋ก (64๊ฐœ ๋ธ”๋ก) - ์ปค๋„ ํฌ๊ธฐ: `KERNEL_SIZE = 5` (Puzzle 13๊ณผ ๋™์ผํ•œ ๊ฐ„๋‹จํ•œ 1D ํ•ฉ์„ฑ๊ณฑ) - ๋ฐ์ดํ„ฐ ํƒ€์ž…: `DType.float32` -- ๋ ˆ์ด์•„์›ƒ: `Layout.row_major(VECTOR_SIZE)` (1D row-major) +- ๋ ˆ์ด์•„์›ƒ: `row_major[VECTOR_SIZE]()` (1D row-major) ### ๋น„๋™๊ธฐ ๋ณต์‚ฌ์˜ ๊ธฐํšŒ @@ -327,13 +327,13 @@ uv run poe p28 ```mojo # Phase 1: Launch async copy for input tile input_tile = input.tile[CONV_TILE_SIZE](block_idx.x) -comptime load_layout = Layout.row_major(THREADS_PER_BLOCK_ASYNC) +comptime load_layout = row_major[THREADS_PER_BLOCK_ASYNC]() copy_dram_to_sram_async[thread_layout=load_layout](input_shared, input_tile) ``` -- **ํƒ€์ผ ์ƒ์„ฑ**: `input.tile[CONV_TILE_SIZE](block_idx.x)`๋Š” `block_idx.x * 256`์—์„œ ์‹œ์ž‘ํ•˜๋Š” 256๊ฐœ ์š”์†Œ์˜ ์ž…๋ ฅ ๋ฐฐ์—ด ๋ทฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. Mojo์˜ [`tile` ๋ฉ”์„œ๋“œ](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/#tile)๋Š” ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋‚˜ ์ œ๋กœ ํŒจ๋”ฉ์„ **์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค**. ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ธ๋ฑ์Šค ์ ‘๊ทผ์€ ๋ฏธ์ •์˜ ๋™์ž‘์„ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌํ˜„์—์„œ ํƒ€์ผ ํฌ๊ธฐ์™€ offset์ด ์œ ํšจํ•œ ๋ฐฐ์—ด ๋ฒ”์œ„ ๋‚ด์— ์žˆ๋Š”์ง€ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. +- **ํƒ€์ผ ์ƒ์„ฑ**: `input.tile[CONV_TILE_SIZE](block_idx.x)`๋Š” `block_idx.x * 256`์—์„œ ์‹œ์ž‘ํ•˜๋Š” 256๊ฐœ ์š”์†Œ์˜ ์ž…๋ ฅ ๋ฐฐ์—ด ๋ทฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. Mojo์˜ [`tile` ๋ฉ”์„œ๋“œ](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/#tile)๋Š” ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋‚˜ ์ œ๋กœ ํŒจ๋”ฉ์„ **์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค**. ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ธ๋ฑ์Šค ์ ‘๊ทผ์€ ๋ฏธ์ •์˜ ๋™์ž‘์„ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌํ˜„์—์„œ ํƒ€์ผ ํฌ๊ธฐ์™€ offset์ด ์œ ํšจํ•œ ๋ฐฐ์—ด ๋ฒ”์œ„ ๋‚ด์— ์žˆ๋Š”์ง€ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. -- **์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ**: `Layout.row_major(THREADS_PER_BLOCK_ASYNC, 1)`๋Š” ๋ธ”๋ก ๊ตฌ์„ฑ๊ณผ ์ผ์น˜ํ•˜๋Š” `256 x 1` ๋ ˆ์ด์•„์›ƒ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ **ํ•„์ˆ˜**์ž…๋‹ˆ๋‹ค - ์ตœ์ ์˜ ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•ด ๋ ˆ์ด์•„์›ƒ์ด ๋ฌผ๋ฆฌ์  ์Šค๋ ˆ๋“œ ๋ฐฐ์น˜์™€ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ ˆ์ด์•„์›ƒ์ด ์ผ์น˜ํ•˜์ง€ ์•Š์œผ๋ฉด ์Šค๋ ˆ๋“œ๊ฐ€ ๋น„์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ์— ์ ‘๊ทผํ•˜์—ฌ ๋ณ‘ํ•ฉ์ด ๊นจ์ง€๊ณ  ์„ฑ๋Šฅ์ด ์‹ฌ๊ฐํ•˜๊ฒŒ ์ €ํ•˜๋ฉ๋‹ˆ๋‹ค. +- **์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ**: `row_major[THREADS_PER_BLOCK_ASYNC, 1]()`๋Š” ๋ธ”๋ก ๊ตฌ์„ฑ๊ณผ ์ผ์น˜ํ•˜๋Š” `256 x 1` ๋ ˆ์ด์•„์›ƒ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ **ํ•„์ˆ˜**์ž…๋‹ˆ๋‹ค - ์ตœ์ ์˜ ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•ด ๋ ˆ์ด์•„์›ƒ์ด ๋ฌผ๋ฆฌ์  ์Šค๋ ˆ๋“œ ๋ฐฐ์น˜์™€ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ ˆ์ด์•„์›ƒ์ด ์ผ์น˜ํ•˜์ง€ ์•Š์œผ๋ฉด ์Šค๋ ˆ๋“œ๊ฐ€ ๋น„์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ์— ์ ‘๊ทผํ•˜์—ฌ ๋ณ‘ํ•ฉ์ด ๊นจ์ง€๊ณ  ์„ฑ๋Šฅ์ด ์‹ฌ๊ฐํ•˜๊ฒŒ ์ €ํ•˜๋ฉ๋‹ˆ๋‹ค. - **๋น„๋™๊ธฐ ๋ณต์‚ฌ ์‹œ์ž‘**: `copy_dram_to_sram_async`๋Š” DRAM์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ์˜ ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์ „์†ก์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋“œ์›จ์–ด๊ฐ€ 256๊ฐœ์˜ float(1KB)๋ฅผ ๋ณต์‚ฌํ•˜๋Š” ๋™์•ˆ ๋ธ”๋ก์€ ๊ณ„์† ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. @@ -412,7 +412,7 @@ Total Time = MAX(Input_Transfer_Time, Kernel_Transfer_Time) + Compute_Time #### **ํ•ต์‹ฌ ๊ธฐ์ˆ ์  ํ†ต์ฐฐ** -1. **์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ ๋งค์นญ**: `Layout.row_major(256, 1)` ๋ ˆ์ด์•„์›ƒ์ด ๋ธ”๋ก์˜ `(256, 1)` ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ๊ณผ ์ •ํ™•ํžˆ ์ผ์น˜ํ•˜์—ฌ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. +1. **์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ ๋งค์นญ**: `row_major[256, 1]()` ๋ ˆ์ด์•„์›ƒ์ด ๋ธ”๋ก์˜ `(256, 1)` ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ๊ณผ ์ •ํ™•ํžˆ ์ผ์น˜ํ•˜์—ฌ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. 2. **๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€**: ์ ์ ˆํ•œ ์ˆœ์„œ ์ง€์ •(๋น„๋™๊ธฐ ๋ณต์‚ฌ โ†’ ์ปค๋„ ๋กœ๋“œ โ†’ ๋Œ€๊ธฐ โ†’ ๋ฐฐ๋ฆฌ์–ด โ†’ ์—ฐ์‚ฐ)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. diff --git a/book/i18n/ko/src/puzzle_32/conflict_free_patterns.md b/book/i18n/ko/src/puzzle_32/conflict_free_patterns.md index dc2972bd..5dc47afa 100644 --- a/book/i18n/ko/src/puzzle_32/conflict_free_patterns.md +++ b/book/i18n/ko/src/puzzle_32/conflict_free_patterns.md @@ -356,7 +356,7 @@ constant = shared[0] # All threads read same address - hardware optimized **3. ํŒจ๋”ฉ ๊ธฐ๋ฒ•:** ```mojo -shared = LayoutTensor[dtype, Layout.row_major(TPB + 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() # Shift access patterns +shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + 1]()) # Shift access patterns ``` **4. ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„:** diff --git a/book/i18n/ko/src/puzzle_33/puzzle_33.md b/book/i18n/ko/src/puzzle_33/puzzle_33.md index a730eb0b..4912e19e 100644 --- a/book/i18n/ko/src/puzzle_33/puzzle_33.md +++ b/book/i18n/ko/src/puzzle_33/puzzle_33.md @@ -178,9 +178,9 @@ Total: 4ร—2 = 8 warps, each handling 32ร—32 output region ๋ ˆ์ด์•„์›ƒ ์„ค์ •: -- ์ž…๋ ฅ A: `Layout.row_major(SIZE, SIZE)` -- ์ž…๋ ฅ B: `Layout.row_major(SIZE, SIZE)` -- ์ถœ๋ ฅ C: `Layout.row_major(SIZE, SIZE)` +- ์ž…๋ ฅ A: `row_major[SIZE, SIZE]()` +- ์ž…๋ ฅ B: `row_major[SIZE, SIZE]()` +- ์ถœ๋ ฅ C: `row_major[SIZE, SIZE]()` - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ธ”๋ก ํฌ๊ธฐ ํƒ€์ผ ## ๋„์ „ ๊ณผ์ œ diff --git a/book/i18n/ko/src/puzzle_34/advanced_cluster_patterns.md b/book/i18n/ko/src/puzzle_34/advanced_cluster_patterns.md index bf675187..8fb33014 100644 --- a/book/i18n/ko/src/puzzle_34/advanced_cluster_patterns.md +++ b/book/i18n/ko/src/puzzle_34/advanced_cluster_patterns.md @@ -39,7 +39,7 @@ - **์›Œํ”„ ํฌ๊ธฐ**: `WARP_SIZE = 32` ์›Œํ”„๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (NVIDIA ํ‘œ์ค€) - **๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜**: `TPB / WARP_SIZE = 8` ์›Œํ”„ - **๋ฐ์ดํ„ฐ ํƒ€์ž…**: `DType.float32` -- **๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ**: ์ž…๋ ฅ `Layout.row_major(SIZE)`, ์ถœ๋ ฅ `Layout.row_major(CLUSTER_SIZE)` +- **๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ**: ์ž…๋ ฅ `row_major[SIZE]()`, ์ถœ๋ ฅ `row_major[CLUSTER_SIZE]()` **์ฒ˜๋ฆฌ ๋ถ„๋ฐฐ:** diff --git a/book/i18n/ko/src/puzzle_34/cluster_collective_ops.md b/book/i18n/ko/src/puzzle_34/cluster_collective_ops.md index 1016d94f..116140af 100644 --- a/book/i18n/ko/src/puzzle_34/cluster_collective_ops.md +++ b/book/i18n/ko/src/puzzle_34/cluster_collective_ops.md @@ -42,8 +42,8 @@ - **๋ธ”๋ก ์„ค์ •**: `TPB = 256` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ `(256, 1)` - **๊ทธ๋ฆฌ๋“œ ์„ค์ •**: `CLUSTER_SIZE = 4` ํด๋Ÿฌ์Šคํ„ฐ๋‹น ๋ธ”๋ก ์ˆ˜ `(4, 1)` - **๋ฐ์ดํ„ฐ ํƒ€์ž…**: `DType.float32` -- **๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ**: ์ž…๋ ฅ `Layout.row_major(SIZE)`, ์ถœ๋ ฅ `Layout.row_major(1)` -- **์ž„์‹œ ์ €์žฅ์†Œ**: ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ `Layout.row_major(CLUSTER_SIZE)` +- **๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ**: ์ž…๋ ฅ `row_major[SIZE]()`, ์ถœ๋ ฅ `row_major[1]()` +- **์ž„์‹œ ์ €์žฅ์†Œ**: ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ `row_major[CLUSTER_SIZE]()` **์˜ˆ์ƒ ๊ฒฐ๊ณผ**: ์ˆ˜์—ด `0, 0.01, 0.02, ..., 10.23`์˜ ํ•ฉ = **523,776** diff --git a/book/i18n/ko/src/puzzle_34/cluster_coordination_basics.md b/book/i18n/ko/src/puzzle_34/cluster_coordination_basics.md index d011d478..67d391ce 100644 --- a/book/i18n/ko/src/puzzle_34/cluster_coordination_basics.md +++ b/book/i18n/ko/src/puzzle_34/cluster_coordination_basics.md @@ -37,7 +37,7 @@ - **๋ธ”๋ก ์„ค์ •**: `TPB = 256` ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ `(256, 1)` - **๊ทธ๋ฆฌ๋“œ ์„ค์ •**: `CLUSTER_SIZE = 4` ํด๋Ÿฌ์Šคํ„ฐ๋‹น ๋ธ”๋ก ์ˆ˜ `(4, 1)` - **๋ฐ์ดํ„ฐ ํƒ€์ž…**: `DType.float32` -- **๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ**: ์ž…๋ ฅ `Layout.row_major(SIZE)`, ์ถœ๋ ฅ `Layout.row_major(CLUSTER_SIZE)` +- **๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ**: ์ž…๋ ฅ `row_major[SIZE]()`, ์ถœ๋ ฅ `row_major[CLUSTER_SIZE]()` **์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋ถ„๋ฐฐ:** @@ -67,7 +67,7 @@ ### **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •** -- `LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค ([Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ดˆ](../puzzle_08/puzzle_08.md) ์ฐธ๊ณ ) +- `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[tpb]())`์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค ([Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ดˆ](../puzzle_08/puzzle_08.md) ์ฐธ๊ณ ) - `block_id + 1`๋กœ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๋ธ”๋ก๋งˆ๋‹ค ๊ณ ์œ ํ•œ ์Šค์ผ€์ผ๋ง์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค - ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ์‹œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค ([Puzzle 3์˜ ๊ฐ€๋“œ ํŒจํ„ด](../puzzle_03/puzzle_03.md)) @@ -155,7 +155,7 @@ block_id = Int(block_idx.x) # Block index for reliable **๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ:** -- ๊ฐ ๋ธ”๋ก์ด ์ž์ฒด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ž‘์—… ๊ณต๊ฐ„์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค: `LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` +- ๊ฐ ๋ธ”๋ก์ด ์ž์ฒด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ž‘์—… ๊ณต๊ฐ„์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค: `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[tpb]())` - **์Šค์ผ€์ผ๋ง ์ „๋žต**: `data_scale = Float32(block_id + 1)`๋กœ ๊ฐ ๋ธ”๋ก์ด ๋‹ค๋ฅด๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค - Block 0: 1.0๋ฐฐ, Block 1: 2.0๋ฐฐ, Block 2: 3.0๋ฐฐ, Block 3: 4.0๋ฐฐ - **๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ**: `if global_i < size:`๋กœ ๋ฒ”์œ„ ๋ฐ– ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md index 7094598a..3714ce93 100644 --- a/book/src/SUMMARY.md +++ b/book/src/SUMMARY.md @@ -10,23 +10,17 @@ - [Puzzle 1: Map](./puzzle_01/puzzle_01.md) - [๐Ÿ”ฐ Raw Memory Approach](./puzzle_01/raw.md) - - [๐Ÿ’ก Preview: Modern Approach with LayoutTensor](./puzzle_01/layout_tensor_preview.md) + - [๐Ÿ’ก Preview: Modern Approach with TileTensor](./puzzle_01/tile_tensor_preview.md) - [Puzzle 2: Zip](./puzzle_02/puzzle_02.md) - [Puzzle 3: Guards](./puzzle_03/puzzle_03.md) - [Puzzle 4: 2D Map](./puzzle_04/puzzle_04.md) - [๐Ÿ”ฐ Raw Memory Approach](./puzzle_04/raw.md) - - [๐Ÿ“š Learn about LayoutTensor](./puzzle_04/introduction_layout_tensor.md) - - [๐Ÿš€ Modern 2D Operations](./puzzle_04/layout_tensor.md) + - [๐Ÿ“š Learn about TileTensor](./puzzle_04/introduction_tile_tensor.md) + - [๐Ÿš€ Modern 2D Operations](./puzzle_04/tile_tensor.md) - [Puzzle 5: Broadcast](./puzzle_05/puzzle_05.md) - - [๐Ÿ”ฐ Raw Memory Approach](./puzzle_05/raw.md) - - [๐Ÿ“ LayoutTensor Version](./puzzle_05/layout_tensor.md) - [Puzzle 6: Blocks](./puzzle_06/puzzle_06.md) - [Puzzle 7: 2D Blocks](./puzzle_07/puzzle_07.md) - - [๐Ÿ”ฐ Raw Memory Approach](./puzzle_07/raw.md) - - [๐Ÿ“ LayoutTensor Version](./puzzle_07/layout_tensor.md) - [Puzzle 8: Shared Memory](./puzzle_08/puzzle_08.md) - - [๐Ÿ”ฐ Raw Memory Approach](./puzzle_08/raw.md) - - [๐Ÿ“ LayoutTensor Version](./puzzle_08/layout_tensor.md) # Part II: ๐Ÿž Debugging GPU Programs @@ -42,11 +36,7 @@ # Part III: ๐Ÿงฎ GPU Algorithms - [Puzzle 11: Pooling](./puzzle_11/puzzle_11.md) - - [๐Ÿ”ฐ Raw Memory Approach](./puzzle_11/raw.md) - - [๐Ÿ“ LayoutTensor Version](./puzzle_11/layout_tensor.md) - [Puzzle 12: Dot Product](./puzzle_12/puzzle_12.md) - - [๐Ÿ”ฐ Raw Memory Approach](./puzzle_12/raw.md) - - [๐Ÿ“ LayoutTensor Version](./puzzle_12/layout_tensor.md) - [Puzzle 13: 1D Convolution](./puzzle_13/puzzle_13.md) - [๐Ÿ”ฐ Simple Version](./puzzle_13/simple.md) - [โญ Block Boundary Version](./puzzle_13/block_boundary.md) diff --git a/book/src/introduction.md b/book/src/introduction.md index b7f6c99c..ba9f9fec 100644 --- a/book/src/introduction.md +++ b/book/src/introduction.md @@ -156,7 +156,7 @@ This book takes you on a journey from first principles to advanced GPU programmi - Learn thread indexing and block organization - Understand memory access patterns and guards -- Work with both raw pointers and LayoutTensor abstractions +- Work with both raw pointers and TileTensor abstractions - Learn shared memory basics for inter-thread communication **Part II: Debugging GPU programs (Puzzles 9-10) โœ…** @@ -226,7 +226,7 @@ This book takes you on a journey from first principles to advanced GPU programmi - Program tensor cores for AI workloads - Learn cluster programming in modern GPUs -The book uniquely challenges the status quo approach by first building understanding with low-level memory manipulation, then gradually transitioning to Mojo's LayoutTensor abstractions. This provides both deep understanding of GPU memory patterns and practical knowledge of modern tensor-based approaches. +The book uniquely challenges the status quo approach by first building understanding with low-level memory manipulation, then gradually transitioning to Mojo's TileTensor abstractions. This provides both deep understanding of GPU memory patterns and practical knowledge of modern tensor-based approaches. ## Ready to get started? diff --git a/book/src/puzzle_01/puzzle_01.md b/book/src/puzzle_01/puzzle_01.md index cc1f175d..8a47ee05 100644 --- a/book/src/puzzle_01/puzzle_01.md +++ b/book/src/puzzle_01/puzzle_01.md @@ -30,8 +30,8 @@ For each position \\(i\\): Start with direct memory manipulation to understand GPU fundamentals. -### [๐Ÿ’ก Preview: Modern Approach with LayoutTensor](./layout_tensor_preview.md) +### [๐Ÿ’ก Preview: Modern Approach with TileTensor](./tile_tensor_preview.md) -See how LayoutTensor simplifies GPU programming with safer, cleaner code. +See how TileTensor simplifies GPU programming with safer, cleaner code. ๐Ÿ’ก **Tip**: Understanding both approaches leads to better appreciation of modern GPU programming patterns. diff --git a/book/src/puzzle_01/layout_tensor_preview.md b/book/src/puzzle_01/tile_tensor_preview.md similarity index 76% rename from book/src/puzzle_01/layout_tensor_preview.md rename to book/src/puzzle_01/tile_tensor_preview.md index 0c411e9d..62b971b8 100644 --- a/book/src/puzzle_01/layout_tensor_preview.md +++ b/book/src/puzzle_01/tile_tensor_preview.md @@ -1,4 +1,4 @@ -## Why consider LayoutTensor? +## Why consider TileTensor? Looking at our traditional implementation below, you might notice some potential issues: @@ -30,9 +30,9 @@ idx = (batch * HEIGHT + row) * WIDTH + col idx = (batch * padded_height + row) * padded_width + col ``` -### LayoutTensor preview +### TileTensor preview -[LayoutTensor](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/) will help us handle these cases more elegantly: +[TileTensor](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/) will help us handle these cases more elegantly: ```mojo # Future preview - don't worry about this syntax yet! @@ -40,7 +40,7 @@ output[i, j] = a[i, j] + 10.0 # 2D indexing output[b, i, j] = a[b, i, j] + 10.0 # 3D indexing ``` -We'll learn about LayoutTensor in detail in Puzzle 4, where these concepts become essential. For now, focus on understanding: +We'll learn about TileTensor in detail in Puzzle 4, where these concepts become essential. For now, focus on understanding: - Basic thread indexing - Simple memory access patterns diff --git a/book/src/puzzle_02/puzzle_02.md b/book/src/puzzle_02/puzzle_02.md index 56f7e099..21d50cf5 100644 --- a/book/src/puzzle_02/puzzle_02.md +++ b/book/src/puzzle_02/puzzle_02.md @@ -130,4 +130,4 @@ While this direct indexing works for simple element-wise operations, consider: - What if we need to broadcast one array to another? - How to ensure coalesced access across multiple arrays? -These questions will be addressed when we [introduce LayoutTensor in Puzzle 4](../puzzle_04/introduction_layout_tensor.md). +These questions will be addressed when we [introduce TileTensor in Puzzle 4](../puzzle_04/introduction_tile_tensor.md). diff --git a/book/src/puzzle_03/puzzle_03.md b/book/src/puzzle_03/puzzle_03.md index d1081936..148034cf 100644 --- a/book/src/puzzle_03/puzzle_03.md +++ b/book/src/puzzle_03/puzzle_03.md @@ -159,4 +159,4 @@ if i < height and j < width and k < depth and i >= padding and j >= padding: ... ``` -These boundary handling patterns will become more elegant when we [learn about LayoutTensor in Puzzle 4](../puzzle_04/introduction_layout_tensor.md), which provides built-in shape management. +These boundary handling patterns will become more elegant when we [learn about TileTensor in Puzzle 4](../puzzle_04/introduction_tile_tensor.md), which provides built-in shape management. diff --git a/book/src/puzzle_04/intro.mojo b/book/src/puzzle_04/intro.mojo index 49a66bde..10b248b9 100644 --- a/book/src/puzzle_04/intro.mojo +++ b/book/src/puzzle_04/intro.mojo @@ -1,15 +1,17 @@ from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major comptime HEIGHT = 2 comptime WIDTH = 3 comptime dtype = DType.float32 -comptime layout = Layout.row_major(HEIGHT, WIDTH) +comptime layout = row_major[HEIGHT, WIDTH]() +comptime LayoutType = type_of(layout) -def kernel[ - dtype: DType, layout: Layout -](tensor: LayoutTensor[dtype, layout, MutAnyOrigin]): +def kernel( + tensor: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], +): print("Before:") print(tensor) tensor[0, 0] += 1 @@ -22,9 +24,9 @@ def main() raises: a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH) a.enqueue_fill(0) - tensor = LayoutTensor[dtype, layout, MutAnyOrigin](a) + tensor = TileTensor(a, layout) # Note: since `tensor` is a device tensor we can't print it without the kernel wrapper - ctx.enqueue_function[kernel[dtype, layout], kernel[dtype, layout]]( + ctx.enqueue_function[kernel, kernel]( tensor, grid_dim=1, block_dim=1 ) diff --git a/book/src/puzzle_04/introduction_layout_tensor.md b/book/src/puzzle_04/introduction_tile_tensor.md similarity index 67% rename from book/src/puzzle_04/introduction_layout_tensor.md rename to book/src/puzzle_04/introduction_tile_tensor.md index 868a6fe0..c01a29d3 100644 --- a/book/src/puzzle_04/introduction_layout_tensor.md +++ b/book/src/puzzle_04/introduction_tile_tensor.md @@ -1,9 +1,9 @@ -# Introduction to LayoutTensor +# Introduction to TileTensor Let's take a quick break from solving puzzles to preview a powerful abstraction that will make our GPU programming journey more enjoyable: -๐Ÿฅ ... the **[LayoutTensor](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/)**. +๐Ÿฅ ... the **[TileTensor](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/)**. -> ๐Ÿ’ก _This is a motivational overview of LayoutTensor's capabilities. Don't worry about understanding everything now - we'll explore each feature in depth as we progress through the puzzles_. +> ๐Ÿ’ก _This is a motivational overview of TileTensor's capabilities. Don't worry about understanding everything now - we'll explore each feature in depth as we progress through the puzzles_. ## The challenge: Growing complexity @@ -30,9 +30,9 @@ if row < height and col < width: output[idx] = a[idx] + 10.0 ``` -## The solution: A peek at LayoutTensor +## The solution: A peek at TileTensor -LayoutTensor will help us tackle these challenges with elegant solutions. Here's a glimpse of what's coming: +TileTensor will help us tackle these challenges with elegant solutions. Here's a glimpse of what's coming: 1. **Natural Indexing**: Use `tensor[i, j]` instead of manual offset calculations 2. **Flexible Memory Layouts**: Support for row-major, column-major, and tiled organizations @@ -40,34 +40,36 @@ LayoutTensor will help us tackle these challenges with elegant solutions. Here's ## A taste of what's ahead -Let's look at a few examples of what LayoutTensor can do. Don't worry about understanding all the details now - we'll cover each feature thoroughly in upcoming puzzles. +Let's look at a few examples of what TileTensor can do. Don't worry about understanding all the details now - we'll cover each feature thoroughly in upcoming puzzles. ### Basic usage example ```mojo -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major # Define layout comptime HEIGHT = 2 comptime WIDTH = 3 -comptime layout = Layout.row_major(HEIGHT, WIDTH) +comptime layout = row_major[HEIGHT, WIDTH]() +comptime LayoutType = type_of(layout) # Create tensor -tensor = LayoutTensor[dtype, layout](buffer.unsafe_ptr()) +tensor = TileTensor(buffer, layout) # Access elements naturally tensor[0, 0] = 1.0 # First element tensor[1, 2] = 2.0 # Last element ``` -To learn more about `Layout` and `LayoutTensor`, see these guides from the [Mojo manual](https://docs.modular.com/mojo/manual/) +To learn more about `Layout` and `TileTensor`, see these guides from the [Mojo manual](https://docs.modular.com/mojo/manual/) - [Introduction to layouts](https://docs.modular.com/mojo/manual/layout/layouts) -- [Using LayoutTensor](https://docs.modular.com/mojo/manual/layout/tensors) +- [Using TileTensor](https://docs.modular.com/mojo/manual/layout/tensors) ## Quick example -Let's put everything together with a simple example that demonstrates the basics of LayoutTensor: +Let's put everything together with a simple example that demonstrates the basics of TileTensor: ```mojo {{#include ./intro.mojo}} @@ -85,28 +87,28 @@ When we run this code with:
```bash -pixi run layout_tensor_intro +pixi run tile_tensor_intro ```
```bash -pixi run -e amd layout_tensor_intro +pixi run -e amd tile_tensor_intro ```
```bash -pixi run -e apple layout_tensor_intro +pixi run -e apple tile_tensor_intro ```
```bash -uv run poe layout_tensor_intro +uv run poe tile_tensor_intro ```
@@ -128,7 +130,7 @@ Let's break down what's happening: 3. Using natural indexing, we modify a single element 4. The change is reflected in our output -This simple example demonstrates key LayoutTensor benefits: +This simple example demonstrates key TileTensor benefits: - Clean syntax for tensor creation and access - Automatic memory layout handling @@ -141,6 +143,6 @@ While this example is straightforward, the same patterns will scale to complex G - Complex tiling strategies - Hardware-accelerated computations -Ready to start your GPU programming journey with LayoutTensor? Let's dive into the puzzles! +Ready to start your GPU programming journey with TileTensor? Let's dive into the puzzles! ๐Ÿ’ก **Tip**: Keep this example in mind as we progress - we'll build upon these fundamental concepts to create increasingly sophisticated GPU programs. diff --git a/book/src/puzzle_04/puzzle_04.md b/book/src/puzzle_04/puzzle_04.md index 58615d62..6acf7cf7 100644 --- a/book/src/puzzle_04/puzzle_04.md +++ b/book/src/puzzle_04/puzzle_04.md @@ -44,10 +44,10 @@ For each position \\((i,j)\\): ### [๐Ÿ”ฐ Raw memory approach](./raw.md) Learn how 2D indexing works with manual memory management. -### [๐Ÿ“š Learn about LayoutTensor](./introduction_layout_tensor.md) +### [๐Ÿ“š Learn about TileTensor](./introduction_tile_tensor.md) Discover a powerful abstraction that simplifies multi-dimensional array operations and memory management on GPU. -### [๐Ÿš€ Modern 2D operations](./layout_tensor.md) -Put LayoutTensor into practice with natural 2D indexing and automatic bounds checking. +### [๐Ÿš€ Modern 2D operations](./tile_tensor.md) +Put TileTensor into practice with natural 2D indexing and automatic bounds checking. -๐Ÿ’ก **Note**: From this puzzle onward, we'll primarily use LayoutTensor for cleaner, safer GPU code. +๐Ÿ’ก **Note**: From this puzzle onward, we'll primarily use TileTensor for cleaner, safer GPU code. diff --git a/book/src/puzzle_04/layout_tensor.md b/book/src/puzzle_04/tile_tensor.md similarity index 65% rename from book/src/puzzle_04/layout_tensor.md rename to book/src/puzzle_04/tile_tensor.md index b6a2d431..7f7896e0 100644 --- a/book/src/puzzle_04/layout_tensor.md +++ b/book/src/puzzle_04/tile_tensor.md @@ -1,8 +1,8 @@ -# LayoutTensor Version +# TileTensor Version ## Overview -Implement a kernel that adds 10 to each position of 2D _LayoutTensor_ `a` and stores it in 2D _LayoutTensor_ `output`. +Implement a kernel that adds 10 to each position of 2D _TileTensor_ `a` and stores it in 2D _TileTensor_ `output`. **Note:** _You have more threads than positions_. @@ -10,13 +10,13 @@ Implement a kernel that adds 10 to each position of 2D _LayoutTensor_ `a` and st In this puzzle, you'll learn about: -- Using `LayoutTensor` for 2D array access +- Using `TileTensor` for 2D array access - Direct 2D indexing with `tensor[i, j]` -- Handling bounds checking with `LayoutTensor` +- Handling bounds checking with `TileTensor` -The key insight is that `LayoutTensor` provides a natural 2D indexing interface, abstracting away the underlying memory layout while still requiring bounds checking. +The key insight is that `TileTensor` provides a natural 2D indexing interface, abstracting away the underlying memory layout while still requiring bounds checking. -- **2D access**: Natural \\((i,j)\\) indexing with `LayoutTensor` +- **2D access**: Natural \\((i,j)\\) indexing with `TileTensor` - **Memory abstraction**: No manual row-major calculation needed - **Guard condition**: Still need bounds checking in both dimensions - **Thread bounds**: More threads \\((3 \times 3)\\) than tensor elements \\((2 \times 2)\\) @@ -24,10 +24,10 @@ The key insight is that `LayoutTensor` provides a natural 2D indexing interface, ## Code to complete ```mojo -{{#include ../../../problems/p04/p04_layout_tensor.mojo:add_10_2d_layout_tensor}} +{{#include ../../../problems/p04/p04_tile_tensor.mojo:add_10_2d_tile_tensor}} ``` -View full file: problems/p04/p04_layout_tensor.mojo +View full file: problems/p04/p04_tile_tensor.mojo
Tips @@ -55,28 +55,28 @@ To test your solution, run the following command in your terminal:
```bash -pixi run p04_layout_tensor +pixi run p04_tile_tensor ```
```bash -pixi run -e amd p04_layout_tensor +pixi run -e amd p04_tile_tensor ```
```bash -pixi run -e apple p04_layout_tensor +pixi run -e apple p04_tile_tensor ```
```bash -uv run poe p04_layout_tensor +uv run poe p04_tile_tensor ```
@@ -95,7 +95,7 @@ expected: HostBuffer([10.0, 11.0, 12.0, 13.0]) ```mojo -{{#include ../../../solutions/p04/p04_layout_tensor.mojo:add_10_2d_layout_tensor_solution}} +{{#include ../../../solutions/p04/p04_tile_tensor.mojo:add_10_2d_tile_tensor_solution}} ```
@@ -104,7 +104,7 @@ This solution: - Gets 2D thread indices with `row = thread_idx.y`, `col = thread_idx.x` - Guards against out-of-bounds with `if row < size and col < size` -- Uses `LayoutTensor`'s 2D indexing: `output[row, col] = a[row, col] + 10.0` +- Uses `TileTensor`'s 2D indexing: `output[row, col] = a[row, col] + 10.0`
diff --git a/book/src/puzzle_05/layout_tensor.md b/book/src/puzzle_05/layout_tensor.md deleted file mode 100644 index 873be966..00000000 --- a/book/src/puzzle_05/layout_tensor.md +++ /dev/null @@ -1,128 +0,0 @@ -# LayoutTensor Version - -## Overview - -Implement a kernel that broadcast adds 1D LayoutTensor `a` and 1D LayoutTensor `b` and stores it in 2D LayoutTensor `output`. - -**Note:** _You have more threads than positions._ - -## Key concepts - -In this puzzle, you'll learn about: - -- Using `LayoutTensor` for broadcast operations -- Working with different tensor shapes -- Handling 2D indexing with `LayoutTensor` - -The key insight is that `LayoutTensor` allows natural broadcasting through different tensor shapes: \\((1, n)\\) and \\((n, 1)\\) to \\((n,n)\\), while still requiring bounds checking. - -- **Tensor shapes**: Input vectors have shapes \\((1, n)\\) and \\((n, 1)\\) -- **Broadcasting**: Output combines both dimensions to \\((n,n)\\) -- **Guard condition**: Still need bounds checking for output size -- **Thread bounds**: More threads \\((3 \times 3)\\) than tensor elements \\((2 \times 2)\\) - -## Code to complete - -```mojo -{{#include ../../../problems/p05/p05_layout_tensor.mojo:broadcast_add_layout_tensor}} -``` - -View full file: problems/p05/p05_layout_tensor.mojo - -
-Tips - -
- -1. Get 2D indices: `row = thread_idx.y`, `col = thread_idx.x` -2. Add guard: `if row < size and col < size` -3. Inside guard: think about how to broadcast values of `a` and `b` as LayoutTensors - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p05_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p05_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p05_layout_tensor -``` - -
-
- -```bash -uv run poe p05_layout_tensor -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([1.0, 2.0, 11.0, 12.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p05/p05_layout_tensor.mojo:broadcast_add_layout_tensor_solution}} -``` - -
- -This solution demonstrates key concepts of LayoutTensor broadcasting and GPU thread mapping: - -1. **Thread to matrix mapping** - - - Uses `thread_idx.y` for row access and `thread_idx.x` for column access - - Natural 2D indexing matches the output matrix structure - - Excess threads (3ร—3 grid) are handled by bounds checking - -2. **Broadcasting mechanics** - - Input `a` has shape `(1,n)`: `a[0,col]` broadcasts across rows - - Input `b` has shape `(n,1)`: `b[row,0]` broadcasts across columns - - Output has shape `(n,n)`: Each element is sum of corresponding broadcasts - - ```txt - [ a0 a1 ] + [ b0 ] = [ a0+b0 a1+b0 ] - [ b1 ] [ a0+b1 a1+b1 ] - ``` - -3. **Bounds Checking** - - Guard condition `row < size and col < size` prevents out-of-bounds access - - Handles both matrix bounds and excess threads efficiently - - No need for separate checks for `a` and `b` due to broadcasting - -This pattern forms the foundation for more complex tensor operations we'll explore in later puzzles. -
-
diff --git a/book/src/puzzle_05/puzzle_05.md b/book/src/puzzle_05/puzzle_05.md index 67f534b8..e3409ca9 100644 --- a/book/src/puzzle_05/puzzle_05.md +++ b/book/src/puzzle_05/puzzle_05.md @@ -2,7 +2,7 @@ ## Overview -Implement a kernel that broadcast adds vector `a` and vector `b` and stores it in 2D matrix `output`. +Implement a kernel that broadcast adds 1D TileTensor `a` and 1D TileTensor `b` and stores it in 2D TileTensor `output`. **Broadcasting** in parallel programming refers to the operation where lower-dimensional arrays are automatically expanded to match the shape of higher-dimensional arrays during element-wise operations. Instead of physically replicating data in memory, values are logically repeated across the additional dimensions. For example, adding a 1D vector to each row (or column) of a 2D matrix applies the same vector elements repeatedly without creating multiple copies. @@ -12,17 +12,124 @@ Implement a kernel that broadcast adds vector `a` and vector `b` and stores it i Broadcast visualization ## Key concepts -- Broadcasting vectors to matrix -- 2D thread management -- Mixed dimension operations -- Memory layout patterns -## Implementation approaches +In this puzzle, you'll learn about: -### [๐Ÿ”ฐ Raw memory approach](./raw.md) -Learn how to handle broadcasting with manual memory indexing. +- Broadcasting 1D vectors across different dimensions with `TileTensor` +- Using 2D thread indices to map GPU threads to a 2D output matrix +- Working with different tensor shapes for mixed-dimension operations +- Handling boundary conditions in broadcast patterns -### [๐Ÿ“ LayoutTensor Version](./layout_tensor.md) -Use LayoutTensor to handle mixed-dimension operations. +The key insight is that `TileTensor` allows natural broadcasting through different tensor shapes: \\((1, n)\\) and \\((n, 1)\\) to \\((n,n)\\), while still requiring bounds checking. -๐Ÿ’ก **Note**: Notice how LayoutTensor simplifies broadcasting compared to manual indexing. +- **Tensor shapes**: Input vectors have shapes \\((1, n)\\) and \\((n, 1)\\) +- **Broadcasting**: Each element of `a` combines with each element of `b`; output expands both dimensions to \\((n,n)\\) +- **Access patterns**: `a[0, col]` broadcasts horizontally across rows; `b[row, 0]` broadcasts vertically across columns +- **Guard condition**: Still need bounds checking for output size +- **Thread bounds**: More threads \\((3 \times 3)\\) than tensor elements \\((2 \times 2)\\) + +## Code to complete + +```mojo +{{#include ../../../problems/p05/p05.mojo:broadcast_add}} +``` + +View full file: problems/p05/p05.mojo + +
+Tips + +
+ +1. Get 2D indices: `row = thread_idx.y`, `col = thread_idx.x` +2. Add guard: `if row < size and col < size` +3. Inside guard: think about how to broadcast values of `a` and `b` as TileTensors + +
+
+ +## Running the code + +To test your solution, run the following command in your terminal: + +
+
+ + + + +
+
+ +```bash +pixi run p05 +``` + +
+
+ +```bash +pixi run -e amd p05 +``` + +
+
+ +```bash +pixi run -e apple p05 +``` + +
+
+ +```bash +uv run poe p05 +``` + +
+
+ +Your output will look like this if the puzzle isn't solved yet: + +```txt +out: HostBuffer([0.0, 0.0, 0.0, 0.0]) +expected: HostBuffer([1.0, 2.0, 11.0, 12.0]) +``` + +## Solution + +
+ + +```mojo +{{#include ../../../solutions/p05/p05.mojo:broadcast_add_solution}} +``` + +
+ +This solution demonstrates key concepts of TileTensor broadcasting and GPU thread mapping: + +1. **Thread to matrix mapping** + + - Uses `thread_idx.y` for row access and `thread_idx.x` for column access + - Natural 2D indexing matches the output matrix structure + - Excess threads (3ร—3 grid) are handled by bounds checking + +2. **Broadcasting mechanics** + - Input `a` has shape `(1,n)`: `a[0,col]` broadcasts across rows + - Input `b` has shape `(n,1)`: `b[row,0]` broadcasts across columns + - Output has shape `(n,n)`: Each element is sum of corresponding broadcasts + + ```txt + [ a0 a1 ] + [ b0 ] = [ a0+b0 a1+b0 ] + [ b1 ] [ a0+b1 a1+b1 ] + ``` + +3. **Bounds Checking** + - Guard condition `row < size and col < size` prevents out-of-bounds access + - Handles both matrix bounds and excess threads efficiently + - No need for separate checks for `a` and `b` due to broadcasting + +This pattern forms the foundation for more complex tensor operations we'll explore in later puzzles. +
+
diff --git a/book/src/puzzle_05/raw.md b/book/src/puzzle_05/raw.md deleted file mode 100644 index 9ddb7d22..00000000 --- a/book/src/puzzle_05/raw.md +++ /dev/null @@ -1,125 +0,0 @@ -## Overview - -Implement a kernel that broadcast adds vector `a` and vector `b` and stores it in 2D matrix `output`. - -**Note:** _You have more threads than positions._ - -## Key concepts - -In this puzzle, you'll learn about: - -- Broadcasting 1D vectors across different dimensions -- Using 2D thread indices for broadcast operations -- Handling boundary conditions in broadcast patterns - -The key insight is understanding how to map elements from two 1D vectors to create a 2D output matrix through broadcasting, while handling thread bounds correctly. - -- **Broadcasting**: Each element of `a` combines with each element of `b` -- **Thread mapping**: 2D thread grid \\((3 \times 3)\\) for \\(2 \times 2\\) output -- **Vector access**: Different access patterns for `a` and `b` -- **Bounds checking**: Guard against threads outside matrix dimensions - -## Code to complete - -```mojo -{{#include ../../../problems/p05/p05.mojo:broadcast_add}} -``` - -View full file: problems/p05/p05.mojo - -
-Tips - -
- -1. Get 2D indices: `row = thread_idx.y`, `col = thread_idx.x` -2. Add guard: `if row < size and col < size` -3. Inside guard: think about how to broadcast values of `a` and `b` - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p05 -``` - -
-
- -```bash -pixi run -e amd p05 -``` - -
-
- -```bash -pixi run -e apple p05 -``` - -
-
- -```bash -uv run poe p05 -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([1.0, 2.0, 11.0, 12.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p05/p05.mojo:broadcast_add_solution}} -``` - -
- -This solution demonstrates fundamental GPU broadcasting concepts without LayoutTensor abstraction: - -1. **Thread to matrix mapping** - - Uses `thread_idx.y` for row access and `thread_idx.x` for column access - - Direct mapping from 2D thread grid to output matrix elements - - Handles excess threads (3ร—3 grid) for 2ร—2 output matrix - -2. **Broadcasting mechanics** - - Vector `a` broadcasts horizontally: same `a[col]` used across each row - - Vector `b` broadcasts vertically: same `b[row]` used across each column - - Output combines both vectors through addition - - ```txt - [ a0 a1 ] + [ b0 ] = [ a0+b0 a1+b0 ] - [ b1 ] [ a0+b1 a1+b1 ] - ``` - -3. **Bounds checking** - - Single guard condition `row < size and col < size` handles both dimensions - - Prevents out-of-bounds access for both input vectors and output matrix - - Required due to 3ร—3 thread grid being larger than 2ร—2 data - -Compare this with the LayoutTensor version to see how the abstraction simplifies broadcasting operations while maintaining the same underlying concepts. -
-
diff --git a/book/src/puzzle_06/puzzle_06.md b/book/src/puzzle_06/puzzle_06.md index 61566ebf..7b528a6c 100644 --- a/book/src/puzzle_06/puzzle_06.md +++ b/book/src/puzzle_06/puzzle_06.md @@ -4,6 +4,8 @@ Implement a kernel that adds 10 to each position of vector `a` and stores it in `output`. +A **thread block** (or just **block**) is a group of threads that execute together on a single GPU multiprocessor. All threads in a block share the same shared memory and can synchronize with each other. When data is larger than one block can handle, the GPU schedules multiple blocks โ€” each block independently processes its portion of the data. The global position of a thread is computed from both its position within the block (`thread_idx.x`) and which block it belongs to (`block_idx.x`): `global_i = block_dim.x * block_idx.x + thread_idx.x`. + **Note:** _You have fewer threads per block than the size of a._ Blocks visualization @@ -27,7 +29,7 @@ The key insight is understanding how blocks of threads work together to process View full file: problems/p06/p06.mojo -> Note: The `LayoutTensor` variant of this puzzle is very similar so we leave it to the reader. +> Note: The `TileTensor` variant of this puzzle is very similar so we leave it to the reader.
Tips diff --git a/book/src/puzzle_07/layout_tensor.md b/book/src/puzzle_07/layout_tensor.md deleted file mode 100644 index 95d6e30a..00000000 --- a/book/src/puzzle_07/layout_tensor.md +++ /dev/null @@ -1,158 +0,0 @@ -# LayoutTensor Version - -## Overview - -Implement a kernel that adds 10 to each position of 2D LayoutTensor `a` and stores it in 2D LayoutTensor `output`. - -**Note:** _You have fewer threads per block than the size of `a` in both directions._ - -## Key concepts - -In this puzzle, you'll learn about: - -- Using `LayoutTensor` with multiple blocks -- Handling large matrices with 2D block organization -- Combining block indexing with `LayoutTensor` access - -The key insight is that `LayoutTensor` simplifies 2D indexing while still requiring proper block coordination for large matrices. - -## Configuration - -- **Matrix size**: \\(5 \times 5\\) elements -- **Layout handling**: `LayoutTensor` manages row-major organization -- **Block coordination**: Multiple blocks cover the full matrix -- **2D indexing**: Natural \\((i,j)\\) access with bounds checking -- **Total threads**: \\(36\\) for \\(25\\) elements -- **Thread mapping**: Each thread processes one matrix element - -## Code to complete - -```mojo -{{#include ../../../problems/p07/p07_layout_tensor.mojo:add_10_blocks_2d_layout_tensor}} -``` - -View full file: problems/p07/p07_layout_tensor.mojo - -
-Tips - -
- -1. Calculate global indices: `row = block_dim.y * block_idx.y + thread_idx.y`, `col = block_dim.x * block_idx.x + thread_idx.x` -2. Add guard: `if row < size and col < size` -3. Inside guard: think about how to add 10 to 2D LayoutTensor - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p07_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p07_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p07_layout_tensor -``` - -
-
- -```bash -uv run poe p07_layout_tensor -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0]) -expected: HostBuffer([10.0, 11.0, 12.0, ... , 34.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p07/p07_layout_tensor.mojo:add_10_blocks_2d_layout_tensor_solution}} -``` - -
- -This solution demonstrates how LayoutTensor simplifies 2D block-based processing: - -1. **2D thread indexing** - - Global row: `block_dim.y * block_idx.y + thread_idx.y` - - Global col: `block_dim.x * block_idx.x + thread_idx.x` - - Maps thread grid to tensor elements: - - ```txt - 5ร—5 tensor with 3ร—3 blocks: - - Block (0,0) Block (1,0) - [(0,0) (0,1) (0,2)] [(0,3) (0,4) * ] - [(1,0) (1,1) (1,2)] [(1,3) (1,4) * ] - [(2,0) (2,1) (2,2)] [(2,3) (2,4) * ] - - Block (0,1) Block (1,1) - [(3,0) (3,1) (3,2)] [(3,3) (3,4) * ] - [(4,0) (4,1) (4,2)] [(4,3) (4,4) * ] - [ * * * ] [ * * * ] - ``` - - (* = thread exists but outside tensor bounds) - -2. **LayoutTensor benefits** - - Natural 2D indexing: `tensor[row, col]` instead of manual offset calculation - - Automatic memory layout optimization - - Example access pattern: - - ```txt - Raw memory: LayoutTensor: - row * size + col tensor[row, col] - (2,1) -> 11 (2,1) -> same element - ``` - -3. **Bounds checking** - - Guard `row < size and col < size` handles: - - Excess threads in partial blocks - - Edge cases at tensor boundaries - - Automatic memory layout handling by LayoutTensor - - 36 threads (2ร—2 blocks of 3ร—3) for 25 elements - -4. **Block coordination** - - Each 3ร—3 block processes part of 5ร—5 tensor - - LayoutTensor handles: - - Memory layout optimization - - Efficient access patterns - - Block boundary coordination - - Cache-friendly data access - -This pattern shows how LayoutTensor simplifies 2D block processing while maintaining optimal memory access patterns and thread coordination. -
-
diff --git a/book/src/puzzle_07/puzzle_07.md b/book/src/puzzle_07/puzzle_07.md index 4a35aef0..ffea68dc 100644 --- a/book/src/puzzle_07/puzzle_07.md +++ b/book/src/puzzle_07/puzzle_07.md @@ -2,7 +2,7 @@ ## Overview -Implement a kernel that adds 10 to each position of matrix `a` and stores it in `output`. +Implement a kernel that adds 10 to each position of 2D TileTensor `a` and stores it in 2D TileTensor `output`. **Note:** _You have fewer threads per block than the size of `a` in both directions._ @@ -11,10 +11,13 @@ Implement a kernel that adds 10 to each position of matrix `a` and stores it in ## Key concepts -- Block-based processing -- Grid-block coordination -- Multi-block indexing -- Memory access patterns +In this puzzle, you'll learn about: + +- Using `TileTensor` with multiple blocks +- Handling large matrices with 2D block organization +- Combining block indexing with `TileTensor` access + +The key insight is that `TileTensor` simplifies 2D indexing while still requiring proper block coordination for large matrices. > ๐Ÿ”‘ **2D thread indexing convention** > @@ -43,12 +46,143 @@ Implement a kernel that adds 10 to each position of matrix `a` and stores it in > - No overlap between blocks > - Efficient memory access patterns -## Implementation approaches +## Configuration + +- **Matrix size**: \\(5 \times 5\\) elements +- **Layout handling**: `TileTensor` manages row-major organization +- **Block coordination**: Multiple blocks cover the full matrix +- **2D indexing**: Natural \\((i,j)\\) access with bounds checking +- **Total threads**: \\(36\\) for \\(25\\) elements +- **Thread mapping**: Each thread processes one matrix element + +## Code to complete + +```mojo +{{#include ../../../problems/p07/p07.mojo:add_10_blocks_2d}} +``` + +View full file: problems/p07/p07.mojo + +
+Tips + +
+ +1. Calculate global indices: `row = block_dim.y * block_idx.y + thread_idx.y`, `col = block_dim.x * block_idx.x + thread_idx.x` +2. Add guard: `if row < size and col < size` +3. Inside guard: think about how to add 10 to 2D TileTensor + +
+
+ +## Running the code + +To test your solution, run the following command in your terminal: + +
+
+ + + + +
+
+ +```bash +pixi run p07 +``` + +
+
+ +```bash +pixi run -e amd p07 +``` + +
+
+ +```bash +pixi run -e apple p07 +``` + +
+
+ +```bash +uv run poe p07 +``` + +
+
+ +Your output will look like this if the puzzle isn't solved yet: + +```txt +out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0]) +expected: HostBuffer([10.0, 11.0, 12.0, ... , 34.0]) +``` + +## Solution + +
+ + +```mojo +{{#include ../../../solutions/p07/p07.mojo:add_10_blocks_2d_solution}} +``` + +
+ +This solution demonstrates how TileTensor simplifies 2D block-based processing: + +1. **2D thread indexing** + - Global row: `block_dim.y * block_idx.y + thread_idx.y` + - Global col: `block_dim.x * block_idx.x + thread_idx.x` + - Maps thread grid to tensor elements: + + ```txt + 5ร—5 tensor with 3ร—3 blocks: + + Block (0,0) Block (1,0) + [(0,0) (0,1) (0,2)] [(0,3) (0,4) * ] + [(1,0) (1,1) (1,2)] [(1,3) (1,4) * ] + [(2,0) (2,1) (2,2)] [(2,3) (2,4) * ] + + Block (0,1) Block (1,1) + [(3,0) (3,1) (3,2)] [(3,3) (3,4) * ] + [(4,0) (4,1) (4,2)] [(4,3) (4,4) * ] + [ * * * ] [ * * * ] + ``` + + (* = thread exists but outside tensor bounds) + +2. **TileTensor benefits** + - Natural 2D indexing: `tensor[row, col]` instead of manual offset calculation + - Automatic memory layout optimization + - Example access pattern: + + ```txt + Raw memory: TileTensor: + row * size + col tensor[row, col] + (2,1) -> 11 (2,1) -> same element + ``` -### [๐Ÿ”ฐ Raw memory approach](./raw.md) -Learn how to handle multi-block operations with manual indexing. +3. **Bounds checking** + - Guard `row < size and col < size` handles: + - Excess threads in partial blocks + - Edge cases at tensor boundaries + - Automatic memory layout handling by TileTensor + - 36 threads (2ร—2 blocks of 3ร—3) for 25 elements -### [๐Ÿ“ LayoutTensor Version](./layout_tensor.md) -Use LayoutTensor features to elegantly handle block-based processing. +4. **Block coordination** + - Each 3ร—3 block processes part of 5ร—5 tensor + - TileTensor handles: + - Memory layout optimization + - Efficient access patterns + - Block boundary coordination + - Cache-friendly data access -๐Ÿ’ก **Note**: See how LayoutTensor simplifies block coordination and memory access patterns. +This pattern shows how TileTensor simplifies 2D block processing while maintaining optimal memory access patterns and thread coordination. +
+
diff --git a/book/src/puzzle_07/raw.md b/book/src/puzzle_07/raw.md deleted file mode 100644 index 006e5842..00000000 --- a/book/src/puzzle_07/raw.md +++ /dev/null @@ -1,155 +0,0 @@ -## Overview - -Implement a kernel that adds 10 to each position of matrix `a` and stores it in `output`. - -**Note:** _You have fewer threads per block than the size of `a` in both directions._ - -## Key concepts - -In this puzzle, you'll learn about: - -- Working with 2D block and thread arrangements -- Handling matrix data larger than block size -- Converting between 2D and linear memory access - -The key insight is understanding how to coordinate multiple blocks of threads to process a 2D matrix that's larger than a single block's dimensions. - -## Configuration - -- **Matrix size**: \\(5 \times 5\\) elements -- **2D blocks**: Each block processes a \\(3 \times 3\\) region -- **Grid layout**: Blocks arranged in \\(2 \times 2\\) grid -- **Total threads**: \\(36\\) for \\(25\\) elements -- **Memory pattern**: Row-major storage for 2D data -- **Coverage**: Ensuring all matrix elements are processed - -## Code to complete - -```mojo -{{#include ../../../problems/p07/p07.mojo:add_10_blocks_2d}} -``` - -View full file: problems/p07/p07.mojo - -
-Tips - -
- -1. Calculate global indices: `row = block_dim.y * block_idx.y + thread_idx.y`, `col = block_dim.x * block_idx.x + thread_idx.x` -2. Add guard: `if row < size and col < size` -3. Inside guard: think about how to add 10 in row-major way! - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p07 -``` - -
-
- -```bash -pixi run -e amd p07 -``` - -
-
- -```bash -pixi run -e apple p07 -``` - -
-
- -```bash -uv run poe p07 -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0]) -expected: HostBuffer([10.0, 11.0, 12.0, ... , 34.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p07/p07.mojo:add_10_blocks_2d_solution}} -``` - -
- -This solution demonstrates key concepts of 2D block-based processing with raw memory: - -1. **2D thread indexing** - - Global row: `block_dim.y * block_idx.y + thread_idx.y` - - Global col: `block_dim.x * block_idx.x + thread_idx.x` - - Maps thread grid to matrix elements: - - ```txt - 5ร—5 matrix with 3ร—3 blocks: - - Block (0,0) Block (1,0) - [(0,0) (0,1) (0,2)] [(0,3) (0,4) * ] - [(1,0) (1,1) (1,2)] [(1,3) (1,4) * ] - [(2,0) (2,1) (2,2)] [(2,3) (2,4) * ] - - Block (0,1) Block (1,1) - [(3,0) (3,1) (3,2)] [(3,3) (3,4) * ] - [(4,0) (4,1) (4,2)] [(4,3) (4,4) * ] - [ * * * ] [ * * * ] - ``` - - (* = thread exists but outside matrix bounds) - -2. **Memory layout** - - Row-major linear memory: `index = row * size + col` - - Example for 5ร—5 matrix: - - ```txt - 2D indices: Linear memory: - (2,1) -> 11 [00 01 02 03 04] - [05 06 07 08 09] - [10 11 12 13 14] - [15 16 17 18 19] - [20 21 22 23 24] - ``` - -3. **Bounds checking** - - Guard `row < size and col < size` handles: - - Excess threads in partial blocks - - Edge cases at matrix boundaries - - 2ร—2 block grid with 3ร—3 threads each = 36 threads for 25 elements - -4. **Block coordination** - - Each 3ร—3 block processes part of 5ร—5 matrix - - 2ร—2 grid of blocks ensures full coverage - - Overlapping threads handled by bounds check - - Efficient parallel processing across blocks - -This pattern shows how to handle 2D data larger than block size while maintaining efficient memory access and thread coordination. -
-
diff --git a/book/src/puzzle_08/layout_tensor.md b/book/src/puzzle_08/layout_tensor.md deleted file mode 100644 index e85ec4a7..00000000 --- a/book/src/puzzle_08/layout_tensor.md +++ /dev/null @@ -1,191 +0,0 @@ -## Overview - -Implement a kernel that adds 10 to each position of a 1D LayoutTensor `a` and stores it in 1D LayoutTensor `output`. - -**Note:** _You have fewer threads per block than the size of `a`._ - -## Key concepts - -In this puzzle, you'll learn about: - -- Using LayoutTensor's shared memory features with address_space -- Thread synchronization with shared memory -- Block-local data management with LayoutTensor - -The key insight is how LayoutTensor simplifies shared memory management while maintaining the performance benefits of block-local storage. - -## Configuration - -- Array size: `SIZE = 8` elements -- Threads per block: `TPB = 4` -- Number of blocks: 2 -- Shared memory: `TPB` elements per block - -## Key differences from raw approach - -1. **Memory allocation**: We will use [LayoutTensor](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor) with address_space instead of [stack_allocation](https://docs.modular.com/mojo/std/memory/memory/stack_allocation/) - - ```mojo - # Raw approach - shared = stack_allocation[TPB, Scalar[dtype]]() - - # LayoutTensor approach - shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - ``` - -2. **Memory access**: Same syntax - - ```mojo - # Raw approach - shared[local_i] = a[global_i] - - # LayoutTensor approach - shared[local_i] = a[global_i] - ``` - -3. **Safety features**: - - - Type safety - - Layout management - - Memory alignment handling - -> **Note**: LayoutTensor handles memory layout, but you still need to manage thread synchronization with `barrier()` when using shared memory. - -**Educational Note**: In this specific puzzle, the `barrier()` isn't strictly necessary since each thread only accesses its own shared memory location. However, it's included to teach proper shared memory synchronization patterns for more complex scenarios where threads need to coordinate access to shared data. - -## Code to complete - -```mojo -{{#include ../../../problems/p08/p08_layout_tensor.mojo:add_10_shared_layout_tensor}} -``` - -View full file: problems/p08/p08_layout_tensor.mojo - -
-Tips - -
- -1. Create shared memory with LayoutTensor using address_space parameter -2. Load data with natural indexing: `shared[local_i] = a[global_i]` -3. Synchronize with `barrier()` (educational - not strictly needed here) -4. Process data using shared memory indices -5. Guard against out-of-bounds access - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p08_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p08_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p08_layout_tensor -``` - -
-
- -```bash -uv run poe p08_layout_tensor -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p08/p08_layout_tensor.mojo:add_10_shared_layout_tensor_solution}} -``` - -
- -This solution demonstrates how LayoutTensor simplifies shared memory usage while maintaining performance: - -1. **Memory hierarchy with LayoutTensor** - - Global tensors: `a` and `output` (slow, visible to all blocks) - - Shared tensor: `shared` (fast, thread-block local) - - Example for 8 elements with 4 threads per block: - - ```txt - Global tensor a: [1 1 1 1 | 1 1 1 1] # Input: all ones - - Block (0): Block (1): - shared[0..3] shared[0..3] - [1 1 1 1] [1 1 1 1] - ``` - -2. **Thread coordination** - - Load phase with natural indexing: - - ```txt - Thread 0: shared[0] = a[0]=1 Thread 2: shared[2] = a[2]=1 - Thread 1: shared[1] = a[1]=1 Thread 3: shared[3] = a[3]=1 - barrier() โ†“ โ†“ โ†“ โ†“ # Wait for all loads - ``` - - - Process phase: Each thread adds 10 to its shared tensor value - - Result: `output[global_i] = shared[local_i] + 10 = 11` - - **Note**: In this specific case, the `barrier()` isn't strictly necessary since each thread only writes to and reads from its own shared memory location (`shared[local_i]`). However, it's included for educational purposes to demonstrate proper shared memory synchronization patterns that are essential when threads need to access each other's data. - -3. **LayoutTensor benefits** - - Shared memory allocation: - - ```txt - # Clean LayoutTensor API with address_space - shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - ``` - - - Natural indexing for both global and shared: - - ```txt - Block 0 output: [11 11 11 11] - Block 1 output: [11 11 11 11] - ``` - - - Built-in layout management and type safety - -4. **Memory access pattern** - - Load: Global tensor โ†’ Shared tensor (optimized) - - Sync: Same `barrier()` requirement as raw version - - Process: Add 10 to shared values - - Store: Write 11s back to global tensor - -This pattern shows how LayoutTensor maintains the performance benefits of shared memory while providing a more ergonomic API and built-in features. -
-
diff --git a/book/src/puzzle_08/puzzle_08.md b/book/src/puzzle_08/puzzle_08.md index 2b8ced1e..15081cad 100644 --- a/book/src/puzzle_08/puzzle_08.md +++ b/book/src/puzzle_08/puzzle_08.md @@ -2,19 +2,169 @@ ## Overview -Implement a kernel that adds 10 to each position of a vector `a` and stores it in vector `output`. +Implement a kernel that adds 10 to each position of a 1D TileTensor `a` and stores it in 1D TileTensor `output`. + +**Shared memory** is fast, on-chip storage that is visible to all threads within the same block. Unlike global memory (which all blocks can access but is slow), shared memory has latency similar to a CPU register cache. Each block gets its own private shared memory region โ€” threads in one block cannot see the shared memory of another block. Because threads can read and write to the same shared memory locations, coordination via `barrier()` is required to prevent one thread from reading a value before another thread has finished writing it. **Note:** _You have fewer threads per block than the size of `a`._ Shared memory visualization Shared memory visualization -## Implementation approaches +## Key concepts + +In this puzzle, you'll learn about: + +- Using TileTensor's shared memory features with address_space +- Thread synchronization with shared memory +- Block-local data management with TileTensor + +The key insight is how TileTensor simplifies shared memory management while maintaining the performance benefits of block-local storage. + +## Configuration + +- Array size: `SIZE = 8` elements +- Threads per block: `TPB = 4` +- Number of blocks: 2 +- Shared memory: `TPB` elements per block + +> **Warning**: Each block can only have a _constant_ amount of shared memory that threads in that block can read and write to. This needs to be a literal python constant, not a variable. After writing to shared memory you need to call [barrier](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/) to ensure that threads do not cross. + +**Educational Note**: In this specific puzzle, the `barrier()` isn't strictly necessary since each thread only accesses its own shared memory location. However, it's included to teach proper shared memory synchronization patterns for more complex scenarios where threads need to coordinate access to shared data. + +## Code to complete + +```mojo +{{#include ../../../problems/p08/p08.mojo:add_10_shared}} +``` + +View full file: problems/p08/p08.mojo + +
+Tips + +
+ +1. Create shared memory with TileTensor using address_space parameter +2. Load data with natural indexing: `shared[local_i] = a[global_i]` +3. Synchronize with `barrier()` (educational - not strictly needed here) +4. Process data using shared memory indices +5. Guard against out-of-bounds access + +
+
+ +## Running the code + +To test your solution, run the following command in your terminal: + +
+
+ + + + +
+
+ +```bash +pixi run p08 +``` + +
+
+ +```bash +pixi run -e amd p08 +``` + +
+
+ +```bash +pixi run -e apple p08 +``` + +
+
+ +```bash +uv run poe p08 +``` + +
+
+ +Your output will look like this if the puzzle isn't solved yet: + +```txt +out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) +expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0]) +``` + +## Solution + +
+ + +```mojo +{{#include ../../../solutions/p08/p08.mojo:add_10_shared_solution}} +``` + +
+ +This solution demonstrates how TileTensor simplifies shared memory usage while maintaining performance: + +1. **Memory hierarchy with TileTensor** + - Global tensors: `a` and `output` (slow, visible to all blocks) + - Shared tensor: `shared` (fast, thread-block local) + - Example for 8 elements with 4 threads per block: + + ```txt + Global tensor a: [1 1 1 1 | 1 1 1 1] # Input: all ones + + Block (0): Block (1): + shared[0..3] shared[0..3] + [1 1 1 1] [1 1 1 1] + ``` + +2. **Thread coordination** + - Load phase with natural indexing: + + ```txt + Thread 0: shared[0] = a[0]=1 Thread 2: shared[2] = a[2]=1 + Thread 1: shared[1] = a[1]=1 Thread 3: shared[3] = a[3]=1 + barrier() โ†“ โ†“ โ†“ โ†“ # Wait for all loads + ``` + + - Process phase: Each thread adds 10 to its shared tensor value + - Result: `output[global_i] = shared[local_i] + 10 = 11` + + **Note**: In this specific case, the `barrier()` isn't strictly necessary since each thread only writes to and reads from its own shared memory location (`shared[local_i]`). However, it's included for educational purposes to demonstrate proper shared memory synchronization patterns that are essential when threads need to access each other's data. + +3. **TileTensor benefits** + - Shared memory allocation: + + ```txt + # Clean TileTensor API with address_space + shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]()) + ``` + + - Natural indexing for both global and shared: + + ```txt + Block 0 output: [11 11 11 11] + Block 1 output: [11 11 11 11] + ``` -### [๐Ÿ”ฐ Raw memory approach](./raw.md) -Learn how to manually manage shared memory and synchronization. + - Built-in layout management and type safety -### [๐Ÿ“ LayoutTensor Version](./layout_tensor.md) -Use LayoutTensor's built-in shared memory management features. +4. **Memory access pattern** + - Load: Global tensor โ†’ Shared tensor (optimized) + - Sync: Same `barrier()` requirement as raw version + - Process: Add 10 to shared values + - Store: Write 11s back to global tensor -๐Ÿ’ก **Note**: Experience how LayoutTensor simplifies shared memory operations while maintaining performance. +This pattern shows how TileTensor maintains the performance benefits of shared memory while providing a more ergonomic API and built-in features. +
+
diff --git a/book/src/puzzle_08/raw.md b/book/src/puzzle_08/raw.md deleted file mode 100644 index 7729bf8a..00000000 --- a/book/src/puzzle_08/raw.md +++ /dev/null @@ -1,166 +0,0 @@ -## Overview - -Implement a kernel that adds 10 to each position of a vector `a` and stores it in `output`. - -**Note:** _You have fewer threads per block than the size of `a`._ - -## Key concepts - -In this puzzle, you'll learn about: - -- Using shared memory within thread blocks -- Synchronizing threads with barriers -- Managing block-local data storage - -The key insight is understanding how shared memory provides fast, block-local storage that all threads in a block can access, requiring careful coordination between threads. - -## Configuration - -- Array size: `SIZE = 8` elements -- Threads per block: `TPB = 4` -- Number of blocks: 2 -- Shared memory: `TPB` elements per block - -Notes: - -- **Shared memory**: Fast storage shared by threads in a block -- **Thread sync**: Coordination using `barrier()` -- **Memory scope**: Shared memory only visible within block -- **Access pattern**: Local vs global indexing - -> **Warning**: Each block can only have a _constant_ amount of shared memory that threads in that block can read and write to. This needs to be a literal python constant, not a variable. After writing to shared memory you need to call [barrier](https://docs.modular.com/mojo/std/gpu/sync/sync/barrier/) to ensure that threads do not cross. - -**Educational Note**: In this specific puzzle, the `barrier()` isn't strictly necessary since each thread only accesses its own shared memory location. However, it's included to teach proper shared memory synchronization patterns for more complex scenarios where threads need to coordinate access to shared data. - -## Code to complete - -```mojo -{{#include ../../../problems/p08/p08.mojo:add_10_shared}} -``` - -View full file: problems/p08/p08.mojo - -
-Tips - -
- -1. Wait for shared memory load with `barrier()` (educational - not strictly needed here) -2. Use `local_i` to access shared memory: `shared[local_i]` -3. Use `global_i` for output: `output[global_i]` -4. Add guard: `if global_i < size` - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p08 -``` - -
-
- -```bash -pixi run -e amd p08 -``` - -
-
- -```bash -pixi run -e apple p08 -``` - -
-
- -```bash -uv run poe p08 -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p08/p08.mojo:add_10_shared_solution}} -``` - -
- -This solution demonstrates key concepts of shared memory usage in GPU programming: - -1. **Memory hierarchy** - - Global memory: `a` and `output` arrays (slow, visible to all blocks) - - Shared memory: `shared` array (fast, thread-block local) - - Example for 8 elements with 4 threads per block: - - ```txt - Global array a: [1 1 1 1 | 1 1 1 1] # Input: all ones - - Block (0): Block (1): - shared[0..3] shared[0..3] - [1 1 1 1] [1 1 1 1] - ``` - -2. **Thread coordination** - - Load phase: - - ```txt - Thread 0: shared[0] = a[0]=1 Thread 2: shared[2] = a[2]=1 - Thread 1: shared[1] = a[1]=1 Thread 3: shared[3] = a[3]=1 - barrier() โ†“ โ†“ โ†“ โ†“ # Wait for all loads - ``` - - - Process phase: Each thread adds 10 to its shared memory value - - Result: `output[i] = shared[local_i] + 10 = 11` - - **Note**: In this specific case, the `barrier()` isn't strictly necessary since each thread only writes to and reads from its own shared memory location (`shared[local_i]`). However, it's included for educational purposes to demonstrate proper shared memory synchronization patterns that are essential when threads need to access each other's data. - -3. **Index mapping** - - Global index: `block_dim.x * block_idx.x + thread_idx.x` - - ```txt - Block 0 output: [11 11 11 11] - Block 1 output: [11 11 11 11] - ``` - - - Local index: `thread_idx.x` for shared memory access - - ```txt - Both blocks process: 1 + 10 = 11 - ``` - -4. **Memory access pattern** - - Load: Global โ†’ Shared (coalesced reads of 1s) - - Sync: `barrier()` ensures all loads complete - - Process: Add 10 to shared values - - Store: Write 11s back to global memory - -This pattern shows how to use shared memory to optimize data access while maintaining thread coordination within blocks. -
-
diff --git a/book/src/puzzle_09/puzzle_09.md b/book/src/puzzle_09/puzzle_09.md index 805aee36..6715ba58 100644 --- a/book/src/puzzle_09/puzzle_09.md +++ b/book/src/puzzle_09/puzzle_09.md @@ -78,7 +78,7 @@ This puzzle takes you through a carefully designed progression from basic debugg **Logic bug investigation** - Debug a program with wrong results -- Investigate LayoutTensor-based algorithmic errors +- Investigate TileTensor-based algorithmic errors - Learn execution flow analysis when variables are optimized out - Learn loop boundary analysis and iteration counting - Practice pattern recognition in incorrect results diff --git a/book/src/puzzle_09/second_case.md b/book/src/puzzle_09/second_case.md index 5827e8e6..eaf7d2e0 100644 --- a/book/src/puzzle_09/second_case.md +++ b/book/src/puzzle_09/second_case.md @@ -8,7 +8,7 @@ Building on your [crash debugging skills from the First Case](./first_case.md), - **[First Case](./first_case.md)**: Clear crash signals (`CUDA_ERROR_ILLEGAL_ADDRESS`) guided your investigation - **Second Case**: No crashes, no error messages - just subtly wrong results that require detective work -This intermediate-level debugging challenge covers investigating **algorithmic errors** using `LayoutTensor` operations, where the program runs successfully but produces wrong output - a much more common (and trickier) real-world debugging scenario. +This intermediate-level debugging challenge covers investigating **algorithmic errors** using `TileTensor` operations, where the program runs successfully but produces wrong output - a much more common (and trickier) real-world debugging scenario. **Prerequisites**: Complete [Mojo GPU Debugging Essentials](./essentials.md) and [Detective Work: First Case](./first_case.md) to understand CUDA-GDB workflow and systematic debugging techniques. Make sure you run the setup: @@ -20,7 +20,7 @@ pixi run -e nvidia setup-cuda-gdb In this debugging challenge, you'll learn about: -- **LayoutTensor debugging**: Investigating structured data access patterns +- **TileTensor debugging**: Investigating structured data access patterns - **Logic bug detection**: Finding algorithmic errors that don't crash - **Loop boundary analysis**: Understanding iteration count problems - **Result pattern analysis**: Using output data to trace back to root causes @@ -155,14 +155,14 @@ Each position should sum its neighbors: [left + center + right] CUDA thread hit application kernel entry function breakpoint, p09_process_sliding_window_... <<<(1,1,1),(4,1,1)>>> (output=..., input=...) at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:30 -30 input: LayoutTensor[mut=False, dtype, vector_layout], +30 input: TileTensor[mut=False, dtype, vector_layout], ``` #### Step 4: Navigate to the main logic ```bash (cuda-gdb) n -29 output: LayoutTensor[mut=True, dtype, vector_layout], +29 output: TileTensor[mut=True, dtype, vector_layout], (cuda-gdb) n 32 thread_id = thread_idx.x (cuda-gdb) n @@ -190,7 +190,7 @@ Cannot access memory at address 0x0 Attempt to take address of value not located in memory. ``` -**โŒ Problem**: Direct LayoutTensor indexing doesn't work. +**โŒ Problem**: Direct TileTensor indexing doesn't work. ```bash (cuda-gdb) p a.ptr[0] @@ -199,7 +199,7 @@ $2 = {0} $3 = {{0}, {1}, {2}, {3}} ``` -**๐ŸŽฏ BREAKTHROUGH**: `a.ptr[0]@4` shows the full input array! This is how we can inspect LayoutTensor data. +**๐ŸŽฏ BREAKTHROUGH**: `a.ptr[0]@4` shows the full input array! This is how we can inspect TileTensor data. ### Phase 3: The critical loop investigation @@ -353,9 +353,9 @@ for offset in range(ITER): # โ† Only 2 iterations: [0, 1] - **Host output patterns** provide crucial debugging clues - **Source code reasoning** complements limited debugger capabilities -**LayoutTensor Debugging**: +**TileTensor Debugging**: -- Even with LayoutTensor abstractions, underlying algorithmic bugs still manifest +- Even with TileTensor abstractions, underlying algorithmic bugs still manifest - Focus on the algorithm logic rather than trying to inspect tensor contents - Use systematic reasoning to trace what each thread should vs actually accesses diff --git a/book/src/puzzle_09/third_case.md b/book/src/puzzle_09/third_case.md index ff6ce70d..94b87aa2 100644 --- a/book/src/puzzle_09/third_case.md +++ b/book/src/puzzle_09/third_case.md @@ -9,7 +9,7 @@ You've learned debugging [memory crashes](./first_case.md) and [logic bugs](./se - **[Second Case](./second_case.md)**: Program produces wrong results โ†’ Analyze patterns โ†’ Find logic bugs - **Third Case**: Program hangs forever โ†’ Investigate thread states โ†’ Find coordination bugs -This advanced-level debugging challenge teaches you to investigate **thread coordination failures** using shared memory, LayoutTensor operations, and barrier synchronization - combining all the systematic investigation skills from the previous cases. +This advanced-level debugging challenge teaches you to investigate **thread coordination failures** using shared memory, TileTensor operations, and barrier synchronization - combining all the systematic investigation skills from the previous cases. **Prerequisites**: Complete [Mojo GPU Debugging Essentials](./essentials.md), [Detective Work: First Case](./first_case.md), and [Detective Work: Second Case](./second_case.md) to understand CUDA-GDB workflow, variable inspection limitations, and systematic debugging approaches. Make sure you run the setup: @@ -21,7 +21,7 @@ pixi run -e nvidia setup-cuda-gdb In this debugging challenge, you'll learn about: - **Barrier deadlock detection**: Identifying when threads wait forever at synchronization points -- **Shared memory coordination**: Understanding thread cooperation patterns with LayoutTensor +- **Shared memory coordination**: Understanding thread cooperation patterns with TileTensor - **Conditional execution analysis**: Debugging when some threads take different code paths - **Thread coordination debugging**: Using CUDA-GDB to analyze multi-thread synchronization failures @@ -146,7 +146,7 @@ Waiting for GPU computation to complete... CUDA thread hit application kernel entry function breakpoint, p09_collaborative_filter_Orig6A6AcB6A6A_1882ca334fc2d34b2b9c4fa338df6c07<<<(1,1,1),(4,1,1)>>> ( output=..., a=...) at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:56 -56 a: LayoutTensor[mut=False, dtype, vector_layout], +56 a: TileTensor[mut=False, dtype, vector_layout], ``` **๐Ÿ” Key Observations**: @@ -158,7 +158,7 @@ CUDA thread hit application kernel entry function breakpoint, p09_collaborative_ #### Step 4: navigate through initialization ```bash (cuda-gdb) n -55 output: LayoutTensor[mut=True, dtype, vector_layout], +55 output: TileTensor[mut=True, dtype, vector_layout], (cuda-gdb) n 58 thread_id = thread_idx.x (cuda-gdb) n @@ -289,13 +289,13 @@ if thread_id < SIZE - 1: # Not all threads enter **The Fix**: Move the barrier outside the conditional block: ```mojo def collaborative_filter( - output: LayoutTensor[mut=True, dtype, vector_layout], - a: LayoutTensor[mut=False, dtype, vector_layout], + output: TileTensor[mut=True, dtype, vector_layout], + a: TileTensor[mut=False, dtype, vector_layout], ): thread_id = thread_idx.x - shared_workspace = LayoutTensor[ + shared_workspace = TileTensor[ dtype, - Layout.row_major(SIZE-1), + row_major[SIZE-1](), MutAnyOrigin, address_space = AddressSpace.SHARED, ].stack_allocation() @@ -339,7 +339,7 @@ def collaborative_filter( - **Barrier rule**: ALL threads in a block must reach the SAME barrier - **Conditional execution pitfalls**: Any if-statement can cause thread divergence - **Shared memory coordination**: Requires careful barrier placement for correct synchronization -- **LayoutTensor doesn't prevent deadlocks**: Higher-level abstractions still need correct synchronization +- **TileTensor doesn't prevent deadlocks**: Higher-level abstractions still need correct synchronization **๐Ÿ’ก Key Insight**: Barrier deadlocks are among the hardest GPU bugs to debug because: - **No visible error** - just infinite waiting diff --git a/book/src/puzzle_10/memcheck.md b/book/src/puzzle_10/memcheck.md index 697825f6..7ccf050a 100644 --- a/book/src/puzzle_10/memcheck.md +++ b/book/src/puzzle_10/memcheck.md @@ -6,13 +6,13 @@ Learn how to detect memory violations that can silently corrupt GPU programs, ev **Key insight**: A GPU program can produce "correct" results while simultaneously performing illegal memory accesses. -**Prerequisites**: Understanding of [Puzzle 4 LayoutTensor](../puzzle_04/introduction_layout_tensor.md) and basic GPU memory concepts. +**Prerequisites**: Understanding of [Puzzle 4 TileTensor](../puzzle_04/introduction_tile_tensor.md) and basic GPU memory concepts. ## The silent memory bug discovery ### Test passes, but is my code actually correct? -Let's start with a seemingly innocent program that appears to work perfectly (this is [Puzzle 04](../puzzle_04/layout_tensor.md) without guards): +Let's start with a seemingly innocent program that appears to work perfectly (this is [Puzzle 04](../puzzle_04/tile_tensor.md) without guards): ```mojo {{#include ../../../problems/p10/p10.mojo:add_10_2d_no_guard}} @@ -151,10 +151,10 @@ The program has **7 total errors** despite passing all tests: ### The solution -As we saw in [Puzzle 04](../puzzle_04/layout_tensor.md), we need to bound-check as follows: +As we saw in [Puzzle 04](../puzzle_04/tile_tensor.md), we need to bound-check as follows: ```mojo -{{#include ../../../solutions/p04/p04_layout_tensor.mojo:add_10_2d_layout_tensor_solution}} +{{#include ../../../solutions/p04/p04_tile_tensor.mojo:add_10_2d_tile_tensor_solution}} ``` The fix is simple: **always validate thread indices against data dimensions** before accessing memory. diff --git a/book/src/puzzle_10/puzzle_10.md b/book/src/puzzle_10/puzzle_10.md index 19822550..5d990c61 100644 --- a/book/src/puzzle_10/puzzle_10.md +++ b/book/src/puzzle_10/puzzle_10.md @@ -123,7 +123,7 @@ But like any good detective, you'll learn to: - GPU programming concepts from Puzzles 1-8 (memory management, thread coordination, barriers) - **[Compatible NVIDIA GPU hardware](https://docs.modular.com/max/faq#gpu-requirements)** - Environment setup with `pixi` package manager for accessing `compute-sanitizer` -- **Prior puzzles**: Familiarity with [Puzzle 4](../puzzle_04/introduction_layout_tensor.md) and [Puzzle 8](../puzzle_08/layout_tensor.md) are recommended +- **Prior puzzles**: Familiarity with [Puzzle 4](../puzzle_04/introduction_tile_tensor.md) and [Puzzle 8](../puzzle_08/tile_tensor.md) are recommended **What you'll gain**: diff --git a/book/src/puzzle_11/layout_tensor.md b/book/src/puzzle_11/layout_tensor.md deleted file mode 100644 index a97dceea..00000000 --- a/book/src/puzzle_11/layout_tensor.md +++ /dev/null @@ -1,184 +0,0 @@ -## Overview - -Implement a kernel that compute the running sum of the last 3 positions of 1D LayoutTensor `a` and stores it in 1D LayoutTensor `output`. - -**Note:** _You have 1 thread per position. You only need 1 global read and 1 global write per thread._ - -## Key concepts - -In this puzzle, you'll learn about: - -- Using LayoutTensor for sliding window operations -- Managing shared memory with LayoutTensor address_space that we saw in [puzzle_08](../puzzle_08/layout_tensor.md) -- Efficient neighbor access patterns -- Boundary condition handling - -The key insight is how LayoutTensor simplifies shared memory management while maintaining efficient window-based operations. - -## Configuration - -- Array size: `SIZE = 8` elements -- Threads per block: `TPB = 8` -- Window size: 3 elements -- Shared memory: `TPB` elements - -Notes: - -- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` -- **Window access**: Natural indexing for 3-element windows -- **Edge handling**: Special cases for first two positions -- **Memory pattern**: One shared memory load per thread - -## Code to complete - -```mojo -{{#include ../../../problems/p11/p11_layout_tensor.mojo:pooling_layout_tensor}} -``` - -View full file: problems/p11/p11_layout_tensor.mojo - -
-Tips - -
- -1. Create shared memory with LayoutTensor using address_space -2. Load data with natural indexing: `shared[local_i] = a[global_i]` -3. Handle special cases for first two elements -4. Use shared memory for window operations -5. Guard against out-of-bounds access - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p11_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p11_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p11_layout_tensor -``` - -
-
- -```bash -uv run poe p11_layout_tensor -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p11/p11_layout_tensor.mojo:pooling_layout_tensor_solution}} -``` - -
- -The solution implements a sliding window sum using LayoutTensor with these key steps: - -1. **Shared memory setup** - - LayoutTensor creates block-local storage with address_space: - - ```txt - shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - ``` - - - Each thread loads one element: - - ```txt - Input array: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] - Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] - ``` - - - `barrier()` ensures all data is loaded - -2. **Boundary cases** - - Position 0: Single element - - ```txt - output[0] = shared[0] = 0.0 - ``` - - - Position 1: Sum of first two elements - - ```txt - output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0 - ``` - -3. **Main window operation** - - For positions 2 and beyond: - - ```txt - Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0 - Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0 - Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0 - ... - ``` - - - Natural indexing with LayoutTensor: - - ```txt - # Sliding window of 3 elements - window_sum = shared[i-2] + shared[i-1] + shared[i] - ``` - -4. **Memory access pattern** - - One global read per thread into shared tensor - - Efficient neighbor access through shared memory - - LayoutTensor benefits: - - Automatic bounds checking - - Natural window indexing - - Layout-aware memory access - - Type safety throughout - -This approach combines the performance of shared memory with LayoutTensor's safety and ergonomics: - -- Minimizes global memory access -- Simplifies window operations -- Handles boundaries cleanly -- Maintains coalesced access patterns - -The final output shows the cumulative window sums: - -```txt -[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0] -``` - -
-
diff --git a/book/src/puzzle_11/puzzle_11.md b/book/src/puzzle_11/puzzle_11.md index 4d82b43d..ff52a3ad 100644 --- a/book/src/puzzle_11/puzzle_11.md +++ b/book/src/puzzle_11/puzzle_11.md @@ -2,19 +2,190 @@ ## Overview -Implement a kernel that computes the running sum of the last 3 positions of vector `a` and stores it in vector `output`. +Implement a kernel that compute the running sum of the last 3 positions of 1D TileTensor `a` and stores it in 1D TileTensor `output`. + +**Pooling** is an operation that condenses a region of values into a single summary value โ€” for example, their sum, maximum, or average. A **sliding window** applies this condensation repeatedly by moving a fixed-size window one step at a time across the input, producing one output value per window position. Here the window is 3 elements wide and the summary function is a sum, so each output element equals the sum of the current element and the two preceding it (with special cases at the boundaries where fewer than 3 elements are available). **Note:** _You have 1 thread per position. You only need 1 global read and 1 global write per thread._ Pooling visualization Pooling visualization -## Implementation approaches +## Key concepts + +In this puzzle, you'll learn about: + +- Using TileTensor for sliding window operations +- Managing shared memory with TileTensor address_space that we saw in [puzzle 8](../puzzle_08/puzzle_08.md) +- Efficient neighbor access patterns +- Boundary condition handling + +The key insight is how TileTensor simplifies shared memory management while maintaining efficient window-based operations. + +## Configuration + +- Array size: `SIZE = 8` elements +- Threads per block: `TPB = 8` +- Window size: 3 elements +- Shared memory: `TPB` elements + +Notes: + +- **TileTensor allocation**: Use `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]())` +- **Window access**: Natural indexing for 3-element windows +- **Edge handling**: Special cases for first two positions +- **Memory pattern**: One shared memory load per thread + +## Code to complete + +```mojo +{{#include ../../../problems/p11/p11.mojo:pooling}} +``` + +View full file: problems/p11/p11.mojo + +
+Tips + +
+ +1. Create shared memory with TileTensor using address_space +2. Load data with natural indexing: `shared[local_i] = a[global_i]` +3. Handle special cases for first two elements +4. Use shared memory for window operations +5. Guard against out-of-bounds access + +
+
+ +## Running the code + +To test your solution, run the following command in your terminal: + +
+
+ + + + +
+
+ +```bash +pixi run p11 +``` + +
+
+ +```bash +pixi run -e amd p11 +``` + +
+
+ +```bash +pixi run -e apple p11 +``` + +
+
+ +```bash +uv run poe p11 +``` + +
+
+ +Your output will look like this if the puzzle isn't solved yet: + +```txt +out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) +expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]) +``` + +## Solution + +
+ + +```mojo +{{#include ../../../solutions/p11/p11.mojo:pooling_solution}} +``` + +
+ +The solution implements a sliding window sum using TileTensor with these key steps: + +1. **Shared memory setup** + - TileTensor creates block-local storage with address_space: + + ```txt + shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]()) + ``` + + - Each thread loads one element: + + ```txt + Input array: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] + Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] + ``` + + - `barrier()` ensures all data is loaded + +2. **Boundary cases** + - Position 0: Single element + + ```txt + output[0] = shared[0] = 0.0 + ``` + + - Position 1: Sum of first two elements + + ```txt + output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0 + ``` + +3. **Main window operation** + - For positions 2 and beyond: + + ```txt + Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0 + Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0 + Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0 + ... + ``` + + - Natural indexing with TileTensor: + + ```txt + # Sliding window of 3 elements + window_sum = shared[i-2] + shared[i-1] + shared[i] + ``` + +4. **Memory access pattern** + - One global read per thread into shared tensor + - Efficient neighbor access through shared memory + - TileTensor benefits: + - Automatic bounds checking + - Natural window indexing + - Layout-aware memory access + - Type safety throughout + +This approach combines the performance of shared memory with TileTensor's safety and ergonomics: + +- Minimizes global memory access +- Simplifies window operations +- Handles boundaries cleanly +- Maintains coalesced access patterns -### [๐Ÿ”ฐ Raw memory approach](./raw.md) -Learn how to implement sliding window operations with manual memory management and synchronization. +The final output shows the cumulative window sums: -### [๐Ÿ“ LayoutTensor Version](./layout_tensor.md) -Use LayoutTensor's features for efficient window-based operations and shared memory management. +```txt +[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0] +``` -๐Ÿ’ก **Note**: See how LayoutTensor simplifies sliding window operations while maintaining efficient memory access patterns. +
+
diff --git a/book/src/puzzle_11/raw.md b/book/src/puzzle_11/raw.md deleted file mode 100644 index 1bf27ecd..00000000 --- a/book/src/puzzle_11/raw.md +++ /dev/null @@ -1,174 +0,0 @@ -## Overview - -Implement a kernel that compute the running sum of the last 3 positions of vector `a` and stores it in vector `output`. - -**Note:** _You have 1 thread per position. You only need 1 global read and 1 global write per thread._ - -## Key concepts - -In this puzzle, you'll learn about: - -- Using shared memory for sliding window operations -- Handling boundary conditions in pooling -- Coordinating thread access to neighboring elements - -The key insight is understanding how to efficiently access a window of elements using shared memory, with special handling for the first elements in the sequence. - -## Configuration - -- Array size: `SIZE = 8` elements -- Threads per block: `TPB = 8` -- Window size: 3 elements -- Shared memory: `TPB` elements - -Notes: - -- **Window access**: Each output depends on up to 3 previous elements -- **Edge handling**: First two positions need special treatment -- **Memory pattern**: One shared memory load per thread -- **Thread sync**: Coordination before window operations - -## Code to complete - -```mojo -{{#include ../../../problems/p11/p11.mojo:pooling}} -``` - -View full file: problems/p11/p11.mojo - -
-Tips - -
- -1. Load data and call `barrier()` -2. Special cases: `output[0] = shared[0]`, `output[1] = shared[0] + shared[1]` -3. General case: `if 1 < global_i < size` -4. Sum three elements: `shared[local_i - 2] + shared[local_i - 1] + shared[local_i]` - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p11 -``` - -
-
- -```bash -pixi run -e amd p11 -``` - -
-
- -```bash -pixi run -e apple p11 -``` - -
-
- -```bash -uv run poe p11 -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) -expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p11/p11.mojo:pooling_solution}} -``` - -
- -The solution implements a sliding window sum using shared memory with these key steps: - -1. **Shared memory setup** - - Allocates `TPB` elements in shared memory: - - ```txt - Input array: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] - Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0] - ``` - - - Each thread loads one element from global memory - - `barrier()` ensures all data is loaded - -2. **Boundary cases** - - Position 0: Single element - - ```txt - output[0] = shared[0] = 0.0 - ``` - - - Position 1: Sum of first two elements - - ```txt - output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0 - ``` - -3. **Main window operation** - - For positions 2 and beyond: - - ```txt - Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0 - Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0 - Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0 - ... - ``` - - - Window calculation using local indices: - - ```txt - # Sliding window of 3 elements - window_sum = shared[i-2] + shared[i-1] + shared[i] - ``` - -4. **Memory access pattern** - - One global read per thread into shared memory - - One global write per thread from shared memory - - Uses shared memory for efficient neighbor access - - Maintains coalesced memory access pattern - -This approach optimizes performance through: - -- Minimal global memory access -- Fast shared memory neighbor lookups -- Clean boundary handling -- Efficient memory coalescing - -The final output shows the cumulative window sums: - -```txt -[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0] -``` - -
-
diff --git a/book/src/puzzle_12/layout_tensor.md b/book/src/puzzle_12/layout_tensor.md deleted file mode 100644 index 3038bef7..00000000 --- a/book/src/puzzle_12/layout_tensor.md +++ /dev/null @@ -1,228 +0,0 @@ -## Overview - -Implement a kernel that computes the dot product of 1D LayoutTensor `a` and 1D LayoutTensor `b` and stores it in 1D LayoutTensor `output` (single number). - -**Note:** _You have 1 thread per position. You only need 2 global reads per thread and 1 global write per thread block._ - -## Key concepts - -This puzzle covers: - -- Similar to the [puzzle 8](../puzzle_08/layout_tensor.md) and [puzzle 11](../puzzle_11/layout_tensor.md), implementing parallel reduction with LayoutTensor -- Managing shared memory using LayoutTensor with address_space -- Coordinating threads for collective operations -- Using layout-aware tensor operations - -The key insight is how LayoutTensor simplifies memory management while maintaining efficient parallel reduction patterns. - -## Configuration - -- Vector size: `SIZE = 8` elements -- Threads per block: `TPB = 8` -- Number of blocks: 1 -- Output size: 1 element -- Shared memory: `TPB` elements - -Notes: - -- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` -- **Element access**: Natural indexing with bounds checking -- **Layout handling**: Separate layouts for input and output -- **Thread coordination**: Same synchronization patterns with `barrier()` - -## Code to complete - -```mojo -{{#include ../../../problems/p12/p12_layout_tensor.mojo:dot_product_layout_tensor}} -``` - -View full file: problems/p12/p12_layout_tensor.mojo - -
-Tips - -
- -1. Create shared memory with LayoutTensor using address_space -2. Store `a[global_i] * b[global_i]` in `shared[local_i]` -3. Use parallel reduction pattern with `barrier()` -4. Let thread 0 write final result to `output[0]` - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p12_layout_tensor -``` - -
-
- -```bash -pixi run -e amd p12_layout_tensor -``` - -
-
- -```bash -pixi run -e apple p12_layout_tensor -``` - -
-
- -```bash -uv run poe p12_layout_tensor -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0]) -expected: HostBuffer([140.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p12/p12_layout_tensor.mojo:dot_product_layout_tensor_solution}} -``` - -
- -The solution implements a parallel reduction for dot product using LayoutTensor. Here's the detailed breakdown: - -### Phase 1: Element-wise Multiplication - -Each thread performs one multiplication with natural indexing: - -```mojo -shared[local_i] = a[global_i] * b[global_i] -``` - -### Phase 2: Parallel Reduction - -Tree-based reduction with layout-aware operations: - -```txt -Initial: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7] - = [0 1 4 9 16 25 36 49] - -Step 1: [0+16 1+25 4+36 9+49 16 25 36 49] - = [16 26 40 58 16 25 36 49] - -Step 2: [16+40 26+58 40 58 16 25 36 49] - = [56 84 40 58 16 25 36 49] - -Step 3: [56+84 84 40 58 16 25 36 49] - = [140 84 40 58 16 25 36 49] -``` - -### Key implementation features - -1. **Memory Management**: - - Clean shared memory allocation with LayoutTensor address_space parameter - - Type-safe operations with LayoutTensor - - Automatic bounds checking - - Layout-aware indexing - -2. **Thread Synchronization**: - - `barrier()` after initial multiplication - - `barrier()` between reduction steps - - Safe thread coordination - -3. **Reduction Logic**: - - ```mojo - stride = TPB // 2 - while stride > 0: - if local_i < stride: - shared[local_i] += shared[local_i + stride] - barrier() - stride //= 2 - ``` - -4. **Performance Benefits**: - - \\(O(\log n)\\) time complexity - - Coalesced memory access - - Minimal thread divergence - - Efficient shared memory usage - -The LayoutTensor version maintains the same efficient parallel reduction while providing: - -- Better type safety -- Cleaner memory management -- Layout awareness -- Natural indexing syntax - -### Barrier synchronization importance - -The `barrier()` between reduction steps is critical for correctness. Here's why: - -Without `barrier()`, race conditions occur: - -```text -Initial shared memory: [0 1 4 9 16 25 36 49] - -Step 1 (stride = 4): -Thread 0 reads: shared[0] = 0, shared[4] = 16 -Thread 1 reads: shared[1] = 1, shared[5] = 25 -Thread 2 reads: shared[2] = 4, shared[6] = 36 -Thread 3 reads: shared[3] = 9, shared[7] = 49 - -Without barrier: -- Thread 0 writes: shared[0] = 0 + 16 = 16 -- Thread 1 starts next step (stride = 2) before Thread 0 finishes - and reads old value shared[0] = 0 instead of 16! -``` - -With `barrier()`: - -```text -Step 1 (stride = 4): -All threads write their sums: -[16 26 40 58 16 25 36 49] -barrier() ensures ALL threads see these values - -Step 2 (stride = 2): -Now threads safely read the updated values: -Thread 0: shared[0] = 16 + 40 = 56 -Thread 1: shared[1] = 26 + 58 = 84 -``` - -The `barrier()` ensures: - -1. All writes from current step complete -2. All threads see updated values -3. No thread starts next iteration early -4. Consistent shared memory state - -Without these synchronization points, we could get: - -- Memory race conditions -- Threads reading stale values -- Non-deterministic results -- Incorrect final sum - -
-
diff --git a/book/src/puzzle_12/puzzle_12.md b/book/src/puzzle_12/puzzle_12.md index 15937eac..88b390b9 100644 --- a/book/src/puzzle_12/puzzle_12.md +++ b/book/src/puzzle_12/puzzle_12.md @@ -2,7 +2,7 @@ ## Overview -Implement a kernel that computes the dot product of vector `a` and vector `b` and stores it in `output` (single number). The dot product is an operation that takes two vectors of the same size and returns a single number (a scalar). It is calculated by multiplying corresponding elements from each vector and then summing those products. +Implement a kernel that computes the dot product of 1D TileTensor `a` and 1D TileTensor `b` and stores it in 1D TileTensor `output` (single number). The dot product is an operation that takes two vectors of the same size and returns a single number (a scalar). It is calculated by multiplying corresponding elements from each vector and then summing those products. For example, if you have two vectors: @@ -17,12 +17,227 @@ For example, if you have two vectors: Dot product visualization Dot product visualization -## Implementation approaches +## Key concepts -### [๐Ÿ”ฐ Raw memory approach](./raw.md) -Learn how to implement the reduction with manual memory management and synchronization. +**Parallel reduction** is an algorithm that combines \\(n\\) values into one using a binary operation (here, addition) in \\(O(\log n)\\) steps instead of \\(O(n)\\) sequential steps. In each step, half the active threads each add one value into another, halving the number of remaining partial results. After \\(\log_2 n\\) steps, thread 0 holds the final sum. This tree-shaped computation requires a `barrier()` between steps so no thread reads a partially-updated value. -### [๐Ÿ“ LayoutTensor Version](./layout_tensor.md) -Use LayoutTensor's features for efficient reduction and shared memory management. +This puzzle covers: -๐Ÿ’ก **Note**: See how LayoutTensor simplifies efficient memory access patterns. +- Similar to [puzzle 8](../puzzle_08/puzzle_08.md) and [puzzle 11](../puzzle_11/puzzle_11.md), implementing parallel reduction with TileTensor +- Managing shared memory using TileTensor with address_space +- Coordinating threads for collective operations +- Using layout-aware tensor operations + +The key insight is how TileTensor simplifies memory management while maintaining efficient parallel reduction patterns. + +## Configuration + +- Vector size: `SIZE = 8` elements +- Threads per block: `TPB = 8` +- Number of blocks: 1 +- Output size: 1 element +- Shared memory: `TPB` elements + +Notes: + +- **TileTensor allocation**: Use `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]())` +- **Element access**: Natural indexing with bounds checking +- **Layout handling**: Separate layouts for input and output +- **Thread coordination**: Same synchronization patterns with `barrier()` + +## Code to complete + +```mojo +{{#include ../../../problems/p12/p12.mojo:dot_product}} +``` + +View full file: problems/p12/p12.mojo + +
+Tips + +
+ +1. Create shared memory with TileTensor using address_space +2. Store `a[global_i] * b[global_i]` in `shared[local_i]` +3. Use parallel reduction pattern with `barrier()` +4. Let thread 0 write final result to `output[0]` + +
+
+ +## Running the code + +To test your solution, run the following command in your terminal: + +
+
+ + + + +
+
+ +```bash +pixi run p12 +``` + +
+
+ +```bash +pixi run -e amd p12 +``` + +
+
+ +```bash +pixi run -e apple p12 +``` + +
+
+ +```bash +uv run poe p12 +``` + +
+
+ +Your output will look like this if the puzzle isn't solved yet: + +```txt +out: HostBuffer([0.0]) +expected: HostBuffer([140.0]) +``` + +## Solution + +
+ + +```mojo +{{#include ../../../solutions/p12/p12.mojo:dot_product_solution}} +``` + +
+ +The solution implements a parallel reduction for dot product using TileTensor. Here's the detailed breakdown: + +### Phase 1: Element-wise Multiplication + +Each thread performs one multiplication with natural indexing: + +```mojo +shared[local_i] = a[global_i] * b[global_i] +``` + +### Phase 2: Parallel Reduction + +Tree-based reduction with layout-aware operations: + +```txt +Initial: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7] + = [0 1 4 9 16 25 36 49] + +Step 1: [0+16 1+25 4+36 9+49 16 25 36 49] + = [16 26 40 58 16 25 36 49] + +Step 2: [16+40 26+58 40 58 16 25 36 49] + = [56 84 40 58 16 25 36 49] + +Step 3: [56+84 84 40 58 16 25 36 49] + = [140 84 40 58 16 25 36 49] +``` + +### Key implementation features + +1. **Memory Management**: + - Clean shared memory allocation with TileTensor address_space parameter + - Type-safe operations with TileTensor + - Automatic bounds checking + - Layout-aware indexing + +2. **Thread Synchronization**: + - `barrier()` after initial multiplication + - `barrier()` between reduction steps + - Safe thread coordination + +3. **Reduction Logic**: + + ```mojo + stride = TPB // 2 + while stride > 0: + if local_i < stride: + shared[local_i] += shared[local_i + stride] + barrier() + stride //= 2 + ``` + +4. **Performance Benefits**: + - \\(O(\log n)\\) time complexity + - Coalesced memory access + - Minimal thread divergence + - Efficient shared memory usage + +The TileTensor version maintains the same efficient parallel reduction while providing: + +- Better type safety +- Cleaner memory management +- Layout awareness +- Natural indexing syntax + +### Barrier synchronization importance + +The `barrier()` between reduction steps is critical for correctness. Here's why: + +Without `barrier()`, race conditions occur: + +```text +Initial shared memory: [0 1 4 9 16 25 36 49] + +Step 1 (stride = 4): +Thread 0 reads: shared[0] = 0, shared[4] = 16 +Thread 1 reads: shared[1] = 1, shared[5] = 25 +Thread 2 reads: shared[2] = 4, shared[6] = 36 +Thread 3 reads: shared[3] = 9, shared[7] = 49 + +Without barrier: +- Thread 0 writes: shared[0] = 0 + 16 = 16 +- Thread 1 starts next step (stride = 2) before Thread 0 finishes + and reads old value shared[0] = 0 instead of 16! +``` + +With `barrier()`: + +```text +Step 1 (stride = 4): +All threads write their sums: +[16 26 40 58 16 25 36 49] +barrier() ensures ALL threads see these values + +Step 2 (stride = 2): +Now threads safely read the updated values: +Thread 0: shared[0] = 16 + 40 = 56 +Thread 1: shared[1] = 26 + 58 = 84 +``` + +The `barrier()` ensures: + +1. All writes from current step complete +2. All threads see updated values +3. No thread starts next iteration early +4. Consistent shared memory state + +Without these synchronization points, we could get: + +- Memory race conditions +- Threads reading stale values +- Non-deterministic results +- Incorrect final sum + +
+
diff --git a/book/src/puzzle_12/raw.md b/book/src/puzzle_12/raw.md deleted file mode 100644 index 1788f7e6..00000000 --- a/book/src/puzzle_12/raw.md +++ /dev/null @@ -1,228 +0,0 @@ -## Overview - -Implement a kernel that computes the dot product of vector `a` and vector `b` and stores it in `output` (single number). - -**Note:** _You have 1 thread per position. You only need 2 global reads per thread and 1 global write per thread block._ - -## Key concepts - -This puzzle covers: - -- Implementing parallel reduction operations -- Using shared memory for intermediate results -- Coordinating threads for collective operations - -The key insight is understanding how to efficiently combine multiple values into a single result using parallel computation and shared memory. - -## Configuration - -- Vector size: `SIZE = 8` elements -- Threads per block: `TPB = 8` -- Number of blocks: 1 -- Output size: 1 element -- Shared memory: `TPB` elements - -Notes: - -- **Element access**: Each thread reads corresponding elements from `a` and `b` -- **Partial results**: Computing and storing intermediate values -- **Thread coordination**: Synchronizing before combining results -- **Final reduction**: Converting partial results to scalar output - -_Note: For this problem, you don't need to worry about number of shared reads. We will -handle that challenge later._ - -## Code to complete - -```mojo -{{#include ../../../problems/p12/p12.mojo:dot_product}} -``` - -View full file: problems/p12/p12.mojo - -
-Tips - -
- -1. Store `a[global_i] * b[global_i]` in `shared[local_i]` -2. Call `barrier()` to synchronize -3. Use thread 0 to sum all products in shared memory -4. Write final sum to `output[0]` - -
-
- -## Running the code - -To test your solution, run the following command in your terminal: - -
-
- - - - -
-
- -```bash -pixi run p12 -``` - -
-
- -```bash -pixi run -e amd p12 -``` - -
-
- -```bash -pixi run -e apple p12 -``` - -
-
- -```bash -uv run poe p12 -``` - -
-
- -Your output will look like this if the puzzle isn't solved yet: - -```txt -out: HostBuffer([0.0]) -expected: HostBuffer([140.0]) -``` - -## Solution - -
- - -```mojo -{{#include ../../../solutions/p12/p12.mojo:dot_product_solution}} -``` - -
- -The solution implements a parallel reduction algorithm for dot product computation using shared memory. Here's a detailed breakdown: - -### Phase 1: Element-wise Multiplication - -Each thread performs one multiplication: - -```txt -Thread i: shared[i] = a[i] * b[i] -``` - -### Phase 2: Parallel Reduction - -The reduction uses a tree-based approach that halves active threads in each step: - -```txt -Initial: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7] - = [0 1 4 9 16 25 36 49] - -Step 1: [0+16 1+25 4+36 9+49 16 25 36 49] - = [16 26 40 58 16 25 36 49] - -Step 2: [16+40 26+58 40 58 16 25 36 49] - = [56 84 40 58 16 25 36 49] - -Step 3: [56+84 84 40 58 16 25 36 49] - = [140 84 40 58 16 25 36 49] -``` - -### Key implementation features - -1. **Memory Access Pattern**: - - Each thread loads exactly two values from global memory (`a[i]`, `b[i]`) - - Uses shared memory for intermediate results - - Final result written once to global memory - -2. **Thread Synchronization**: - - `barrier()` after initial multiplication - - `barrier()` after each reduction step - - Prevents race conditions between reduction steps - -3. **Reduction Logic**: - - ```mojo - stride = TPB // 2 - while stride > 0: - if local_i < stride: - shared[local_i] += shared[local_i + stride] - barrier() - stride //= 2 - ``` - - - Halves stride in each step - - Only active threads perform additions - - Maintains work efficiency - -4. **Performance Considerations**: - - \\(\log_2(n)\\) steps for \\(n\\) elements - - Coalesced memory access pattern - - Minimal thread divergence - - Efficient use of shared memory - -This implementation achieves \\(O(\log n)\\) time complexity compared to \\(O(n)\\) in sequential execution, demonstrating the power of parallel reduction algorithms. - -### Barrier synchronization importance - -The `barrier()` between reduction steps is critical for correctness. Here's why: - -Without `barrier()`, race conditions occur: - -```text -Initial shared memory: [0 1 4 9 16 25 36 49] - -Step 1 (stride = 4): -Thread 0 reads: shared[0] = 0, shared[4] = 16 -Thread 1 reads: shared[1] = 1, shared[5] = 25 -Thread 2 reads: shared[2] = 4, shared[6] = 36 -Thread 3 reads: shared[3] = 9, shared[7] = 49 - -Without barrier: -- Thread 0 writes: shared[0] = 0 + 16 = 16 -- Thread 1 starts next step (stride = 2) before Thread 0 finishes - and reads old value shared[0] = 0 instead of 16! -``` - -With `barrier()`: - -```text -Step 1 (stride = 4): -All threads write their sums: -[16 26 40 58 16 25 36 49] -barrier() ensures ALL threads see these values - -Step 2 (stride = 2): -Now threads safely read the updated values: -Thread 0: shared[0] = 16 + 40 = 56 -Thread 1: shared[1] = 26 + 58 = 84 -``` - -The `barrier()` ensures: - -1. All writes from current step complete -2. All threads see updated values -3. No thread starts next iteration early -4. Consistent shared memory state - -Without these synchronization points, we could get: - -- Memory race conditions -- Threads reading stale values -- Non-deterministic results -- Incorrect final sum - -
-
diff --git a/book/src/puzzle_13/block_boundary.md b/book/src/puzzle_13/block_boundary.md index 66cac6a2..dfb570b4 100644 --- a/book/src/puzzle_13/block_boundary.md +++ b/book/src/puzzle_13/block_boundary.md @@ -1,6 +1,6 @@ # Block Boundary Version -Implement a kernel that computes a 1D convolution between 1D LayoutTensor `a` and 1D LayoutTensor `b` and stores it in 1D LayoutTensor `output`. +Implement a kernel that computes a 1D convolution between 1D TileTensor `a` and 1D TileTensor `b` and stores it in 1D TileTensor `output`. **Note:** _You need to handle the general case. You only need 2 global reads and 1 global write per thread._ @@ -32,7 +32,7 @@ Notes:
-1. Use `LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` for shared memory +1. Use `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]())` for shared memory 2. Load main data: `shared_a[local_i] = a[global_i]` 3. Load boundary: `if local_i < CONV_2 - 1` handle next block data 4. Load kernel: `shared_b[local_i] = b[local_i]` @@ -125,8 +125,8 @@ Size calculation: ```mojo # First: account for padding needed for convolution window - shared_a = LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - shared_b = LayoutTensor[dtype, Layout.row_major(CONV_2), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() + shared_a = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]()) + shared_b = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[CONV_2]()) ``` This allocation pattern ensures we have enough space for both the block's data and the overlap region. @@ -239,7 +239,7 @@ This implementation achieves efficient cross-block convolution while maintaining - Memory safety through proper bounds checking - High performance through optimized memory access -- Clean code structure using LayoutTensor abstractions +- Clean code structure using TileTensor abstractions - Minimal synchronization overhead - Mathematically sound boundary handling diff --git a/book/src/puzzle_13/puzzle_13.md b/book/src/puzzle_13/puzzle_13.md index b36a3d4b..28775394 100644 --- a/book/src/puzzle_13/puzzle_13.md +++ b/book/src/puzzle_13/puzzle_13.md @@ -1,13 +1,13 @@ # Puzzle 13: 1D Convolution -> ## Moving to LayoutTensor +> ## Moving to TileTensor > > So far in our GPU puzzle journey, we've been exploring two parallel approaches to GPU memory management: > > 1. Raw memory management with direct pointer manipulation using [UnsafePointer](https://docs.modular.com/mojo/std/memory/unsafe_pointer/UnsafePointer/) -> 2. The more structured [LayoutTensor](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/) with its powerful address_space parameter for memory allocation +> 2. The more structured [TileTensor](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/) with its powerful address_space parameter for memory allocation > -> Starting from this puzzle, we're transitioning exclusively to using `LayoutTensor`. This abstraction provides several benefits: +> Starting from this puzzle, we're transitioning exclusively to using `TileTensor`. This abstraction provides several benefits: > - Type-safe memory access patterns > - Clear representation of data layouts > - Better code maintainability @@ -23,7 +23,7 @@ In signal processing and image analysis, convolution is a fundamental operation that combines two sequences to produce a third sequence. This puzzle challenges you to implement a 1D convolution on the GPU, where each output element is computed by sliding a kernel over an input array. -Implement a kernel that computes a 1D convolution between vector `a` and vector `b` and stores it in `output` using the `LayoutTensor` abstraction. +Implement a kernel that computes a 1D convolution between vector `a` and vector `b` and stores it in `output` using the `TileTensor` abstraction. **Note:** _You need to handle the general case. You only need 2 global reads and 1 global write per thread._ @@ -46,9 +46,9 @@ for i in range(SIZE): This puzzle is split into two parts to help you build understanding progressively: - [Simple Version with Single Block](./simple.md) - Start here to learn the basics of implementing convolution with shared memory in a single block using LayoutTensor. + Start here to learn the basics of implementing convolution with shared memory in a single block using TileTensor. - [Block Boundary Version](./block_boundary.md) - Then tackle the more challenging case where data needs to be shared across block boundaries, leveraging LayoutTensor's capabilities. + Then tackle the more challenging case where data needs to be shared across block boundaries, leveraging TileTensor's capabilities. Each version presents unique challenges in terms of memory access patterns and thread coordination. The simple version helps you understand the basic convolution operation, while the complete version tests your ability to handle more complex scenarios that arise in real-world GPU programming. diff --git a/book/src/puzzle_13/simple.md b/book/src/puzzle_13/simple.md index 9c427dd0..db98b6da 100644 --- a/book/src/puzzle_13/simple.md +++ b/book/src/puzzle_13/simple.md @@ -1,6 +1,6 @@ # Simple Case with Single Block -Implement a kernel that computes a 1D convolution between 1D LayoutTensor `a` and 1D LayoutTensor `b` and stores it in 1D LayoutTensor `output`. +Implement a kernel that computes a 1D convolution between 1D TileTensor `a` and 1D TileTensor `b` and stores it in 1D TileTensor `output`. **Note:** _You need to handle the general case. You only need 2 global reads and 1 global write per thread._ @@ -41,7 +41,7 @@ Notes:
-1. Use `LayoutTensor[dtype, Layout.row_major(SIZE), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` for shared memory allocation +1. Use `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[SIZE]())` for shared memory allocation 2. Load input to `shared_a[local_i]` and kernel to `shared_b[local_i]` 3. Call `barrier()` after loading 4. Sum products within bounds: `if local_i + j < SIZE` @@ -177,7 +177,7 @@ Kernel b: [0 1 2] - Uses `var` for proper type inference with `output.element_type` - Employs `@parameter` decorator to unroll the convolution loop at compile time - Maintains strict bounds checking for memory safety - - Leverages LayoutTensor's type system for better code safety + - Leverages TileTensor's type system for better code safety 3. **Memory Management**: - Uses shared memory for both input array and kernel diff --git a/book/src/puzzle_14/complete.md b/book/src/puzzle_14/complete.md index 032b84ce..972b016a 100644 --- a/book/src/puzzle_14/complete.md +++ b/book/src/puzzle_14/complete.md @@ -1,6 +1,6 @@ # Complete Version -Implement a kernel that computes a prefix-sum over 1D LayoutTensor `a` and stores it in 1D LayoutTensor `output`. +Implement a kernel that computes a prefix-sum over 1D TileTensor `a` and stores it in 1D TileTensor `output`. **Note:** _If the size of `a` is greater than the block size, we need to synchronize across multiple blocks to get the correct result._ diff --git a/book/src/puzzle_14/puzzle_14.md b/book/src/puzzle_14/puzzle_14.md index 9b3bf220..226892cd 100644 --- a/book/src/puzzle_14/puzzle_14.md +++ b/book/src/puzzle_14/puzzle_14.md @@ -4,7 +4,7 @@ Prefix sum (also known as _scan_) is a fundamental parallel algorithm that computes running totals of a sequence. Found at the heart of many parallel applications - from sorting algorithms to scientific simulations - it transforms a sequence of numbers into their running totals. While simple to compute sequentially, making this efficient on a GPU requires clever parallel thinking! -Implement a kernel that computes a prefix-sum over 1D LayoutTensor `a` and stores it in 1D LayoutTensor `output`. +Implement a kernel that computes a prefix-sum over 1D TileTensor `a` and stores it in 1D TileTensor `output`. **Note:** _If the size of `a` is greater than the block size, only store the sum of each block._ diff --git a/book/src/puzzle_14/simple.md b/book/src/puzzle_14/simple.md index 46692d50..88513776 100644 --- a/book/src/puzzle_14/simple.md +++ b/book/src/puzzle_14/simple.md @@ -1,6 +1,6 @@ # Simple Version -Implement a kernel that computes a prefix-sum over 1D LayoutTensor `a` and stores it in 1D LayoutTensor `output`. +Implement a kernel that computes a prefix-sum over 1D TileTensor `a` and stores it in 1D TileTensor `output`. **Note:** _If the size of `a` is greater than the block size, only store the sum of each block._ @@ -13,11 +13,11 @@ Implement a kernel that computes a prefix-sum over 1D LayoutTensor `a` and store Notes: -- **Data loading**: Each thread loads one element using LayoutTensor access -- **Memory pattern**: Shared memory for intermediate results using LayoutTensor with address_space +- **Data loading**: Each thread loads one element using TileTensor access +- **Memory pattern**: Shared memory for intermediate results using TileTensor with address_space - **Thread sync**: Coordination between computation phases - **Access pattern**: Stride-based parallel computation -- **Type safety**: Leveraging LayoutTensor's type system +- **Type safety**: Leveraging TileTensor's type system ## Code to complete diff --git a/book/src/puzzle_15/puzzle_15.md b/book/src/puzzle_15/puzzle_15.md index 1af82d8a..4880f3d1 100644 --- a/book/src/puzzle_15/puzzle_15.md +++ b/book/src/puzzle_15/puzzle_15.md @@ -2,7 +2,7 @@ ## Overview -Implement a kernel that computes a sum over each row of 2D matrix `a` and stores it in `output` using LayoutTensor. +Implement a kernel that computes a sum over each row of 2D matrix `a` and stores it in `output` using TileTensor. Axis sum visualization Axis sum visualization @@ -11,12 +11,12 @@ Implement a kernel that computes a sum over each row of 2D matrix `a` and stores This puzzle covers: -- Parallel reduction along matrix dimensions using LayoutTensor +- Parallel reduction along matrix dimensions using TileTensor - Using block coordinates for data partitioning - Efficient shared memory reduction patterns - Working with multi-dimensional tensor layouts -The key insight is understanding how to map thread blocks to matrix rows and perform efficient parallel reduction within each block while leveraging LayoutTensor's dimensional indexing. +The key insight is understanding how to map thread blocks to matrix rows and perform efficient parallel reduction within each block while leveraging TileTensor's dimensional indexing. ## Configuration @@ -24,8 +24,8 @@ The key insight is understanding how to map thread blocks to matrix rows and per - Threads per block: \\(\\text{TPB} = 8\\) - Grid dimensions: \\(1 \\times \\text{BATCH}\\) - Shared memory: \\(\\text{TPB}\\) elements per block -- Input layout: `Layout.row_major(BATCH, SIZE)` -- Output layout: `Layout.row_major(BATCH, 1)` +- Input layout: `row_major[BATCH, SIZE]()` +- Output layout: `row_major[BATCH, 1]()` Matrix visualization: @@ -116,12 +116,12 @@ expected: HostBuffer([15.0, 51.0, 87.0, 123.0])
-The solution implements a parallel row-wise sum reduction for a 2D matrix using LayoutTensor. Here's a comprehensive breakdown: +The solution implements a parallel row-wise sum reduction for a 2D matrix using TileTensor. Here's a comprehensive breakdown: ### Matrix layout and block mapping ```txt -Input Matrix (4ร—6) with LayoutTensor: Block Assignment: +Input Matrix (4ร—6) with TileTensor: Block Assignment: [[ a[0,0] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] ] โ†’ Block(0,0) [ a[1,0] a[1,1] a[1,2] a[1,3] a[1,4] a[1,5] ] โ†’ Block(0,1) [ a[2,0] a[2,1] a[2,2] a[2,3] a[2,4] a[2,5] ] โ†’ Block(0,2) @@ -156,9 +156,9 @@ Input Matrix (4ร—6) with LayoutTensor: Block Assignment: - Each block processes one complete row 2. **Memory Access Pattern**: - - LayoutTensor 2D indexing for input: `a[batch, local_i]` + - TileTensor 2D indexing for input: `a[batch, local_i]` - Shared memory for efficient reduction - - LayoutTensor 2D indexing for output: `output[batch, 0]` + - TileTensor 2D indexing for output: `output[batch, 0]` 3. **Parallel Reduction Logic**: @@ -196,7 +196,7 @@ Input Matrix (4ร—6) with LayoutTensor: Block Assignment: ### Performance optimizations 1. **Memory Efficiency**: - - Coalesced memory access through LayoutTensor + - Coalesced memory access through TileTensor - Shared memory for fast reduction - Single write per row result diff --git "a/book/src/puzzle_16/na\303\257ve.md" "b/book/src/puzzle_16/na\303\257ve.md" index 936f65aa..429e5d27 100644 --- "a/book/src/puzzle_16/na\303\257ve.md" +++ "b/book/src/puzzle_16/na\303\257ve.md" @@ -24,9 +24,9 @@ The key insight is understanding how to map 2D thread indices to matrix elements Layout configuration: -- Input A: `Layout.row_major(SIZE, SIZE)` -- Input B: `Layout.row_major(SIZE, SIZE)` -- Output: `Layout.row_major(SIZE, SIZE)` +- Input A: `row_major[SIZE, SIZE]()` +- Input B: `row_major[SIZE, SIZE]()` +- Output: `row_major[SIZE, SIZE]()` ## Code to complete @@ -108,7 +108,7 @@ expected: HostBuffer([4.0, 6.0, 12.0, 22.0])
-The naive matrix multiplication using LayoutTensor follows this basic approach: +The naive matrix multiplication using TileTensor follows this basic approach: ### Matrix layout (2ร—2 example) diff --git a/book/src/puzzle_16/shared_memory.md b/book/src/puzzle_16/shared_memory.md index 8b8c41d2..5cf51fd2 100644 --- a/book/src/puzzle_16/shared_memory.md +++ b/book/src/puzzle_16/shared_memory.md @@ -8,13 +8,13 @@ This puzzle implements matrix multiplication for square matrices \\(A\\) and \\( This puzzle covers: -- Block-local memory management with LayoutTensor +- Block-local memory management with TileTensor - Thread synchronization patterns - Memory access optimization using shared memory - Collaborative data loading with 2D indexing -- Efficient use of LayoutTensor for matrix operations +- Efficient use of TileTensor for matrix operations -The central concept involves utilizing fast shared memory through LayoutTensor to minimize costly global memory accesses. +The central concept involves utilizing fast shared memory through TileTensor to minimize costly global memory accesses. ## Configuration @@ -24,15 +24,15 @@ The central concept involves utilizing fast shared memory through LayoutTensor t Layout configuration: -- Input A: `Layout.row_major(SIZE, SIZE)` -- Input B: `Layout.row_major(SIZE, SIZE)` -- Output: `Layout.row_major(SIZE, SIZE)` -- Shared Memory: Two `TPB ร— TPB` LayoutTensors +- Input A: `row_major[SIZE, SIZE]()` +- Input B: `row_major[SIZE, SIZE]()` +- Output: `row_major[SIZE, SIZE]()` +- Shared Memory: Two `TPB ร— TPB` TileTensors Memory organization: ```txt -Global Memory (LayoutTensor): Shared Memory (LayoutTensor): +Global Memory (TileTensor): Shared Memory (TileTensor): A[i,j]: Direct access a_shared[local_row, local_col] B[i,j]: Direct access b_shared[local_row, local_col] ``` @@ -117,7 +117,7 @@ expected: HostBuffer([4.0, 6.0, 12.0, 22.0])
-The shared memory implementation with LayoutTensor improves performance through efficient memory access patterns: +The shared memory implementation with TileTensor improves performance through efficient memory access patterns: ### Memory organization @@ -138,9 +138,9 @@ Matrix B: b_shared: (similar layout) 1. **Shared Memory Setup**: ```mojo - # Create 2D shared memory tensors using LayoutTensor with address_space - a_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() - b_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() + # Create 2D shared memory tensors using TileTensor with address_space + a_shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB, TPB]()) + b_shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB, TPB]()) ``` 2. **Thread Indexing**: @@ -158,7 +158,7 @@ Matrix B: b_shared: (similar layout) 3. **Data Loading**: ```mojo - # Load data into shared memory using LayoutTensor indexing + # Load data into shared memory using TileTensor indexing if row < size and col < size: a_shared[local_row, local_col] = a[row, col] b_shared[local_row, local_col] = b[row, col] @@ -217,13 +217,13 @@ Matrix B: b_shared: (similar layout) ### Key language features -1. **LayoutTensor benefits**: +1. **TileTensor benefits**: - Direct 2D indexing simplifies code - Type safety through `element_type` - Efficient memory layout handling 2. **Shared memory allocation**: - - LayoutTensor with address_space for structured allocation + - TileTensor with address_space for structured allocation - Row-major layout matching input tensors - Proper alignment for efficient access @@ -253,7 +253,7 @@ This implementation significantly improves performance over the naive version by - Reducing global memory accesses - Enabling data reuse through shared memory -- Using efficient 2D indexing with LayoutTensor +- Using efficient 2D indexing with TileTensor - Maintaining proper thread synchronization
diff --git a/book/src/puzzle_16/tiled.md b/book/src/puzzle_16/tiled.md index 27f5e5df..3701ebc2 100644 --- a/book/src/puzzle_16/tiled.md +++ b/book/src/puzzle_16/tiled.md @@ -2,28 +2,28 @@ ## Overview -Implement a kernel that multiplies square matrices \\(A\\) and \\(B\\) using tiled matrix multiplication with LayoutTensor. This approach handles large matrices by processing them in smaller chunks (tiles). +Implement a kernel that multiplies square matrices \\(A\\) and \\(B\\) using tiled matrix multiplication with TileTensor. This approach handles large matrices by processing them in smaller chunks (tiles). ## Key concepts -- Matrix tiling with LayoutTensor for efficient computation +- Matrix tiling with TileTensor for efficient computation - Multi-block coordination with proper layouts - Efficient shared memory usage through TensorBuilder -- Boundary handling for tiles with LayoutTensor indexing +- Boundary handling for tiles with TileTensor indexing ## Configuration - Matrix size: \\(\\text{SIZE\_TILED} = 9\\) - Threads per block: \\(\\text{TPB} \times \\text{TPB} = 3 \times 3\\) - Grid dimensions: \\(3 \times 3\\) blocks -- Shared memory: Two \\(\\text{TPB} \times \\text{TPB}\\) LayoutTensors per block +- Shared memory: Two \\(\\text{TPB} \times \\text{TPB}\\) TileTensors per block Layout configuration: -- Input A: `Layout.row_major(SIZE_TILED, SIZE_TILED)` -- Input B: `Layout.row_major(SIZE_TILED, SIZE_TILED)` -- Output: `Layout.row_major(SIZE_TILED, SIZE_TILED)` -- Shared Memory: Two `TPB ร— TPB` LayoutTensors using TensorBuilder +- Input A: `row_major[SIZE_TILED, SIZE_TILED]()` +- Input B: `row_major[SIZE_TILED, SIZE_TILED]()` +- Output: `row_major[SIZE_TILED, SIZE_TILED]()` +- Shared Memory: Two `TPB ร— TPB` TileTensors using TensorBuilder ## Tiling strategy @@ -35,7 +35,7 @@ Grid Layout (3ร—3): Thread Layout per Block (3ร—3): [B10][B11][B12] [T10 T11 T12] [B20][B21][B22] [T20 T21 T22] -Each block processes a tile using LayoutTensor indexing +Each block processes a tile using TileTensor indexing ``` ### Tile processing steps @@ -316,7 +316,7 @@ Key performance features: This implementation achieves high performance through: -- Efficient use of LayoutTensor for memory access +- Efficient use of TileTensor for memory access - Optimal tiling strategy - Proper thread synchronization - Careful boundary handling @@ -324,7 +324,7 @@ This implementation achieves high performance through:
-## Solution: Idiomatic LayoutTensor tiling +## Solution: Idiomatic TileTensor tiling
@@ -335,14 +335,14 @@ This implementation achieves high performance through:
-The idiomatic tiled matrix multiplication leverages Mojo's LayoutTensor API and asynchronous memory operations for a beautifully clean implementation. +The idiomatic tiled matrix multiplication leverages Mojo's TileTensor API and asynchronous memory operations for a beautifully clean implementation. **๐Ÿ”‘ Key Point: This implementation performs standard matrix multiplication A ร— B using coalesced loading for both matrices.** **What this implementation does:** - **Matrix operation**: Standard \\(A \times B\\) multiplication (not \\(A \times B^T\\)) -- **Loading pattern**: Both matrices use `Layout.row_major(1, TPB)` for coalesced access +- **Loading pattern**: Both matrices use `row_major[1, TPB]()` for coalesced access - **Computation**: `acc += a_shared[local_row, k] * b_shared[k, local_col]` - **Data layout**: No transposition during loading - both matrices loaded in same orientation @@ -354,7 +354,7 @@ The idiomatic tiled matrix multiplication leverages Mojo's LayoutTensor API and With the \\((9 \times 9)\\) matrix size, we get perfect tiling that eliminates all boundary checks: -1. **LayoutTensor tile API** +1. **TileTensor tile API** ```mojo out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x) @@ -362,7 +362,7 @@ With the \\((9 \times 9)\\) matrix size, we get perfect tiling that eliminates a b_tile = b.tile[TPB, TPB](idx, block_idx.x) ``` - This directly expresses "get the tile at position (block_idx.y, block_idx.x)" without manual coordinate calculation. See the [documentation](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/#tile) for more details. + This directly expresses "get the tile at position (block_idx.y, block_idx.x)" without manual coordinate calculation. See the [documentation](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/#tile) for more details. 2. **Asynchronous memory operations** @@ -393,14 +393,14 @@ With the \\((9 \times 9)\\) matrix size, we get perfect tiling that eliminates a 3. **Optimized memory access layouts** ```mojo - comptime load_a_layout = Layout.row_major(1, TPB) # Coalesced loading - comptime load_b_layout = Layout.row_major(1, TPB) # Coalesced loading + comptime load_a_layout = row_major[1, TPB]() # Coalesced loading + comptime load_b_layout = row_major[1, TPB]() # Coalesced loading # Note: Both matrices use the same layout for standard A ร— B multiplication ``` **Memory Access Analysis for Current Implementation:** - Both matrices use `Layout.row_major(1, TPB)` for coalesced loading from global memory: + Both matrices use `row_major[1, TPB]()` for coalesced loading from global memory: - `load_a_layout`: Threads cooperate to load consecutive elements from matrix A rows - `load_b_layout`: Threads cooperate to load consecutive elements from matrix B rows - **Key insight**: Thread layout determines how threads cooperate during copy, not the final data layout @@ -422,11 +422,11 @@ With the \\((9 \times 9)\\) matrix size, we get perfect tiling that eliminates a - Matrix A tile: threads load A[block_row, k], A[block_row, k+1], A[block_row, k+2]... (consecutive) - Matrix B tile: threads load B[k, block_col], B[k, block_col+1], B[k, block_col+2]... (consecutive) - Both patterns are coalesced with Layout.row_major(1, TPB) + Both patterns are coalesced with row_major[1, TPB]() ``` **Three separate memory concerns:** - 1. **Global-to-shared coalescing**: `Layout.row_major(1, TPB)` ensures coalesced global memory access + 1. **Global-to-shared coalescing**: `row_major[1, TPB]()` ensures coalesced global memory access 2. **Shared memory computation**: `a_shared[local_row, k] * b_shared[k, local_col]` avoids bank conflicts 3. **Matrix operation**: The computation pattern determines this is A ร— B, not A ร— B^T @@ -465,7 +465,7 @@ This implementation shows how high-level abstractions can express complex GPU al | Feature | Manual Tiling | Idiomatic Tiling | |---------|--------------|------------------| -| Memory access | Direct indexing with bounds checks | LayoutTensor tile API | +| Memory access | Direct indexing with bounds checks | TileTensor tile API | | Tile loading | Explicit element-by-element copying | Dedicated copy engine bulk transfers | | Shared memory | Manual initialization (defensive) | Managed by copy functions | | Code complexity | More verbose with explicit indexing | More concise with higher-level APIs | @@ -481,7 +481,7 @@ The current implementation does NOT use transposed loading. This section is pure **Current implementation recap:** -- Uses `Layout.row_major(1, TPB)` for both matrices +- Uses `row_major[1, TPB]()` for both matrices - Performs standard A ร— B multiplication - No data transposition during copy @@ -492,8 +492,8 @@ While this puzzle uses standard coalesced loading for both matrices, the layout ```mojo # Example: Loading pre-transposed matrix B^T to compute A ร— B # (This is NOT what the current implementation does) -comptime load_b_layout = Layout.row_major(TPB, 1) # Load B^T with coalesced access -comptime store_b_layout = Layout.row_major(1, TPB) # Store as B in shared memory +comptime load_b_layout = row_major[TPB, 1]() # Load B^T with coalesced access +comptime store_b_layout = row_major[1, TPB]() # Store as B in shared memory copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile) ``` @@ -506,7 +506,7 @@ copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store **Key distinction:** -- **Current implementation**: Both matrices use `Layout.row_major(1, TPB)` for standard \\(A \times B\\) multiplication +- **Current implementation**: Both matrices use `row_major[1, TPB]()` for standard \\(A \times B\\) multiplication - **Transposed loading example**: Would use different layouts to handle pre-transposed data or different matrix operations This demonstrates Mojo's philosophy: providing low-level control when needed while maintaining high-level abstractions for common cases. @@ -518,13 +518,13 @@ This demonstrates Mojo's philosophy: providing low-level control when needed whi **What the idiomatic tiled implementation actually does:** 1. **Matrix Operation**: Standard A ร— B multiplication -2. **Memory Loading**: Both matrices use `Layout.row_major(1, TPB)` for coalesced access +2. **Memory Loading**: Both matrices use `row_major[1, TPB]()` for coalesced access 3. **Computation Pattern**: `acc += a_shared[local_row, k] * b_shared[k, local_col]` 4. **Data Layout**: No transposition during loading **Why this is optimal:** -- **Coalesced global memory access**: `Layout.row_major(1, TPB)` ensures efficient loading +- **Coalesced global memory access**: `row_major[1, TPB]()` ensures efficient loading - **Bank conflict avoidance**: Shared memory access pattern avoids conflicts - **Standard algorithm**: Implements the most common matrix multiplication pattern diff --git a/book/src/puzzle_17/puzzle_17.md b/book/src/puzzle_17/puzzle_17.md index ef8f2ec5..fd7d0fe1 100644 --- a/book/src/puzzle_17/puzzle_17.md +++ b/book/src/puzzle_17/puzzle_17.md @@ -141,7 +141,7 @@ Let's break down how this works in the larger context: 3. **Custom op registration**: - The `@compiler.register("conv1d")` decorator exposes our operation to MAX Graph. See [@compiler.register](https://docs.modular.com/mojo/manual/decorators/compiler-register/) - The `execute` method parameters define the interface (inputs, outputs, context) - - Input/output tensors are converted to LayoutTensors for use in our kernel + - Input/output tensors are converted to TileTensors for use in our kernel - Device context manages GPU memory allocation and kernel execution 4. **Kernel execution**: @@ -180,7 +180,7 @@ Let's break down how this works in the larger context: kernel_tensor = kernel.to_layout_tensor() ``` - - MAX Graph tensors are converted to Mojo LayoutTensors + - MAX Graph tensors are converted to Mojo TileTensors - This allows our kernel to work with them directly - The layouts are extracted for compile-time optimization diff --git a/book/src/puzzle_18/puzzle_18.md b/book/src/puzzle_18/puzzle_18.md index b21e84e8..e1884e5c 100644 --- a/book/src/puzzle_18/puzzle_18.md +++ b/book/src/puzzle_18/puzzle_18.md @@ -42,8 +42,8 @@ Our GPU implementation uses parallel reduction for both finding the maximum valu Layout configuration: -- Input tensor: `Layout.row_major(SIZE)` -- Output tensor: `Layout.row_major(SIZE)` +- Input tensor: `row_major[SIZE]()` +- Output tensor: `row_major[SIZE]()` - Custom op parameters: `{"input_size": input_tensor.shape[0]}` Key aspects of this puzzle include: @@ -257,8 +257,8 @@ def softmax_gpu_kernel[ input_size: Int, dtype: DType = DType.float32, ]( - output: LayoutTensor[mut=True, dtype, layout], - input: LayoutTensor[mut=False, dtype, layout], + output: TileTensor[mut=True, dtype, layout], + input: TileTensor[mut=False, dtype, layout], ) ``` @@ -273,8 +273,8 @@ The kernel is parameterized with: #### Shared memory allocation ```mojo -shared_max = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() -shared_sum = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() +shared_max = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[BLOCK_DIM_X]()) +shared_sum = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[BLOCK_DIM_X]()) ``` The kernel allocates two shared memory buffers: diff --git a/book/src/puzzle_19/puzzle_19.md b/book/src/puzzle_19/puzzle_19.md index 53cea305..b581784e 100644 --- a/book/src/puzzle_19/puzzle_19.md +++ b/book/src/puzzle_19/puzzle_19.md @@ -81,10 +81,10 @@ Our GPU implementation **reuses and combines optimized kernels from previous puz Layout configuration: -- Query tensor: `Layout.row_major(d)` -- Key tensor: `Layout.row_major(seq_len, d)` -- Value tensor: `Layout.row_major(seq_len, d)` -- Output tensor: `Layout.row_major(d)` +- Query tensor: `row_major[d]()` +- Key tensor: `row_major[seq_len, d]()` +- Value tensor: `row_major[seq_len, d]()` +- Output tensor: `row_major[d]()` - Custom op parameters: `{"seq_len": seq_len, "d": d, "dtype": dtype}` Key aspects of this puzzle include: @@ -121,7 +121,7 @@ To complete this puzzle, we'll leverage the tiled matmul kernel from [Puzzle 16] **Transpose Kernel Implementation Guide:** -1. **Shared Memory Setup**: Use `LayoutTensor[dtype, Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` to create a square `TRANSPOSE_BLOCK_DIM_XY` ร— `TRANSPOSE_BLOCK_DIM_XY` shared memory tile for efficient data exchange between threads +1. **Shared Memory Setup**: Use `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY]())` to create a square `TRANSPOSE_BLOCK_DIM_XY` ร— `TRANSPOSE_BLOCK_DIM_XY` shared memory tile for efficient data exchange between threads 2. **Thread Indexing**: Map threads to matrix elements: - `local_row = thread_idx.y`, `local_col = thread_idx.x` (position within the block) diff --git a/book/src/puzzle_23/elementwise.md b/book/src/puzzle_23/elementwise.md index 0942d44d..cb0ad5df 100644 --- a/book/src/puzzle_23/elementwise.md +++ b/book/src/puzzle_23/elementwise.md @@ -10,7 +10,7 @@ This puzzle covers: - **Functional GPU programming** with `elementwise` - **Automatic SIMD vectorization** within GPU threads -- **LayoutTensor operations** for safe memory access +- **TileTensor operations** for safe memory access - **GPU thread hierarchy** vs SIMD operations - **Capturing semantics** in nested functions @@ -24,7 +24,7 @@ The implementation covers fundamental patterns applicable to all GPU functional - Vector size: `SIZE = 1024` - Data type: `DType.float32` - SIMD width: Target-dependent (determined by GPU architecture and data type) -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) ## Code to complete diff --git a/book/src/puzzle_23/puzzle_23.md b/book/src/puzzle_23/puzzle_23.md index 47aeff5d..c34ebc2d 100644 --- a/book/src/puzzle_23/puzzle_23.md +++ b/book/src/puzzle_23/puzzle_23.md @@ -71,7 +71,7 @@ Before diving into functional patterns, ensure you're comfortable with: - **Basic GPU concepts**: Memory hierarchy, thread execution, SIMD operations - **Mojo fundamentals**: Parameter functions, compile-time specialization, capturing semantics -- **LayoutTensor operations**: Loading, storing, and tensor manipulation +- **TileTensor operations**: Loading, storing, and tensor manipulation - **GPU memory management**: Buffer allocation, host-device synchronization ## Learning path @@ -86,7 +86,7 @@ Start with the foundation: automatic thread management and SIMD vectorization. - Functional GPU programming with `elementwise` - Automatic SIMD vectorization within GPU threads -- LayoutTensor operations for safe memory access +- TileTensor operations for safe memory access - Capturing semantics in nested functions **Key pattern:** diff --git a/book/src/puzzle_23/tile.md b/book/src/puzzle_23/tile.md index 2a26d174..bd0000b1 100644 --- a/book/src/puzzle_23/tile.md +++ b/book/src/puzzle_23/tile.md @@ -31,7 +31,7 @@ But with a completely different execution strategy optimized for memory hierarch - Tile size: `TILE_SIZE = 32` - Data type: `DType.float32` - SIMD width: GPU-dependent (for operations within tiles) -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) ## Code to complete @@ -58,7 +58,7 @@ For a 1024-element vector with `TILE_SIZE=32`: `1024 รท 32 = 32` tiles exactly. ### 2. **Tile extraction pattern** -Check out the [LayoutTensor `.tile` documentation](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/#tile). +Check out the [TileTensor `.tile` documentation](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/#tile). ```mojo tile_id = indices[0] # Each thread gets one tile to process diff --git a/book/src/puzzle_23/vectorize.md b/book/src/puzzle_23/vectorize.md index 6098ac58..5dd31e81 100644 --- a/book/src/puzzle_23/vectorize.md +++ b/book/src/puzzle_23/vectorize.md @@ -32,7 +32,7 @@ But with sophisticated vectorization strategies for maximum performance. - Tile size: `TILE_SIZE = 32` - Data type: `DType.float32` - SIMD width: GPU-dependent -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) ## 1. Manual vectorization approach diff --git a/book/src/puzzle_24/puzzle_24.md b/book/src/puzzle_24/puzzle_24.md index 59733f5d..d261d862 100644 --- a/book/src/puzzle_24/puzzle_24.md +++ b/book/src/puzzle_24/puzzle_24.md @@ -49,9 +49,9 @@ Learn the core warp primitives from `gpu.primitives.warp`: ```mojo # 1. Reduction through shared memory # Complex pattern we have seen earlier (from p12.mojo): -shared = LayoutTensor[ +shared = TileTensor[ dtype, - Layout.row_major(WARP_SIZE), + row_major[WARP_SIZE](), MutAnyOrigin, address_space = AddressSpace.SHARED, ].stack_allocation() @@ -93,7 +93,7 @@ Before diving into warp programming, ensure you're comfortable with: - **Part V functional patterns**: Elementwise, tiled, and vectorized approaches - **GPU thread hierarchy**: Understanding blocks, warps, and threads -- **LayoutTensor operations**: Loading, storing, and tensor manipulation +- **TileTensor operations**: Loading, storing, and tensor manipulation - **Shared memory concepts**: Why barriers and tree reduction are complex ## Learning path diff --git a/book/src/puzzle_24/warp_sum.md b/book/src/puzzle_24/warp_sum.md index 1289baff..c04db78f 100644 --- a/book/src/puzzle_24/warp_sum.md +++ b/book/src/puzzle_24/warp_sum.md @@ -25,7 +25,7 @@ But the implementation teaches fundamental patterns for all warp-level GPU progr - Data type: `DType.float32` - Block configuration: `(WARP_SIZE, 1)` threads per block - Grid configuration: `(1, 1)` blocks per grid -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) ## The traditional complexity (from Puzzle 12) diff --git a/book/src/puzzle_25/puzzle_25.md b/book/src/puzzle_25/puzzle_25.md index ac88c36f..8f0b7798 100644 --- a/book/src/puzzle_25/puzzle_25.md +++ b/book/src/puzzle_25/puzzle_25.md @@ -48,9 +48,9 @@ Learn the core communication primitives from `gpu.primitives.warp`: ```mojo # Complex neighbor access pattern (traditional approach): -shared = LayoutTensor[ +shared = TileTensor[ dtype, - Layout.row_major(WARP_SIZE), + row_major[WARP_SIZE](), MutAnyOrigin, address_space = AddressSpace.SHARED, ].stack_allocation() @@ -89,7 +89,7 @@ Before diving into warp communication, ensure you're comfortable with: - **Part VII warp fundamentals**: Understanding SIMT execution and basic warp operations (see [Puzzle 24](../puzzle_24/puzzle_24.md)) - **GPU thread hierarchy**: Blocks, warps, and lane numbering -- **LayoutTensor operations**: Loading, storing, and tensor manipulation +- **TileTensor operations**: Loading, storing, and tensor manipulation - **Boundary condition handling**: Managing edge cases in parallel algorithms ## Learning path diff --git a/book/src/puzzle_25/warp_broadcast.md b/book/src/puzzle_25/warp_broadcast.md index b6729a70..25d9f6d1 100644 --- a/book/src/puzzle_25/warp_broadcast.md +++ b/book/src/puzzle_25/warp_broadcast.md @@ -80,7 +80,7 @@ Implement a basic broadcast pattern where lane 0 computes a block-level statisti - Grid configuration: `(1, 1)` blocks per grid - Block configuration: `(WARP_SIZE, 1)` threads per block - Data type: `DType.float32` -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) ### Code to complete diff --git a/book/src/puzzle_25/warp_shuffle_down.md b/book/src/puzzle_25/warp_shuffle_down.md index 55664f9e..e034fc18 100644 --- a/book/src/puzzle_25/warp_shuffle_down.md +++ b/book/src/puzzle_25/warp_shuffle_down.md @@ -29,7 +29,7 @@ This transforms complex neighbor access patterns into simple warp-level operatio - Grid configuration: `(1, 1)` blocks per grid - Block configuration: `(WARP_SIZE, 1)` threads per block - Data type: `DType.float32` -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) ### The shuffle_down concept diff --git a/book/src/puzzle_26/puzzle_26.md b/book/src/puzzle_26/puzzle_26.md index 962a606f..6ddad887 100644 --- a/book/src/puzzle_26/puzzle_26.md +++ b/book/src/puzzle_26/puzzle_26.md @@ -48,9 +48,9 @@ Learn the sophisticated communication primitives from `gpu.primitives.warp`: ```mojo # Complex parallel reduction (traditional approach - from Puzzle 14): -shared = LayoutTensor[ +shared = TileTensor[ dtype, - Layout.row_major(WARP_SIZE), + row_major[WARP_SIZE](), MutAnyOrigin, address_space = AddressSpace.SHARED, ].stack_allocation() diff --git a/book/src/puzzle_26/warp_prefix_sum.md b/book/src/puzzle_26/warp_prefix_sum.md index 4a0d6574..bf861e87 100644 --- a/book/src/puzzle_26/warp_prefix_sum.md +++ b/book/src/puzzle_26/warp_prefix_sum.md @@ -26,7 +26,7 @@ This transforms multi-phase shared memory algorithms into elegant single-functio - Grid configuration: `(1, 1)` blocks per grid - Block configuration: `(WARP_SIZE, 1)` threads per block - Data type: `DType.float32` -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) ### The `prefix_sum` advantage diff --git a/book/src/puzzle_26/warp_shuffle_xor.md b/book/src/puzzle_26/warp_shuffle_xor.md index 925b48d6..51415b98 100644 --- a/book/src/puzzle_26/warp_shuffle_xor.md +++ b/book/src/puzzle_26/warp_shuffle_xor.md @@ -29,7 +29,7 @@ This transforms complex parallel algorithms into elegant butterfly communication - Grid configuration: `(1, 1)` blocks per grid - Block configuration: `(WARP_SIZE, 1)` threads per block - Data type: `DType.float32` -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) ### The `shuffle_xor` concept diff --git a/book/src/puzzle_27/block_broadcast.md b/book/src/puzzle_27/block_broadcast.md index 8e0a9fd9..6edcaefc 100644 --- a/book/src/puzzle_27/block_broadcast.md +++ b/book/src/puzzle_27/block_broadcast.md @@ -25,7 +25,7 @@ Each thread contributes to the mean calculation, then receives the broadcast mea - Data type: `DType.float32` - Block configuration: `(128, 1)` threads per block (`TPB = 128`) - Grid configuration: `(1, 1)` blocks per grid -- Layout: `Layout.row_major(SIZE)` (1D row-major for input and output) +- Layout: `row_major[SIZE]()` (1D row-major for input and output) - Test data: Values cycling 1-8, so mean = 4.5 - Expected output: Normalized vector with mean = 1.0 @@ -100,7 +100,7 @@ The algorithm follows the perfect block operations pattern: ### 2. **Data loading and sum computation (familiar patterns)** -Load your element using the established LayoutTensor pattern: +Load your element using the established TileTensor pattern: ```mojo var my_value: Scalar[dtype] = 0.0 @@ -255,7 +255,7 @@ Thread indexing (consistent across all puzzles): global_i = block_dim.x * block_idx.x + thread_idx.x // Maps to input array position local_i = thread_idx.x // Position within block (0-127) -Parallel element loading using LayoutTensor pattern: +Parallel element loading using TileTensor pattern: Thread 0: my_value = input_data[0][0] = 1.0 // First cycle value Thread 1: my_value = input_data[1][0] = 2.0 // Second cycle value Thread 7: my_value = input_data[7][0] = 8.0 // Last cycle value @@ -373,10 +373,10 @@ Mathematical proof of correctness: Algorithm produces provably correct mathematical result. ``` -### **Connection to [Puzzle 12](../puzzle_12/layout_tensor.md) (foundational patterns):** +### **Connection to [Puzzle 12](../puzzle_12/tile_tensor.md) (foundational patterns):** - **Thread coordination evolution**: Same `global_i`, `local_i` patterns but with block primitives -- **Memory access patterns**: Same LayoutTensor SIMD extraction `[0]` but optimized workflow +- **Memory access patterns**: Same TileTensor SIMD extraction `[0]` but optimized workflow - **Complexity elimination**: Replaces 20+ lines of manual barriers with 2 block operations - **Educational progression**: Manual โ†’ automated, complex โ†’ simple, error-prone โ†’ reliable @@ -454,7 +454,7 @@ Mean normalization is the perfect educational example of this fundamental patter **Complete block operations progression:** -1. **Manual coordination** ([Puzzle 12](../puzzle_12/layout_tensor.md)): Understand parallel fundamentals +1. **Manual coordination** ([Puzzle 12](../puzzle_12/tile_tensor.md)): Understand parallel fundamentals 2. **Warp primitives** ([Puzzle 24](../puzzle_24/warp_sum.md)): Learn hardware-accelerated patterns 3. **Block reduction** ([`block.sum()`](./block_sum.md)): Learn allโ†’one communication 4. **Block scan** ([`block.prefix_sum()`](./block_prefix_sum.md)): Learn allโ†’each communication diff --git a/book/src/puzzle_27/block_prefix_sum.md b/book/src/puzzle_27/block_prefix_sum.md index afdd50bf..d9cbf3c6 100644 --- a/book/src/puzzle_27/block_prefix_sum.md +++ b/book/src/puzzle_27/block_prefix_sum.md @@ -26,7 +26,7 @@ Each thread determines its element's bin assignment, with `block.prefix_sum()` c - Block configuration: `(128, 1)` threads per block (`TPB = 128`) - Grid configuration: `(1, 1)` blocks per grid - Number of bins: `NUM_BINS = 8` (ranges [0.0, 0.125), [0.125, 0.25), etc.) -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) - Warps per block: `128 / WARP_SIZE` (2 or 4 warps depending on GPU) ## The challenge: Parallel bin extraction @@ -132,7 +132,7 @@ if belongs_to_target == 1: bin_output[Int(offset[0])] = my_value # Convert SIMD to Int for indexing ``` -This is just like the bounds checking pattern from [Puzzle 12](../puzzle_12/layout_tensor.md), but now the condition is "belongs to target bin." +This is just like the bounds checking pattern from [Puzzle 12](../puzzle_12/tile_tensor.md), but now the condition is "belongs to target bin." ### 6. **Final count computation** @@ -150,7 +150,7 @@ if local_i == tpb - 1: # Last thread in block Remember the patterns from previous puzzles: -- `LayoutTensor` indexing returns SIMD: `input_data[i][0]` +- `TileTensor` indexing returns SIMD: `input_data[i][0]` - `block.prefix_sum()` returns SIMD: `offset[0]` to extract - Array indexing needs `Int`: `Int(offset[0])` for `bin_output[...]` @@ -252,14 +252,14 @@ The `block.prefix_sum()` kernel demonstrates advanced parallel coordination patt ## **Step-by-step algorithm walkthrough:** -### **Phase 1: Element processing (like [Puzzle 12](../puzzle_12/layout_tensor.md) dot product)** +### **Phase 1: Element processing (like [Puzzle 12](../puzzle_12/tile_tensor.md) dot product)** ``` Thread indexing (familiar pattern): global_i = block_dim.x * block_idx.x + thread_idx.x // Global element index local_i = thread_idx.x // Local thread index -Element loading (like LayoutTensor pattern): +Element loading (like TileTensor pattern): Thread 0: my_value = input_data[0][0] = 0.00 Thread 1: my_value = input_data[1][0] = 0.01 Thread 13: my_value = input_data[13][0] = 0.13 @@ -328,11 +328,11 @@ Last thread computes total (not thread 0!): ## **Why this advanced algorithm works:** -### **Connection to [Puzzle 12](../puzzle_12/layout_tensor.md) (Traditional dot product):** +### **Connection to [Puzzle 12](../puzzle_12/tile_tensor.md) (Traditional dot product):** - **Same thread indexing**: `global_i` and `local_i` patterns - **Same bounds checking**: `if global_i < size` validation -- **Same data loading**: LayoutTensor SIMD extraction with `[0]` +- **Same data loading**: TileTensor SIMD extraction with `[0]` ### **Connection to [`block.sum()`](./block_sum.md) (earlier in this puzzle):** diff --git a/book/src/puzzle_27/block_sum.md b/book/src/puzzle_27/block_sum.md index 179e7d4f..89317757 100644 --- a/book/src/puzzle_27/block_sum.md +++ b/book/src/puzzle_27/block_sum.md @@ -25,12 +25,12 @@ But the implementation teaches fundamental patterns for all block-level GPU prog - Data type: `DType.float32` - Block configuration: `(128, 1)` threads per block (`TPB = 128`) - Grid configuration: `(1, 1)` blocks per grid -- Layout: `Layout.row_major(SIZE)` (1D row-major) +- Layout: `row_major[SIZE]()` (1D row-major) - Warps per block: `128 / WARP_SIZE` (4 warps on NVIDIA, 2 or 4 warps on AMD) ## The traditional complexity (from Puzzle 12) -Recall the complex approach from [Puzzle 12](../puzzle_12/layout_tensor.md) that required shared memory, barriers, and tree reduction: +Recall the complex approach from [Puzzle 12](../puzzle_12/tile_tensor.md) that required shared memory, barriers, and tree reduction: ```mojo {{#include ../../../solutions/p27/p27.mojo:traditional_dot_product_solution}} @@ -171,9 +171,9 @@ Every block reduction follows the same conceptual pattern: Each thread should handle one element pair from vectors `a` and `b`. What operation combines these into a "partial result" that can be summed across threads? -### 3. **LayoutTensor indexing patterns** +### 3. **TileTensor indexing patterns** -When accessing `LayoutTensor` elements, remember that indexing returns SIMD values. You'll need to extract the scalar value for arithmetic operations. +When accessing `TileTensor` elements, remember that indexing returns SIMD values. You'll need to extract the scalar value for arithmetic operations. ### 4. **[block.sum()](https://docs.modular.com/mojo/std/gpu/primitives/block/sum) API concepts** diff --git a/book/src/puzzle_28/puzzle_28.md b/book/src/puzzle_28/puzzle_28.md index 8a1fb7c3..9976e13d 100644 --- a/book/src/puzzle_28/puzzle_28.md +++ b/book/src/puzzle_28/puzzle_28.md @@ -162,7 +162,7 @@ This concept becomes particularly important when implementing async copy operati - Grid configuration: `(VECTOR_SIZE // CONV_TILE_SIZE, 1)` blocks per grid (64 blocks) - Kernel size: `KERNEL_SIZE = 5` (simple 1D convolution, same as Puzzle 13) - Data type: `DType.float32` -- Layout: `Layout.row_major(VECTOR_SIZE)` (1D row-major) +- Layout: `row_major[VECTOR_SIZE]()` (1D row-major) ### The async copy opportunity @@ -325,13 +325,13 @@ The async copy overlap solution demonstrates how to hide memory latency by overl ```mojo # Phase 1: Launch async copy for input tile input_tile = input.tile[CONV_TILE_SIZE](block_idx.x) -comptime load_layout = Layout.row_major(THREADS_PER_BLOCK_ASYNC) +comptime load_layout = row_major[THREADS_PER_BLOCK_ASYNC]() copy_dram_to_sram_async[thread_layout=load_layout](input_shared, input_tile) ``` -- **Tile Creation**: `input.tile[CONV_TILE_SIZE](block_idx.x)` creates a 256-element view of the input array starting at `block_idx.x * 256`. The Mojo [`tile` method](https://docs.modular.com/mojo/kernels/layout/layout_tensor/LayoutTensor/#tile) does **NOT** perform bounds checking or zero-padding. Accessing out-of-bounds indices results in undefined behavior. The implementation must ensure the tile size and offset remain within valid array bounds. +- **Tile Creation**: `input.tile[CONV_TILE_SIZE](block_idx.x)` creates a 256-element view of the input array starting at `block_idx.x * 256`. The Mojo [`tile` method](https://docs.modular.com/mojo/kernels/layout/tile_tensor/TileTensor/#tile) does **NOT** perform bounds checking or zero-padding. Accessing out-of-bounds indices results in undefined behavior. The implementation must ensure the tile size and offset remain within valid array bounds. -- **Thread Layout**: `Layout.row_major(THREADS_PER_BLOCK_ASYNC, 1)` creates a `256 x 1` layout that matches our block organization. This is **critical** - the layout must match the physical thread arrangement for optimal coalesced memory access. When layouts mismatch, threads may access non-contiguous memory addresses, breaking coalescing and severely degrading performance. +- **Thread Layout**: `row_major[THREADS_PER_BLOCK_ASYNC, 1]()` creates a `256 x 1` layout that matches our block organization. This is **critical** - the layout must match the physical thread arrangement for optimal coalesced memory access. When layouts mismatch, threads may access non-contiguous memory addresses, breaking coalescing and severely degrading performance. - **Async Copy Launch**: `copy_dram_to_sram_async` initiates a background transfer from DRAM to shared memory. The hardware copies 256 floats (1KB) while the block continues executing. @@ -410,7 +410,7 @@ Total Time = MAX(Input_Transfer_Time, Kernel_Transfer_Time) + Compute_Time #### **Key technical insights** -1. **Thread Layout Matching**: The `Layout.row_major(256, 1)` layout precisely matches the block's `(256, 1)` thread organization, enabling optimal memory coalescing. +1. **Thread Layout Matching**: The `row_major[256, 1]()` layout precisely matches the block's `(256, 1)` thread organization, enabling optimal memory coalescing. 2. **Race Condition Avoidance**: Proper sequencing (async copy โ†’ kernel load โ†’ wait โ†’ barrier โ†’ compute) eliminates all race conditions that could corrupt shared memory. diff --git a/book/src/puzzle_32/conflict_free_patterns.md b/book/src/puzzle_32/conflict_free_patterns.md index dcda0edf..08cc5550 100644 --- a/book/src/puzzle_32/conflict_free_patterns.md +++ b/book/src/puzzle_32/conflict_free_patterns.md @@ -354,7 +354,7 @@ constant = shared[0] # All threads read same address - hardware optimized **3. Padding techniques:** ```mojo -shared = LayoutTensor[dtype, Layout.row_major(TPB + 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() # Shift access patterns +shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + 1]()) # Shift access patterns ``` **4. Access pattern analysis:** diff --git a/book/src/puzzle_33/puzzle_33.md b/book/src/puzzle_33/puzzle_33.md index f4abfe51..254c0b26 100644 --- a/book/src/puzzle_33/puzzle_33.md +++ b/book/src/puzzle_33/puzzle_33.md @@ -176,9 +176,9 @@ Your task is to complete the `tensor_core_matrix_multiplication` function. The s Layout configuration: -- Input A: `Layout.row_major(SIZE, SIZE)` -- Input B: `Layout.row_major(SIZE, SIZE)` -- Output C: `Layout.row_major(SIZE, SIZE)` +- Input A: `row_major[SIZE, SIZE]()` +- Input B: `row_major[SIZE, SIZE]()` +- Output C: `row_major[SIZE, SIZE]()` - Shared Memory: Block-sized tiles with async copy operations ## The challenge diff --git a/book/src/puzzle_34/advanced_cluster_patterns.md b/book/src/puzzle_34/advanced_cluster_patterns.md index 0e5ad190..a5b9cdcf 100644 --- a/book/src/puzzle_34/advanced_cluster_patterns.md +++ b/book/src/puzzle_34/advanced_cluster_patterns.md @@ -37,7 +37,7 @@ Real-world GPU algorithms often require **hierarchical coordination** where diff - **Warp Size**: `WARP_SIZE = 32` threads per warp (NVIDIA standard) - **Warps per Block**: `TPB / WARP_SIZE = 8` warps - **Data Type**: `DType.float32` -- **Memory Layout**: Input `Layout.row_major(SIZE)`, Output `Layout.row_major(CLUSTER_SIZE)` +- **Memory Layout**: Input `row_major[SIZE]()`, Output `row_major[CLUSTER_SIZE]()` **Processing Distribution:** diff --git a/book/src/puzzle_34/cluster_collective_ops.md b/book/src/puzzle_34/cluster_collective_ops.md index da6d0588..4e2965d5 100644 --- a/book/src/puzzle_34/cluster_collective_ops.md +++ b/book/src/puzzle_34/cluster_collective_ops.md @@ -40,8 +40,8 @@ Single blocks (as learned in [Puzzle 27](../puzzle_27/puzzle_27.md)) are limited - **Block Configuration**: `TPB = 256` threads per block `(256, 1)` - **Grid Configuration**: `CLUSTER_SIZE = 4` blocks per cluster `(4, 1)` - **Data Type**: `DType.float32` -- **Memory Layout**: Input `Layout.row_major(SIZE)`, Output `Layout.row_major(1)` -- **Temporary Storage**: `Layout.row_major(CLUSTER_SIZE)` for partial results +- **Memory Layout**: Input `row_major[SIZE]()`, Output `row_major[1]()` +- **Temporary Storage**: `row_major[CLUSTER_SIZE]()` for partial results **Expected Result**: Sum of sequence `0, 0.01, 0.02, ..., 10.23` = **523,776** diff --git a/book/src/puzzle_34/cluster_coordination_basics.md b/book/src/puzzle_34/cluster_coordination_basics.md index feac8b14..bbe99959 100644 --- a/book/src/puzzle_34/cluster_coordination_basics.md +++ b/book/src/puzzle_34/cluster_coordination_basics.md @@ -35,7 +35,7 @@ Traditional single-block algorithms like those in [Puzzle 27](../puzzle_27/puzzl - **Block Configuration**: `TPB = 256` threads per block `(256, 1)` - **Grid Configuration**: `CLUSTER_SIZE = 4` blocks per cluster `(4, 1)` - **Data Type**: `DType.float32` -- **Memory Layout**: Input `Layout.row_major(SIZE)`, Output `Layout.row_major(CLUSTER_SIZE)` +- **Memory Layout**: Input `row_major[SIZE]()`, Output `row_major[CLUSTER_SIZE]()` **Thread Block Distribution:** @@ -65,7 +65,7 @@ Traditional single-block algorithms like those in [Puzzle 27](../puzzle_27/puzzl ### **Shared memory coordination** -- Allocate shared memory using `LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` (see [shared memory basics from Puzzle 8](../puzzle_08/puzzle_08.md)) +- Allocate shared memory using `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[tpb]())` (see [shared memory basics from Puzzle 8](../puzzle_08/puzzle_08.md)) - Process input data scaled by `block_id + 1` to create distinct scaling per block - Use bounds checking when accessing input data (pattern from [guards in Puzzle 3](../puzzle_03/puzzle_03.md)) @@ -153,7 +153,7 @@ block_id = Int(block_idx.x) # Block index for reliable **Shared memory allocation and data processing:** -- Each block allocates its own shared memory workspace: `LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` +- Each block allocates its own shared memory workspace: `stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[tpb]())` - **Scaling strategy**: `data_scale = Float32(block_id + 1)` ensures each block processes data differently - Block 0: multiplies by 1.0, Block 1: by 2.0, Block 2: by 3.0, Block 3: by 4.0 - **Bounds checking**: `if global_i < size:` prevents out-of-bounds memory access diff --git a/pixi.toml b/pixi.toml index c886bba0..90cd7c4f 100644 --- a/pixi.toml +++ b/pixi.toml @@ -83,24 +83,21 @@ p03 = "mojo problems/p03/p03.mojo" viz03 = "cd book/src/puzzle_03 && python puzzle_03_viz.py" p04 = "mojo problems/p04/p04.mojo" -p04_layout_tensor = "mojo problems/p04/p04_layout_tensor.mojo" +p04_tile_tensor = "mojo problems/p04/p04_tile_tensor.mojo" viz04 = "cd book/src/puzzle_04 && python puzzle_04_viz.py" thread_indexing = "cd book/src/puzzle_04 && python thread_indexing_viz.py" -layout_tensor_intro = "mojo book/src/puzzle_04/intro.mojo" +tile_tensor_intro = "mojo book/src/puzzle_04/intro.mojo" p05 = "mojo problems/p05/p05.mojo" -p05_layout_tensor = "mojo problems/p05/p05_layout_tensor.mojo" viz05 = "cd book/src/puzzle_05 && python puzzle_05_viz.py" p06 = "mojo problems/p06/p06.mojo" viz06 = "cd book/src/puzzle_06 && python puzzle_06_viz.py" p07 = "mojo problems/p07/p07.mojo" -p07_layout_tensor = "mojo problems/p07/p07_layout_tensor.mojo" viz07 = "cd book/src/puzzle_07 && python puzzle_07_viz.py" p08 = "mojo problems/p08/p08.mojo" -p08_layout_tensor = "mojo problems/p08/p08_layout_tensor.mojo" viz08 = "cd book/src/puzzle_08 && python puzzle_08_viz.py" p09 = "mojo problems/p09/p09.mojo" @@ -108,11 +105,9 @@ p09 = "mojo problems/p09/p09.mojo" p10 = "mojo problems/p10/p10.mojo" p11 = "mojo problems/p11/p11.mojo" -p11_layout_tensor = "mojo problems/p11/p11_layout_tensor.mojo" viz11 = "cd book/src/puzzle_11 && python puzzle_11_viz.py" p12 = "mojo problems/p12/p12.mojo" -p12_layout_tensor = "mojo problems/p12/p12_layout_tensor.mojo" viz12 = "cd book/src/puzzle_12 && python puzzle_12_viz.py" p13 = "mojo problems/p13/p13.mojo" diff --git a/problems/p04/p04_layout_tensor.mojo b/problems/p04/p04_tile_tensor.mojo similarity index 72% rename from problems/p04/p04_layout_tensor.mojo rename to problems/p04/p04_tile_tensor.mojo index ad8aff51..f4f963f0 100644 --- a/problems/p04/p04_layout_tensor.mojo +++ b/problems/p04/p04_tile_tensor.mojo @@ -1,19 +1,21 @@ from std.gpu import thread_idx from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major from std.testing import assert_equal -# ANCHOR: add_10_2d_layout_tensor +# ANCHOR: add_10_2d_tile_tensor comptime SIZE = 2 comptime BLOCKS_PER_GRID = 1 comptime THREADS_PER_BLOCK = (3, 3) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE, SIZE) +comptime layout = row_major[SIZE, SIZE]() +comptime LayoutType = type_of(layout) def add_10_2d( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], size: Int, ): var row = thread_idx.y @@ -21,15 +23,15 @@ def add_10_2d( # FILL ME IN (roughly 2 lines) -# ANCHOR_END: add_10_2d_layout_tensor +# ANCHOR_END: add_10_2d_tile_tensor def main() raises: with DeviceContext() as ctx: var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) out_buf.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out_buf) - print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]()) + var out_tensor = TileTensor(out_buf, layout) + print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]()) var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) expected.enqueue_fill(0) @@ -41,7 +43,7 @@ def main() raises: a_host[i] = Scalar[dtype](i) expected[i] = a_host[i] + 10 - var a_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](a) + var a_tensor = TileTensor(a, layout) ctx.enqueue_function[add_10_2d, add_10_2d]( out_tensor, diff --git a/problems/p05/p05.mojo b/problems/p05/p05.mojo index 8dc43341..335f6feb 100644 --- a/problems/p05/p05.mojo +++ b/problems/p05/p05.mojo @@ -1,6 +1,7 @@ -from std.memory import UnsafePointer from std.gpu import thread_idx from std.gpu.host import DeviceContext +from layout import TileTensor +from layout.tile_layout import row_major from std.testing import assert_equal # ANCHOR: broadcast_add @@ -8,12 +9,18 @@ comptime SIZE = 2 comptime BLOCKS_PER_GRID = 1 comptime THREADS_PER_BLOCK = (3, 3) comptime dtype = DType.float32 +comptime out_layout = row_major[SIZE, SIZE]() +comptime a_layout = row_major[1, SIZE]() +comptime b_layout = row_major[SIZE, 1]() +comptime OutLayout = type_of(out_layout) +comptime ALayout = type_of(a_layout) +comptime BLayout = type_of(b_layout) def broadcast_add( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], - b: UnsafePointer[Scalar[dtype], MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, BLayout, ImmutAnyOrigin], size: Int, ): var row = thread_idx.y @@ -24,10 +31,15 @@ def broadcast_add( # ANCHOR_END: broadcast_add def main() raises: with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - out.enqueue_fill(0) - var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) - expected.enqueue_fill(0) + var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) + out_buf.enqueue_fill(0) + var out_tensor = TileTensor(out_buf, out_layout) + print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]()) + + var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) + expected_buf.enqueue_fill(0) + var expected_tensor = TileTensor(expected_buf, out_layout) + var a = ctx.enqueue_create_buffer[dtype](SIZE) a.enqueue_fill(0) var b = ctx.enqueue_create_buffer[dtype](SIZE) @@ -39,12 +51,15 @@ def main() raises: for i in range(SIZE): for j in range(SIZE): - expected[i * SIZE + j] = a_host[j] + b_host[i] + expected_tensor[i, j] = a_host[j] + b_host[i] + + var a_tensor = TileTensor[mut=False, dtype, ALayout](a, a_layout) + var b_tensor = TileTensor[mut=False, dtype, BLayout](b, b_layout) ctx.enqueue_function[broadcast_add, broadcast_add]( - out, - a, - b, + out_tensor, + a_tensor, + b_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -52,10 +67,12 @@ def main() raises: ctx.synchronize() - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) + with out_buf.map_to_host() as out_buf_host: + print("out:", out_buf_host) + print("expected:", expected_buf) for i in range(SIZE): for j in range(SIZE): - assert_equal(out_host[i * SIZE + j], expected[i * SIZE + j]) + assert_equal( + out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j] + ) print("Puzzle 05 complete โœ…") diff --git a/problems/p05/p05_layout_tensor.mojo b/problems/p05/p05_layout_tensor.mojo deleted file mode 100644 index 1e65f5a0..00000000 --- a/problems/p05/p05_layout_tensor.mojo +++ /dev/null @@ -1,81 +0,0 @@ -from std.gpu import thread_idx -from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor -from std.testing import assert_equal - -# ANCHOR: broadcast_add_layout_tensor -comptime SIZE = 2 -comptime BLOCKS_PER_GRID = 1 -comptime THREADS_PER_BLOCK = (3, 3) -comptime dtype = DType.float32 -comptime out_layout = Layout.row_major(SIZE, SIZE) -comptime a_layout = Layout.row_major(1, SIZE) -comptime b_layout = Layout.row_major(SIZE, 1) - - -def broadcast_add[ - out_layout: Layout, - a_layout: Layout, - b_layout: Layout, -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin], - size: Int, -): - var row = thread_idx.y - var col = thread_idx.x - # FILL ME IN (roughly 2 lines) - - -# ANCHOR_END: broadcast_add_layout_tensor -def main() raises: - with DeviceContext() as ctx: - var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - out_buf.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf) - print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]()) - - var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) - expected_buf.enqueue_fill(0) - var expected_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin]( - expected_buf - ) - - var a = ctx.enqueue_create_buffer[dtype](SIZE) - a.enqueue_fill(0) - var b = ctx.enqueue_create_buffer[dtype](SIZE) - b.enqueue_fill(0) - with a.map_to_host() as a_host, b.map_to_host() as b_host: - for i in range(SIZE): - a_host[i] = Scalar[dtype](i + 1) - b_host[i] = Scalar[dtype](i * 10) - - for i in range(SIZE): - for j in range(SIZE): - expected_tensor[i, j] = a_host[j] + b_host[i] - - var a_tensor = LayoutTensor[dtype, a_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, b_layout, ImmutAnyOrigin](b) - - comptime kernel = broadcast_add[out_layout, a_layout, b_layout] - ctx.enqueue_function[kernel, kernel]( - out_tensor, - a_tensor, - b_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - ctx.synchronize() - - with out_buf.map_to_host() as out_buf_host: - print("out:", out_buf_host) - print("expected:", expected_buf) - for i in range(SIZE): - for j in range(SIZE): - assert_equal( - out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j] - ) - print("Puzzle 05 complete โœ…") diff --git a/problems/p07/p07.mojo b/problems/p07/p07.mojo index f6eaa3bb..51e2de6a 100644 --- a/problems/p07/p07.mojo +++ b/problems/p07/p07.mojo @@ -1,6 +1,7 @@ -from std.memory import UnsafePointer from std.gpu import thread_idx, block_idx, block_dim from std.gpu.host import DeviceContext +from layout import TileTensor +from layout.tile_layout import row_major from std.testing import assert_equal # ANCHOR: add_10_blocks_2d @@ -8,11 +9,15 @@ comptime SIZE = 5 comptime BLOCKS_PER_GRID = (2, 2) comptime THREADS_PER_BLOCK = (3, 3) comptime dtype = DType.float32 +comptime out_layout = row_major[SIZE, SIZE]() +comptime a_layout = row_major[SIZE, SIZE]() +comptime OutLayout = type_of(out_layout) +comptime ALayout = type_of(a_layout) def add_10_blocks_2d( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin], size: Int, ): var row = block_dim.y * block_idx.y + thread_idx.y @@ -25,10 +30,13 @@ def add_10_blocks_2d( def main() raises: with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - out.enqueue_fill(0) - var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) - expected.enqueue_fill(1) + var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) + out_buf.enqueue_fill(0) + var out_tensor = TileTensor(out_buf, out_layout) + + var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) + expected_buf.enqueue_fill(1) + var a = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) a.enqueue_fill(1) @@ -37,11 +45,13 @@ def main() raises: for i in range(SIZE): var k = j * SIZE + i a_host[k] = Scalar[dtype](k) - expected[k] = Scalar[dtype](k + 10) + expected_buf[k] = Scalar[dtype](k + 10) + + var a_tensor = TileTensor[mut=False, dtype, ALayout](a, a_layout) ctx.enqueue_function[add_10_blocks_2d, add_10_blocks_2d]( - out, - a, + out_tensor, + a_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -49,10 +59,17 @@ def main() raises: ctx.synchronize() - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) + var expected_tensor = TileTensor(expected_buf, out_layout) + + with out_buf.map_to_host() as out_buf_host: + print( + "out:", + TileTensor(out_buf_host, out_layout), + ) + print("expected:", expected_tensor) for i in range(SIZE): for j in range(SIZE): - assert_equal(out_host[i * SIZE + j], expected[i * SIZE + j]) + assert_equal( + out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j] + ) print("Puzzle 07 complete โœ…") diff --git a/problems/p07/p07_layout_tensor.mojo b/problems/p07/p07_layout_tensor.mojo deleted file mode 100644 index 604ac552..00000000 --- a/problems/p07/p07_layout_tensor.mojo +++ /dev/null @@ -1,78 +0,0 @@ -from std.gpu import thread_idx, block_idx, block_dim -from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor -from std.testing import assert_equal - -# ANCHOR: add_10_blocks_2d_layout_tensor -comptime SIZE = 5 -comptime BLOCKS_PER_GRID = (2, 2) -comptime THREADS_PER_BLOCK = (3, 3) -comptime dtype = DType.float32 -comptime out_layout = Layout.row_major(SIZE, SIZE) -comptime a_layout = Layout.row_major(SIZE, SIZE) - - -def add_10_blocks_2d[ - out_layout: Layout, - a_layout: Layout, -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin], - size: Int, -): - var row = block_dim.y * block_idx.y + thread_idx.y - var col = block_dim.x * block_idx.x + thread_idx.x - # FILL ME IN (roughly 2 lines) - - -# ANCHOR_END: add_10_blocks_2d_layout_tensor - - -def main() raises: - with DeviceContext() as ctx: - var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - out_buf.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf) - - var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) - expected_buf.enqueue_fill(1) - - var a = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - a.enqueue_fill(1) - - with a.map_to_host() as a_host: - for j in range(SIZE): - for i in range(SIZE): - var k = j * SIZE + i - a_host[k] = Scalar[dtype](k) - expected_buf[k] = Scalar[dtype](k + 10) - - var a_tensor = LayoutTensor[dtype, a_layout, ImmutAnyOrigin](a) - - comptime kernel = add_10_blocks_2d[out_layout, a_layout] - ctx.enqueue_function[kernel, kernel]( - out_tensor, - a_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - ctx.synchronize() - - var expected_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin]( - expected_buf - ) - - with out_buf.map_to_host() as out_buf_host: - print( - "out:", - LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf_host), - ) - print("expected:", expected_tensor) - for i in range(SIZE): - for j in range(SIZE): - assert_equal( - out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j] - ) - print("Puzzle 07 complete โœ…") diff --git a/problems/p08/p08.mojo b/problems/p08/p08.mojo index 2f994b19..b89c6fc2 100644 --- a/problems/p08/p08.mojo +++ b/problems/p08/p08.mojo @@ -1,7 +1,9 @@ -from std.memory import UnsafePointer, stack_allocation from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.testing import assert_equal # ANCHOR: add_10_shared @@ -10,26 +12,26 @@ comptime SIZE = 8 comptime BLOCKS_PER_GRID = (2, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) def add_10_shared( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): + # Allocate shared memory using stack_allocation var shared = stack_allocation[ - TPB, - Scalar[dtype], - address_space=AddressSpace.SHARED, - ]() + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) + var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x - # Load local data into shared memory + if global_i < size: shared[local_i] = a[global_i] - # wait for all threads to complete - # works within a thread block barrier() # FILL ME IN (roughly 2 lines) @@ -44,9 +46,13 @@ def main() raises: out.enqueue_fill(0) var a = ctx.enqueue_create_buffer[dtype](SIZE) a.enqueue_fill(1) + + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + ctx.enqueue_function[add_10_shared, add_10_shared]( - out, - a, + out_tensor, + a_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -54,7 +60,6 @@ def main() raises: var expected = ctx.enqueue_create_host_buffer[dtype](SIZE) expected.enqueue_fill(11) - ctx.synchronize() with out.map_to_host() as out_host: diff --git a/problems/p08/p08_layout_tensor.mojo b/problems/p08/p08_layout_tensor.mojo deleted file mode 100644 index 4856d2c3..00000000 --- a/problems/p08/p08_layout_tensor.mojo +++ /dev/null @@ -1,73 +0,0 @@ -from std.gpu import thread_idx, block_idx, block_dim, barrier -from std.gpu.host import DeviceContext -from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor -from std.testing import assert_equal - -# ANCHOR: add_10_shared_layout_tensor -comptime TPB = 4 -comptime SIZE = 8 -comptime BLOCKS_PER_GRID = (2, 1) -comptime THREADS_PER_BLOCK = (TPB, 1) -comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) - - -def add_10_shared_layout_tensor[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - size: Int, -): - # Allocate shared memory using LayoutTensor with explicit address_space - var shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - - var global_i = block_dim.x * block_idx.x + thread_idx.x - var local_i = thread_idx.x - - if global_i < size: - shared[local_i] = a[global_i] - - barrier() - - # FILL ME IN (roughly 2 lines) - - -# ANCHOR_END: add_10_shared_layout_tensor - - -def main() raises: - with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](SIZE) - out.enqueue_fill(0) - var a = ctx.enqueue_create_buffer[dtype](SIZE) - a.enqueue_fill(1) - - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - - comptime kernel = add_10_shared_layout_tensor[layout] - ctx.enqueue_function[kernel, kernel]( - out_tensor, - a_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - var expected = ctx.enqueue_create_host_buffer[dtype](SIZE) - expected.enqueue_fill(11) - ctx.synchronize() - - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) - for i in range(SIZE): - assert_equal(out_host[i], expected[i]) - print("Puzzle 08 complete โœ…") diff --git a/problems/p09/p09.mojo b/problems/p09/p09.mojo index 3455c9f0..38467f2c 100644 --- a/problems/p09/p09.mojo +++ b/problems/p09/p09.mojo @@ -2,7 +2,9 @@ from std.memory import UnsafePointer from std.gpu import thread_idx, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.testing import assert_equal from std.sys import argv @@ -11,7 +13,8 @@ comptime MATRIX_SIZE = 3 comptime BLOCKS_PER_GRID = 1 comptime THREADS_PER_BLOCK = SIZE comptime dtype = DType.float32 -comptime vector_layout = Layout.row_major(SIZE) +comptime vector_layout = row_major[SIZE]() +comptime VectorLayout = type_of(vector_layout) comptime ITER = 2 @@ -29,8 +32,8 @@ def add_10( # ANCHOR: second_crash def process_sliding_window( - output: LayoutTensor[dtype, vector_layout, MutAnyOrigin], - a: LayoutTensor[dtype, vector_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, VectorLayout, ImmutAnyOrigin], ): var thread_id = thread_idx.x @@ -52,18 +55,15 @@ def process_sliding_window( # ANCHOR: third_crash def collaborative_filter( - output: LayoutTensor[dtype, vector_layout, MutAnyOrigin], - a: LayoutTensor[dtype, vector_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, VectorLayout, ImmutAnyOrigin], ): var thread_id = thread_idx.x # Shared memory workspace for collaborative processing - var shared_workspace = LayoutTensor[ - dtype, - Layout.row_major(SIZE - 1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_workspace = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[SIZE - 1]()) # Phase 1: Initialize shared workspace (all threads participate) if thread_id < SIZE - 1: @@ -139,13 +139,11 @@ def main() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i) - # Create LayoutTensors for structured access - input_tensor = LayoutTensor[dtype, vector_layout, ImmutAnyOrigin]( - input_buf - ) - output_tensor = LayoutTensor[dtype, vector_layout, MutAnyOrigin]( - output_buf + # Create TileTensors for structured access + input_tensor = TileTensor[mut=False, dtype, VectorLayout]( + input_buf, vector_layout ) + output_tensor = TileTensor(output_buf, vector_layout) print("Input array: [0, 1, 2, 3]") print("Computing sliding window sums (window size = 3)...") @@ -216,13 +214,11 @@ def main() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i + 1) - # Create LayoutTensors - input_tensor = LayoutTensor[dtype, vector_layout, ImmutAnyOrigin]( - input_buf - ) - output_tensor = LayoutTensor[dtype, vector_layout, MutAnyOrigin]( - output_buf + # Create TileTensors + input_tensor = TileTensor[mut=False, dtype, VectorLayout]( + input_buf, vector_layout ) + output_tensor = TileTensor(output_buf, vector_layout) print("Input array: [1, 2, 3, 4]") print("Applying collaborative filter using shared memory...") diff --git a/problems/p10/p10.mojo b/problems/p10/p10.mojo index d64fd3b4..e42e37e0 100644 --- a/problems/p10/p10.mojo +++ b/problems/p10/p10.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_dim, block_idx, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.testing import assert_equal from std.sys import argv @@ -11,23 +13,21 @@ comptime SIZE = 2 comptime BLOCKS_PER_GRID = 1 comptime THREADS_PER_BLOCK = (3, 3) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE, SIZE) +comptime layout = row_major[SIZE, SIZE]() +comptime LayoutType = type_of(layout) def shared_memory_race( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): var row = thread_idx.y var col = thread_idx.x - var shared_sum = LayoutTensor[ - dtype, - Layout.row_major(1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_sum = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[1]()) if row < size and col < size: shared_sum[0] += a[row, col] @@ -43,8 +43,8 @@ def shared_memory_race( # ANCHOR: add_10_2d_no_guard def add_10_2d( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): var row = thread_idx.y @@ -68,10 +68,8 @@ def main() raises: with DeviceContext() as ctx: var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) out_buf.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - out_buf - ).reshape[layout]() - print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]()) + var out_tensor = TileTensor(out_buf, layout) + print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]()) var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) expected.enqueue_fill(0) @@ -81,9 +79,7 @@ def main() raises: for i in range(SIZE * SIZE): a_host[i] = Scalar[dtype](i) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a).reshape[ - layout - ]() + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) if flag == "--memory-bug": print("Running memory bug example (bounds checking issue)...") diff --git a/problems/p11/p11.mojo b/problems/p11/p11.mojo index 06b82580..c6ed142f 100644 --- a/problems/p11/p11.mojo +++ b/problems/p11/p11.mojo @@ -1,7 +1,9 @@ -from std.memory import UnsafePointer, stack_allocation from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.testing import assert_equal # ANCHOR: pooling @@ -10,21 +12,23 @@ comptime SIZE = 8 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) def pooling( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): + # Allocate shared memory using stack_allocation var shared = stack_allocation[ - TPB, - Scalar[dtype], - address_space=AddressSpace.SHARED, - ]() + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) + var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x - # FILL ME IN (roughly 10 lines) + # FIX ME IN (roughly 10 lines) # ANCHOR_END: pooling @@ -36,13 +40,17 @@ def main() raises: out.enqueue_fill(0) var a = ctx.enqueue_create_buffer[dtype](SIZE) a.enqueue_fill(0) + with a.map_to_host() as a_host: for i in range(SIZE): a_host[i] = Scalar[dtype](i) + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + ctx.enqueue_function[pooling, pooling]( - out, - a, + out_tensor, + a_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -50,7 +58,6 @@ def main() raises: var expected = ctx.enqueue_create_host_buffer[dtype](SIZE) expected.enqueue_fill(0) - ctx.synchronize() with a.map_to_host() as a_host: @@ -59,7 +66,6 @@ def main() raises: var s = Scalar[dtype](0) for j in range(max(i - 2, 0), i + 1): s += ptr[j] - expected[i] = s with out.map_to_host() as out_host: diff --git a/problems/p11/p11_layout_tensor.mojo b/problems/p11/p11_layout_tensor.mojo deleted file mode 100644 index f7e293f1..00000000 --- a/problems/p11/p11_layout_tensor.mojo +++ /dev/null @@ -1,78 +0,0 @@ -from std.gpu import thread_idx, block_idx, block_dim, barrier -from std.gpu.host import DeviceContext -from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor -from std.testing import assert_equal - -# ANCHOR: pooling_layout_tensor -comptime TPB = 8 -comptime SIZE = 8 -comptime BLOCKS_PER_GRID = (1, 1) -comptime THREADS_PER_BLOCK = (TPB, 1) -comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) - - -def pooling[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - size: Int, -): - # Allocate shared memory using tensor builder - var shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - - var global_i = block_dim.x * block_idx.x + thread_idx.x - var local_i = thread_idx.x - # FIX ME IN (roughly 10 lines) - - -# ANCHOR_END: pooling_layout_tensor - - -def main() raises: - with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](SIZE) - out.enqueue_fill(0) - var a = ctx.enqueue_create_buffer[dtype](SIZE) - a.enqueue_fill(0) - - with a.map_to_host() as a_host: - for i in range(SIZE): - a_host[i] = Scalar[dtype](i) - - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - - ctx.enqueue_function[pooling[layout], pooling[layout]]( - out_tensor, - a_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - var expected = ctx.enqueue_create_host_buffer[dtype](SIZE) - expected.enqueue_fill(0) - ctx.synchronize() - - with a.map_to_host() as a_host: - var ptr = a_host - for i in range(SIZE): - var s = Scalar[dtype](0) - for j in range(max(i - 2, 0), i + 1): - s += ptr[j] - expected[i] = s - - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) - for i in range(SIZE): - assert_equal(out_host[i], expected[i]) - print("Puzzle 11 complete โœ…") diff --git a/problems/p12/p12.mojo b/problems/p12/p12.mojo index 4b2e1153..bdd36908 100644 --- a/problems/p12/p12.mojo +++ b/problems/p12/p12.mojo @@ -1,21 +1,29 @@ -from std.memory import UnsafePointer, stack_allocation -from std.gpu import thread_idx, block_idx, block_dim, barrier -from std.gpu.host import DeviceContext -from std.gpu.memory import AddressSpace from std.testing import assert_equal +from std.gpu.host import DeviceContext # ANCHOR: dot_product +from std.gpu import thread_idx, block_idx, block_dim, barrier +from std.gpu.memory import AddressSpace +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation + + comptime TPB = 8 comptime SIZE = 8 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 +comptime layout = row_major[SIZE]() +comptime out_layout = row_major[1]() +comptime LayoutType = type_of(layout) +comptime OutLayout = type_of(out_layout) def dot_product( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], - b: UnsafePointer[Scalar[dtype], MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): # FILL ME IN (roughly 13 lines) @@ -33,15 +41,20 @@ def main() raises: a.enqueue_fill(0) var b = ctx.enqueue_create_buffer[dtype](SIZE) b.enqueue_fill(0) + with a.map_to_host() as a_host, b.map_to_host() as b_host: for i in range(SIZE): a_host[i] = Scalar[dtype](i) b_host[i] = Scalar[dtype](i) + var out_tensor = TileTensor(out, out_layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout) + ctx.enqueue_function[dot_product, dot_product]( - out, - a, - b, + out_tensor, + a_tensor, + b_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -49,7 +62,6 @@ def main() raises: var expected = ctx.enqueue_create_host_buffer[dtype](1) expected.enqueue_fill(0) - ctx.synchronize() with a.map_to_host() as a_host, b.map_to_host() as b_host: diff --git a/problems/p12/p12_layout_tensor.mojo b/problems/p12/p12_layout_tensor.mojo deleted file mode 100644 index 691730cf..00000000 --- a/problems/p12/p12_layout_tensor.mojo +++ /dev/null @@ -1,74 +0,0 @@ -from std.testing import assert_equal -from std.gpu.host import DeviceContext - -# ANCHOR: dot_product_layout_tensor -from std.gpu import thread_idx, block_idx, block_dim, barrier -from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor - - -comptime TPB = 8 -comptime SIZE = 8 -comptime BLOCKS_PER_GRID = (1, 1) -comptime THREADS_PER_BLOCK = (TPB, 1) -comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(1) - - -def dot_product[ - in_layout: Layout, out_layout: Layout -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - size: Int, -): - # FILL ME IN (roughly 13 lines) - ... - - -# ANCHOR_END: dot_product_layout_tensor - - -def main() raises: - with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](1) - out.enqueue_fill(0) - var a = ctx.enqueue_create_buffer[dtype](SIZE) - a.enqueue_fill(0) - var b = ctx.enqueue_create_buffer[dtype](SIZE) - b.enqueue_fill(0) - - with a.map_to_host() as a_host, b.map_to_host() as b_host: - for i in range(SIZE): - a_host[i] = Scalar[dtype](i) - b_host[i] = Scalar[dtype](i) - - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b) - - comptime kernel = dot_product[layout, out_layout] - ctx.enqueue_function[kernel, kernel]( - out_tensor, - a_tensor, - b_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - var expected = ctx.enqueue_create_host_buffer[dtype](1) - expected.enqueue_fill(0) - ctx.synchronize() - - with a.map_to_host() as a_host, b.map_to_host() as b_host: - for i in range(SIZE): - expected[0] += a_host[i] * b_host[i] - - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) - assert_equal(out_host[0], expected[0]) - print("Puzzle 12 complete โœ…") diff --git a/problems/p13/p13.mojo b/problems/p13/p13.mojo index 1da7b0b0..2430bc85 100644 --- a/problems/p13/p13.mojo +++ b/problems/p13/p13.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_equal @@ -12,17 +14,18 @@ comptime CONV = 3 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 -comptime in_layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(SIZE) -comptime conv_layout = Layout.row_major(CONV) - - -def conv_1d_simple[ - in_layout: Layout, out_layout: Layout, conv_layout: Layout -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin], +comptime in_layout = row_major[SIZE]() +comptime InLayout = type_of(in_layout) +comptime out_layout = row_major[SIZE]() +comptime OutLayout = type_of(out_layout) +comptime conv_layout = row_major[CONV]() +comptime ConvLayout = type_of(conv_layout) + + +def conv_1d_simple( + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin], ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -36,17 +39,18 @@ comptime SIZE_2 = 15 comptime CONV_2 = 4 comptime BLOCKS_PER_GRID_2 = (2, 1) comptime THREADS_PER_BLOCK_2 = (TPB, 1) -comptime in_2_layout = Layout.row_major(SIZE_2) -comptime out_2_layout = Layout.row_major(SIZE_2) -comptime conv_2_layout = Layout.row_major(CONV_2) - - -def conv_1d_block_boundary[ - in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin], +comptime in_2_layout = row_major[SIZE_2]() +comptime In2Layout = type_of(in_2_layout) +comptime out_2_layout = row_major[SIZE_2]() +comptime Out2Layout = type_of(out_2_layout) +comptime conv_2_layout = row_major[CONV_2]() +comptime Conv2Layout = type_of(conv_2_layout) + + +def conv_1d_block_boundary( + output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin], ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -84,11 +88,12 @@ def main() raises: ) if argv()[1] == "--simple": - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, conv_layout, ImmutAnyOrigin](b) - comptime kernel = conv_1d_simple[in_layout, out_layout, conv_layout] - ctx.enqueue_function[kernel, kernel]( + var out_tensor = TileTensor(out, out_layout) + var a_tensor = TileTensor[mut=False, dtype, InLayout](a, in_layout) + var b_tensor = TileTensor[mut=False, dtype, ConvLayout]( + b, conv_layout + ) + ctx.enqueue_function[conv_1d_simple, conv_1d_simple]( out_tensor, a_tensor, b_tensor, @@ -96,15 +101,16 @@ def main() raises: block_dim=THREADS_PER_BLOCK, ) else: - var out_tensor = LayoutTensor[dtype, out_2_layout, MutAnyOrigin]( - out + var out_tensor = TileTensor(out, out_2_layout) + var a_tensor = TileTensor[mut=False, dtype, In2Layout]( + a, in_2_layout + ) + var b_tensor = TileTensor[mut=False, dtype, Conv2Layout]( + b, conv_2_layout ) - var a_tensor = LayoutTensor[dtype, in_2_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, conv_2_layout, ImmutAnyOrigin](b) - comptime kernel = conv_1d_block_boundary[ - in_2_layout, out_2_layout, conv_2_layout, dtype - ] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[ + conv_1d_block_boundary, conv_1d_block_boundary + ]( out_tensor, a_tensor, b_tensor, diff --git a/problems/p14/p14.mojo b/problems/p14/p14.mojo index e48e0c5f..a3673699 100644 --- a/problems/p14/p14.mojo +++ b/problems/p14/p14.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.math import log2 from std.testing import assert_equal @@ -12,14 +14,13 @@ comptime SIZE = 8 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) -def prefix_sum_simple[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def prefix_sum_simple( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): var global_i = block_dim.x * block_idx.x + thread_idx.x @@ -34,16 +35,16 @@ comptime SIZE_2 = 15 comptime BLOCKS_PER_GRID_2 = (2, 1) comptime THREADS_PER_BLOCK_2 = (TPB, 1) comptime EXTENDED_SIZE = SIZE_2 + 2 # up to 2 blocks -comptime layout_2 = Layout.row_major(SIZE_2) -comptime extended_layout = Layout.row_major(EXTENDED_SIZE) +comptime layout_2 = row_major[SIZE_2]() +comptime Layout2Type = type_of(layout_2) +comptime extended_layout = row_major[EXTENDED_SIZE]() +comptime ExtendedLayoutType = type_of(extended_layout) # Kernel 1: Compute local prefix sums and store block sums in out -def prefix_sum_local_phase[ - out_layout: Layout, in_layout: Layout -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], +def prefix_sum_local_phase( + output: TileTensor[mut=True, dtype, ExtendedLayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin], size: Int, ): var global_i = block_dim.x * block_idx.x + thread_idx.x @@ -52,9 +53,10 @@ def prefix_sum_local_phase[ # Kernel 2: Add block sums to their respective blocks -def prefix_sum_block_sum_phase[ - layout: Layout -](output: LayoutTensor[dtype, layout, MutAnyOrigin], size: Int): +def prefix_sum_block_sum_phase( + output: TileTensor[mut=True, dtype, ExtendedLayoutType, MutAnyOrigin], + size: Int, +): var global_i = block_dim.x * block_idx.x + thread_idx.x # FILL ME IN (roughly 3 lines) @@ -91,11 +93,10 @@ def main() raises: a_host[i] = Scalar[dtype](i) if use_simple: - a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) + a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + out_tensor = TileTensor(out, layout) - comptime kernel = prefix_sum_simple[layout] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[prefix_sum_simple, prefix_sum_simple]( out_tensor, a_tensor, size, @@ -103,15 +104,16 @@ def main() raises: block_dim=THREADS_PER_BLOCK, ) else: - var a_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin](a) - var out_tensor = LayoutTensor[dtype, extended_layout, MutAnyOrigin]( - out + var a_tensor = TileTensor[mut=False, dtype, Layout2Type]( + a, layout_2 ) + var out_tensor = TileTensor(out, extended_layout) # ANCHOR: prefix_sum_complete_block_level_sync # Phase 1: Local prefix sums - comptime kernel = prefix_sum_local_phase[extended_layout, layout_2] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[ + prefix_sum_local_phase, prefix_sum_local_phase + ]( out_tensor, a_tensor, size, @@ -123,8 +125,9 @@ def main() raises: # No explicit ctx.synchronize() needed in this case. # Phase 2: Add block sums - comptime kernel2 = prefix_sum_block_sum_phase[extended_layout] - ctx.enqueue_function[kernel2, kernel2]( + ctx.enqueue_function[ + prefix_sum_block_sum_phase, prefix_sum_block_sum_phase + ]( out_tensor, size, grid_dim=BLOCKS_PER_GRID_2, diff --git a/problems/p15/p15.mojo b/problems/p15/p15.mojo index 4a4f79a2..c9f7ead5 100644 --- a/problems/p15/p15.mojo +++ b/problems/p15/p15.mojo @@ -4,7 +4,9 @@ from std.gpu.host import DeviceContext # ANCHOR: axis_sum from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation comptime TPB = 8 @@ -13,15 +15,15 @@ comptime SIZE = 6 comptime BLOCKS_PER_GRID = (1, BATCH) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 -comptime in_layout = Layout.row_major(BATCH, SIZE) -comptime out_layout = Layout.row_major(BATCH, 1) +comptime in_layout = row_major[BATCH, SIZE]() +comptime InLayout = type_of(in_layout) +comptime out_layout = row_major[BATCH, 1]() +comptime OutLayout = type_of(out_layout) -def axis_sum[ - in_layout: Layout, out_layout: Layout -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], +def axis_sum( + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], size: Int, ): var global_i = block_dim.x * block_idx.x + thread_idx.x @@ -44,11 +46,10 @@ def main() raises: for col in range(SIZE): inp_host[row * SIZE + col] = Scalar[dtype](row * SIZE + col) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) - var inp_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](inp) + var out_tensor = TileTensor(out, out_layout) + var inp_tensor = TileTensor[mut=False, dtype, InLayout](inp, in_layout) - comptime kernel = axis_sum[in_layout, out_layout] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[axis_sum, axis_sum]( out_tensor, inp_tensor, SIZE, diff --git a/problems/p16/p16.mojo b/problems/p16/p16.mojo index 8ff079ad..e16cf873 100644 --- a/problems/p16/p16.mojo +++ b/problems/p16/p16.mojo @@ -5,7 +5,9 @@ from std.gpu.host import DeviceContext # ANCHOR: naive_matmul from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation comptime TPB = 3 @@ -13,15 +15,14 @@ comptime SIZE = 2 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, TPB) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE, SIZE) +comptime layout = row_major[SIZE, SIZE]() +comptime LayoutType = type_of(layout) -def naive_matmul[ - layout: Layout, size: Int -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def naive_matmul( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): var row = block_dim.y * block_idx.y + thread_idx.y var col = block_dim.x * block_idx.x + thread_idx.x @@ -32,12 +33,10 @@ def naive_matmul[ # ANCHOR: single_block_matmul -def single_block_matmul[ - layout: Layout, size: Int -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def single_block_matmul( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): var row = block_dim.y * block_idx.y + thread_idx.y var col = block_dim.x * block_idx.x + thread_idx.x @@ -52,15 +51,14 @@ def single_block_matmul[ comptime SIZE_TILED = 9 comptime BLOCKS_PER_GRID_TILED = (3, 3) # each block convers 3x3 elements comptime THREADS_PER_BLOCK_TILED = (TPB, TPB) -comptime layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED) +comptime layout_tiled = row_major[SIZE_TILED, SIZE_TILED]() +comptime LayoutTiledType = type_of(layout_tiled) -def matmul_tiled[ - layout: Layout, size: Int -]( - output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin], - a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin], +def matmul_tiled( + output: TileTensor[mut=True, dtype, LayoutTiledType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin], ): var local_row = thread_idx.y var local_col = thread_idx.x @@ -109,13 +107,12 @@ def main() raises: inp1_host[i * size + k] * inp2_host[k * size + j] ) - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp1) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp2) + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](inp1, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](inp2, layout) if argv()[1] == "--naive": - comptime kernel = naive_matmul[layout, SIZE] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[naive_matmul, naive_matmul]( out_tensor, a_tensor, b_tensor, @@ -123,8 +120,7 @@ def main() raises: block_dim=THREADS_PER_BLOCK, ) elif argv()[1] == "--single-block": - comptime kernel = single_block_matmul[layout, SIZE] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[single_block_matmul, single_block_matmul]( out_tensor, a_tensor, b_tensor, @@ -133,18 +129,15 @@ def main() raises: ) elif argv()[1] == "--tiled": # Need to update the layout of the tensors to the tiled layout - var out_tensor_tiled = LayoutTensor[ - dtype, layout_tiled, MutAnyOrigin - ](out) - var a_tensor_tiled = LayoutTensor[ - dtype, layout_tiled, ImmutAnyOrigin - ](inp1) - var b_tensor_tiled = LayoutTensor[ - dtype, layout_tiled, ImmutAnyOrigin - ](inp2) - - comptime kernel = matmul_tiled[layout_tiled, SIZE_TILED] - ctx.enqueue_function[kernel, kernel]( + var out_tensor_tiled = TileTensor(out, layout_tiled) + var a_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType]( + inp1, layout_tiled + ) + var b_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType]( + inp2, layout_tiled + ) + + ctx.enqueue_function[matmul_tiled, matmul_tiled]( out_tensor_tiled, a_tensor_tiled, b_tensor_tiled, diff --git a/problems/p17/op/conv1d.mojo b/problems/p17/op/conv1d.mojo index 808de38c..97de1bb7 100644 --- a/problems/p17/op/conv1d.mojo +++ b/problems/p17/op/conv1d.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation # ANCHOR: conv1d_kernel comptime TPB = 15 @@ -9,32 +11,26 @@ comptime BLOCKS_PER_GRID = (2, 1) def conv1d_kernel[ - in_layout: Layout, - out_layout: Layout, - conv_layout: Layout, input_size: Int, conv_size: Int, + OutLayout: TensorLayout, + InLayout: TensorLayout, + ConvLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, MutAnyOrigin], - kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], + kernel: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin], ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x # first: need to account for padding - var shared_a = LayoutTensor[ - dtype, - Layout.row_major(TPB + conv_size - 1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var shared_b = LayoutTensor[ - dtype, - Layout.row_major(conv_size), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_a = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB + conv_size - 1]()) + var shared_b = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[conv_size]()) if global_i < input_size: shared_a[local_i] = input[global_i] @@ -53,7 +49,7 @@ def conv1d_kernel[ barrier() if global_i < input_size: - var local_sum: output.element_type = 0 + var local_sum: output.ElementType = 0 comptime for j in range(conv_size): if local_i + j < TPB + conv_size - 1: @@ -92,9 +88,6 @@ struct Conv1DCustomOp: var output_tensor = output.to_layout_tensor() var input_tensor = input.to_layout_tensor() var kernel_tensor = kernel.to_layout_tensor() - comptime in_layout = input_tensor.layout - comptime out_layout = output_tensor.layout - comptime conv_layout = kernel_tensor.layout comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() diff --git a/problems/p18/op/softmax.mojo b/problems/p18/op/softmax.mojo index 8839d538..7b3025ec 100644 --- a/problems/p18/op/softmax.mojo +++ b/problems/p18/op/softmax.mojo @@ -4,26 +4,28 @@ from std.memory import UnsafePointer from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.math import exp from std.bit import log2_ceil from std.utils.numerics import max_finite, min_finite comptime SIZE = 128 # This must be equal to INPUT_SIZE in p18.py -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) comptime GRID_DIM_X = 1 # Tree-based reduction require the number of threads to be the next power of two >= SIZE for correctness. comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE) def softmax_gpu_kernel[ - layout: Layout, input_size: Int, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): comptime assert ( dtype.is_floating_point() @@ -37,12 +39,11 @@ def softmax_gpu_kernel[ # ANCHOR: softmax_cpu_kernel def softmax_cpu_kernel[ - layout: Layout, input_size: Int, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): comptime assert ( dtype.is_floating_point() @@ -71,12 +72,12 @@ struct SoftmaxCustomOp: ctx: DeviceContextPtr, ) raises: # Note: rebind is necessary now but it shouldn't be! - var output_tensor = rebind[LayoutTensor[dtype, layout, MutAnyOrigin]]( - output.to_layout_tensor() - ) - var input_tensor = rebind[LayoutTensor[dtype, layout, ImmutAnyOrigin]]( - input.to_layout_tensor() - ) + var output_tensor = rebind[ + TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin] + ](output.to_layout_tensor()) + var input_tensor = rebind[ + TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin] + ](input.to_layout_tensor()) comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() @@ -91,7 +92,7 @@ struct SoftmaxCustomOp: 0, ) - comptime kernel = softmax_gpu_kernel[layout, input_size, dtype] + comptime kernel = softmax_gpu_kernel[input_size, dtype] gpu_ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -100,8 +101,6 @@ struct SoftmaxCustomOp: ) elif target == "cpu": - softmax_cpu_kernel[layout, input_size, dtype]( - output_tensor, input_tensor - ) + softmax_cpu_kernel[input_size, dtype](output_tensor, input_tensor) else: raise Error("Unsupported target: " + target) diff --git a/problems/p18/test/test_softmax.mojo b/problems/p18/test/test_softmax.mojo index 70b25871..483a2321 100644 --- a/problems/p18/test/test_softmax.mojo +++ b/problems/p18/test/test_softmax.mojo @@ -1,12 +1,14 @@ from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major from std.testing import assert_almost_equal from std.bit import log2_ceil from op import softmax_gpu_kernel, softmax_cpu_kernel comptime SIZE = 128 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) comptime GRID_DIM_X = 1 comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE) comptime dtype = DType.float32 @@ -21,9 +23,7 @@ def test_softmax() raises: # for CPU testing var expected = ctx.enqueue_create_host_buffer[DType.float32](SIZE) expected.enqueue_fill(0) - var expected_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - expected - ) + var expected_tensor = TileTensor(expected, layout) # Initialize input with more reasonable values with inp.map_to_host() as inp_host: for i in range(SIZE): @@ -34,21 +34,19 @@ def test_softmax() raises: print(inp_host[i], end=" ") print() # Create layout tensors for CPU calculation - input_host_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - inp_host + input_host_tensor = TileTensor[mut=False, dtype, LayoutType]( + inp_host, layout ) # for GPU testing - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp) + var output_tensor = TileTensor(out, layout) + var input_tensor = TileTensor[mut=False, dtype, LayoutType](inp, layout) # Compute expected results using our CPU kernel - softmax_cpu_kernel[layout, SIZE, dtype]( - expected_tensor, input_host_tensor - ) + softmax_cpu_kernel[SIZE, dtype](expected_tensor, input_host_tensor) # Run GPU kernel - comptime kernel = softmax_gpu_kernel[layout, SIZE, dtype] + comptime kernel = softmax_gpu_kernel[SIZE, dtype] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, diff --git a/problems/p19/op/attention.mojo b/problems/p19/op/attention.mojo index 3e71bdb2..99aca727 100644 --- a/problems/p19/op/attention.mojo +++ b/problems/p19/op/attention.mojo @@ -2,7 +2,9 @@ from std.memory import UnsafePointer from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer from std.gpu.memory import AddressSpace, async_copy_wait_all -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation from layout.layout_tensor import copy_dram_to_sram_async from std.math import exp from std.bit import log2_ceil @@ -22,24 +24,24 @@ comptime SOFTMAX_BLOCK_DIM_X = 1 << log2_ceil(SEQ_LEN) # Tiled matrix multiplication (from p16), updated to: -# 1) Support different layouts for input (a, b) and output LayoutTensors. +# 1) Support different layouts for input (a, b) and output TileTensors. # 2) Handle cases where the inner dimension is not a multiple of MATMUL_BLOCK_DIM_XY. # 3) Explicitly check for out-of-bounds elements. -# The approach still tiles all three LayoutTensors (a, b, and output) into identical square tiles +# The approach still tiles all three TileTensors (a, b, and output) into identical square tiles # of size (MATMUL_BLOCK_DIM_XY x MATMUL_BLOCK_DIM_XY) with each thread loading one element # from a and b, and writing one element to output. def matmul_idiomatic_tiled[ - a_layout: Layout, - b_layout: Layout, - out_layout: Layout, rows: Int, cols: Int, inner: Int, + OutLayout: TensorLayout, + ALayout: TensorLayout, + BLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, a_layout, MutAnyOrigin], - b: LayoutTensor[dtype, b_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, ALayout, MutAnyOrigin], + b: TileTensor[mut=False, dtype, BLayout, MutAnyOrigin], ): """Updated idiomatic tiled matrix multiplication from p16.""" var local_row = thread_idx.y @@ -51,26 +53,23 @@ def matmul_idiomatic_tiled[ var out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY]( block_idx.y, block_idx.x ) - var a_shared = LayoutTensor[ - dtype, - Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var b_shared = LayoutTensor[ - dtype, - Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var acc: output.element_type = 0 - - comptime load_a_layout = Layout.row_major( + comptime shared_layout = row_major[ MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY - ) # Coalesced loading - comptime load_b_layout = Layout.row_major( + ]() + var a_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) + var b_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) + var acc: output.ElementType = 0 + + comptime load_a_layout = row_major[ MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY - ) # Coalesced loading + ]() # Coalesced loading + comptime load_b_layout = row_major[ + MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY + ]() # Coalesced loading comptime for idx in range( (inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY @@ -118,14 +117,14 @@ def matmul_idiomatic_tiled[ # ANCHOR: transpose_kernel def transpose_kernel[ - layout_in: Layout, # Layout for input matrix (seq_len, d) - layout_out: Layout, # Layout for output matrix (d, seq_len) rows: Int, cols: Int, + OutLayout: TensorLayout, + InLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout_out, MutAnyOrigin], - inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + inp: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], ): # FILL ME IN (roughly 18 lines) ... @@ -136,28 +135,23 @@ def transpose_kernel[ # Apply softmax to attention scores taken from p16 def softmax_gpu_kernel[ - layout: Layout, input_size: Int, + LayoutType: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin], ): comptime assert ( dtype.is_floating_point() ), "dtype must be a floating-point type" - var shared_max = LayoutTensor[ - dtype, - Layout.row_major(SOFTMAX_BLOCK_DIM_X), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var shared_sum = LayoutTensor[ - dtype, - Layout.row_major(SOFTMAX_BLOCK_DIM_X), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + comptime softmax_layout = row_major[SOFTMAX_BLOCK_DIM_X]() + var shared_max = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](softmax_layout) + var shared_sum = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](softmax_layout) var global_i = thread_idx.x # Initialize out-of-bounds (shared_max[local_i], global_i >= input_size) shared memory addresses to the minimum @@ -208,18 +202,18 @@ def softmax_gpu_kernel[ # CPU implementation for vector attention def attention_cpu_kernel[ - layout_q: Layout, - layout_k: Layout, - layout_v: Layout, - layout_out: Layout, seq_len: Int, d: Int, + OutLayout: TensorLayout, + QLayout: TensorLayout, + KLayout: TensorLayout, + VLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout_out, MutAnyOrigin], - q: LayoutTensor[dtype, layout_q, MutAnyOrigin], - k: LayoutTensor[dtype, layout_k, ImmutAnyOrigin], - v: LayoutTensor[dtype, layout_v, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + q: TileTensor[mut=False, dtype, QLayout, MutAnyOrigin], + k: TileTensor[mut=False, dtype, KLayout, ImmutAnyOrigin], + v: TileTensor[mut=False, dtype, VLayout, MutAnyOrigin], ): """CPU implementation of vector attention.""" var scores = List[Float32]() @@ -273,25 +267,30 @@ struct AttentionCustomOp: ctx: DeviceContextPtr, ) raises: # Define layouts - comptime layout_q = Layout.row_major(d) - comptime layout_k = Layout.row_major(seq_len, d) - comptime layout_v = Layout.row_major(seq_len, d) - comptime layout_out = Layout.row_major(d) - comptime layout_scores = Layout.row_major(seq_len) + comptime layout_q = row_major[d]() + comptime layout_k = row_major[seq_len, d]() + comptime layout_v = row_major[seq_len, d]() + comptime layout_out = row_major[d]() + comptime layout_scores = row_major[seq_len]() + comptime QLayout = type_of(layout_q) + comptime KLayout = type_of(layout_k) + comptime VLayout = type_of(layout_v) + comptime OutLayout = type_of(layout_out) + comptime ScoresLayout = type_of(layout_scores) # Convert to layout tensors var output_tensor = rebind[ - LayoutTensor[dtype, layout_out, MutAnyOrigin] + TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin] ](output.to_layout_tensor()) - var q_tensor = rebind[LayoutTensor[dtype, layout_q, MutAnyOrigin]]( - q.to_layout_tensor() - ) - var k_tensor = rebind[LayoutTensor[dtype, layout_k, ImmutAnyOrigin]]( - k.to_layout_tensor() - ) - var v_tensor = rebind[LayoutTensor[dtype, layout_v, MutAnyOrigin]]( - v.to_layout_tensor() - ) + var q_tensor = rebind[ + TileTensor[mut=False, dtype, QLayout, MutAnyOrigin] + ](q.to_layout_tensor()) + var k_tensor = rebind[ + TileTensor[mut=False, dtype, KLayout, ImmutAnyOrigin] + ](k.to_layout_tensor()) + var v_tensor = rebind[ + TileTensor[mut=False, dtype, VLayout, MutAnyOrigin] + ](v.to_layout_tensor()) comptime if target == "gpu": # ANCHOR: attention_orchestration @@ -299,15 +298,20 @@ struct AttentionCustomOp: # Define layouts for matrix multiplication # Q reshaped to (1, d) - comptime layout_q_2d = Layout.row_major(1, d) + comptime layout_q_2d = row_major[1, d]() + comptime Q2DLayout = type_of(layout_q_2d) # K^T is (d, seq_len) - comptime layout_k_t = Layout.row_major(d, seq_len) + comptime layout_k_t = row_major[d, seq_len]() + comptime KTLayout = type_of(layout_k_t) # Scores as (1, seq_len) - comptime layout_scores_2d = Layout.row_major(1, seq_len) + comptime layout_scores_2d = row_major[1, seq_len]() + comptime Scores2DLayout = type_of(layout_scores_2d) # Weights as (1, seq_len) - comptime layout_weights_2d = Layout.row_major(1, seq_len) + comptime layout_weights_2d = row_major[1, seq_len]() + comptime Weights2DLayout = type_of(layout_weights_2d) # Result as (1, d) - comptime layout_result_2d = Layout.row_major(1, d) + comptime layout_result_2d = row_major[1, d]() + comptime Result2DLayout = type_of(layout_result_2d) # Transpose implementation limited to square (TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY) thread blocks comptime transpose_threads_per_block = ( @@ -344,7 +348,7 @@ struct AttentionCustomOp: seq_len ) # Reused for scores and weights - var k_t = LayoutTensor[dtype, layout_k_t, MutAnyOrigin](k_t_buf) + var k_t = TileTensor(k_t_buf, layout_k_t) # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed # FILL ME IN 1 line @@ -373,9 +377,9 @@ struct AttentionCustomOp: # ANCHOR_END: attention_orchestration elif target == "cpu": - attention_cpu_kernel[ - layout_q, layout_k, layout_v, layout_out, seq_len, d, dtype - ](output_tensor, q_tensor, k_tensor, v_tensor) + attention_cpu_kernel[seq_len, d, dtype]( + output_tensor, q_tensor, k_tensor, v_tensor + ) else: raise Error("Unsupported target: " + target) diff --git a/problems/p20/op/conv1d.mojo b/problems/p20/op/conv1d.mojo index 07d29d9a..7c6bec92 100644 --- a/problems/p20/op/conv1d.mojo +++ b/problems/p20/op/conv1d.mojo @@ -2,7 +2,9 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_equal @@ -12,32 +14,26 @@ comptime BLOCKS_PER_GRID = (2, 1) # ANCHOR: conv1d_kernel def conv1d_kernel[ - in_layout: Layout, - out_layout: Layout, - conv_layout: Layout, input_size: Int, conv_size: Int, + OutLayout: TensorLayout, + InLayout: TensorLayout, + ConvLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, MutAnyOrigin], - kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin], + kernel: TileTensor[mut=False, dtype, ConvLayout, MutAnyOrigin], ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x # first: need to account for padding - var shared_a = LayoutTensor[ - dtype, - Layout.row_major(TPB + conv_size - 1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var shared_b = LayoutTensor[ - dtype, - Layout.row_major(conv_size), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_a = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB + conv_size - 1]()) + var shared_b = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[conv_size]()) if global_i < input_size: shared_a[local_i] = input[global_i] @@ -58,7 +54,7 @@ def conv1d_kernel[ barrier() if global_i < input_size: - var local_sum: output.element_type = 0 + var local_sum: output.ElementType = 0 comptime for j in range(conv_size): if local_i + j < TPB + conv_size - 1: @@ -95,9 +91,6 @@ struct Conv1DCustomOp: var out_tensor = output.to_layout_tensor() var input_tensor = input.to_layout_tensor() var kernel_tensor = kernel.to_layout_tensor() - comptime in_layout = input_tensor.layout - comptime out_layout = out_tensor.layout - comptime conv_layout = kernel_tensor.layout comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() @@ -111,9 +104,7 @@ struct Conv1DCustomOp: ), 0, ) - comptime kernel = conv1d_kernel[ - in_layout, out_layout, conv_layout, input_size, conv_size - ] + comptime kernel = conv1d_kernel[input_size, conv_size] gpu_ctx.enqueue_function[kernel, kernel]( out_tensor, input_tensor, diff --git a/problems/p21/op/embedding.mojo b/problems/p21/op/embedding.mojo index 8108d7ba..22e73d51 100644 --- a/problems/p21/op/embedding.mojo +++ b/problems/p21/op/embedding.mojo @@ -1,7 +1,8 @@ from std.math import ceildiv from std.gpu import thread_idx, block_idx, block_dim, grid_dim, barrier from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout from std.sys import argv from std.testing import assert_equal @@ -10,18 +11,18 @@ comptime THREADS_PER_BLOCK = 256 def embedding_kernel_coalesced[ - indices_layout: Layout, - weights_layout: Layout, - out_layout: Layout, batch_size: Int, seq_len: Int, vocab_size: Int, embed_dim: Int, + OutLayout: TensorLayout, + IndicesLayout: TensorLayout, + WeightsLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin], - weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + indices: TileTensor[mut=False, DType.int32, IndicesLayout, MutAnyOrigin], + weights: TileTensor[mut=False, dtype, WeightsLayout, MutAnyOrigin], ): """ Memory-coalescing focused embedding kernel. @@ -54,18 +55,18 @@ def embedding_kernel_coalesced[ # ANCHOR: embedding_kernel_2d def embedding_kernel_2d[ - indices_layout: Layout, - weights_layout: Layout, - out_layout: Layout, batch_size: Int, seq_len: Int, vocab_size: Int, embed_dim: Int, + OutLayout: TensorLayout, + IndicesLayout: TensorLayout, + WeightsLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin], - weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + indices: TileTensor[mut=False, DType.int32, IndicesLayout, MutAnyOrigin], + weights: TileTensor[mut=False, dtype, WeightsLayout, MutAnyOrigin], ): """ 2D grid non-coalesced embedding kernel. @@ -128,10 +129,6 @@ struct EmbeddingCustomOp: var indices_tensor = indices.to_layout_tensor() var weights_tensor = weights.to_layout_tensor() - comptime indices_layout = indices_tensor.layout - comptime weights_layout = weights_tensor.layout - comptime out_layout = output_tensor.layout - comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() @@ -152,9 +149,6 @@ struct EmbeddingCustomOp: # Compile and launch optimized kernel comptime kernel = embedding_kernel_coalesced[ - indices_layout, - weights_layout, - out_layout, batch_size, seq_len, vocab_size, @@ -210,10 +204,6 @@ struct Embedding2DCustomOp: var indices_tensor = indices.to_layout_tensor() var weights_tensor = weights.to_layout_tensor() - comptime indices_layout = indices_tensor.layout - comptime weights_layout = weights_tensor.layout - comptime out_layout = output_tensor.layout - comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() @@ -237,9 +227,6 @@ struct Embedding2DCustomOp: # Compile and launch 2D kernel comptime kernel = embedding_kernel_2d[ - indices_layout, - weights_layout, - out_layout, batch_size, seq_len, vocab_size, diff --git a/problems/p22/op/layernorm_linear.mojo b/problems/p22/op/layernorm_linear.mojo index 8519c015..dc659b12 100644 --- a/problems/p22/op/layernorm_linear.mojo +++ b/problems/p22/op/layernorm_linear.mojo @@ -2,7 +2,9 @@ from std.math import sqrt from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.memory import AddressSpace, async_copy_wait_all from std.os.atomic import Atomic -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation from layout.layout_tensor import copy_dram_to_sram_async import compiler from std.runtime.asyncrt import DeviceContextPtr @@ -20,17 +22,17 @@ comptime dtype = DType.float32 # ANCHOR: matmul_idiomatic_tiled # Idiomatic tiled matmul from p19.mojo def matmul_idiomatic_tiled[ - a_layout: Layout, - b_layout: Layout, - out_layout: Layout, rows: Int, cols: Int, inner: Int, + OutLayout: TensorLayout, + ALayout: TensorLayout, + BLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, a_layout, MutAnyOrigin], - b: LayoutTensor[dtype, b_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, ALayout, MutAnyOrigin], + b: TileTensor[mut=False, dtype, BLayout, MutAnyOrigin], ): """Idiomatic tiled matrix multiplication from p19.""" var local_row = thread_idx.y @@ -42,26 +44,23 @@ def matmul_idiomatic_tiled[ var out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY]( block_idx.y, block_idx.x ) - var a_shared = LayoutTensor[ - dtype, - Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var b_shared = LayoutTensor[ - dtype, - Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var acc: output.element_type = 0 - - comptime load_a_layout = Layout.row_major( + comptime shared_layout = row_major[ MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY - ) # Coalesced loading - comptime load_b_layout = Layout.row_major( + ]() + var a_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) + var b_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) + var acc: output.ElementType = 0 + + comptime load_a_layout = row_major[ MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY - ) # Coalesced loading + ]() # Coalesced loading + comptime load_b_layout = row_major[ + MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY + ]() # Coalesced loading comptime for idx in range( (inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY @@ -112,17 +111,17 @@ def matmul_idiomatic_tiled[ # ANCHOR: layernorm_kernel def layernorm_kernel[ - input_layout: Layout, - ln_params_layout: Layout, - output_layout: Layout, batch_size: Int, seq_len: Int, hidden_dim: Int, + OutputLayout: TensorLayout, + InputLayout: TensorLayout, + LnParamsLayout: TensorLayout, ]( - output: LayoutTensor[dtype, output_layout, MutAnyOrigin], - input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin], - ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin], + ln_weight: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin], + ln_bias: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin], ): var batch_idx = block_idx.x var seq_idx = block_idx.y @@ -147,24 +146,24 @@ def layernorm_kernel[ # ANCHOR: transpose_kernel def transpose_kernel[ - layout_in: Layout, - layout_out: Layout, rows: Int, cols: Int, + OutLayout: TensorLayout, + InLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout_out, MutAnyOrigin], - inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + inp: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], ): """Transpose matrix using shared memory tiling for coalesced access. We will learn more about coalesced access in the next part. """ - var shared_tile = LayoutTensor[ - dtype, - Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + comptime shared_layout = row_major[ + TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY + ]() + var shared_tile = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) var local_row = thread_idx.y var local_col = thread_idx.x @@ -191,16 +190,16 @@ def transpose_kernel[ # ANCHOR: add_bias_kernel def add_bias_kernel[ - input_layout: Layout, - bias_layout: Layout, - output_layout: Layout, batch_size: Int, seq_len: Int, output_dim: Int, + OutputLayout: TensorLayout, + InputLayout: TensorLayout, + BiasLayout: TensorLayout, ]( - output: LayoutTensor[dtype, output_layout, MutAnyOrigin], - input: LayoutTensor[dtype, input_layout, MutAnyOrigin], - bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InputLayout, MutAnyOrigin], + bias: TileTensor[mut=False, dtype, BiasLayout, ImmutAnyOrigin], ): """Simple bias addition.""" var batch_idx = block_idx.x @@ -220,22 +219,22 @@ def add_bias_kernel[ # ANCHOR: minimal_fused_forward_kernel def minimal_fused_kernel[ - input_layout: Layout, - ln_params_layout: Layout, - weight_layout: Layout, - bias_layout: Layout, - output_layout: Layout, batch_size: Int, seq_len: Int, hidden_dim: Int, output_dim: Int, + OutputLayout: TensorLayout, + InputLayout: TensorLayout, + LnParamsLayout: TensorLayout, + WeightLayout: TensorLayout, + BiasLayout: TensorLayout, ]( - output: LayoutTensor[dtype, output_layout, MutAnyOrigin], - input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin], - ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin], - linear_bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin], + ln_weight: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin], + ln_bias: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin], + linear_weight: TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin], + linear_bias: TileTensor[mut=False, dtype, BiasLayout, ImmutAnyOrigin], ): """Minimal fused kernel - one thread per sequence position to avoid redundancy. """ @@ -261,30 +260,32 @@ def minimal_fused_kernel[ # ANCHOR: minimal_fused_backward_kernel def minimal_fused_kernel_backward[ - grad_output_layout: Layout, - input_layout: Layout, - ln_params_layout: Layout, - weight_layout: Layout, - grad_input_layout: Layout, - grad_ln_weight_layout: Layout, - grad_ln_bias_layout: Layout, - grad_weight_layout: Layout, - grad_bias_layout: Layout, batch_size: Int, seq_len: Int, hidden_dim: Int, output_dim: Int, + GradInputLayout: TensorLayout, + GradLnWeightLayout: TensorLayout, + GradLnBiasLayout: TensorLayout, + GradWeightLayout: TensorLayout, + GradBiasLayout: TensorLayout, + GradOutputLayout: TensorLayout, + InputLayout: TensorLayout, + LnParamsLayout: TensorLayout, + WeightLayout: TensorLayout, ]( - grad_input: LayoutTensor[dtype, grad_input_layout, MutAnyOrigin], - grad_ln_weight: LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin], - grad_ln_bias: LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin], - grad_weight: LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin], - grad_bias: LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin], - grad_output: LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin], - input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin], - ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin], + grad_input: TileTensor[mut=True, dtype, GradInputLayout, MutAnyOrigin], + grad_ln_weight: TileTensor[ + mut=True, dtype, GradLnWeightLayout, MutAnyOrigin + ], + grad_ln_bias: TileTensor[mut=True, dtype, GradLnBiasLayout, MutAnyOrigin], + grad_weight: TileTensor[mut=True, dtype, GradWeightLayout, MutAnyOrigin], + grad_bias: TileTensor[mut=True, dtype, GradBiasLayout, MutAnyOrigin], + grad_output: TileTensor[mut=False, dtype, GradOutputLayout, ImmutAnyOrigin], + input: TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin], + ln_weight: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin], + ln_bias: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin], + linear_weight: TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin], ): """Fused backward kernel using atomic operations for safe gradient accumulation. """ @@ -372,25 +373,30 @@ struct LayerNormLinearCustomOp: comptime weight_layout = linear_weight.static_spec.to_layout() comptime bias_layout = linear_bias.static_spec.to_layout() comptime output_layout = output.static_spec.to_layout() + comptime InputLayout = type_of(input_layout) + comptime LnParamsLayout = type_of(ln_params_layout) + comptime WeightLayout = type_of(weight_layout) + comptime BiasLayout = type_of(bias_layout) + comptime OutputLayout = type_of(output_layout) # Note: rebind is necessary now but it shouldn't be! var output_tensor = rebind[ - LayoutTensor[dtype, output_layout, MutAnyOrigin] + TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin] ](output.to_layout_tensor()) var input_tensor = rebind[ - LayoutTensor[dtype, input_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin] ](input.to_layout_tensor()) var ln_weight_tensor = rebind[ - LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin] ](ln_weight.to_layout_tensor()) var ln_bias_tensor = rebind[ - LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin] ](ln_bias.to_layout_tensor()) var linear_weight_tensor = rebind[ - LayoutTensor[dtype, weight_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin] ](linear_weight.to_layout_tensor()) var linear_bias_tensor = rebind[ - LayoutTensor[dtype, bias_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, BiasLayout, ImmutAnyOrigin] ](linear_bias.to_layout_tensor()) comptime if target == "gpu": @@ -400,11 +406,6 @@ struct LayerNormLinearCustomOp: comptime if algorithm == "fused": # fused case - one thread per sequence position comptime kernel = minimal_fused_kernel[ - input_layout, - ln_params_layout, - weight_layout, - bias_layout, - output_layout, batch_size, seq_len, hidden_dim, @@ -426,15 +427,12 @@ struct LayerNormLinearCustomOp: var normalized_buffer = gpu_ctx.enqueue_create_buffer[dtype]( batch_size * seq_len * hidden_dim ) - var normalized_tensor = LayoutTensor[ - dtype, input_layout, MutAnyOrigin - ](normalized_buffer) + var normalized_tensor = TileTensor[ + mut=True, dtype, InputLayout, MutAnyOrigin + ](normalized_buffer, input_layout) # Step 1: LayerNorm kernel comptime kernel = layernorm_kernel[ - input_layout, - ln_params_layout, - input_layout, batch_size, seq_len, hidden_dim, @@ -457,19 +455,26 @@ struct LayerNormLinearCustomOp: var matmul_buffer = gpu_ctx.enqueue_create_buffer[dtype]( batch_size * seq_len * output_dim ) - var matmul_tensor = LayoutTensor[ - dtype, output_layout, MutAnyOrigin - ](matmul_buffer) + var matmul_tensor = TileTensor[ + mut=True, dtype, OutputLayout, MutAnyOrigin + ](matmul_buffer, output_layout) # Create transposed weight matrix: [output_dim, hidden_dim] -> [hidden_dim, output_dim] var transposed_weight_buffer = gpu_ctx.enqueue_create_buffer[ dtype ](hidden_dim * output_dim) - var transposed_weight_tensor = LayoutTensor[ + comptime transposed_weight_layout = row_major[ + hidden_dim, output_dim + ]() + comptime TransposedWeightLayout = type_of( + transposed_weight_layout + ) + var transposed_weight_tensor = TileTensor[ + mut=True, dtype, - Layout.row_major(hidden_dim, output_dim), + TransposedWeightLayout, MutAnyOrigin, - ](transposed_weight_buffer) + ](transposed_weight_buffer, transposed_weight_layout) # Transpose the weight matrix var transpose_blocks_x = ( @@ -479,8 +484,6 @@ struct LayerNormLinearCustomOp: output_dim + TRANSPOSE_BLOCK_DIM_XY - 1 ) // TRANSPOSE_BLOCK_DIM_XY comptime kernel2 = transpose_kernel[ - weight_layout, - transposed_weight_tensor.layout, output_dim, hidden_dim, ] @@ -492,17 +495,20 @@ struct LayerNormLinearCustomOp: ) # Reshape tensors for matmul: [batch*seq, hidden] @ [hidden, output] -> [batch*seq, output] - var flat_normalized = normalized_tensor.reshape[ - Layout.row_major(batch_size * seq_len, hidden_dim) + comptime flat_normalized_layout = row_major[ + batch_size * seq_len, hidden_dim ]() - var flat_matmul = matmul_tensor.reshape[ - Layout.row_major(batch_size * seq_len, output_dim) + comptime FlatNormalizedLayout = type_of(flat_normalized_layout) + comptime flat_matmul_layout = row_major[ + batch_size * seq_len, output_dim ]() + comptime FlatMatmulLayout = type_of(flat_matmul_layout) + var flat_normalized = normalized_tensor.reshape[ + flat_normalized_layout + ]() + var flat_matmul = matmul_tensor.reshape[flat_matmul_layout]() comptime kernel3 = matmul_idiomatic_tiled[ - flat_normalized.layout, - transposed_weight_tensor.layout, - flat_matmul.layout, batch_size * seq_len, output_dim, hidden_dim, @@ -516,14 +522,15 @@ struct LayerNormLinearCustomOp: ) # Step 3: Add bias - reshape matmul result back to 3D for bias addition + comptime reshaped_matmul_layout = row_major[ + batch_size, seq_len, output_dim + ]() + comptime ReshapedMatmulLayout = type_of(reshaped_matmul_layout) var reshaped_matmul = matmul_tensor.reshape[ - Layout.row_major(batch_size, seq_len, output_dim) + reshaped_matmul_layout ]() comptime kernel4 = add_bias_kernel[ - reshaped_matmul.layout, - bias_layout, - output_layout, batch_size, seq_len, output_dim, @@ -612,36 +619,45 @@ struct LayerNormLinearBackwardCustomOp: comptime grad_ln_bias_layout = grad_ln_bias.static_spec.to_layout() comptime grad_weight_layout = grad_weight.static_spec.to_layout() comptime grad_bias_layout = grad_bias.static_spec.to_layout() + comptime GradOutputLayout = type_of(grad_output_layout) + comptime InputLayout = type_of(input_layout) + comptime LnParamsLayout = type_of(ln_params_layout) + comptime WeightLayout = type_of(weight_layout) + comptime GradInputLayout = type_of(grad_input_layout) + comptime GradLnWeightLayout = type_of(grad_ln_weight_layout) + comptime GradLnBiasLayout = type_of(grad_ln_bias_layout) + comptime GradWeightLayout = type_of(grad_weight_layout) + comptime GradBiasLayout = type_of(grad_bias_layout) var grad_input_tensor = rebind[ - LayoutTensor[dtype, grad_input_layout, MutAnyOrigin] + TileTensor[mut=True, dtype, GradInputLayout, MutAnyOrigin] ](grad_input.to_layout_tensor()) var grad_ln_weight_tensor = rebind[ - LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin] + TileTensor[mut=True, dtype, GradLnWeightLayout, MutAnyOrigin] ](grad_ln_weight.to_layout_tensor()) var grad_ln_bias_tensor = rebind[ - LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin] + TileTensor[mut=True, dtype, GradLnBiasLayout, MutAnyOrigin] ](grad_ln_bias.to_layout_tensor()) var grad_weight_tensor = rebind[ - LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin] + TileTensor[mut=True, dtype, GradWeightLayout, MutAnyOrigin] ](grad_weight.to_layout_tensor()) var grad_bias_tensor = rebind[ - LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin] + TileTensor[mut=True, dtype, GradBiasLayout, MutAnyOrigin] ](grad_bias.to_layout_tensor()) var grad_output_tensor = rebind[ - LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, GradOutputLayout, ImmutAnyOrigin] ](grad_output.to_layout_tensor()) var input_tensor = rebind[ - LayoutTensor[dtype, input_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin] ](input.to_layout_tensor()) var ln_weight_tensor = rebind[ - LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin] ](ln_weight.to_layout_tensor()) var ln_bias_tensor = rebind[ - LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin] ](ln_bias.to_layout_tensor()) var linear_weight_tensor = rebind[ - LayoutTensor[dtype, weight_layout, ImmutAnyOrigin] + TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin] ](linear_weight.to_layout_tensor()) comptime if target == "gpu": @@ -649,15 +665,6 @@ struct LayerNormLinearBackwardCustomOp: # Launch backward kernel comptime kernel = minimal_fused_kernel_backward[ - grad_output_layout, - input_layout, - ln_params_layout, - weight_layout, - grad_input_layout, - grad_ln_weight_layout, - grad_ln_bias_layout, - grad_weight_layout, - grad_bias_layout, batch_size, seq_len, hidden_dim, diff --git a/problems/p23/p23.mojo b/problems/p23/p23.mojo index 69843290..5b390bb4 100644 --- a/problems/p23/p23.mojo +++ b/problems/p23/p23.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_dim, block_idx, barrier from std.gpu.host import DeviceContext from std.gpu.host.compile import get_gpu_target -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation from std.utils import IndexList from std.math import log2 from std.algorithm.functional import elementwise, vectorize @@ -12,17 +14,18 @@ from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep # ANCHOR: elementwise_add comptime SIZE = 1024 comptime rank = 1 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) comptime dtype = DType.float32 comptime SIMD_WIDTH = simd_width_of[dtype, target=get_gpu_target()]() def elementwise_add[ - layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int + LayoutT: TensorLayout, dtype: DType, simd_width: Int, rank: Int, size: Int ]( - output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], ctx: DeviceContext, ) raises: @parameter @@ -34,7 +37,7 @@ def elementwise_add[ print("idx:", idx) # FILL IN (2 to 4 lines) - elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx) + elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx) # ANCHOR_END: elementwise_add @@ -45,16 +48,16 @@ comptime TILE_SIZE = 32 def tiled_elementwise_add[ - layout: Layout, + LayoutT: TensorLayout, dtype: DType, simd_width: Int, rank: Int, size: Int, tile_size: Int, ]( - output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], ctx: DeviceContext, ) raises: @parameter @@ -79,7 +82,7 @@ def tiled_elementwise_add[ # ANCHOR: manual_vectorized_tiled_elementwise_add def manual_vectorized_tiled_elementwise_add[ - layout: Layout, + LayoutT: TensorLayout, dtype: DType, simd_width: Int, num_threads_per_tile: Int, @@ -87,9 +90,9 @@ def manual_vectorized_tiled_elementwise_add[ size: Int, tile_size: Int, ]( - output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], ctx: DeviceContext, ) raises: # Each tile contains tile_size groups of simd_width elements @@ -120,7 +123,7 @@ def manual_vectorized_tiled_elementwise_add[ # ANCHOR: vectorize_within_tiles_elementwise_add def vectorize_within_tiles_elementwise_add[ - layout: Layout, + LayoutT: TensorLayout, dtype: DType, simd_width: Int, num_threads_per_tile: Int, @@ -128,9 +131,9 @@ def vectorize_within_tiles_elementwise_add[ size: Int, tile_size: Int, ]( - output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], ctx: DeviceContext, ) raises: # Each tile contains tile_size elements (not SIMD groups) @@ -171,7 +174,8 @@ def benchmark_elementwise_parameterized[ test_size: Int, tile_size: Int ](mut b: Bencher) raises: var bench_ctx = DeviceContext() - comptime layout = Layout.row_major(test_size) + comptime bench_layout = row_major[test_size]() + comptime BenchLayoutType = type_of(bench_layout) var out = bench_ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = bench_ctx.enqueue_create_buffer[dtype](test_size) @@ -184,20 +188,20 @@ def benchmark_elementwise_parameterized[ a_host[i] = Scalar[dtype](2 * i) b_host[i] = Scalar[dtype](2 * i + 1) - var a_tensor = LayoutTensor[mut=False, dtype, layout, MutAnyOrigin]( - a.unsafe_ptr() - ) - var b_tensor = LayoutTensor[mut=False, dtype, layout, MutAnyOrigin]( - b_buf.unsafe_ptr() - ) - var out_tensor = LayoutTensor[mut=True, dtype, layout, MutAnyOrigin]( - out.unsafe_ptr() + var a_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](a, bench_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](b_buf, bench_layout) + var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin]( + out, bench_layout ) @parameter @always_inline def elementwise_workflow(ctx: DeviceContext) raises: - elementwise_add[layout, dtype, SIMD_WIDTH, rank, test_size]( + elementwise_add[BenchLayoutType, dtype, SIMD_WIDTH, rank, test_size]( out_tensor, a_tensor, b_tensor, ctx ) @@ -212,7 +216,8 @@ def benchmark_tiled_parameterized[ test_size: Int, tile_size: Int ](mut b: Bencher) raises: var bench_ctx = DeviceContext() - comptime layout = Layout.row_major(test_size) + comptime bench_layout = row_major[test_size]() + comptime BenchLayoutType = type_of(bench_layout) var out = bench_ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = bench_ctx.enqueue_create_buffer[dtype](test_size) @@ -225,15 +230,21 @@ def benchmark_tiled_parameterized[ a_host[i] = Scalar[dtype](2 * i) b_host[i] = Scalar[dtype](2 * i + 1) - var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr()) - var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr()) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var a_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](a, bench_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](b_buf, bench_layout) + var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin]( + out, bench_layout + ) @parameter @always_inline def tiled_workflow(ctx: DeviceContext) raises: tiled_elementwise_add[ - layout, dtype, SIMD_WIDTH, rank, test_size, tile_size + BenchLayoutType, dtype, SIMD_WIDTH, rank, test_size, tile_size ](out_tensor, a_tensor, b_tensor, ctx) b.iter_custom[tiled_workflow](bench_ctx) @@ -247,7 +258,8 @@ def benchmark_manual_vectorized_parameterized[ test_size: Int, tile_size: Int ](mut b: Bencher) raises: var bench_ctx = DeviceContext() - comptime layout = Layout.row_major(test_size) + comptime bench_layout = row_major[test_size]() + comptime BenchLayoutType = type_of(bench_layout) var out = bench_ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = bench_ctx.enqueue_create_buffer[dtype](test_size) @@ -260,15 +272,21 @@ def benchmark_manual_vectorized_parameterized[ a_host[i] = Scalar[dtype](2 * i) b_host[i] = Scalar[dtype](2 * i + 1) - var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr()) - var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr()) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var a_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](a, bench_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](b_buf, bench_layout) + var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin]( + out, bench_layout + ) @parameter @always_inline def manual_vectorized_workflow(ctx: DeviceContext) raises: manual_vectorized_tiled_elementwise_add[ - layout, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size + BenchLayoutType, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size ](out_tensor, a_tensor, b_tensor, ctx) b.iter_custom[manual_vectorized_workflow](bench_ctx) @@ -282,7 +300,8 @@ def benchmark_vectorized_parameterized[ test_size: Int, tile_size: Int ](mut b: Bencher) raises: var bench_ctx = DeviceContext() - comptime layout = Layout.row_major(test_size) + comptime bench_layout = row_major[test_size]() + comptime BenchLayoutType = type_of(bench_layout) var out = bench_ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = bench_ctx.enqueue_create_buffer[dtype](test_size) @@ -295,15 +314,21 @@ def benchmark_vectorized_parameterized[ a_host[i] = Scalar[dtype](2 * i) b_host[i] = Scalar[dtype](2 * i + 1) - var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr()) - var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr()) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var a_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](a, bench_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](b_buf, bench_layout) + var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin]( + out, bench_layout + ) @parameter @always_inline def vectorized_workflow(ctx: DeviceContext) raises: vectorize_within_tiles_elementwise_add[ - layout, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size + BenchLayoutType, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size ](out_tensor, a_tensor, b_tensor, ctx) b.iter_custom[vectorized_workflow](bench_ctx) @@ -328,8 +353,12 @@ def main() raises: b_host[i] = Scalar[dtype](2 * i + 1) expected[i] = a_host[i] + b_host[i] - var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr()) - var b_tensor = LayoutTensor[mut=False, dtype, layout](b.unsafe_ptr()) + var a_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin]( + a, layout + ) + var b_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin]( + b, layout + ) ctx.synchronize() @@ -337,8 +366,10 @@ def main() raises: print("simd_width:", SIMD_WIDTH) if argv()[1] == "--elementwise": - out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) - elementwise_add[layout, dtype, SIMD_WIDTH, rank, SIZE]( + var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin]( + out, layout + ) + elementwise_add[LayoutType, dtype, SIMD_WIDTH, rank, SIZE]( out_tensor, a_tensor, b_tensor, ctx ) @@ -350,11 +381,13 @@ def main() raises: print("Puzzle 23 complete โœ…") elif argv()[1] == "--tiled": - out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) - print("tile size:", TILE_SIZE) - tiled_elementwise_add[layout, dtype, SIMD_WIDTH, rank, SIZE, TILE_SIZE]( - out_tensor, a_tensor, b_tensor, ctx + var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin]( + out, layout ) + print("tile size:", TILE_SIZE) + tiled_elementwise_add[ + LayoutType, dtype, SIMD_WIDTH, rank, SIZE, TILE_SIZE + ](out_tensor, a_tensor, b_tensor, ctx) with out.map_to_host() as out_host: print("out:", out_host) @@ -364,10 +397,12 @@ def main() raises: print("Puzzle 23 complete โœ…") elif argv()[1] == "--manual-vectorized": - out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin]( + out, layout + ) print("tile size:", TILE_SIZE) manual_vectorized_tiled_elementwise_add[ - layout, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE + LayoutType, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE ](out_tensor, a_tensor, b_tensor, ctx) with out.map_to_host() as out_host: @@ -378,10 +413,12 @@ def main() raises: print("Puzzle 23 complete โœ…") elif argv()[1] == "--vectorized": - out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin]( + out, layout + ) print("tile size:", TILE_SIZE) vectorize_within_tiles_elementwise_add[ - layout, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE + LayoutType, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE ](out_tensor, a_tensor, b_tensor, ctx) with out.map_to_host() as out_host: diff --git a/problems/p24/p24.mojo b/problems/p24/p24.mojo index 10ae8958..ed901d15 100644 --- a/problems/p24/p24.mojo +++ b/problems/p24/p24.mojo @@ -4,7 +4,9 @@ from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer from std.gpu.memory import AddressSpace from std.gpu.primitives.warp import sum as warp_sum, WARP_SIZE from std.algorithm.functional import elementwise -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.utils import IndexList from std.sys import argv, simd_width_of, align_of from std.testing import assert_equal @@ -27,26 +29,25 @@ comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (WARP_SIZE, 1) comptime dtype = DType.float32 comptime SIMD_WIDTH = simd_width_of[dtype]() -comptime in_layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(1) +comptime in_layout = row_major[SIZE]() +comptime InLayoutType = type_of(in_layout) +comptime out_layout = row_major[1]() +comptime OutLayoutType = type_of(out_layout) def traditional_dot_product_p12_style[ - in_layout: Layout, out_layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], ): """ This is the complex approach from p12_layout_tensor.mojo - kept for comparison. """ - var shared = LayoutTensor[ - dtype, - Layout.row_major(WARP_SIZE), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[WARP_SIZE]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -73,11 +74,11 @@ def traditional_dot_product_p12_style[ # ANCHOR: simple_warp_kernel def simple_warp_dot_product[ - in_layout: Layout, out_layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], ): var global_i = block_dim.x * block_idx.x + thread_idx.x # FILL IN (6 lines at most) @@ -88,16 +89,14 @@ def simple_warp_dot_product[ # ANCHOR: functional_warp_approach def functional_warp_dot_product[ - layout: Layout, - out_layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int, ]( - output: LayoutTensor[mut=True, dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayoutType, MutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayoutType, MutAnyOrigin], ctx: DeviceContext, ) raises: @parameter @@ -162,8 +161,10 @@ def benchmark_simple_warp_parameterized[ test_size: Int ](mut bencher: Bencher) raises: comptime n_warps = test_size // WARP_SIZE - comptime in_layout = Layout.row_major(test_size) - comptime out_layout = Layout.row_major(n_warps) + comptime bench_in_layout = row_major[test_size]() + comptime BenchInLayoutType = type_of(bench_in_layout) + comptime bench_out_layout = row_major[n_warps]() + comptime BenchOutLayoutType = type_of(bench_out_layout) comptime n_threads = WARP_SIZE comptime n_blocks = (ceildiv(test_size, n_threads), 1) @@ -182,16 +183,18 @@ def benchmark_simple_warp_parameterized[ rand_int[dtype, test_size](b) expected_output[dtype, n_warps](expected, a, b) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + var a_tensor = TileTensor[ + mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin + ](a, bench_in_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin + ](b, bench_in_layout) + var out_tensor = TileTensor(out, bench_out_layout) @parameter @always_inline def traditional_workflow(ctx: DeviceContext) raises: - comptime kernel = simple_warp_dot_product[ - in_layout, out_layout, test_size - ] + comptime kernel = simple_warp_dot_product[test_size] ctx.enqueue_function[kernel, kernel]( out_tensor, a_tensor, @@ -214,8 +217,10 @@ def benchmark_functional_warp_parameterized[ test_size: Int ](mut bencher: Bencher) raises: comptime n_warps = test_size // WARP_SIZE - comptime in_layout = Layout.row_major(test_size) - comptime out_layout = Layout.row_major(n_warps) + comptime bench_in_layout = row_major[test_size]() + comptime BenchInLayoutType = type_of(bench_in_layout) + comptime bench_out_layout = row_major[n_warps]() + comptime BenchOutLayoutType = type_of(bench_out_layout) var bench_ctx = DeviceContext() @@ -232,16 +237,20 @@ def benchmark_functional_warp_parameterized[ rand_int[dtype, test_size](b) expected_output[dtype, n_warps](expected, a, b) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + var a_tensor = TileTensor[ + mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin + ](a, bench_in_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin + ](b, bench_in_layout) + var out_tensor = TileTensor(out, bench_out_layout) @parameter @always_inline def functional_warp_workflow(ctx: DeviceContext) raises: - functional_warp_dot_product[ - in_layout, out_layout, dtype, SIMD_WIDTH, 1, test_size - ](out_tensor, a_tensor, b_tensor, ctx) + functional_warp_dot_product[dtype, SIMD_WIDTH, 1, test_size]( + out_tensor, a_tensor, b_tensor, ctx + ) bencher.iter_custom[functional_warp_workflow](bench_ctx) check_result[dtype, n_warps](out, expected) @@ -257,8 +266,10 @@ def benchmark_traditional_parameterized[ test_size: Int ](mut bencher: Bencher) raises: comptime n_warps = test_size // WARP_SIZE - comptime in_layout = Layout.row_major(test_size) - comptime out_layout = Layout.row_major(n_warps) + comptime bench_in_layout = row_major[test_size]() + comptime BenchInLayoutType = type_of(bench_in_layout) + comptime bench_out_layout = row_major[n_warps]() + comptime BenchOutLayoutType = type_of(bench_out_layout) comptime n_blocks = (ceildiv(test_size, WARP_SIZE), 1) var bench_ctx = DeviceContext() @@ -276,16 +287,20 @@ def benchmark_traditional_parameterized[ rand_int[dtype, test_size](b) expected_output[dtype, n_warps](expected, a, b) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + var a_tensor = TileTensor[ + mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin + ](a, bench_in_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchInLayoutType, ImmutAnyOrigin + ](b, bench_in_layout) + var out_tensor = TileTensor(out, bench_out_layout) @parameter @always_inline def traditional_workflow(ctx: DeviceContext) raises: ctx.enqueue_function[ - traditional_dot_product_p12_style[in_layout, out_layout, test_size], - traditional_dot_product_p12_style[in_layout, out_layout, test_size], + traditional_dot_product_p12_style[test_size], + traditional_dot_product_p12_style[test_size], ]( out_tensor, a_tensor, @@ -318,9 +333,13 @@ def main() raises: var expected = ctx.enqueue_create_host_buffer[dtype](n_warps) expected.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b) + var out_tensor = TileTensor(out, out_layout) + var a_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](a, in_layout) + var b_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](b, in_layout) with a.map_to_host() as a_host, b.map_to_host() as b_host: for i in range(SIZE): @@ -329,12 +348,8 @@ def main() raises: if argv()[1] == "--traditional": ctx.enqueue_function[ - traditional_dot_product_p12_style[ - in_layout, out_layout, SIZE - ], - traditional_dot_product_p12_style[ - in_layout, out_layout, SIZE - ], + traditional_dot_product_p12_style[SIZE], + traditional_dot_product_p12_style[SIZE], ]( out_tensor, a_tensor, @@ -344,8 +359,8 @@ def main() raises: ) elif argv()[1] == "--kernel": ctx.enqueue_function[ - simple_warp_dot_product[in_layout, out_layout, SIZE], - simple_warp_dot_product[in_layout, out_layout, SIZE], + simple_warp_dot_product[SIZE], + simple_warp_dot_product[SIZE], ]( out_tensor, a_tensor, @@ -354,9 +369,9 @@ def main() raises: block_dim=THREADS_PER_BLOCK, ) elif argv()[1] == "--functional": - functional_warp_dot_product[ - in_layout, out_layout, dtype, SIMD_WIDTH, 1, SIZE - ](out_tensor, a_tensor, b_tensor, ctx) + functional_warp_dot_product[dtype, SIMD_WIDTH, 1, SIZE]( + out_tensor, a_tensor, b_tensor, ctx + ) expected_output[dtype, n_warps](expected, a, b) check_result[dtype, n_warps, True](out, expected) print("Puzzle 24 complete โœ…") diff --git a/problems/p25/p25.mojo b/problems/p25/p25.mojo index 8aaa6edf..ba29aa7e 100644 --- a/problems/p25/p25.mojo +++ b/problems/p25/p25.mojo @@ -1,7 +1,8 @@ from std.gpu import thread_idx, block_idx, block_dim, lane_id from std.gpu.host import DeviceContext from std.gpu.primitives.warp import shuffle_down, broadcast, WARP_SIZE -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major from std.sys import argv from std.testing import assert_equal, assert_almost_equal @@ -10,14 +11,15 @@ comptime SIZE = WARP_SIZE comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (WARP_SIZE, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) def neighbor_difference[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Compute finite differences: output[i] = input[i+1] - input[i] @@ -36,14 +38,15 @@ def neighbor_difference[ comptime SIZE_2 = 64 comptime BLOCKS_PER_GRID_2 = (2, 1) comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1) -comptime layout_2 = Layout.row_major(SIZE_2) +comptime layout_2 = row_major[SIZE_2]() +comptime LayoutType_2 = type_of(layout_2) def moving_average_3[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType_2, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType_2, ImmutAnyOrigin], ): """ Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3 @@ -61,10 +64,10 @@ def moving_average_3[ # ANCHOR: broadcast_shuffle_coordination def broadcast_shuffle_coordination[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Combine broadcast() and shuffle_down() for advanced warp coordination. @@ -74,7 +77,7 @@ def broadcast_shuffle_coordination[ var global_i = block_dim.x * block_idx.x + thread_idx.x var lane = Int(lane_id()) if global_i < size: - var scale_factor: output.element_type = 0.0 + var scale_factor: output.ElementType = 0.0 # FILL IN (roughly 14 lines) @@ -84,10 +87,10 @@ def broadcast_shuffle_coordination[ # ANCHOR: basic_broadcast def basic_broadcast[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Basic broadcast: Lane 0 computes a block-local value, broadcasts it to all lanes. @@ -96,7 +99,7 @@ def basic_broadcast[ var global_i = block_dim.x * block_idx.x + thread_idx.x var lane = Int(lane_id()) if global_i < size: - var broadcast_value: output.element_type = 0.0 + var broadcast_value: output.ElementType = 0.0 # FILL IN (roughly 10 lines) @@ -106,10 +109,10 @@ def basic_broadcast[ # ANCHOR: conditional_broadcast def conditional_broadcast[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Conditional broadcast: Lane 0 makes a decision based on block-local data, broadcasts it to all lanes. @@ -118,7 +121,7 @@ def conditional_broadcast[ var global_i = block_dim.x * block_idx.x + thread_idx.x var lane = Int(lane_id()) if global_i < size: - var decision_value: output.element_type = 0.0 + var decision_value: output.ElementType = 0.0 # FILL IN (roughly 10 lines) @@ -145,14 +148,12 @@ def test_neighbor_difference() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i * i) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](input_buf, layout) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = neighbor_difference[layout, SIZE] + comptime kernel = neighbor_difference[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -193,14 +194,12 @@ def test_moving_average() raises: for i in range(1, SIZE_2): input_host[i] = input_host[i - 1] + Scalar[dtype](i + 1) - var input_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout_2, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType_2, ImmutAnyOrigin + ](input_buf, layout_2) + var output_tensor = TileTensor(output_buf, layout_2) - comptime kernel = moving_average_3[layout_2, SIZE_2] + comptime kernel = moving_average_3[SIZE_2] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -263,14 +262,12 @@ def test_broadcast_shuffle_coordination() raises: else: input_host[i] = Scalar[dtype](((i - 4) % 4) * 2 + 1) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](input_buf, layout) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = broadcast_shuffle_coordination[layout, SIZE] + comptime kernel = broadcast_shuffle_coordination[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -317,14 +314,12 @@ def test_basic_broadcast() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i + 1) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](input_buf, layout) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = basic_broadcast[layout, SIZE] + comptime kernel = basic_broadcast[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -377,14 +372,12 @@ def test_conditional_broadcast() raises: for i in range(SIZE): input_host[i] = test_values[i % len(test_values)] - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](input_buf, layout) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = conditional_broadcast[layout, SIZE] + comptime kernel = conditional_broadcast[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, diff --git a/problems/p26/p26.mojo b/problems/p26/p26.mojo index d4e82266..33e127fe 100644 --- a/problems/p26/p26.mojo +++ b/problems/p26/p26.mojo @@ -1,7 +1,8 @@ from std.gpu import thread_idx, block_idx, block_dim, lane_id from std.gpu.host import DeviceContext from std.gpu.primitives.warp import shuffle_xor, prefix_sum, WARP_SIZE -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major from std.sys import argv from std.testing import assert_equal, assert_almost_equal @@ -10,14 +11,15 @@ comptime SIZE = WARP_SIZE comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (WARP_SIZE, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) def butterfly_pair_swap[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern. @@ -35,10 +37,10 @@ def butterfly_pair_swap[ # ANCHOR: butterfly_parallel_max def butterfly_parallel_max[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Parallel maximum reduction using butterfly pattern. @@ -59,14 +61,15 @@ def butterfly_parallel_max[ comptime SIZE_2 = 64 comptime BLOCKS_PER_GRID_2 = (2, 1) comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1) -comptime layout_2 = Layout.row_major(SIZE_2) +comptime layout_2 = row_major[SIZE_2]() +comptime LayoutType_2 = type_of(layout_2) def butterfly_conditional_max[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType_2, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType_2, ImmutAnyOrigin], ): """ Conditional butterfly maximum: Perform butterfly max reduction, but only store result @@ -88,10 +91,10 @@ def butterfly_conditional_max[ # ANCHOR: warp_inclusive_prefix_sum def warp_inclusive_prefix_sum[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Inclusive prefix sum using warp primitive: @@ -123,10 +126,10 @@ def warp_inclusive_prefix_sum[ # ANCHOR: warp_partition def warp_partition[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], pivot: Float32, ): """ @@ -167,14 +170,12 @@ def test_butterfly_pair_swap() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](input_buf, layout) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = butterfly_pair_swap[layout, SIZE] + comptime kernel = butterfly_pair_swap[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -218,14 +219,12 @@ def test_butterfly_parallel_max() raises: # Make sure we have a clear maximum input_host[SIZE - 1] = 1000.0 - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](input_buf, layout) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = butterfly_parallel_max[layout, SIZE] + comptime kernel = butterfly_parallel_max[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -264,14 +263,12 @@ def test_butterfly_conditional_max() raises: else: input_host[i] = Scalar[dtype](i % 10) - var input_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout_2, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType_2, ImmutAnyOrigin + ](input_buf, layout_2) + var output_tensor = TileTensor(output_buf, layout_2) - comptime kernel = butterfly_conditional_max[layout_2, SIZE_2] + comptime kernel = butterfly_conditional_max[SIZE_2] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -324,14 +321,12 @@ def test_warp_inclusive_prefix_sum() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i + 1) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](input_buf, layout) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = warp_inclusive_prefix_sum[layout, SIZE] + comptime kernel = warp_inclusive_prefix_sum[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -390,14 +385,12 @@ def test_warp_partition() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](test_values[i % len(test_values)]) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf - ) + var input_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](input_buf, layout) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = warp_partition[layout, SIZE] + comptime kernel = warp_partition[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, diff --git a/problems/p27/p27.mojo b/problems/p27/p27.mojo index 78835fc3..6dbbd7de 100644 --- a/problems/p27/p27.mojo +++ b/problems/p27/p27.mojo @@ -4,7 +4,9 @@ from std.gpu.primitives.warp import WARP_SIZE from std.gpu.primitives import block from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_equal from std.math import floor @@ -12,22 +14,19 @@ from std.math import floor # ANCHOR: traditional_dot_product def traditional_dot_product[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], size: Int, ): """Traditional dot product using shared memory + barriers + tree reduction. Educational but complex - shows the manual coordination needed.""" - var shared = LayoutTensor[ - dtype, - Layout.row_major(tpb), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[tpb]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -58,17 +57,19 @@ def traditional_dot_product[ comptime SIZE = 128 comptime TPB = 128 comptime NUM_BINS = 8 -comptime in_layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(1) +comptime in_layout = row_major[SIZE]() +comptime InLayoutType = type_of(in_layout) +comptime out_layout = row_major[1]() +comptime OutLayoutType = type_of(out_layout) comptime dtype = DType.float32 def block_sum_dot_product[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], size: Int, ): """Dot product using block.sum() - convenience function like warp.sum()! @@ -83,15 +84,18 @@ def block_sum_dot_product[ # ANCHOR_END: block_sum_dot_product # ANCHOR: block_histogram -comptime bin_layout = Layout.row_major(SIZE) # Max SIZE elements per bin +comptime bin_layout = row_major[SIZE]() # Max SIZE elements per bin +comptime BinLayoutType = type_of(bin_layout) def block_histogram_bin_extract[ - in_layout: Layout, bin_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - bin_output: LayoutTensor[dtype, bin_layout, MutAnyOrigin], - count_output: LayoutTensor[DType.int32, out_layout, MutAnyOrigin], + input_data: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], + bin_output: TileTensor[mut=True, dtype, BinLayoutType, MutAnyOrigin], + count_output: TileTensor[ + mut=True, DType.int32, OutLayoutType, MutAnyOrigin + ], size: Int, target_bin: Int, num_bins: Int, @@ -133,14 +137,15 @@ def block_histogram_bin_extract[ # ANCHOR: block_normalize -comptime vector_layout = Layout.row_major(SIZE) +comptime vector_layout = row_major[SIZE]() +comptime VectorLayoutType = type_of(vector_layout) def block_normalize_vector[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin], + input_data: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], + output_data: TileTensor[mut=True, dtype, VectorLayoutType, MutAnyOrigin], size: Int, ): """Vector mean normalization using block.sum() + block.broadcast() combination. @@ -208,14 +213,16 @@ def main() raises: print("TPB:", TPB) print("Expected result:", expected) - a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b_buf) - out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + a_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](a, in_layout) + b_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](b_buf, in_layout) + out_tensor = TileTensor(out, out_layout) # Traditional approach: works perfectly when size == TPB - comptime kernel = traditional_dot_product[ - in_layout, out_layout, TPB - ] + comptime kernel = traditional_dot_product[TPB] ctx.enqueue_function[kernel, kernel]( out_tensor, a_tensor, @@ -253,12 +260,16 @@ def main() raises: print("TPB:", TPB) print("Expected result:", expected) - a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b_buf) - out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + a_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](a, in_layout) + b_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](b_buf, in_layout) + out_tensor = TileTensor(out, out_layout) # Block.sum(): Same result with dramatically simpler code! - comptime kernel = block_sum_dot_product[in_layout, out_layout, TPB] + comptime kernel = block_sum_dot_product[TPB] ctx.enqueue_function[kernel, kernel]( out_tensor, a_tensor, @@ -307,9 +318,9 @@ def main() raises: print("...") print() - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf - ) + input_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](input_buf, in_layout) # Demonstrate histogram for each bin using block.prefix_sum() for target_bin in range(NUM_BINS): @@ -329,17 +340,11 @@ def main() raises: var bin_count = ctx.enqueue_create_buffer[DType.int32](1) bin_count.enqueue_fill(0) - var bin_tensor = LayoutTensor[dtype, bin_layout, MutAnyOrigin]( - bin_data - ) - var count_tensor = LayoutTensor[ - DType.int32, out_layout, MutAnyOrigin - ](bin_count) + var bin_tensor = TileTensor(bin_data, bin_layout) + var count_tensor = TileTensor(bin_count, out_layout) # Execute histogram kernel for this specific bin - comptime kernel = block_histogram_bin_extract[ - in_layout, bin_layout, out_layout, TPB - ] + comptime kernel = block_histogram_bin_extract[TPB] ctx.enqueue_function[kernel, kernel]( input_tensor, bin_tensor, @@ -405,17 +410,13 @@ def main() raises: print("Mean value:", mean_value) print() - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[ - dtype, vector_layout, MutAnyOrigin - ](output_buf) + input_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](input_buf, in_layout) + var output_tensor = TileTensor(output_buf, vector_layout) # Execute vector normalization kernel - comptime kernel = block_normalize_vector[ - in_layout, vector_layout, TPB - ] + comptime kernel = block_normalize_vector[TPB] ctx.enqueue_function[kernel, kernel]( input_tensor, output_tensor, diff --git a/problems/p28/p28.mojo b/problems/p28/p28.mojo index 86b48d7c..ec1c8928 100644 --- a/problems/p28/p28.mojo +++ b/problems/p28/p28.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, grid_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace, async_copy_wait_all -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from layout.layout_tensor import copy_dram_to_sram_async from std.sys import argv, info from std.testing import assert_equal, assert_almost_equal @@ -17,15 +19,18 @@ comptime BLOCKS_PER_GRID_ASYNC = ( ) // CONV_TILE_SIZE comptime THREADS_PER_BLOCK_ASYNC = 256 comptime dtype = DType.float32 -comptime layout_async = Layout.row_major(VECTOR_SIZE) +comptime layout_async = row_major[VECTOR_SIZE]() +comptime LayoutAsyncType = type_of(layout_async) +comptime kernel_layout = row_major[KERNEL_SIZE]() +comptime KernelLayoutType = type_of(kernel_layout) def async_copy_overlap_convolution[ - dtype: DType, layout: Layout + dtype: DType ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], - kernel: LayoutTensor[dtype, Layout.row_major(KERNEL_SIZE), ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutAsyncType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutAsyncType, ImmutAnyOrigin], + kernel: TileTensor[mut=False, dtype, KernelLayoutType, ImmutAnyOrigin], ): """Demonstrates async copy operations building on p14 patterns. @@ -34,18 +39,12 @@ def async_copy_overlap_convolution[ """ # Shared memory buffers (like p14, but without .fill(0) to avoid race) - var input_shared = LayoutTensor[ - dtype, - Layout.row_major(CONV_TILE_SIZE), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var kernel_shared = LayoutTensor[ - dtype, - Layout.row_major(KERNEL_SIZE), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var input_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[CONV_TILE_SIZE]()) + var kernel_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[KERNEL_SIZE]()) # FILL IN HERE (roughly 19 lines) @@ -73,17 +72,15 @@ def test_async_copy_overlap_convolution() raises: for i in range(KERNEL_SIZE): kernel_host[i] = Scalar[dtype](i + 1) - var input_tensor = LayoutTensor[dtype, layout_async, ImmutAnyOrigin]( - input_buf + var input_tensor = TileTensor[ + mut=False, dtype, LayoutAsyncType, ImmutAnyOrigin + ](input_buf, layout_async) + var output_tensor = TileTensor(output_buf, layout_async) + var kernel_tensor = TileTensor[mut=False, dtype, KernelLayoutType]( + kernel_buf, kernel_layout ) - var output_tensor = LayoutTensor[dtype, layout_async, MutAnyOrigin]( - output_buf - ) - var kernel_tensor = LayoutTensor[ - mut=False, dtype, Layout.row_major(KERNEL_SIZE) - ](kernel_buf) - comptime kernel = async_copy_overlap_convolution[dtype, layout_async] + comptime kernel = async_copy_overlap_convolution[dtype] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, diff --git a/problems/p29/p29.mojo b/problems/p29/p29.mojo index a645526f..89156067 100644 --- a/problems/p29/p29.mojo +++ b/problems/p29/p29.mojo @@ -9,7 +9,9 @@ from std.gpu.sync import ( ) from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace, async_copy_wait_all -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from layout.layout_tensor import copy_dram_to_sram_async from std.sys import argv, info from std.testing import assert_true, assert_almost_equal @@ -21,7 +23,8 @@ comptime SIZE = 1024 # Image size (1D for simplicity) comptime BLOCKS_PER_GRID = (4, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) # Multi-stage processing configuration comptime STAGE1_THREADS = TPB // 2 @@ -29,11 +32,9 @@ comptime STAGE2_THREADS = TPB // 2 comptime BLUR_RADIUS = 2 -def multi_stage_image_blur_pipeline[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def multi_stage_image_blur_pipeline( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): """Multi-stage image blur pipeline with barrier coordination. @@ -44,18 +45,12 @@ def multi_stage_image_blur_pipeline[ """ # Shared memory buffers for pipeline stages - var input_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var blur_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var input_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) + var blur_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -88,11 +83,9 @@ comptime STENCIL_ITERATIONS = 3 comptime BUFFER_COUNT = 2 -def double_buffered_stencil_computation[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def double_buffered_stencil_computation( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): """Double-buffered stencil computation with memory barrier coordination. @@ -102,38 +95,23 @@ def double_buffered_stencil_computation[ """ # Double-buffering: Two shared memory buffers - var buffer_A = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var buffer_B = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var buffer_A = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) + var buffer_B = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) # Memory barriers for coordinating buffer swaps - var init_barrier = LayoutTensor[ - DType.uint64, - Layout.row_major(1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var iter_barrier = LayoutTensor[ - DType.uint64, - Layout.row_major(1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var final_barrier = LayoutTensor[ - DType.uint64, - Layout.row_major(1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var init_barrier = stack_allocation[ + dtype=DType.uint64, address_space=AddressSpace.SHARED + ](row_major[1]()) + var iter_barrier = stack_allocation[ + dtype=DType.uint64, address_space=AddressSpace.SHARED + ](row_major[1]()) + var final_barrier = stack_allocation[ + dtype=DType.uint64, address_space=AddressSpace.SHARED + ](row_major[1]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -205,11 +183,13 @@ def test_multi_stage_pipeline() raises: # Create a simple wave pattern for blurring inp_host[i] = Scalar[dtype](i % 10) + Scalar[dtype](i) / 100.0 - # Create LayoutTensors - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var inp_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp) + # Create TileTensors + var out_tensor = TileTensor(out, layout) + var inp_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](inp, layout) - comptime kernel = multi_stage_image_blur_pipeline[layout] + comptime kernel = multi_stage_image_blur_pipeline ctx.enqueue_function[kernel, kernel]( out_tensor, inp_tensor, @@ -267,11 +247,13 @@ def test_double_buffered_stencil() raises: # Create a step pattern that will be smoothed by stencil inp_host[i] = Scalar[dtype](1.0 if i % 20 < 10 else 0.0) - # Create LayoutTensors for Puzzle 29B - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var inp_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp) + # Create TileTensors for Puzzle 29B + var out_tensor = TileTensor(out, layout) + var inp_tensor = TileTensor[ + mut=False, dtype, LayoutType, ImmutAnyOrigin + ](inp, layout) - comptime kernel = double_buffered_stencil_computation[layout] + comptime kernel = double_buffered_stencil_computation ctx.enqueue_function[kernel, kernel]( out_tensor, inp_tensor, diff --git a/problems/p30/p30.mojo b/problems/p30/p30.mojo index 57ca41dd..4cee7205 100644 --- a/problems/p30/p30.mojo +++ b/problems/p30/p30.mojo @@ -1,6 +1,7 @@ from std.gpu import thread_idx, block_dim, block_idx from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major from std.sys import argv from std.testing import assert_almost_equal from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep @@ -12,16 +13,15 @@ comptime BLOCKS_PER_GRID = ( 1, ) # Enough blocks to cover all elements comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) # ANCHOR: kernel1 -def kernel1[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def kernel1( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): var i = block_dim.x * block_idx.x + thread_idx.x @@ -33,12 +33,10 @@ def kernel1[ # ANCHOR: kernel2 -def kernel2[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def kernel2( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): var tid = block_idx.x * block_dim.x + thread_idx.x @@ -54,12 +52,10 @@ def kernel2[ # ANCHOR: kernel3 -def kernel3[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def kernel3( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): var tid = block_idx.x * block_dim.x + thread_idx.x @@ -81,7 +77,8 @@ def benchmark_kernel1_parameterized[test_size: Int](mut b: Bencher) raises: @parameter @always_inline def kernel1_workflow(ctx: DeviceContext) raises: - comptime layout = Layout.row_major(test_size) + comptime layout = row_major[test_size]() + comptime LayoutType = type_of(layout) var out = ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = ctx.enqueue_create_buffer[dtype](test_size) @@ -94,11 +91,11 @@ def benchmark_kernel1_parameterized[test_size: Int](mut b: Bencher) raises: a_host[i] = Scalar[dtype](i + 1) b_host[i] = Scalar[dtype](i + 2) - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b_buf) + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](b_buf, layout) - ctx.enqueue_function[kernel1[layout], kernel1[layout]]( + ctx.enqueue_function[kernel1, kernel1]( out_tensor, a_tensor, b_tensor, @@ -119,7 +116,8 @@ def benchmark_kernel2_parameterized[test_size: Int](mut b: Bencher) raises: @parameter @always_inline def kernel2_workflow(ctx: DeviceContext) raises: - comptime layout = Layout.row_major(test_size) + comptime layout = row_major[test_size]() + comptime LayoutType = type_of(layout) var out = ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = ctx.enqueue_create_buffer[dtype](test_size) @@ -132,11 +130,11 @@ def benchmark_kernel2_parameterized[test_size: Int](mut b: Bencher) raises: a_host[i] = Scalar[dtype](i + 1) b_host[i] = Scalar[dtype](i + 2) - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b_buf) + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](b_buf, layout) - ctx.enqueue_function[kernel2[layout], kernel2[layout]]( + ctx.enqueue_function[kernel2, kernel2]( out_tensor, a_tensor, b_tensor, @@ -157,7 +155,8 @@ def benchmark_kernel3_parameterized[test_size: Int](mut b: Bencher) raises: @parameter @always_inline def kernel3_workflow(ctx: DeviceContext) raises: - comptime layout = Layout.row_major(test_size) + comptime layout = row_major[test_size]() + comptime LayoutType = type_of(layout) var out = ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = ctx.enqueue_create_buffer[dtype](test_size) @@ -170,11 +169,11 @@ def benchmark_kernel3_parameterized[test_size: Int](mut b: Bencher) raises: a_host[i] = Scalar[dtype](i + 1) b_host[i] = Scalar[dtype](i + 2) - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b_buf) + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](b_buf, layout) - ctx.enqueue_function[kernel3[layout], kernel3[layout]]( + ctx.enqueue_function[kernel3, kernel3]( out_tensor, a_tensor, b_tensor, @@ -206,12 +205,12 @@ def test_kernel1() raises: a_host[i] = Scalar[dtype](i + 1) b_host[i] = Scalar[dtype](i + 2) - # Create LayoutTensors - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b) + # Create TileTensors + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout) - ctx.enqueue_function[kernel1[layout], kernel1[layout]]( + ctx.enqueue_function[kernel1, kernel1]( out_tensor, a_tensor, b_tensor, @@ -249,12 +248,12 @@ def test_kernel2() raises: a_host[i] = Scalar[dtype](i + 1) b_host[i] = Scalar[dtype](i + 2) - # Create LayoutTensors - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b) + # Create TileTensors + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout) - ctx.enqueue_function[kernel2[layout], kernel2[layout]]( + ctx.enqueue_function[kernel2, kernel2]( out_tensor, a_tensor, b_tensor, @@ -295,12 +294,12 @@ def test_kernel3() raises: a_host[i] = Scalar[dtype](i + 1) b_host[i] = Scalar[dtype](i + 2) - # Create LayoutTensors - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b) + # Create TileTensors + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout) - ctx.enqueue_function[kernel3[layout], kernel3[layout]]( + ctx.enqueue_function[kernel3, kernel3]( out_tensor, a_tensor, b_tensor, diff --git a/problems/p31/p31.mojo b/problems/p31/p31.mojo index d70f583b..36930648 100644 --- a/problems/p31/p31.mojo +++ b/problems/p31/p31.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_dim, block_idx, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_almost_equal from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep @@ -11,15 +13,14 @@ comptime SIZE = 32 * 1024 * 1024 # 32M elements - larger workload to show occup comptime THREADS_PER_BLOCK = (1024, 1) comptime BLOCKS_PER_GRID = (SIZE // 1024, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) comptime ALPHA = Scalar[dtype](2.5) # SAXPY coefficient -def minimal_kernel[ - layout: Layout -]( - y: LayoutTensor[dtype, layout, MutAnyOrigin], - x: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def minimal_kernel( + y: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + x: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], alpha: Float32, size: Int, ): @@ -35,23 +36,20 @@ def minimal_kernel[ # ANCHOR: sophisticated_kernel -def sophisticated_kernel[ - layout: Layout -]( - y: LayoutTensor[dtype, layout, MutAnyOrigin], - x: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def sophisticated_kernel( + y: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + x: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], alpha: Float32, size: Int, ): """Sophisticated SAXPY kernel - over-engineered with excessive resource usage. """ # Maximum shared memory allocation (close to 48KB limit) - var shared_cache = LayoutTensor[ - dtype, - Layout.row_major(1024 * 12), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() # 48KB + var shared_cache = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ]( + row_major[1024 * 12]() + ) # 48KB var i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -132,23 +130,20 @@ def sophisticated_kernel[ # ANCHOR: balanced_kernel -def balanced_kernel[ - layout: Layout -]( - y: LayoutTensor[dtype, layout, MutAnyOrigin], - x: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def balanced_kernel( + y: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + x: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], alpha: Float32, size: Int, ): """Balanced SAXPY kernel - efficient optimization with moderate resources. """ # Reasonable shared memory usage for effective caching (16KB) - var shared_cache = LayoutTensor[ - dtype, - Layout.row_major(1024 * 4), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() # 16KB total + var shared_cache = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ]( + row_major[1024 * 4]() + ) # 16KB total var i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -195,7 +190,8 @@ def benchmark_minimal_parameterized[test_size: Int](mut b: Bencher) raises: @parameter @always_inline def minimal_workflow(ctx: DeviceContext) raises: - comptime layout = Layout.row_major(test_size) + comptime layout = row_major[test_size]() + comptime LayoutType = type_of(layout) var y = ctx.enqueue_create_buffer[dtype](test_size) y.enqueue_fill(0) var x = ctx.enqueue_create_buffer[dtype](test_size) @@ -206,10 +202,10 @@ def benchmark_minimal_parameterized[test_size: Int](mut b: Bencher) raises: x_host[i] = Scalar[dtype](i + 1) y_host[i] = Scalar[dtype](i + 2) - var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y) - var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x) + var y_tensor = TileTensor(y, layout) + var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout) - comptime kernel = minimal_kernel[layout] + comptime kernel = minimal_kernel ctx.enqueue_function[kernel, kernel]( y_tensor, x_tensor, @@ -233,7 +229,8 @@ def benchmark_sophisticated_parameterized[ @parameter @always_inline def sophisticated_workflow(ctx: DeviceContext) raises: - comptime layout = Layout.row_major(test_size) + comptime layout = row_major[test_size]() + comptime LayoutType = type_of(layout) var y = ctx.enqueue_create_buffer[dtype](test_size) y.enqueue_fill(0) var x = ctx.enqueue_create_buffer[dtype](test_size) @@ -244,10 +241,10 @@ def benchmark_sophisticated_parameterized[ x_host[i] = Scalar[dtype](i + 1) y_host[i] = Scalar[dtype](i + 2) - var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y) - var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x) + var y_tensor = TileTensor(y, layout) + var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout) - comptime kernel = sophisticated_kernel[layout] + comptime kernel = sophisticated_kernel ctx.enqueue_function[kernel, kernel]( y_tensor, x_tensor, @@ -269,7 +266,8 @@ def benchmark_balanced_parameterized[test_size: Int](mut b: Bencher) raises: @parameter @always_inline def balanced_workflow(ctx: DeviceContext) raises: - comptime layout = Layout.row_major(test_size) + comptime layout = row_major[test_size]() + comptime LayoutType = type_of(layout) var y = ctx.enqueue_create_buffer[dtype](test_size) y.enqueue_fill(0) var x = ctx.enqueue_create_buffer[dtype](test_size) @@ -280,10 +278,10 @@ def benchmark_balanced_parameterized[test_size: Int](mut b: Bencher) raises: x_host[i] = Scalar[dtype](i + 1) y_host[i] = Scalar[dtype](i + 2) - var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y) - var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x) + var y_tensor = TileTensor(y, layout) + var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout) - comptime kernel = balanced_kernel[layout] + comptime kernel = balanced_kernel ctx.enqueue_function[kernel, kernel]( y_tensor, x_tensor, @@ -314,11 +312,11 @@ def test_minimal() raises: x_host[i] = Scalar[dtype](i + 1) y_host[i] = Scalar[dtype](i + 2) - # Create LayoutTensors - var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y) - var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x) + # Create TileTensors + var y_tensor = TileTensor(y, layout) + var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout) - comptime kernel = minimal_kernel[layout] + comptime kernel = minimal_kernel ctx.enqueue_function[kernel, kernel]( y_tensor, x_tensor, @@ -357,11 +355,11 @@ def test_sophisticated() raises: x_host[i] = Scalar[dtype](i + 1) y_host[i] = Scalar[dtype](i + 2) - # Create LayoutTensors - var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y) - var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x) + # Create TileTensors + var y_tensor = TileTensor(y, layout) + var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout) - comptime kernel = sophisticated_kernel[layout] + comptime kernel = sophisticated_kernel ctx.enqueue_function[kernel, kernel]( y_tensor, x_tensor, @@ -401,11 +399,11 @@ def test_balanced() raises: x_host[i] = Scalar[dtype](i + 1) y_host[i] = Scalar[dtype](i + 2) - # Create LayoutTensors - var y_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](y) - var x_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](x) + # Create TileTensors + var y_tensor = TileTensor(y, layout) + var x_tensor = TileTensor[mut=False, dtype, LayoutType](x, layout) - comptime kernel = balanced_kernel[layout] + comptime kernel = balanced_kernel ctx.enqueue_function[kernel, kernel]( y_tensor, x_tensor, diff --git a/problems/p32/p32.mojo b/problems/p32/p32.mojo index 21e3c543..d8b406af 100644 --- a/problems/p32/p32.mojo +++ b/problems/p32/p32.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_dim, block_idx, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_almost_equal from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep @@ -12,14 +14,13 @@ comptime TPB = 256 # Threads per block - divisible by 32 (warp size) comptime THREADS_PER_BLOCK = (TPB, 1) comptime BLOCKS_PER_GRID = (SIZE // TPB, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) -def no_conflict_kernel[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def no_conflict_kernel( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): """Perfect shared memory access - no bank conflicts. @@ -29,12 +30,9 @@ def no_conflict_kernel[ """ # Shared memory buffer - each thread loads one element - var shared_buf = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_buf = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -58,26 +56,21 @@ def no_conflict_kernel[ # ANCHOR: two_way_conflict_kernel -def two_way_conflict_kernel[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def two_way_conflict_kernel( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): """Stride-2 shared memory access - creates 2-way bank conflicts. - Threads 0,16 โ†’ Bank 0, Threads 1,17 โ†’ Bank 1, etc. + Threads 0,16 -> Bank 0, Threads 1,17 -> Bank 1, etc. Each bank serves 2 threads, doubling access time. """ # Shared memory buffer - stride-2 access pattern creates conflicts - var shared_buf = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_buf = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -111,7 +104,8 @@ def benchmark_no_conflict[test_size: Int](mut b: Bencher) raises: @parameter @always_inline def kernel_workflow(ctx: DeviceContext) raises: - comptime layout = Layout.row_major(test_size) + comptime layout = row_major[test_size]() + comptime LayoutType = type_of(layout) var out = ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var input_buf = ctx.enqueue_create_buffer[dtype](test_size) @@ -121,12 +115,12 @@ def benchmark_no_conflict[test_size: Int](mut b: Bencher) raises: for i in range(test_size): input_host[i] = Scalar[dtype](i + 1) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) - var input_tensor = LayoutTensor[mut=False, dtype, layout]( - input_buf.unsafe_ptr() + var out_tensor = TileTensor(out, layout) + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) - comptime kernel = no_conflict_kernel[layout] + comptime kernel = no_conflict_kernel ctx.enqueue_function[kernel, kernel]( out_tensor, input_tensor, @@ -147,7 +141,8 @@ def benchmark_two_way_conflict[test_size: Int](mut b: Bencher) raises: @parameter @always_inline def kernel_workflow(ctx: DeviceContext) raises: - comptime layout = Layout.row_major(test_size) + comptime layout = row_major[test_size]() + comptime LayoutType = type_of(layout) var out = ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var input_buf = ctx.enqueue_create_buffer[dtype](test_size) @@ -157,12 +152,12 @@ def benchmark_two_way_conflict[test_size: Int](mut b: Bencher) raises: for i in range(test_size): input_host[i] = Scalar[dtype](i + 1) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) - var input_tensor = LayoutTensor[mut=False, dtype, layout]( - input_buf.unsafe_ptr() + var out_tensor = TileTensor(out, layout) + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) - comptime kernel = two_way_conflict_kernel[layout] + comptime kernel = two_way_conflict_kernel ctx.enqueue_function[kernel, kernel]( out_tensor, input_tensor, @@ -189,12 +184,12 @@ def test_no_conflict() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i + 1) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) - var input_tensor = LayoutTensor[mut=False, dtype, layout]( - input_buf.unsafe_ptr() + var out_tensor = TileTensor(out, layout) + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) - comptime kernel = no_conflict_kernel[layout] + comptime kernel = no_conflict_kernel ctx.enqueue_function[kernel, kernel]( out_tensor, input_tensor, @@ -223,12 +218,12 @@ def test_two_way_conflict() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i + 1) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) - var input_tensor = LayoutTensor[mut=False, dtype, layout]( - input_buf.unsafe_ptr() + var out_tensor = TileTensor(out, layout) + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) - comptime kernel = two_way_conflict_kernel[layout] + comptime kernel = two_way_conflict_kernel ctx.enqueue_function[kernel, kernel]( out_tensor, input_tensor, diff --git a/problems/p33/p33.mojo b/problems/p33/p33.mojo index 9ffa00a4..1e18af61 100644 --- a/problems/p33/p33.mojo +++ b/problems/p33/p33.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier, WARP_SIZE from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace, async_copy_wait_all -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from layout.tensor_core import TensorCore from layout.layout_tensor import copy_dram_to_sram_async from std.utils import Index @@ -10,7 +12,8 @@ from std.testing import assert_equal, assert_almost_equal comptime dtype = DType.float32 comptime SIZE = 1024 -comptime layout = Layout.row_major(SIZE, SIZE) +comptime layout = row_major[SIZE, SIZE]() +comptime LayoutType = type_of(layout) comptime BLOCK_DIM_COUNT = 2 comptime TILE_SIZE = 32 @@ -23,11 +26,11 @@ comptime THREADS_PER_BLOCK_TILED = (TILE_SIZE, TILE_SIZE) # ANCHOR: matmul_idiomatic_tiled_solution def matmul_idiomatic_tiled[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): # Use block_dim to get actual tile size dynamically var tile_size_x = block_dim.x @@ -40,23 +43,17 @@ def matmul_idiomatic_tiled[ # Get the tile of the output matrix that this thread block is responsible for var out_tile = output.tile[TILE_SIZE, TILE_SIZE](block_idx.y, block_idx.x) - var a_shared = LayoutTensor[ - dtype, - Layout.row_major(TILE_SIZE, TILE_SIZE), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var b_shared = LayoutTensor[ - dtype, - Layout.row_major(TILE_SIZE, TILE_SIZE), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - - var acc: output.element_type = 0 - - comptime load_a_layout = Layout.row_major(1, TILE_SIZE) # Coalesced loading - comptime load_b_layout = Layout.row_major(1, TILE_SIZE) # Coalesced loading + var a_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TILE_SIZE, TILE_SIZE]()) + var b_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TILE_SIZE, TILE_SIZE]()) + + var acc: output.ElementType = 0 + + comptime load_a_layout = row_major[1, TILE_SIZE]() # Coalesced loading + comptime load_b_layout = row_major[1, TILE_SIZE]() # Coalesced loading # Note: Both matrices stored in same orientation for correct matrix multiplication # Transposed loading would be useful if B were pre-transposed in global memory @@ -121,9 +118,6 @@ comptime BLOCKS_PER_GRID_TENSOR_CORE = ( def tensor_core_matrix_multiplication[ dtype: DType, - layout_a: Layout, - layout_b: Layout, - layout_c: Layout, BM: Int, BN: Int, BK: Int, @@ -133,13 +127,13 @@ def tensor_core_matrix_multiplication[ MMA_N: Int, MMA_K: Int, ]( - A: LayoutTensor[dtype, layout_a, ImmutAnyOrigin], - B: LayoutTensor[dtype, layout_b, ImmutAnyOrigin], - C: LayoutTensor[dtype, layout_c, MutAnyOrigin], + A: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + B: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + C: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], ): - comptime M = C.shape[0]() - comptime N = C.shape[1]() - comptime K = A.shape[1]() + comptime M = C.dim[0]() + comptime N = C.dim[1]() + comptime K = A.dim[1]() var warp_id = thread_idx.x // WARP_SIZE var warps_in_n = BN // WN @@ -155,26 +149,17 @@ def tensor_core_matrix_multiplication[ var mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]() # Shared SRAM tiles (no padding to stay under shared memory limit) - var A_sram_tile = LayoutTensor[ - A.dtype, - Layout.row_major(BM, BK), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var B_sram_tile = LayoutTensor[ - B.dtype, - Layout.row_major(BK, BN), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var A_sram_tile = stack_allocation[ + dtype=A.dtype, address_space=AddressSpace.SHARED + ](row_major[BM, BK]()) + var B_sram_tile = stack_allocation[ + dtype=B.dtype, address_space=AddressSpace.SHARED + ](row_major[BK, BN]()) # One per-warp accumulator tile of shape [WM, WN] - var C_warp_accum = LayoutTensor[ - C.dtype, - Layout.row_major(WM, WN), - MutAnyOrigin, - address_space=AddressSpace.GENERIC, - ].stack_allocation() + var C_warp_accum = stack_allocation[ + dtype=C.dtype, address_space=AddressSpace.GENERIC + ](row_major[WM, WN]()) # Zero initialize accumulator (only for active warps) if warp_is_active: @@ -190,12 +175,12 @@ def tensor_core_matrix_multiplication[ var B_dram_tile = B.tile[BK, BN](k_i, block_idx.x) copy_dram_to_sram_async[ - thread_layout=Layout.row_major(4, 8), + thread_layout=row_major[4, 8](), num_threads=256, block_dim_count=BLOCK_DIM_COUNT, ](A_sram_tile.vectorize[1, 4](), A_dram_tile.vectorize[1, 4]()) copy_dram_to_sram_async[ - thread_layout=Layout.row_major(4, 8), + thread_layout=row_major[4, 8](), num_threads=256, block_dim_count=BLOCK_DIM_COUNT, ](B_sram_tile.vectorize[1, 4](), B_dram_tile.vectorize[1, 4]()) @@ -274,19 +259,18 @@ def main() raises: inp1_host[i * SIZE + k] * inp2_host[k * SIZE + j] ) # Create layout tensors - var out_tensor_core_layout = LayoutTensor[dtype, layout]( - out_tensor_core.unsafe_ptr() + var out_tensor_core_layout = TileTensor(out_tensor_core, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin]( + inp1, layout + ) + var b_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin]( + inp2, layout ) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp1) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp2) if mode == "--tensor-core": print("\n=== Running ACTUAL Tensor Core Matrix Multiplication ===") comptime kernel = tensor_core_matrix_multiplication[ dtype, - layout, - layout, - layout, BM, BN, BK, @@ -313,12 +297,10 @@ def main() raises: # Create separate buffer for tiled result out_tiled = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) out_tiled.enqueue_fill(0) - out_tiled_layout = LayoutTensor[dtype, layout]( - out_tiled.unsafe_ptr() - ) + out_tiled_layout = TileTensor(out_tiled, layout) # Run idiomatic tiled version with proper 2D block configuration - comptime kernel = matmul_idiomatic_tiled[layout, SIZE] + comptime kernel = matmul_idiomatic_tiled[SIZE] ctx.enqueue_function[kernel, kernel]( out_tiled_layout, a_tensor, @@ -341,9 +323,6 @@ def main() raises: print("\n--- Test 1: Tensor Core vs CPU Reference ---") comptime kernel = tensor_core_matrix_multiplication[ dtype, - layout, - layout, - layout, BM, BN, BK, @@ -420,11 +399,9 @@ def main() raises: print("\n--- Test 2: Idiomatic Tiled vs CPU Reference ---") out_tiled = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) out_tiled.enqueue_fill(0) - out_tiled_layout = LayoutTensor[dtype, layout]( - out_tiled.unsafe_ptr() - ) + out_tiled_layout = TileTensor(out_tiled, layout) - comptime kernel2 = matmul_idiomatic_tiled[layout, SIZE] + comptime kernel2 = matmul_idiomatic_tiled[SIZE] ctx.enqueue_function[kernel2, kernel2]( out_tiled_layout, a_tensor, diff --git a/problems/p34/p34.mojo b/problems/p34/p34.mojo index 373f9c13..71c5c9af 100644 --- a/problems/p34/p34.mojo +++ b/problems/p34/p34.mojo @@ -8,7 +8,9 @@ from std.gpu.primitives.cluster import ( elect_one_sync, ) from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_equal, assert_almost_equal, assert_true @@ -16,16 +18,20 @@ comptime SIZE = 1024 comptime TPB = 256 comptime CLUSTER_SIZE = 4 comptime dtype = DType.float32 -comptime in_layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(1) +comptime in_layout = row_major[SIZE]() +comptime InLayoutType = type_of(in_layout) +comptime out_layout = row_major[1]() +comptime OutLayoutType = type_of(out_layout) +comptime cluster_layout = row_major[CLUSTER_SIZE]() +comptime ClusterLayoutType = type_of(cluster_layout) # ANCHOR: cluster_coordination_basics def cluster_coordination_basics[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, ClusterLayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], size: Int, ): """Real cluster coordination using SM90+ cluster APIs.""" @@ -36,12 +42,9 @@ def cluster_coordination_basics[ var my_block_rank = Int(block_rank_in_cluster()) var block_id = block_idx.x - var shared_data = LayoutTensor[ - dtype, - Layout.row_major(tpb), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_data = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[tpb]()) # FIX: Use block_idx.x for data distribution instead of cluster rank # Each block should process different portions of the data @@ -77,13 +80,11 @@ def cluster_coordination_basics[ # ANCHOR: cluster_collective_operations def cluster_collective_operations[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - temp_storage: LayoutTensor[ - dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin - ], + output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], + temp_storage: TileTensor[mut=True, dtype, ClusterLayoutType, MutAnyOrigin], size: Int, ): """Cluster-wide collective operations using real cluster APIs.""" @@ -98,10 +99,10 @@ def cluster_collective_operations[ # ANCHOR: advanced_cluster_patterns def advanced_cluster_patterns[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, ClusterLayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin], size: Int, ): """Advanced cluster programming using cluster masks and relaxed synchronization. @@ -135,16 +136,12 @@ def main() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i % 10) * 0.1 - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf - ) - output_tensor = LayoutTensor[ - dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin - ](output_buf) + input_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](input_buf, in_layout) + output_tensor = TileTensor(output_buf, cluster_layout) - comptime kernel = cluster_coordination_basics[ - in_layout, Layout.row_major(CLUSTER_SIZE), TPB - ] + comptime kernel = cluster_coordination_basics[TPB] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -199,19 +196,13 @@ def main() raises: print("Expected sum:", expected_sum) - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin]( - output_buf - ) - var temp_tensor = LayoutTensor[ - dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin - ](temp_buf) + input_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](input_buf, in_layout) + var output_tensor = TileTensor(output_buf, out_layout) + var temp_tensor = TileTensor(temp_buf, cluster_layout) - comptime kernel = cluster_collective_operations[ - in_layout, out_layout, TPB - ] + comptime kernel = cluster_collective_operations[TPB] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -251,16 +242,12 @@ def main() raises: Scalar[dtype](i % 50) * 0.02 ) # Pattern for testing - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf - ) - output_tensor = LayoutTensor[ - dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin - ](output_buf) + input_tensor = TileTensor[ + mut=False, dtype, InLayoutType, ImmutAnyOrigin + ](input_buf, in_layout) + output_tensor = TileTensor(output_buf, cluster_layout) - comptime kernel = advanced_cluster_patterns[ - in_layout, Layout.row_major(CLUSTER_SIZE), TPB - ] + comptime kernel = advanced_cluster_patterns[TPB] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, diff --git a/solutions/p04/p04_layout_tensor.mojo b/solutions/p04/p04_tile_tensor.mojo similarity index 70% rename from solutions/p04/p04_layout_tensor.mojo rename to solutions/p04/p04_tile_tensor.mojo index 394b7a26..c47d4b94 100644 --- a/solutions/p04/p04_layout_tensor.mojo +++ b/solutions/p04/p04_tile_tensor.mojo @@ -1,19 +1,21 @@ from std.gpu import thread_idx from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major from std.testing import assert_equal comptime SIZE = 2 comptime BLOCKS_PER_GRID = 1 comptime THREADS_PER_BLOCK = (3, 3) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE, SIZE) +comptime layout = row_major[SIZE, SIZE]() +comptime LayoutType = type_of(layout) -# ANCHOR: add_10_2d_layout_tensor_solution +# ANCHOR: add_10_2d_tile_tensor_solution def add_10_2d( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], size: Int, ): var row = thread_idx.y @@ -22,17 +24,15 @@ def add_10_2d( output[row, col] = a[row, col] + 10.0 -# ANCHOR_END: add_10_2d_layout_tensor_solution +# ANCHOR_END: add_10_2d_tile_tensor_solution def main() raises: with DeviceContext() as ctx: var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) out_buf.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - out_buf - ).reshape[layout]() - print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]()) + var out_tensor = TileTensor(out_buf, layout) + print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]()) var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) expected.enqueue_fill(0) @@ -44,9 +44,7 @@ def main() raises: a_host[i] = Scalar[dtype](i) expected[i] = a_host[i] + 10 - var a_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](a).reshape[ - layout - ]() + var a_tensor = TileTensor(a, layout) ctx.enqueue_function[add_10_2d, add_10_2d]( out_tensor, diff --git a/solutions/p05/p05.mojo b/solutions/p05/p05.mojo index eeb61d5d..224a57e5 100644 --- a/solutions/p05/p05.mojo +++ b/solutions/p05/p05.mojo @@ -1,25 +1,32 @@ -from std.memory import UnsafePointer from std.gpu import thread_idx from std.gpu.host import DeviceContext +from layout import TileTensor +from layout.tile_layout import row_major from std.testing import assert_equal comptime SIZE = 2 comptime BLOCKS_PER_GRID = 1 comptime THREADS_PER_BLOCK = (3, 3) comptime dtype = DType.float32 +comptime out_layout = row_major[SIZE, SIZE]() +comptime a_layout = row_major[1, SIZE]() +comptime b_layout = row_major[SIZE, 1]() +comptime OutLayout = type_of(out_layout) +comptime ALayout = type_of(a_layout) +comptime BLayout = type_of(b_layout) # ANCHOR: broadcast_add_solution def broadcast_add( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], - b: UnsafePointer[Scalar[dtype], MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, BLayout, ImmutAnyOrigin], size: Int, ): var row = thread_idx.y var col = thread_idx.x if row < size and col < size: - output[row * size + col] = a[col] + b[row] + output[row, col] = a[0, col] + b[row, 0] # ANCHOR_END: broadcast_add_solution @@ -27,10 +34,15 @@ def broadcast_add( def main() raises: with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - out.enqueue_fill(0) - var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) - expected.enqueue_fill(0) + var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) + out_buf.enqueue_fill(0) + var out_tensor = TileTensor(out_buf, out_layout) + print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]()) + + var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) + expected_buf.enqueue_fill(0) + var expected_tensor = TileTensor(expected_buf, out_layout) + var a = ctx.enqueue_create_buffer[dtype](SIZE) a.enqueue_fill(0) var b = ctx.enqueue_create_buffer[dtype](SIZE) @@ -40,14 +52,17 @@ def main() raises: a_host[i] = Scalar[dtype](i + 1) b_host[i] = Scalar[dtype](i * 10) - for y in range(SIZE): - for x in range(SIZE): - expected[y * SIZE + x] = a_host[x] + b_host[y] + for i in range(SIZE): + for j in range(SIZE): + expected_tensor[i, j] = a_host[j] + b_host[i] + + var a_tensor = TileTensor[mut=False, dtype, ALayout](a, a_layout) + var b_tensor = TileTensor[mut=False, dtype, BLayout](b, b_layout) ctx.enqueue_function[broadcast_add, broadcast_add]( - out, - a, - b, + out_tensor, + a_tensor, + b_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -55,10 +70,12 @@ def main() raises: ctx.synchronize() - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) - for y in range(SIZE): - for x in range(SIZE): - assert_equal(out_host[y * SIZE + x], expected[y * SIZE + x]) + with out_buf.map_to_host() as out_buf_host: + print("out:", out_buf_host) + print("expected:", expected_buf) + for i in range(SIZE): + for j in range(SIZE): + assert_equal( + out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j] + ) print("Puzzle 05 complete โœ…") diff --git a/solutions/p05/p05_layout_tensor.mojo b/solutions/p05/p05_layout_tensor.mojo deleted file mode 100644 index 3573c21d..00000000 --- a/solutions/p05/p05_layout_tensor.mojo +++ /dev/null @@ -1,84 +0,0 @@ -from std.gpu import thread_idx -from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor -from std.testing import assert_equal - -comptime SIZE = 2 -comptime BLOCKS_PER_GRID = 1 -comptime THREADS_PER_BLOCK = (3, 3) -comptime dtype = DType.float32 -comptime out_layout = Layout.row_major(SIZE, SIZE) -comptime a_layout = Layout.row_major(1, SIZE) -comptime b_layout = Layout.row_major(SIZE, 1) - - -# ANCHOR: broadcast_add_layout_tensor_solution -def broadcast_add[ - out_layout: Layout, - a_layout: Layout, - b_layout: Layout, -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin], - size: Int, -): - var row = thread_idx.y - var col = thread_idx.x - if row < size and col < size: - output[row, col] = a[0, col] + b[row, 0] - - -# ANCHOR_END: broadcast_add_layout_tensor_solution - - -def main() raises: - with DeviceContext() as ctx: - var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - out_buf.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf) - print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]()) - - var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) - expected_buf.enqueue_fill(0) - var expected_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin]( - expected_buf - ) - - var a = ctx.enqueue_create_buffer[dtype](SIZE) - a.enqueue_fill(0) - var b = ctx.enqueue_create_buffer[dtype](SIZE) - b.enqueue_fill(0) - with a.map_to_host() as a_host, b.map_to_host() as b_host: - for i in range(SIZE): - a_host[i] = Scalar[dtype](i + 1) - b_host[i] = Scalar[dtype](i * 10) - - for i in range(SIZE): - for j in range(SIZE): - expected_tensor[i, j] = a_host[j] + b_host[i] - - var a_tensor = LayoutTensor[dtype, a_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, b_layout, ImmutAnyOrigin](b) - - comptime kernel = broadcast_add[out_layout, a_layout, b_layout] - ctx.enqueue_function[kernel, kernel]( - out_tensor, - a_tensor, - b_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - ctx.synchronize() - - with out_buf.map_to_host() as out_buf_host: - print("out:", out_buf_host) - print("expected:", expected_buf) - for i in range(SIZE): - for j in range(SIZE): - assert_equal( - out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j] - ) - print("Puzzle 05 complete โœ…") diff --git a/solutions/p07/p07.mojo b/solutions/p07/p07.mojo index 35639053..9523bb37 100644 --- a/solutions/p07/p07.mojo +++ b/solutions/p07/p07.mojo @@ -1,24 +1,29 @@ -from std.memory import UnsafePointer from std.gpu import thread_idx, block_idx, block_dim from std.gpu.host import DeviceContext +from layout import TileTensor +from layout.tile_layout import row_major from std.testing import assert_equal comptime SIZE = 5 comptime BLOCKS_PER_GRID = (2, 2) comptime THREADS_PER_BLOCK = (3, 3) comptime dtype = DType.float32 +comptime out_layout = row_major[SIZE, SIZE]() +comptime a_layout = row_major[SIZE, SIZE]() +comptime OutLayout = type_of(out_layout) +comptime ALayout = type_of(a_layout) # ANCHOR: add_10_blocks_2d_solution def add_10_blocks_2d( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin], size: Int, ): var row = block_dim.y * block_idx.y + thread_idx.y var col = block_dim.x * block_idx.x + thread_idx.x if row < size and col < size: - output[row * size + col] = a[row * size + col] + 10.0 + output[row, col] = a[row, col] + 10.0 # ANCHOR_END: add_10_blocks_2d_solution @@ -26,10 +31,13 @@ def add_10_blocks_2d( def main() raises: with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - out.enqueue_fill(0) - var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) - expected.enqueue_fill(1) + var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) + out_buf.enqueue_fill(0) + var out_tensor = TileTensor(out_buf, out_layout) + + var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) + expected_buf.enqueue_fill(1) + var a = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) a.enqueue_fill(1) @@ -38,11 +46,13 @@ def main() raises: for i in range(SIZE): var k = j * SIZE + i a_host[k] = Scalar[dtype](k) - expected[k] = Scalar[dtype](k + 10) + expected_buf[k] = Scalar[dtype](k + 10) + + var a_tensor = TileTensor[mut=False, dtype, ALayout](a, a_layout) ctx.enqueue_function[add_10_blocks_2d, add_10_blocks_2d]( - out, - a, + out_tensor, + a_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -50,10 +60,17 @@ def main() raises: ctx.synchronize() - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) + var expected_tensor = TileTensor(expected_buf, out_layout) + + with out_buf.map_to_host() as out_buf_host: + print( + "out:", + TileTensor(out_buf_host, out_layout), + ) + print("expected:", expected_tensor) for i in range(SIZE): for j in range(SIZE): - assert_equal(out_host[i * SIZE + j], expected[i * SIZE + j]) + assert_equal( + out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j] + ) print("Puzzle 07 complete โœ…") diff --git a/solutions/p07/p07_layout_tensor.mojo b/solutions/p07/p07_layout_tensor.mojo deleted file mode 100644 index 2f1b397f..00000000 --- a/solutions/p07/p07_layout_tensor.mojo +++ /dev/null @@ -1,79 +0,0 @@ -from std.gpu import thread_idx, block_idx, block_dim -from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor -from std.testing import assert_equal - -comptime SIZE = 5 -comptime BLOCKS_PER_GRID = (2, 2) -comptime THREADS_PER_BLOCK = (3, 3) -comptime dtype = DType.float32 -comptime out_layout = Layout.row_major(SIZE, SIZE) -comptime a_layout = Layout.row_major(SIZE, SIZE) - - -# ANCHOR: add_10_blocks_2d_layout_tensor_solution -def add_10_blocks_2d[ - out_layout: Layout, - a_layout: Layout, -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin], - size: Int, -): - var row = block_dim.y * block_idx.y + thread_idx.y - var col = block_dim.x * block_idx.x + thread_idx.x - if row < size and col < size: - output[row, col] = a[row, col] + 10.0 - - -# ANCHOR_END: add_10_blocks_2d_layout_tensor_solution - - -def main() raises: - with DeviceContext() as ctx: - var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - out_buf.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf) - - var expected_buf = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) - expected_buf.enqueue_fill(1) - - var a = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) - a.enqueue_fill(1) - - with a.map_to_host() as a_host: - for j in range(SIZE): - for i in range(SIZE): - var k = j * SIZE + i - a_host[k] = Scalar[dtype](k) - expected_buf[k] = Scalar[dtype](k + 10) - - var a_tensor = LayoutTensor[dtype, a_layout, ImmutAnyOrigin](a) - - comptime kernel = add_10_blocks_2d[out_layout, a_layout] - ctx.enqueue_function[kernel, kernel]( - out_tensor, - a_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - ctx.synchronize() - - var expected_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin]( - expected_buf - ) - - with out_buf.map_to_host() as out_buf_host: - print( - "out:", - LayoutTensor[dtype, out_layout, MutAnyOrigin](out_buf_host), - ) - print("expected:", expected_tensor) - for i in range(SIZE): - for j in range(SIZE): - assert_equal( - out_buf_host[i * SIZE + j], expected_buf[i * SIZE + j] - ) - print("Puzzle 07 complete โœ…") diff --git a/solutions/p08/p08.mojo b/solutions/p08/p08.mojo index 3349960c..035d744a 100644 --- a/solutions/p08/p08.mojo +++ b/solutions/p08/p08.mojo @@ -1,7 +1,9 @@ -from std.memory import UnsafePointer, stack_allocation from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.testing import assert_equal comptime TPB = 4 @@ -9,33 +11,33 @@ comptime SIZE = 8 comptime BLOCKS_PER_GRID = (2, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) # ANCHOR: add_10_shared_solution -def add_10_shared( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], +def add_10_shared_tile_tensor( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): + # Allocate shared memory using stack_allocation var shared = stack_allocation[ - TPB, - Scalar[dtype], - address_space=AddressSpace.SHARED, - ]() + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) + var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x - # Load local data into shared memory + if global_i < size: shared[local_i] = a[global_i] - # Wait for all threads to complete (works within a thread block). # Note: barrier is not strictly needed here since each thread only accesses # its own shared memory location. However, it's included to teach proper # shared memory synchronization patterns for more complex scenarios where # threads need to coordinate access to shared data. barrier() - # process using shared memory if global_i < size: output[global_i] = shared[local_i] + 10 @@ -49,9 +51,15 @@ def main() raises: out.enqueue_fill(0) var a = ctx.enqueue_create_buffer[dtype](SIZE) a.enqueue_fill(1) - ctx.enqueue_function[add_10_shared, add_10_shared]( - out, - a, + + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + + ctx.enqueue_function[ + add_10_shared_tile_tensor, add_10_shared_tile_tensor + ]( + out_tensor, + a_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -59,7 +67,6 @@ def main() raises: var expected = ctx.enqueue_create_host_buffer[dtype](SIZE) expected.enqueue_fill(11) - ctx.synchronize() with out.map_to_host() as out_host: diff --git a/solutions/p08/p08_layout_tensor.mojo b/solutions/p08/p08_layout_tensor.mojo deleted file mode 100644 index 9864fe8e..00000000 --- a/solutions/p08/p08_layout_tensor.mojo +++ /dev/null @@ -1,78 +0,0 @@ -from std.gpu import thread_idx, block_idx, block_dim, barrier -from std.gpu.host import DeviceContext -from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor -from std.testing import assert_equal - -comptime TPB = 4 -comptime SIZE = 8 -comptime BLOCKS_PER_GRID = (2, 1) -comptime THREADS_PER_BLOCK = (TPB, 1) -comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) - - -# ANCHOR: add_10_shared_layout_tensor_solution -def add_10_shared_layout_tensor[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - size: Int, -): - # Allocate shared memory using tensor builder - var shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - - var global_i = block_dim.x * block_idx.x + thread_idx.x - var local_i = thread_idx.x - - if global_i < size: - shared[local_i] = a[global_i] - - # Note: barrier is not strictly needed here since each thread only accesses - # its own shared memory location. However, it's included to teach proper - # shared memory synchronization patterns for more complex scenarios where - # threads need to coordinate access to shared data. - barrier() - - if global_i < size: - output[global_i] = shared[local_i] + 10 - - -# ANCHOR_END: add_10_shared_layout_tensor_solution - - -def main() raises: - with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](SIZE) - out.enqueue_fill(0) - var a = ctx.enqueue_create_buffer[dtype](SIZE) - a.enqueue_fill(1) - - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - - comptime kernel = add_10_shared_layout_tensor[layout] - ctx.enqueue_function[kernel, kernel]( - out_tensor, - a_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - var expected = ctx.enqueue_create_host_buffer[dtype](SIZE) - expected.enqueue_fill(11) - ctx.synchronize() - - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) - for i in range(SIZE): - assert_equal(out_host[i], expected[i]) - print("Puzzle 08 complete โœ…") diff --git a/solutions/p10/p10.mojo b/solutions/p10/p10.mojo index fa7bce95..7c25c4ed 100644 --- a/solutions/p10/p10.mojo +++ b/solutions/p10/p10.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_dim, block_idx, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.testing import assert_equal from std.sys import argv from std.os.atomic import Atomic @@ -12,24 +14,22 @@ comptime SIZE = 2 comptime BLOCKS_PER_GRID = 1 comptime THREADS_PER_BLOCK = (3, 3) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE, SIZE) +comptime layout = row_major[SIZE, SIZE]() +comptime LayoutType = type_of(layout) def shared_memory_race( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): """Fixed: sequential access with barriers eliminates race conditions.""" var row = thread_idx.y var col = thread_idx.x - var shared_sum = LayoutTensor[ - dtype, - Layout.row_major(1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_sum = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[1]()) # Only thread 0 does all the accumulation work to prevent races if row == 0 and col == 0: @@ -53,8 +53,8 @@ def shared_memory_race( # ANCHOR: add_10_2d_solution def add_10_2d( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): var row = thread_idx.y @@ -79,10 +79,8 @@ def main() raises: with DeviceContext() as ctx: var out_buf = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) out_buf.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - out_buf - ).reshape[layout]() - print("out shape:", out_tensor.shape[0](), "x", out_tensor.shape[1]()) + var out_tensor = TileTensor(out_buf, layout) + print("out shape:", out_tensor.dim[0](), "x", out_tensor.dim[1]()) var expected = ctx.enqueue_create_host_buffer[dtype](SIZE * SIZE) expected.enqueue_fill(0) @@ -92,9 +90,7 @@ def main() raises: for i in range(SIZE * SIZE): a_host[i] = Scalar[dtype](i) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a).reshape[ - layout - ]() + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) if flag == "--memory-bug": print("Running memory bug example (bounds checking issue)...") diff --git a/solutions/p11/p11.mojo b/solutions/p11/p11.mojo index de3e243d..89d16c70 100644 --- a/solutions/p11/p11.mojo +++ b/solutions/p11/p11.mojo @@ -1,7 +1,9 @@ -from std.memory import UnsafePointer, stack_allocation from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.testing import assert_equal comptime TPB = 8 @@ -9,30 +11,37 @@ comptime SIZE = 8 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) # ANCHOR: pooling_solution def pooling( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): + # Allocate shared memory using stack_allocation var shared = stack_allocation[ - TPB, - Scalar[dtype], - address_space=AddressSpace.SHARED, - ]() + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) + var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x + + # Load data into shared memory if global_i < size: shared[local_i] = a[global_i] + # Synchronize threads within block barrier() + # Handle first two special cases if global_i == 0: output[0] = shared[0] elif global_i == 1: output[1] = shared[0] + shared[1] + # Handle general case elif 1 < global_i < size: output[global_i] = ( shared[local_i - 2] + shared[local_i - 1] + shared[local_i] @@ -48,13 +57,17 @@ def main() raises: out.enqueue_fill(0) var a = ctx.enqueue_create_buffer[dtype](SIZE) a.enqueue_fill(0) + with a.map_to_host() as a_host: for i in range(SIZE): a_host[i] = Scalar[dtype](i) + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + ctx.enqueue_function[pooling, pooling]( - out, - a, + out_tensor, + a_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -62,7 +75,6 @@ def main() raises: var expected = ctx.enqueue_create_host_buffer[dtype](SIZE) expected.enqueue_fill(0) - ctx.synchronize() with a.map_to_host() as a_host: @@ -71,7 +83,6 @@ def main() raises: var s = Scalar[dtype](0) for j in range(max(i - 2, 0), i + 1): s += ptr[j] - expected[i] = s with out.map_to_host() as out_host: diff --git a/solutions/p11/p11_layout_tensor.mojo b/solutions/p11/p11_layout_tensor.mojo deleted file mode 100644 index 7cfa112c..00000000 --- a/solutions/p11/p11_layout_tensor.mojo +++ /dev/null @@ -1,95 +0,0 @@ -from std.gpu import thread_idx, block_idx, block_dim, barrier -from std.gpu.host import DeviceContext -from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor -from std.testing import assert_equal - -comptime TPB = 8 -comptime SIZE = 8 -comptime BLOCKS_PER_GRID = (1, 1) -comptime THREADS_PER_BLOCK = (TPB, 1) -comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) - - -# ANCHOR: pooling_layout_tensor_solution -def pooling[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - size: Int, -): - # Allocate shared memory using tensor builder - var shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - - var global_i = block_dim.x * block_idx.x + thread_idx.x - var local_i = thread_idx.x - - # Load data into shared memory - if global_i < size: - shared[local_i] = a[global_i] - - # Synchronize threads within block - barrier() - - # Handle first two special cases - if global_i == 0: - output[0] = shared[0] - elif global_i == 1: - output[1] = shared[0] + shared[1] - # Handle general case - elif 1 < global_i < size: - output[global_i] = ( - shared[local_i - 2] + shared[local_i - 1] + shared[local_i] - ) - - -# ANCHOR_END: pooling_layout_tensor_solution - - -def main() raises: - with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](SIZE) - out.enqueue_fill(0) - var a = ctx.enqueue_create_buffer[dtype](SIZE) - a.enqueue_fill(0) - - with a.map_to_host() as a_host: - for i in range(SIZE): - a_host[i] = Scalar[dtype](i) - - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - - ctx.enqueue_function[pooling[layout], pooling[layout]]( - out_tensor, - a_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - var expected = ctx.enqueue_create_host_buffer[dtype](SIZE) - expected.enqueue_fill(0) - ctx.synchronize() - - with a.map_to_host() as a_host: - var ptr = a_host - for i in range(SIZE): - var s = Scalar[dtype](0) - for j in range(max(i - 2, 0), i + 1): - s += ptr[j] - expected[i] = s - - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) - for i in range(SIZE): - assert_equal(out_host[i], expected[i]) - print("Puzzle 11 complete โœ…") diff --git a/solutions/p12/p12.mojo b/solutions/p12/p12.mojo index 99393982..b8b962dd 100644 --- a/solutions/p12/p12.mojo +++ b/solutions/p12/p12.mojo @@ -1,7 +1,9 @@ -from std.memory import UnsafePointer, stack_allocation from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.testing import assert_equal comptime TPB = 8 @@ -9,37 +11,33 @@ comptime SIZE = 8 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 +comptime layout = row_major[SIZE]() +comptime out_layout = row_major[1]() +comptime LayoutType = type_of(layout) +comptime OutLayout = type_of(out_layout) # ANCHOR: dot_product_solution def dot_product( - output: UnsafePointer[Scalar[dtype], MutAnyOrigin], - a: UnsafePointer[Scalar[dtype], MutAnyOrigin], - b: UnsafePointer[Scalar[dtype], MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): var shared = stack_allocation[ - TPB, - Scalar[dtype], - address_space=AddressSpace.SHARED, - ]() + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x + + # Compute element-wise multiplication into shared memory if global_i < size: shared[local_i] = a[global_i] * b[global_i] + # Synchronize threads within block barrier() - # The following causes race condition: all threads writing to the same location - # out[0] += shared[local_i] - - # Instead can do parallel reduction in shared memory as opposed to - # global memory which has no guarantee on synchronization. - # Loops using global memory can cause thread divergence because - # fundamentally GPUs execute threads in warps (groups of 32 threads typically) - # and warps can be scheduled independently. - # However, shared memory does not have such issues as long as we use `barrier()` - # correctly when we're in the same thread block. + # Parallel reduction in shared memory var stride = TPB // 2 while stride > 0: if local_i < stride: @@ -48,7 +46,7 @@ def dot_product( barrier() stride //= 2 - # only thread 0 writes the final result + # Only thread 0 writes the final result if local_i == 0: output[0] = shared[0] @@ -64,15 +62,20 @@ def main() raises: a.enqueue_fill(0) var b = ctx.enqueue_create_buffer[dtype](SIZE) b.enqueue_fill(0) + with a.map_to_host() as a_host, b.map_to_host() as b_host: for i in range(SIZE): a_host[i] = Scalar[dtype](i) b_host[i] = Scalar[dtype](i) + var out_tensor = TileTensor(out, out_layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](b, layout) + ctx.enqueue_function[dot_product, dot_product]( - out, - a, - b, + out_tensor, + a_tensor, + b_tensor, SIZE, grid_dim=BLOCKS_PER_GRID, block_dim=THREADS_PER_BLOCK, @@ -80,7 +83,6 @@ def main() raises: var expected = ctx.enqueue_create_host_buffer[dtype](1) expected.enqueue_fill(0) - ctx.synchronize() with a.map_to_host() as a_host, b.map_to_host() as b_host: diff --git a/solutions/p12/p12_layout_tensor.mojo b/solutions/p12/p12_layout_tensor.mojo deleted file mode 100644 index a5359bb5..00000000 --- a/solutions/p12/p12_layout_tensor.mojo +++ /dev/null @@ -1,98 +0,0 @@ -from std.gpu import thread_idx, block_idx, block_dim, barrier -from std.gpu.host import DeviceContext -from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor -from std.testing import assert_equal - -comptime TPB = 8 -comptime SIZE = 8 -comptime BLOCKS_PER_GRID = (1, 1) -comptime THREADS_PER_BLOCK = (TPB, 1) -comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(1) - - -# ANCHOR: dot_product_layout_tensor_solution -def dot_product[ - in_layout: Layout, out_layout: Layout -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - size: Int, -): - var shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var global_i = block_dim.x * block_idx.x + thread_idx.x - var local_i = thread_idx.x - - # Compute element-wise multiplication into shared memory - if global_i < size: - shared[local_i] = a[global_i] * b[global_i] - - # Synchronize threads within block - barrier() - - # Parallel reduction in shared memory - var stride = TPB // 2 - while stride > 0: - if local_i < stride: - shared[local_i] += shared[local_i + stride] - - barrier() - stride //= 2 - - # Only thread 0 writes the final result - if local_i == 0: - output[0] = shared[0] - - -# ANCHOR_END: dot_product_layout_tensor_solution - - -def main() raises: - with DeviceContext() as ctx: - var out = ctx.enqueue_create_buffer[dtype](1) - out.enqueue_fill(0) - var a = ctx.enqueue_create_buffer[dtype](SIZE) - a.enqueue_fill(0) - var b = ctx.enqueue_create_buffer[dtype](SIZE) - b.enqueue_fill(0) - - with a.map_to_host() as a_host, b.map_to_host() as b_host: - for i in range(SIZE): - a_host[i] = Scalar[dtype](i) - b_host[i] = Scalar[dtype](i) - - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](b) - - comptime kernel = dot_product[layout, out_layout] - ctx.enqueue_function[kernel, kernel]( - out_tensor, - a_tensor, - b_tensor, - SIZE, - grid_dim=BLOCKS_PER_GRID, - block_dim=THREADS_PER_BLOCK, - ) - - var expected = ctx.enqueue_create_host_buffer[dtype](1) - expected.enqueue_fill(0) - ctx.synchronize() - - with a.map_to_host() as a_host, b.map_to_host() as b_host: - for i in range(SIZE): - expected[0] += a_host[i] * b_host[i] - - with out.map_to_host() as out_host: - print("out:", out_host) - print("expected:", expected) - assert_equal(out_host[0], expected[0]) - print("Puzzle 12 complete โœ…") diff --git a/solutions/p13/p13.mojo b/solutions/p13/p13.mojo index 7307a6f0..aacf2514 100644 --- a/solutions/p13/p13.mojo +++ b/solutions/p13/p13.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_equal @@ -11,33 +13,28 @@ comptime CONV = 3 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 -comptime in_layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(SIZE) -comptime conv_layout = Layout.row_major(CONV) +comptime in_layout = row_major[SIZE]() +comptime out_layout = row_major[SIZE]() +comptime conv_layout = row_major[CONV]() +comptime InLayout = type_of(in_layout) +comptime OutLayout = type_of(out_layout) +comptime ConvLayout = type_of(conv_layout) # ANCHOR: conv_1d_simple_solution -def conv_1d_simple[ - in_layout: Layout, out_layout: Layout, conv_layout: Layout -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin], +def conv_1d_simple( + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin], ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x - var shared_a = LayoutTensor[ - dtype, - Layout.row_major(SIZE), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var shared_b = LayoutTensor[ - dtype, - Layout.row_major(CONV), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_a = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[SIZE]()) + var shared_b = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[CONV]()) if global_i < SIZE: shared_a[local_i] = a[global_i] @@ -58,8 +55,8 @@ def conv_1d_simple[ # Safe and correct: if global_i < SIZE: # Note: using `var` allows us to include the type in the type inference - # `out.element_type` is available in LayoutTensor - var local_sum: output.element_type = 0 + # `out.ElementType` is available in TileTensor + var local_sum: output.ElementType = 0 # Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant # See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement @@ -77,34 +74,29 @@ comptime SIZE_2 = 15 comptime CONV_2 = 4 comptime BLOCKS_PER_GRID_2 = (2, 1) comptime THREADS_PER_BLOCK_2 = (TPB, 1) -comptime in_2_layout = Layout.row_major(SIZE_2) -comptime out_2_layout = Layout.row_major(SIZE_2) -comptime conv_2_layout = Layout.row_major(CONV_2) +comptime in_2_layout = row_major[SIZE_2]() +comptime out_2_layout = row_major[SIZE_2]() +comptime conv_2_layout = row_major[CONV_2]() +comptime In2Layout = type_of(in_2_layout) +comptime Out2Layout = type_of(out_2_layout) +comptime Conv2Layout = type_of(conv_2_layout) # ANCHOR: conv_1d_block_boundary_solution -def conv_1d_block_boundary[ - in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin], +def conv_1d_block_boundary( + output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin], ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x # first: need to account for padding - var shared_a = LayoutTensor[ - dtype, - Layout.row_major(TPB + CONV_2 - 1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var shared_b = LayoutTensor[ - dtype, - Layout.row_major(CONV_2), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_a = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB + CONV_2 - 1]()) + var shared_b = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[CONV_2]()) if global_i < SIZE_2: shared_a[local_i] = a[global_i] else: @@ -127,7 +119,7 @@ def conv_1d_block_boundary[ barrier() if global_i < SIZE_2: - var local_sum: output.element_type = 0 + var local_sum: output.ElementType = 0 comptime for j in range(CONV_2): if global_i + j < SIZE_2: @@ -158,11 +150,12 @@ def main() raises: b_host[i] = Scalar[dtype](i) if argv()[1] == "--simple": - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, conv_layout, ImmutAnyOrigin](b) - comptime kernel = conv_1d_simple[in_layout, out_layout, conv_layout] - ctx.enqueue_function[kernel, kernel]( + var out_tensor = TileTensor(out, out_layout) + var a_tensor = TileTensor[mut=False, dtype, InLayout](a, in_layout) + var b_tensor = TileTensor[mut=False, dtype, ConvLayout]( + b, conv_layout + ) + ctx.enqueue_function[conv_1d_simple, conv_1d_simple]( out_tensor, a_tensor, b_tensor, @@ -170,15 +163,16 @@ def main() raises: block_dim=THREADS_PER_BLOCK, ) elif argv()[1] == "--block-boundary": - var out_tensor = LayoutTensor[dtype, out_2_layout, MutAnyOrigin]( - out + var out_tensor = TileTensor(out, out_2_layout) + var a_tensor = TileTensor[mut=False, dtype, In2Layout]( + a, in_2_layout + ) + var b_tensor = TileTensor[mut=False, dtype, Conv2Layout]( + b, conv_2_layout ) - var a_tensor = LayoutTensor[dtype, in_2_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, conv_2_layout, ImmutAnyOrigin](b) - comptime kernel = conv_1d_block_boundary[ - in_2_layout, out_2_layout, conv_2_layout, dtype - ] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[ + conv_1d_block_boundary, conv_1d_block_boundary + ]( out_tensor, a_tensor, b_tensor, diff --git a/solutions/p14/p14.mojo b/solutions/p14/p14.mojo index 55d6a800..794b46ee 100644 --- a/solutions/p14/p14.mojo +++ b/solutions/p14/p14.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.math import log2 from std.testing import assert_equal @@ -11,25 +13,21 @@ comptime SIZE = 8 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) # ANCHOR: prefix_sum_simple_solution -def prefix_sum_simple[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def prefix_sum_simple( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], size: Int, ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x - var shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) if global_i < size: shared[local_i] = a[global_i] @@ -37,7 +35,7 @@ def prefix_sum_simple[ var offset = 1 for i in range(Int(log2(Scalar[dtype](TPB)))): - var current_val: output.element_type = 0 + var current_val: output.ElementType = 0 if local_i >= offset and local_i < size: current_val = shared[local_i - offset] # read @@ -59,28 +57,25 @@ comptime SIZE_2 = 15 comptime BLOCKS_PER_GRID_2 = (2, 1) comptime THREADS_PER_BLOCK_2 = (TPB, 1) comptime EXTENDED_SIZE = SIZE_2 + 2 # up to 2 blocks -comptime layout_2 = Layout.row_major(SIZE_2) -comptime extended_layout = Layout.row_major(EXTENDED_SIZE) +comptime layout_2 = row_major[SIZE_2]() +comptime extended_layout = row_major[EXTENDED_SIZE]() +comptime Layout2Type = type_of(layout_2) +comptime ExtendedLayout = type_of(extended_layout) # ANCHOR: prefix_sum_complete_solution # Kernel 1: Compute local prefix sums and store block sums in out -def prefix_sum_local_phase[ - out_layout: Layout, in_layout: Layout -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], +def prefix_sum_local_phase( + output: TileTensor[mut=True, dtype, ExtendedLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin], size: Int, ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x - var shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) # Load data into shared memory # Example with SIZE_2=15, TPB=8, BLOCKS=2: @@ -104,7 +99,7 @@ def prefix_sum_local_phase[ # Block 1 follows same pattern to get [8,17,27,38,50,63,77,???] var offset = 1 for i in range(Int(log2(Scalar[dtype](TPB)))): - var current_val: output.element_type = 0 + var current_val: output.ElementType = 0 if local_i >= offset and local_i < TPB: current_val = shared[local_i - offset] # read @@ -132,9 +127,10 @@ def prefix_sum_local_phase[ # Kernel 2: Add block sums to their respective blocks -def prefix_sum_block_sum_phase[ - layout: Layout -](output: LayoutTensor[dtype, layout, MutAnyOrigin], size: Int): +def prefix_sum_block_sum_phase( + output: TileTensor[mut=True, dtype, ExtendedLayout, MutAnyOrigin], + size: Int, +): var global_i = block_dim.x * block_idx.x + thread_idx.x # Second pass: add previous block's sum to each element @@ -172,11 +168,10 @@ def main() raises: a_host[i] = Scalar[dtype](i) if use_simple: - a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](a) - out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) + a_tensor = TileTensor[mut=False, dtype, LayoutType](a, layout) + out_tensor = TileTensor(out, layout) - comptime kernel = prefix_sum_simple[layout] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[prefix_sum_simple, prefix_sum_simple]( out_tensor, a_tensor, size, @@ -184,15 +179,16 @@ def main() raises: block_dim=THREADS_PER_BLOCK, ) else: - var a_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin](a) - var out_tensor = LayoutTensor[dtype, extended_layout, MutAnyOrigin]( - out + var a_tensor = TileTensor[mut=False, dtype, Layout2Type]( + a, layout_2 ) + var out_tensor = TileTensor(out, extended_layout) # ANCHOR: prefix_sum_complete_block_level_sync # Phase 1: Local prefix sums - comptime kernel = prefix_sum_local_phase[extended_layout, layout_2] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[ + prefix_sum_local_phase, prefix_sum_local_phase + ]( out_tensor, a_tensor, size, @@ -201,8 +197,9 @@ def main() raises: ) # Phase 2: Add block sums - comptime kernel2 = prefix_sum_block_sum_phase[extended_layout] - ctx.enqueue_function[kernel2, kernel2]( + ctx.enqueue_function[ + prefix_sum_block_sum_phase, prefix_sum_block_sum_phase + ]( out_tensor, size, grid_dim=BLOCKS_PER_GRID_2, diff --git a/solutions/p15/p15.mojo b/solutions/p15/p15.mojo index 218c34b7..df06ed5a 100644 --- a/solutions/p15/p15.mojo +++ b/solutions/p15/p15.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.testing import assert_equal comptime TPB = 8 @@ -10,27 +12,24 @@ comptime SIZE = 6 comptime BLOCKS_PER_GRID = (1, BATCH) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 -comptime in_layout = Layout.row_major(BATCH, SIZE) -comptime out_layout = Layout.row_major(BATCH, 1) +comptime in_layout = row_major[BATCH, SIZE]() +comptime out_layout = row_major[BATCH, 1]() +comptime InLayout = type_of(in_layout) +comptime OutLayout = type_of(out_layout) # ANCHOR: axis_sum_solution -def axis_sum[ - in_layout: Layout, out_layout: Layout -]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], +def axis_sum( + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], size: Int, ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x var batch = block_idx.y - var cache = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var cache = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) # Visualize: # Block(0,0): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 0: [0,1,2,3,4,5] @@ -52,7 +51,7 @@ def axis_sum[ var stride = TPB // 2 while stride > 0: # Read phase: all threads read the values they need first to avoid race conditions - var temp_val: output.element_type = 0 + var temp_val: output.ElementType = 0 if local_i < stride: temp_val = cache[local_i + stride] @@ -84,11 +83,10 @@ def main() raises: for col in range(SIZE): inp_host[row * SIZE + col] = Scalar[dtype](row * SIZE + col) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) - var inp_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](inp) + var out_tensor = TileTensor(out, out_layout) + var inp_tensor = TileTensor[mut=False, dtype, InLayout](inp, in_layout) - comptime kernel = axis_sum[in_layout, out_layout] - ctx.enqueue_function[kernel, kernel]( + ctx.enqueue_function[axis_sum, axis_sum]( out_tensor, inp_tensor, SIZE, diff --git a/solutions/p16/p16.mojo b/solutions/p16/p16.mojo index 62b76e00..e800a5f1 100644 --- a/solutions/p16/p16.mojo +++ b/solutions/p16/p16.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_equal @@ -10,22 +12,23 @@ comptime SIZE = 2 comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (TPB, TPB) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE, SIZE) +comptime layout = row_major[SIZE, SIZE]() +comptime LayoutType = type_of(layout) # ANCHOR: naive_matmul_solution def naive_matmul[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): var row = block_dim.y * block_idx.y + thread_idx.y var col = block_dim.x * block_idx.x + thread_idx.x if row < size and col < size: - var acc: output.element_type = 0 + var acc: output.ElementType = 0 comptime for k in range(size): acc += a[row, k] * b[k, col] @@ -38,29 +41,23 @@ def naive_matmul[ # ANCHOR: single_block_matmul_solution def single_block_matmul[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): var row = block_dim.y * block_idx.y + thread_idx.y var col = block_dim.x * block_idx.x + thread_idx.x var local_row = thread_idx.y var local_col = thread_idx.x - var a_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB, TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var b_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB, TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var a_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB, TPB]()) + var b_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB, TPB]()) if row < size and col < size: a_shared[local_row, local_col] = a[row, col] @@ -69,7 +66,7 @@ def single_block_matmul[ barrier() if row < size and col < size: - var acc: output.element_type = 0 + var acc: output.ElementType = 0 comptime for k in range(size): acc += a_shared[local_row, k] * b_shared[k, local_col] @@ -83,36 +80,31 @@ def single_block_matmul[ comptime SIZE_TILED = 9 comptime BLOCKS_PER_GRID_TILED = (3, 3) # each block covers 3x3 elements comptime THREADS_PER_BLOCK_TILED = (TPB, TPB) -comptime layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED) +comptime layout_tiled = row_major[SIZE_TILED, SIZE_TILED]() +comptime LayoutTiledType = type_of(layout_tiled) # ANCHOR: matmul_tiled_solution def matmul_tiled[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin], - a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutTiledType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin], ): var local_row = thread_idx.y var local_col = thread_idx.x var tiled_row = block_idx.y * TPB + local_row var tiled_col = block_idx.x * TPB + local_col - var a_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB, TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var b_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB, TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - - var acc: output.element_type = 0 + var a_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB, TPB]()) + var b_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB, TPB]()) + + var acc: output.ElementType = 0 # Iterate over tiles to compute matrix product comptime for tile in range((size + TPB - 1) // TPB): @@ -147,17 +139,18 @@ def matmul_tiled[ # ANCHOR: matmul_idiomatic_tiled_solution from std.gpu.memory import async_copy_wait_all from layout.layout_tensor import copy_dram_to_sram_async +from layout import Layout as IntTupleLayout comptime NUM_THREADS = TPB * TPB comptime BLOCK_DIM_COUNT = 2 def matmul_idiomatic_tiled[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin], - a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutTiledType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin], ): var local_row = thread_idx.y var local_col = thread_idx.x @@ -166,23 +159,21 @@ def matmul_idiomatic_tiled[ # Get the tile of the output matrix that this thread block is responsible for var out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x) - var a_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB, TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var b_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB, TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - - var acc: output.element_type = 0 - - comptime load_a_layout = Layout.row_major(1, TPB) # Coalesced loading - comptime load_b_layout = Layout.row_major(1, TPB) # Coalesced loading + var a_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB, TPB]()) + var b_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB, TPB]()) + + var acc: output.ElementType = 0 + + comptime load_a_layout = IntTupleLayout.row_major( + 1, TPB + ) # Coalesced loading + comptime load_b_layout = IntTupleLayout.row_major( + 1, TPB + ) # Coalesced loading # Note: Both matrices stored in same orientation for correct matrix multiplication # Transposed loading would be useful if B were pre-transposed in global memory @@ -198,12 +189,12 @@ def matmul_idiomatic_tiled[ thread_layout=load_a_layout, num_threads=NUM_THREADS, block_dim_count=BLOCK_DIM_COUNT, - ](a_shared, a_tile) + ](a_shared.to_layout_tensor(), a_tile.to_layout_tensor()) copy_dram_to_sram_async[ thread_layout=load_b_layout, num_threads=NUM_THREADS, block_dim_count=BLOCK_DIM_COUNT, - ](b_shared, b_tile) + ](b_shared.to_layout_tensor(), b_tile.to_layout_tensor()) # Wait for all async copies to complete async_copy_wait_all() @@ -254,12 +245,12 @@ def main() raises: inp1_host[i * size + k] * inp2_host[k * size + j] ) - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp1) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp2) + var out_tensor = TileTensor(out, layout) + var a_tensor = TileTensor[mut=False, dtype, LayoutType](inp1, layout) + var b_tensor = TileTensor[mut=False, dtype, LayoutType](inp2, layout) if argv()[1] == "--naive": - comptime kernel = naive_matmul[layout, SIZE] + comptime kernel = naive_matmul[SIZE] ctx.enqueue_function[kernel, kernel]( out_tensor, a_tensor, @@ -268,7 +259,7 @@ def main() raises: block_dim=THREADS_PER_BLOCK, ) elif argv()[1] == "--single-block": - comptime kernel = single_block_matmul[layout, SIZE] + comptime kernel = single_block_matmul[SIZE] ctx.enqueue_function[kernel, kernel]( out_tensor, a_tensor, @@ -278,17 +269,15 @@ def main() raises: ) elif argv()[1] == "--tiled": # Need to update the layout of the tensors to the tiled layout - out_tensor_tiled = LayoutTensor[dtype, layout_tiled, MutAnyOrigin]( - out - ) - a_tensor_tiled = LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin]( - inp1 + out_tensor_tiled = TileTensor(out, layout_tiled) + a_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType]( + inp1, layout_tiled ) - b_tensor_tiled = LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin]( - inp2 + b_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType]( + inp2, layout_tiled ) - comptime kernel = matmul_tiled[layout_tiled, SIZE_TILED] + comptime kernel = matmul_tiled[SIZE_TILED] ctx.enqueue_function[kernel, kernel]( out_tensor_tiled, a_tensor_tiled, @@ -297,17 +286,15 @@ def main() raises: block_dim=THREADS_PER_BLOCK_TILED, ) elif argv()[1] == "--idiomatic-tiled": - out_tensor_tiled = LayoutTensor[dtype, layout_tiled, MutAnyOrigin]( - out - ) - a_tensor_tiled = LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin]( - inp1 + out_tensor_tiled = TileTensor(out, layout_tiled) + a_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType]( + inp1, layout_tiled ) - b_tensor_tiled = LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin]( - inp2 + b_tensor_tiled = TileTensor[mut=False, dtype, LayoutTiledType]( + inp2, layout_tiled ) - comptime kernel = matmul_idiomatic_tiled[layout_tiled, SIZE_TILED] + comptime kernel = matmul_idiomatic_tiled[SIZE_TILED] ctx.enqueue_function[kernel, kernel]( out_tensor_tiled, a_tensor_tiled, diff --git a/solutions/p17/op/conv1d.mojo b/solutions/p17/op/conv1d.mojo index 77517480..b83f4a74 100644 --- a/solutions/p17/op/conv1d.mojo +++ b/solutions/p17/op/conv1d.mojo @@ -1,7 +1,10 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation +from std.utils import Index from std.sys import argv from std.testing import assert_equal @@ -10,59 +13,57 @@ comptime BLOCKS_PER_GRID = (2, 1) def conv1d_kernel[ - in_layout: Layout, - out_layout: Layout, - conv_layout: Layout, input_size: Int, conv_size: Int, + OutLayout: TensorLayout, + InLayout: TensorLayout, + ConvLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, MutAnyOrigin], - kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + input: TileTensor[mut=True, dtype, InLayout, MutAnyOrigin], + kernel: TileTensor[mut=True, dtype, ConvLayout, MutAnyOrigin], ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x + # Convert generic TileTensors to LayoutTensor for indexing (flat_rank proof required) + var input_lt = input.to_layout_tensor() + var kernel_lt = kernel.to_layout_tensor() + var output_lt = output.to_layout_tensor() # first: need to account for padding - var shared_a = LayoutTensor[ - dtype, - Layout.row_major(TPB + conv_size - 1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var shared_b = LayoutTensor[ - dtype, - Layout.row_major(conv_size), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_a = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB + conv_size - 1]()) + var shared_b = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[conv_size]()) if global_i < input_size: - shared_a[local_i] = input[global_i] + shared_a[local_i] = rebind[Scalar[dtype]](input_lt[global_i]) # second: load elements needed for convolution at block boundary if local_i < conv_size - 1: # indices from next block var next_idx = global_i + TPB if next_idx < input_size: - shared_a[TPB + local_i] = input[next_idx] + shared_a[TPB + local_i] = rebind[Scalar[dtype]](input_lt[next_idx]) else: # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory # which is an undefined behavior shared_a[TPB + local_i] = 0 if local_i < conv_size: - shared_b[local_i] = kernel[local_i] + shared_b[local_i] = rebind[Scalar[dtype]](kernel_lt[local_i]) barrier() if global_i < input_size: - var local_sum: output.element_type = 0 + var local_sum: Scalar[dtype] = 0 comptime for j in range(conv_size): if local_i + j < TPB + conv_size - 1: local_sum += shared_a[local_i + j] * shared_b[j] - output[global_i] = local_sum + output_lt.store[1](Index(global_i), local_sum) import compiler @@ -82,18 +83,26 @@ struct Conv1DCustomOp: conv_size: Int, dtype: DType = DType.float32, ]( - output: OutputTensor[rank=1, static_spec=_], - input: InputTensor[rank=output.rank, static_spec=_], - kernel: InputTensor[rank=output.rank, static_spec=_], + output: OutputTensor[dtype=dtype, rank=1, static_spec=_], + input: InputTensor[dtype=dtype, rank=output.rank, static_spec=_], + kernel: InputTensor[dtype=dtype, rank=output.rank, static_spec=_], # the context is needed for some GPU calls ctx: DeviceContextPtr, ) raises: - var output_tensor = output.to_layout_tensor() - var input_tensor = input.to_layout_tensor() - var kernel_tensor = kernel.to_layout_tensor() - comptime in_layout = input_tensor.layout - comptime out_layout = output_tensor.layout - comptime conv_layout = kernel_tensor.layout + comptime out_layout_val = row_major[input_size]() + comptime OutLayout = type_of(out_layout_val) + comptime conv_layout_val = row_major[conv_size]() + comptime ConvLayout = type_of(conv_layout_val) + + var output_tensor = TileTensor[ + mut=True, dtype, OutLayout, MutAnyOrigin + ](output.unsafe_ptr(), out_layout_val) + var input_tensor = TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin]( + input.unsafe_ptr(), out_layout_val + ) + var kernel_tensor = TileTensor[ + mut=True, dtype, ConvLayout, MutAnyOrigin + ](kernel.unsafe_ptr(), conv_layout_val) comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() @@ -101,7 +110,7 @@ struct Conv1DCustomOp: gpu_ctx.enqueue_memset( DeviceBuffer[output_tensor.dtype]( gpu_ctx, - output_tensor.ptr, + output.unsafe_ptr(), input_size, owning=False, ), @@ -109,7 +118,7 @@ struct Conv1DCustomOp: ) # ANCHOR: conv1d_custom_op_solution comptime kernel = conv1d_kernel[ - in_layout, out_layout, conv_layout, input_size, conv_size + input_size, conv_size, OutLayout, OutLayout, ConvLayout ] gpu_ctx.enqueue_function[kernel, kernel]( output_tensor, diff --git a/solutions/p18/op/softmax.mojo b/solutions/p18/op/softmax.mojo index d9674e7a..eea5d67c 100644 --- a/solutions/p18/op/softmax.mojo +++ b/solutions/p18/op/softmax.mojo @@ -2,14 +2,17 @@ from std.memory import UnsafePointer from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.math import exp from std.bit import log2_ceil from std.utils.numerics import max_finite, min_finite comptime SIZE = 128 # This must be equal to INPUT_SIZE in p18.py -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) comptime GRID_DIM_X = 1 # Tree-based reduction require the number of threads to be the next power of two >= SIZE for correctness. comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE) @@ -17,28 +20,21 @@ comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE) # ANCHOR: softmax_gpu_kernel_solution def softmax_gpu_kernel[ - layout: Layout, input_size: Int, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], ): comptime assert ( dtype.is_floating_point() ), "dtype must be a floating-point type" - var shared_max = LayoutTensor[ - dtype, - Layout.row_major(BLOCK_DIM_X), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var shared_sum = LayoutTensor[ - dtype, - Layout.row_major(BLOCK_DIM_X), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_max = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[BLOCK_DIM_X]()) + var shared_sum = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[BLOCK_DIM_X]()) var global_i = thread_idx.x # Initialize out-of-bounds (shared_max[local_i], global_i >= input_size) shared memory addresses to the minimum @@ -92,12 +88,11 @@ def softmax_gpu_kernel[ # ANCHOR: softmax_cpu_kernel_solution def softmax_cpu_kernel[ - layout: Layout, input_size: Int, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], ): comptime assert ( dtype.is_floating_point() @@ -131,32 +126,31 @@ struct SoftmaxCustomOp: input_size: Int, dtype: DType = DType.float32, ]( - output: OutputTensor[rank=1, static_spec=_], - input: InputTensor[rank=output.rank, static_spec=_], + output: OutputTensor[dtype=dtype, rank=1, static_spec=_], + input: InputTensor[dtype=dtype, rank=output.rank, static_spec=_], ctx: DeviceContextPtr, ) raises: - # Note: rebind is necessary now but it shouldn't be! - var output_tensor = rebind[LayoutTensor[dtype, layout, MutAnyOrigin]]( - output.to_layout_tensor() - ) - var input_tensor = rebind[LayoutTensor[dtype, layout, ImmutAnyOrigin]]( - input.to_layout_tensor() - ) + var output_tensor = TileTensor[ + mut=True, dtype, LayoutType, MutAnyOrigin + ](output.unsafe_ptr(), layout) + var input_tensor = TileTensor[ + mut=True, dtype, LayoutType, MutAnyOrigin + ](input.unsafe_ptr(), layout) comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() # making sure the output tensor is zeroed out before the kernel is called gpu_ctx.enqueue_memset( - DeviceBuffer[output_tensor.dtype]( + DeviceBuffer[dtype]( gpu_ctx, - output_tensor.ptr, + output.unsafe_ptr(), input_size, owning=False, ), 0, ) - comptime kernel = softmax_gpu_kernel[layout, input_size, dtype] + comptime kernel = softmax_gpu_kernel[input_size, dtype] gpu_ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -165,8 +159,6 @@ struct SoftmaxCustomOp: ) elif target == "cpu": - softmax_cpu_kernel[layout, input_size, dtype]( - output_tensor, input_tensor - ) + softmax_cpu_kernel[input_size, dtype](output_tensor, input_tensor) else: raise Error("Unsupported target: " + target) diff --git a/solutions/p18/test/test_softmax.mojo b/solutions/p18/test/test_softmax.mojo index 70b25871..c506ae79 100644 --- a/solutions/p18/test/test_softmax.mojo +++ b/solutions/p18/test/test_softmax.mojo @@ -1,12 +1,14 @@ from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major from std.testing import assert_almost_equal from std.bit import log2_ceil from op import softmax_gpu_kernel, softmax_cpu_kernel comptime SIZE = 128 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) comptime GRID_DIM_X = 1 comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE) comptime dtype = DType.float32 @@ -21,10 +23,11 @@ def test_softmax() raises: # for CPU testing var expected = ctx.enqueue_create_host_buffer[DType.float32](SIZE) expected.enqueue_fill(0) - var expected_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - expected - ) - # Initialize input with more reasonable values + var expected_tensor = TileTensor[ + mut=True, dtype, LayoutType, MutAnyOrigin + ](expected, layout) + + # Initialize input and compute expected (CPU) inside map_to_host block with inp.map_to_host() as inp_host: for i in range(SIZE): inp_host[i] = Scalar[dtype](i) @@ -33,22 +36,21 @@ def test_softmax() raises: for i in range(SIZE): print(inp_host[i], end=" ") print() - # Create layout tensors for CPU calculation - input_host_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - inp_host - ) + # Create layout tensor for CPU calculation (must stay inside with block) + var input_host_tensor = TileTensor[ + mut=True, dtype, LayoutType, MutAnyOrigin + ](inp_host, layout) + # Compute expected results using our CPU kernel while inp_host is valid + softmax_cpu_kernel[SIZE, dtype](expected_tensor, input_host_tensor) # for GPU testing - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp) - - # Compute expected results using our CPU kernel - softmax_cpu_kernel[layout, SIZE, dtype]( - expected_tensor, input_host_tensor - ) + var output_tensor = TileTensor(out, layout) + var input_tensor = TileTensor[ + mut=True, dtype, LayoutType, MutAnyOrigin + ](inp, layout) # Run GPU kernel - comptime kernel = softmax_gpu_kernel[layout, SIZE, dtype] + comptime kernel = softmax_gpu_kernel[SIZE, dtype] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, diff --git a/solutions/p19/op/attention.mojo b/solutions/p19/op/attention.mojo index 600b5ebe..66962e7f 100644 --- a/solutions/p19/op/attention.mojo +++ b/solutions/p19/op/attention.mojo @@ -1,9 +1,10 @@ from std.memory import UnsafePointer from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer -from std.gpu.memory import AddressSpace, async_copy_wait_all -from layout import Layout, LayoutTensor -from layout.layout_tensor import copy_dram_to_sram_async +from std.gpu.memory import AddressSpace +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation from std.math import exp from std.bit import log2_ceil from std.utils.numerics import max_finite, min_finite @@ -22,24 +23,24 @@ comptime SOFTMAX_BLOCK_DIM_X = 1 << log2_ceil(SEQ_LEN) # Tiled matrix multiplication (from p16), updated to: -# 1) Support different layouts for input (a, b) and output LayoutTensors. +# 1) Support different layouts for input (a, b) and output TileTensors. # 2) Handle cases where the inner dimension is not a multiple of MATMUL_BLOCK_DIM_XY. # 3) Explicitly check for out-of-bounds elements. -# The approach still tiles all three LayoutTensors (a, b, and output) into identical square tiles +# The approach still tiles all three TileTensors (a, b, and output) into identical square tiles # of size (MATMUL_BLOCK_DIM_XY x MATMUL_BLOCK_DIM_XY) with each thread loading one element # from a and b, and writing one element to output. def matmul_idiomatic_tiled[ - a_layout: Layout, - b_layout: Layout, - out_layout: Layout, rows: Int, cols: Int, inner: Int, + OutLayout: TensorLayout, + ALayout: TensorLayout, + BLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, a_layout, MutAnyOrigin], - b: LayoutTensor[dtype, b_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=True, dtype, ALayout, MutAnyOrigin], + b: TileTensor[mut=True, dtype, BLayout, MutAnyOrigin], ): """Updated idiomatic tiled matrix multiplication from p16.""" var local_row = thread_idx.y @@ -51,89 +52,84 @@ def matmul_idiomatic_tiled[ var out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY]( block_idx.y, block_idx.x ) - var a_shared = LayoutTensor[ - dtype, - Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var b_shared = LayoutTensor[ - dtype, - Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var acc: output.element_type = 0 - - comptime load_a_layout = Layout.row_major( + comptime shared_layout = row_major[ MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY - ) # Coalesced loading - comptime load_b_layout = Layout.row_major( - MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY - ) # Coalesced loading + ]() + var a_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) + var b_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) + var acc: output.ElementType = 0 + + var a_lt = a.to_layout_tensor() + var b_lt = b.to_layout_tensor() + var out_tile_lt = out_tile.to_layout_tensor() + var a_shared_lt = a_shared.to_layout_tensor() + var b_shared_lt = b_shared.to_layout_tensor() comptime for idx in range( (inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY ): # Get tiles from A and B matrices - var a_tile = a.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY]( - block_idx.y, idx - ) - var b_tile = b.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY]( - idx, block_idx.x - ) + var a_tile_row_start = block_idx.y * MATMUL_BLOCK_DIM_XY + var a_tile_col_start = idx * MATMUL_BLOCK_DIM_XY + var b_tile_row_start = idx * MATMUL_BLOCK_DIM_XY + var b_tile_col_start = block_idx.x * MATMUL_BLOCK_DIM_XY + + # Synchronously load tiles to shared memory - each thread loads one element + var a_global_row = a_tile_row_start + local_row + var a_global_col = a_tile_col_start + local_col + if a_global_row < rows and a_global_col < inner: + a_shared_lt[local_row, local_col] = a_lt[a_global_row, a_global_col] + else: + a_shared_lt[local_row, local_col] = 0 + + var b_global_row = b_tile_row_start + local_row + var b_global_col = b_tile_col_start + local_col + if b_global_row < inner and b_global_col < cols: + b_shared_lt[local_row, local_col] = b_lt[b_global_row, b_global_col] + else: + b_shared_lt[local_row, local_col] = 0 - # Asynchronously copy tiles to shared memory with consistent orientation - copy_dram_to_sram_async[ - thread_layout=load_a_layout, - num_threads=MATMUL_NUM_THREADS, - block_dim_count=MATMUL_BLOCK_DIM_COUNT, - ](a_shared, a_tile) - copy_dram_to_sram_async[ - thread_layout=load_b_layout, - num_threads=MATMUL_NUM_THREADS, - block_dim_count=MATMUL_BLOCK_DIM_COUNT, - ](b_shared, b_tile) - - # Wait for all async copies to complete - async_copy_wait_all() barrier() # Compute partial matrix multiplication for this tile - comptime for k in range(MATMUL_BLOCK_DIM_XY): - if ( - tiled_row < rows and tiled_col < cols - ): # Only perform calculation for valid outputs - if k < a_tile.dim( - 1 - ): # Only perform calculation on valid inputs - acc += a_shared[local_row, k] * b_shared[k, local_col] + comptime k_max = min( + MATMUL_BLOCK_DIM_XY, inner - idx * MATMUL_BLOCK_DIM_XY + ) + comptime for k in range(k_max): + if tiled_row < rows and tiled_col < cols: + acc += rebind[Scalar[dtype]]( + a_shared_lt[local_row, k] + ) * rebind[Scalar[dtype]](b_shared_lt[k, local_col]) barrier() # Write final result with bounds checking (needed for attention's variable sizes) if tiled_row < rows and tiled_col < cols: - out_tile[local_row, local_col] = acc + out_tile_lt[local_row, local_col] = acc # ANCHOR: transpose_kernel_solution def transpose_kernel[ - layout_in: Layout, # Layout for input matrix (seq_len, d) - layout_out: Layout, # Layout for output matrix (d, seq_len) rows: Int, cols: Int, + OutLayout: TensorLayout, + InLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout_out, MutAnyOrigin], - inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + inp: TileTensor[mut=True, dtype, InLayout, MutAnyOrigin], ): """Transpose matrix using shared memory tiling for coalesced access.""" - var shared_tile = LayoutTensor[ - dtype, - Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + comptime shared_layout = row_major[ + TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY + ]() + var shared_tile = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) var local_row = thread_idx.y var local_col = thread_idx.x @@ -141,8 +137,12 @@ def transpose_kernel[ var global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row var global_col = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_col + var inp_lt = inp.to_layout_tensor() + var output_lt = output.to_layout_tensor() + var shared_tile_lt = shared_tile.to_layout_tensor() + if global_row < rows and global_col < cols: - shared_tile[local_row, local_col] = inp[global_row, global_col] + shared_tile_lt[local_row, local_col] = inp_lt[global_row, global_col] barrier() @@ -152,7 +152,7 @@ def transpose_kernel[ # Store data from shared memory to global memory (coalesced write) # Note: we transpose the shared memory access pattern if out_row < cols and out_col < rows: - output[out_row, out_col] = shared_tile[local_col, local_row] + output_lt[out_row, out_col] = shared_tile_lt[local_col, local_row] # ANCHOR_END: transpose_kernel_solution @@ -160,36 +160,33 @@ def transpose_kernel[ # Apply softmax to attention scores taken from p16 def softmax_gpu_kernel[ - layout: Layout, input_size: Int, + LayoutType: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], ): comptime assert ( dtype.is_floating_point() ), "dtype must be a floating-point type" - var shared_max = LayoutTensor[ - dtype, - Layout.row_major(SOFTMAX_BLOCK_DIM_X), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var shared_sum = LayoutTensor[ - dtype, - Layout.row_major(SOFTMAX_BLOCK_DIM_X), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + comptime softmax_layout = row_major[SOFTMAX_BLOCK_DIM_X]() + var shared_max = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](softmax_layout) + var shared_sum = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](softmax_layout) var global_i = thread_idx.x + var input_lt = input.to_layout_tensor() + var output_lt = output.to_layout_tensor() # Initialize out-of-bounds (shared_max[local_i], global_i >= input_size) shared memory addresses to the minimum # finite value for dtype, ensuring that if these elements are accessed in the parallel max reduction below they # do not influence the result (max(min_finite, x) == x for any x). var val: Scalar[dtype] = min_finite[dtype]() if global_i < input_size: - val = rebind[Scalar[dtype]](input[global_i]) + val = rebind[Scalar[dtype]](input_lt[global_i]) shared_max[global_i] = val barrier() @@ -227,25 +224,29 @@ def softmax_gpu_kernel[ # Normalize by sum if global_i < input_size: - output[global_i] = exp_val / block_sum + output_lt[global_i] = exp_val / block_sum # CPU implementation for vector attention def attention_cpu_kernel[ - layout_q: Layout, - layout_k: Layout, - layout_v: Layout, - layout_out: Layout, seq_len: Int, d: Int, + OutLayout: TensorLayout, + QLayout: TensorLayout, + KLayout: TensorLayout, + VLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout_out, MutAnyOrigin], - q: LayoutTensor[dtype, layout_q, MutAnyOrigin], - k: LayoutTensor[dtype, layout_k, ImmutAnyOrigin], - v: LayoutTensor[dtype, layout_v, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + q: TileTensor[mut=True, dtype, QLayout, MutAnyOrigin], + k: TileTensor[mut=True, dtype, KLayout, MutAnyOrigin], + v: TileTensor[mut=True, dtype, VLayout, MutAnyOrigin], ): """CPU implementation of vector attention.""" + var output_lt = output.to_layout_tensor() + var q_lt = q.to_layout_tensor() + var k_lt = k.to_layout_tensor() + var v_lt = v.to_layout_tensor() var scores = List[Float32]() var weights = List[Float32]() for _ in range(seq_len): @@ -256,7 +257,9 @@ def attention_cpu_kernel[ for i in range(seq_len): var score: Float32 = 0.0 for dim in range(d): - score = score + rebind[Float32](q[dim]) * rebind[Float32](k[i, dim]) + score = score + rebind[Float32](q_lt[dim]) * rebind[Float32]( + k_lt[i, dim] + ) scores[i] = score var max_score: Float32 = scores[0] @@ -276,9 +279,9 @@ def attention_cpu_kernel[ var weighted_sum: Float32 = 0.0 for i in range(seq_len): weighted_sum = weighted_sum + weights[i] * rebind[Float32]( - v[i, dim] + v_lt[i, dim] ) - output[dim] = rebind[Scalar[dtype]](weighted_sum) + output_lt[dim] = rebind[Scalar[dtype]](weighted_sum) @compiler.register("attention") @@ -290,31 +293,42 @@ struct AttentionCustomOp: d: Int, dtype: DType = DType.float32, ]( - output: OutputTensor[rank=1, static_spec=_], # Output vector (d,) - q: InputTensor[rank=1, static_spec=_], # Query vector (d,) - k: InputTensor[rank=2, static_spec=_], # Key matrix (seq_len, d) - v: InputTensor[rank=2, static_spec=_], # Value matrix (seq_len, d) + output: OutputTensor[ + dtype=dtype, rank=1, static_spec=_ + ], # Output vector (d,) + q: InputTensor[dtype=dtype, rank=1, static_spec=_], # Query vector (d,) + k: InputTensor[ + dtype=dtype, rank=2, static_spec=_ + ], # Key matrix (seq_len, d) + v: InputTensor[ + dtype=dtype, rank=2, static_spec=_ + ], # Value matrix (seq_len, d) ctx: DeviceContextPtr, ) raises: # Define layouts - comptime layout_q = Layout.row_major(d) - comptime layout_k = Layout.row_major(seq_len, d) - comptime layout_v = Layout.row_major(seq_len, d) - comptime layout_out = Layout.row_major(d) - comptime layout_scores = Layout.row_major(seq_len) + comptime layout_q = row_major[d]() + comptime layout_k = row_major[seq_len, d]() + comptime layout_v = row_major[seq_len, d]() + comptime layout_out = row_major[d]() + comptime layout_scores = row_major[seq_len]() + comptime QLayout = type_of(layout_q) + comptime KLayout = type_of(layout_k) + comptime VLayout = type_of(layout_v) + comptime OutLayout = type_of(layout_out) + comptime ScoresLayout = type_of(layout_scores) # Convert to layout tensors - var output_tensor = rebind[ - LayoutTensor[dtype, layout_out, MutAnyOrigin] - ](output.to_layout_tensor()) - var q_tensor = rebind[LayoutTensor[dtype, layout_q, MutAnyOrigin]]( - q.to_layout_tensor() + var output_tensor = TileTensor[ + mut=True, dtype, OutLayout, MutAnyOrigin + ](output.unsafe_ptr(), layout_out) + var q_tensor = TileTensor[mut=True, dtype, QLayout, MutAnyOrigin]( + q.unsafe_ptr(), layout_q ) - var k_tensor = rebind[LayoutTensor[dtype, layout_k, ImmutAnyOrigin]]( - k.to_layout_tensor() + var k_tensor = TileTensor[mut=True, dtype, KLayout, MutAnyOrigin]( + k.unsafe_ptr(), layout_k ) - var v_tensor = rebind[LayoutTensor[dtype, layout_v, MutAnyOrigin]]( - v.to_layout_tensor() + var v_tensor = TileTensor[mut=True, dtype, VLayout, MutAnyOrigin]( + v.unsafe_ptr(), layout_v ) comptime if target == "gpu": @@ -322,15 +336,20 @@ struct AttentionCustomOp: # Define layouts for matrix multiplication # Q reshaped to (1, d) - comptime layout_q_2d = Layout.row_major(1, d) + comptime layout_q_2d = row_major[1, d]() + comptime Q2DLayout = type_of(layout_q_2d) # K^T is (d, seq_len) - comptime layout_k_t = Layout.row_major(d, seq_len) + comptime layout_k_t = row_major[d, seq_len]() + comptime KTLayout = type_of(layout_k_t) # Scores as (1, seq_len) - comptime layout_scores_2d = Layout.row_major(1, seq_len) + comptime layout_scores_2d = row_major[1, seq_len]() + comptime Scores2DLayout = type_of(layout_scores_2d) # Weights as (1, seq_len) - comptime layout_weights_2d = Layout.row_major(1, seq_len) + comptime layout_weights_2d = row_major[1, seq_len]() + comptime Weights2DLayout = type_of(layout_weights_2d) # Result as (1, d) - comptime layout_result_2d = Layout.row_major(1, d) + comptime layout_result_2d = row_major[1, d]() + comptime Result2DLayout = type_of(layout_result_2d) # Transpose implementation limited to square (TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY) thread blocks comptime transpose_threads_per_block = ( @@ -367,16 +386,16 @@ struct AttentionCustomOp: seq_len ) # Reused for scores and weights - var k_t = LayoutTensor[dtype, layout_k_t, MutAnyOrigin](k_t_buf) + var k_t = TileTensor(k_t_buf, layout_k_t) # ANCHOR: attention_orchestration_solution # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed - var q_2d = q_tensor.reshape[layout_q_2d]() + var q_2d = q_tensor.reshape(layout_q_2d) # Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)\ comptime kernel = transpose_kernel[ - layout_k, layout_k_t, seq_len, d, dtype + seq_len, d, KTLayout, KLayout, dtype ] gpu_ctx.enqueue_function[kernel, kernel]( k_t, @@ -388,16 +407,14 @@ struct AttentionCustomOp: # Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len) # This computes Q ยท K^T[i] = Q ยท K[i] for each column i of K^T (which is row i of K) # Reuse scores_weights_buf as (1, seq_len) for scores - var scores_2d = LayoutTensor[dtype, layout_scores_2d, MutAnyOrigin]( - scores_weights_buf - ) + var scores_2d = TileTensor(scores_weights_buf, layout_scores_2d) comptime kernel2 = matmul_idiomatic_tiled[ - layout_q_2d, - layout_k_t, - layout_scores_2d, 1, seq_len, d, + Scores2DLayout, + Q2DLayout, + KTLayout, dtype, ] gpu_ctx.enqueue_function[kernel2, kernel2]( @@ -409,30 +426,38 @@ struct AttentionCustomOp: ) # Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax - var weights = scores_2d.reshape[layout_scores]() - - # Step 5: Apply softmax to get attention weights - comptime kernel3 = softmax_gpu_kernel[layout_scores, seq_len, dtype] + var weights = scores_2d.reshape(layout_scores) + + # Step 5: Apply softmax to get attention weights (in-place) + comptime ScoresLayout = type_of(layout_scores) + comptime kernel3 = softmax_gpu_kernel[seq_len, ScoresLayout, dtype] + # Create two TileTensor views from the underlying buffer to avoid aliasing error + var weights_out = TileTensor[ + mut=True, dtype, ScoresLayout, MutAnyOrigin + ](scores_weights_buf, layout_scores) + var weights_in = TileTensor[ + mut=True, dtype, ScoresLayout, MutAnyOrigin + ](scores_weights_buf, layout_scores) gpu_ctx.enqueue_function[kernel3, kernel3]( - weights, - weights, + weights_out, + weights_in, grid_dim=softmax_blocks_per_grid, block_dim=softmax_threads, ) # Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul - var weights_2d = weights.reshape[layout_weights_2d]() + var weights_2d = weights.reshape(layout_weights_2d) # Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d) # Reuse out_tensor reshaped as (1, d) for result - var result_2d = output_tensor.reshape[layout_result_2d]() + var result_2d = output_tensor.reshape(layout_result_2d) comptime kernel4 = matmul_idiomatic_tiled[ - layout_weights_2d, - layout_v, - layout_result_2d, 1, d, seq_len, + Result2DLayout, + Weights2DLayout, + VLayout, dtype, ] gpu_ctx.enqueue_function[kernel4, kernel4]( @@ -447,7 +472,7 @@ struct AttentionCustomOp: elif target == "cpu": attention_cpu_kernel[ - layout_q, layout_k, layout_v, layout_out, seq_len, d, dtype + seq_len, d, OutLayout, QLayout, KLayout, VLayout, dtype ](output_tensor, q_tensor, k_tensor, v_tensor) else: diff --git a/solutions/p20/op/conv1d.mojo b/solutions/p20/op/conv1d.mojo index eb9d86af..63ea3ee4 100644 --- a/solutions/p20/op/conv1d.mojo +++ b/solutions/p20/op/conv1d.mojo @@ -2,7 +2,10 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation +from std.utils import Index from std.sys import argv from std.testing import assert_equal @@ -11,59 +14,56 @@ comptime BLOCKS_PER_GRID = (2, 1) def conv1d_kernel[ - in_layout: Layout, - out_layout: Layout, - conv_layout: Layout, input_size: Int, conv_size: Int, + OutLayout: TensorLayout, + InLayout: TensorLayout, + ConvLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, MutAnyOrigin], - kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + input: TileTensor[mut=True, dtype, InLayout, MutAnyOrigin], + kernel: TileTensor[mut=True, dtype, ConvLayout, MutAnyOrigin], ): var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x + var input_lt = input.to_layout_tensor() + var kernel_lt = kernel.to_layout_tensor() + var output_lt = output.to_layout_tensor() # first: need to account for padding - var shared_a = LayoutTensor[ - dtype, - Layout.row_major(TPB + conv_size - 1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var shared_b = LayoutTensor[ - dtype, - Layout.row_major(conv_size), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_a = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB + conv_size - 1]()) + var shared_b = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[conv_size]()) if global_i < input_size: - shared_a[local_i] = input[global_i] + shared_a[local_i] = rebind[Scalar[dtype]](input_lt[global_i]) # second: load elements needed for convolution at block boundary if local_i < conv_size - 1: # indices from next block var next_idx = global_i + TPB if next_idx < input_size: - shared_a[TPB + local_i] = input[next_idx] + shared_a[TPB + local_i] = rebind[Scalar[dtype]](input_lt[next_idx]) else: # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory # which is an undefined behavior shared_a[TPB + local_i] = 0 if local_i < conv_size: - shared_b[local_i] = kernel[local_i] + shared_b[local_i] = rebind[Scalar[dtype]](kernel_lt[local_i]) barrier() if global_i < input_size: - var local_sum: output.element_type = 0 + var local_sum: Scalar[dtype] = 0 comptime for j in range(conv_size): if local_i + j < TPB + conv_size - 1: local_sum += shared_a[local_i + j] * shared_b[j] - output[global_i] = local_sum + output_lt.store[1](Index(global_i), local_sum) import compiler @@ -89,12 +89,20 @@ struct Conv1DCustomOp: # the context is needed for some GPU calls ctx: DeviceContextPtr, ) raises: - var out_tensor = output.to_layout_tensor() - var input_tensor = input.to_layout_tensor() - var kernel_tensor = kernel.to_layout_tensor() - comptime in_layout = input_tensor.layout - comptime out_layout = out_tensor.layout - comptime conv_layout = kernel_tensor.layout + comptime out_layout_val = row_major[input_size]() + comptime OutLayout = type_of(out_layout_val) + comptime conv_layout_val = row_major[conv_size]() + comptime ConvLayout = type_of(conv_layout_val) + + var out_tensor = TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin]( + output.unsafe_ptr(), out_layout_val + ) + var input_tensor = TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin]( + input.unsafe_ptr(), out_layout_val + ) + var kernel_tensor = TileTensor[ + mut=True, dtype, ConvLayout, MutAnyOrigin + ](kernel.unsafe_ptr(), conv_layout_val) comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() @@ -102,7 +110,7 @@ struct Conv1DCustomOp: gpu_ctx.enqueue_memset( DeviceBuffer[output.dtype]( gpu_ctx, - out_tensor.ptr, + output.unsafe_ptr(), input_size, owning=False, ), @@ -110,7 +118,7 @@ struct Conv1DCustomOp: ) # ANCHOR: conv1d_custom_op_solution comptime kernel = conv1d_kernel[ - in_layout, out_layout, conv_layout, input_size, conv_size + input_size, conv_size, OutLayout, OutLayout, ConvLayout ] gpu_ctx.enqueue_function[kernel, kernel]( out_tensor, diff --git a/solutions/p21/op/embedding.mojo b/solutions/p21/op/embedding.mojo index 10487336..6f8a47be 100644 --- a/solutions/p21/op/embedding.mojo +++ b/solutions/p21/op/embedding.mojo @@ -1,7 +1,8 @@ from std.math import ceildiv from std.gpu import thread_idx, block_idx, block_dim, grid_dim, barrier from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout from std.sys import argv from std.testing import assert_equal @@ -10,18 +11,18 @@ comptime THREADS_PER_BLOCK = 256 # ANCHOR: embedding_kernel_coalesced_solution def embedding_kernel_coalesced[ - indices_layout: Layout, - weights_layout: Layout, - out_layout: Layout, batch_size: Int, seq_len: Int, vocab_size: Int, embed_dim: Int, + OutLayout: TensorLayout, + IndicesLayout: TensorLayout, + WeightsLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin], - weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + indices: TileTensor[mut=True, DType.int32, IndicesLayout, MutAnyOrigin], + weights: TileTensor[mut=True, dtype, WeightsLayout, MutAnyOrigin], ): """ Memory-coalescing focused embedding kernel. @@ -39,6 +40,10 @@ def embedding_kernel_coalesced[ if global_idx >= total_elements: return + var output_lt = output.to_layout_tensor() + var indices_lt = indices.to_layout_tensor() + var weights_lt = weights.to_layout_tensor() + # Convert to (batch, seq, embed) coordinates var batch_idx = global_idx // (seq_len * embed_dim) var remaining = global_idx % (seq_len * embed_dim) @@ -46,15 +51,15 @@ def embedding_kernel_coalesced[ var embed_idx = remaining % embed_dim # Get token index - var token_idx_val = Int(indices[batch_idx, seq_idx]) + var token_idx_val = Int(indices_lt[batch_idx, seq_idx]) # Simple, correct assignment if token_idx_val >= 0 and token_idx_val < vocab_size: - output[batch_idx, seq_idx, embed_idx] = weights[ + output_lt[batch_idx, seq_idx, embed_idx] = weights_lt[ token_idx_val, embed_idx ] else: - output[batch_idx, seq_idx, embed_idx] = 0 + output_lt[batch_idx, seq_idx, embed_idx] = 0 # ANCHOR_END: embedding_kernel_coalesced_solution @@ -62,18 +67,18 @@ def embedding_kernel_coalesced[ # ANCHOR: embedding_kernel_2d_solution def embedding_kernel_2d[ - indices_layout: Layout, - weights_layout: Layout, - out_layout: Layout, batch_size: Int, seq_len: Int, vocab_size: Int, embed_dim: Int, + OutLayout: TensorLayout, + IndicesLayout: TensorLayout, + WeightsLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin], - weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + indices: TileTensor[mut=True, DType.int32, IndicesLayout, MutAnyOrigin], + weights: TileTensor[mut=True, dtype, WeightsLayout, MutAnyOrigin], ): """ 2D grid non-coalesced embedding kernel. @@ -94,20 +99,24 @@ def embedding_kernel_2d[ if batch_seq_idx >= total_positions or embed_idx >= embed_dim: return + var output_lt = output.to_layout_tensor() + var indices_lt = indices.to_layout_tensor() + var weights_lt = weights.to_layout_tensor() + # Convert to (batch, seq) coordinates var batch_idx = batch_seq_idx // seq_len var seq_idx = batch_seq_idx % seq_len # Get token index - var token_idx_val = Int(indices[batch_idx, seq_idx]) + var token_idx_val = Int(indices_lt[batch_idx, seq_idx]) # Assignment with 2D grid pattern if token_idx_val >= 0 and token_idx_val < vocab_size: - output[batch_idx, seq_idx, embed_idx] = weights[ + output_lt[batch_idx, seq_idx, embed_idx] = weights_lt[ token_idx_val, embed_idx ] else: - output[batch_idx, seq_idx, embed_idx] = 0 + output_lt[batch_idx, seq_idx, embed_idx] = 0 # ANCHOR_END: embedding_kernel_2d_solution @@ -141,13 +150,22 @@ struct EmbeddingCustomOp: ], # [vocab_size, embed_dim] ctx: DeviceContextPtr, ) raises: - var output_tensor = output.to_layout_tensor() - var indices_tensor = indices.to_layout_tensor() - var weights_tensor = weights.to_layout_tensor() - - comptime indices_layout = indices_tensor.layout - comptime weights_layout = weights_tensor.layout - comptime out_layout = output_tensor.layout + comptime out_layout_val = row_major[batch_size, seq_len, embed_dim]() + comptime OutLayout = type_of(out_layout_val) + comptime indices_layout_val = row_major[batch_size, seq_len]() + comptime IndicesLayout = type_of(indices_layout_val) + comptime weights_layout_val = row_major[vocab_size, embed_dim]() + comptime WeightsLayout = type_of(weights_layout_val) + + var output_tensor = TileTensor[ + mut=True, output.dtype, OutLayout, MutAnyOrigin + ](output.unsafe_ptr(), out_layout_val) + var indices_tensor = TileTensor[ + mut=True, DType.int32, IndicesLayout, MutAnyOrigin + ](indices.unsafe_ptr(), indices_layout_val) + var weights_tensor = TileTensor[ + mut=True, output.dtype, WeightsLayout, MutAnyOrigin + ](weights.unsafe_ptr(), weights_layout_val) comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() @@ -156,7 +174,7 @@ struct EmbeddingCustomOp: gpu_ctx.enqueue_memset( DeviceBuffer[output.dtype]( gpu_ctx, - output_tensor.ptr, + output.unsafe_ptr(), batch_size * seq_len * embed_dim, owning=False, ), @@ -169,13 +187,13 @@ struct EmbeddingCustomOp: # Compile and launch optimized kernel comptime kernel = embedding_kernel_coalesced[ - indices_layout, - weights_layout, - out_layout, batch_size, seq_len, vocab_size, embed_dim, + OutLayout, + IndicesLayout, + WeightsLayout, output.dtype, ] var compiled_kernel = gpu_ctx.compile_function[kernel, kernel]() @@ -227,13 +245,22 @@ struct Embedding2DCustomOp: ], # [vocab_size, embed_dim] ctx: DeviceContextPtr, ) raises: - var output_tensor = output.to_layout_tensor() - var indices_tensor = indices.to_layout_tensor() - var weights_tensor = weights.to_layout_tensor() - - comptime indices_layout = indices_tensor.layout - comptime weights_layout = weights_tensor.layout - comptime out_layout = output_tensor.layout + comptime out_layout_val = row_major[batch_size, seq_len, embed_dim]() + comptime OutLayout = type_of(out_layout_val) + comptime indices_layout_val = row_major[batch_size, seq_len]() + comptime IndicesLayout = type_of(indices_layout_val) + comptime weights_layout_val = row_major[vocab_size, embed_dim]() + comptime WeightsLayout = type_of(weights_layout_val) + + var output_tensor = TileTensor[ + mut=True, output.dtype, OutLayout, MutAnyOrigin + ](output.unsafe_ptr(), out_layout_val) + var indices_tensor = TileTensor[ + mut=True, DType.int32, IndicesLayout, MutAnyOrigin + ](indices.unsafe_ptr(), indices_layout_val) + var weights_tensor = TileTensor[ + mut=True, output.dtype, WeightsLayout, MutAnyOrigin + ](weights.unsafe_ptr(), weights_layout_val) comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() @@ -242,7 +269,7 @@ struct Embedding2DCustomOp: gpu_ctx.enqueue_memset( DeviceBuffer[output.dtype]( gpu_ctx, - output_tensor.ptr, + output.unsafe_ptr(), batch_size * seq_len * embed_dim, owning=False, ), @@ -258,13 +285,13 @@ struct Embedding2DCustomOp: # Compile and launch 2D kernel comptime kernel = embedding_kernel_2d[ - indices_layout, - weights_layout, - out_layout, batch_size, seq_len, vocab_size, embed_dim, + OutLayout, + IndicesLayout, + WeightsLayout, output.dtype, ] diff --git a/solutions/p22/op/layernorm_linear.mojo b/solutions/p22/op/layernorm_linear.mojo index 3e5a4153..fe9fea3e 100644 --- a/solutions/p22/op/layernorm_linear.mojo +++ b/solutions/p22/op/layernorm_linear.mojo @@ -1,9 +1,10 @@ from std.math import sqrt from std.gpu import thread_idx, block_idx, block_dim, barrier -from std.gpu.memory import async_copy_wait_all, AddressSpace +from std.gpu.memory import AddressSpace from std.os.atomic import Atomic -from layout import Layout, LayoutTensor -from layout.layout_tensor import copy_dram_to_sram_async +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation import compiler from std.runtime.asyncrt import DeviceContextPtr from tensor import InputTensor, OutputTensor @@ -18,17 +19,17 @@ comptime TRANSPOSE_BLOCK_DIM_XY = 16 # Square blocks for input and output # ANCHOR: matmul_idiomatic_tiled # Idiomatic tiled matmul from p19.mojo def matmul_idiomatic_tiled[ - a_layout: Layout, - b_layout: Layout, - out_layout: Layout, rows: Int, cols: Int, inner: Int, + OutLayout: TensorLayout, + ALayout: TensorLayout, + BLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, a_layout, MutAnyOrigin], - b: LayoutTensor[dtype, b_layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=True, dtype, ALayout, MutAnyOrigin], + b: TileTensor[mut=True, dtype, BLayout, MutAnyOrigin], ): """Idiomatic tiled matrix multiplication from p19.""" var local_row = thread_idx.y @@ -40,69 +41,63 @@ def matmul_idiomatic_tiled[ var out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY]( block_idx.y, block_idx.x ) - var a_shared = LayoutTensor[ - dtype, - Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var b_shared = LayoutTensor[ - dtype, - Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var acc: output.element_type = 0 - - comptime load_a_layout = Layout.row_major( + comptime shared_layout = row_major[ MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY - ) # Coalesced loading - comptime load_b_layout = Layout.row_major( - MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY - ) # Coalesced loading + ]() + var a_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) + var b_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) + var acc: output.ElementType = 0 + + var a_lt = a.to_layout_tensor() + var b_lt = b.to_layout_tensor() + var out_tile_lt = out_tile.to_layout_tensor() + var a_shared_lt = a_shared.to_layout_tensor() + var b_shared_lt = b_shared.to_layout_tensor() comptime for idx in range( (inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY ): - # Get tiles from A and B matrices - var a_tile = a.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY]( - block_idx.y, idx - ) - var b_tile = b.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY]( - idx, block_idx.x - ) + # Synchronously load tiles to shared memory - each thread loads one element + var a_tile_row_start = block_idx.y * MATMUL_BLOCK_DIM_XY + var a_tile_col_start = idx * MATMUL_BLOCK_DIM_XY + var b_tile_row_start = idx * MATMUL_BLOCK_DIM_XY + var b_tile_col_start = block_idx.x * MATMUL_BLOCK_DIM_XY + + var a_global_row = a_tile_row_start + local_row + var a_global_col = a_tile_col_start + local_col + if a_global_row < rows and a_global_col < inner: + a_shared_lt[local_row, local_col] = a_lt[a_global_row, a_global_col] + else: + a_shared_lt[local_row, local_col] = 0 + + var b_global_row = b_tile_row_start + local_row + var b_global_col = b_tile_col_start + local_col + if b_global_row < inner and b_global_col < cols: + b_shared_lt[local_row, local_col] = b_lt[b_global_row, b_global_col] + else: + b_shared_lt[local_row, local_col] = 0 - # Asynchronously copy tiles to shared memory with consistent orientation - copy_dram_to_sram_async[ - thread_layout=load_a_layout, - num_threads=MATMUL_NUM_THREADS, - block_dim_count=MATMUL_BLOCK_DIM_COUNT, - ](a_shared, a_tile) - copy_dram_to_sram_async[ - thread_layout=load_b_layout, - num_threads=MATMUL_NUM_THREADS, - block_dim_count=MATMUL_BLOCK_DIM_COUNT, - ](b_shared, b_tile) - - # Wait for all async copies to complete - async_copy_wait_all() barrier() # Compute partial matrix multiplication for this tile - comptime for k in range(MATMUL_BLOCK_DIM_XY): - if ( - tiled_row < rows and tiled_col < cols - ): # Only perform calculation for valid outputs - if k < a_tile.dim( - 1 - ): # Only perform calculation on valid inputs - acc += a_shared[local_row, k] * b_shared[k, local_col] + comptime k_max = min( + MATMUL_BLOCK_DIM_XY, inner - idx * MATMUL_BLOCK_DIM_XY + ) + comptime for k in range(k_max): + if tiled_row < rows and tiled_col < cols: + acc += rebind[Scalar[dtype]]( + a_shared_lt[local_row, k] + ) * rebind[Scalar[dtype]](b_shared_lt[k, local_col]) barrier() # Write final result with bounds checking (needed for variable matrix sizes) if tiled_row < rows and tiled_col < cols: - out_tile[local_row, local_col] = acc + out_tile_lt[local_row, local_col] = acc # ANCHOR_END: matmul_idiomatic_tiled @@ -110,18 +105,18 @@ def matmul_idiomatic_tiled[ # ANCHOR: layernorm_kernel_solution def layernorm_kernel[ - input_layout: Layout, - ln_params_layout: Layout, - output_layout: Layout, batch_size: Int, seq_len: Int, hidden_dim: Int, + OutputLayout: TensorLayout, + InputLayout: TensorLayout, + LnParamsLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, output_layout, MutAnyOrigin], - input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin], - ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin], + input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin], + ln_weight: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin], + ln_bias: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin], ): var batch_idx = block_idx.x var seq_idx = block_idx.y @@ -134,12 +129,17 @@ def layernorm_kernel[ ): return + var output_lt = output.to_layout_tensor() + var input_lt = input.to_layout_tensor() + var ln_weight_lt = ln_weight.to_layout_tensor() + var ln_bias_lt = ln_bias.to_layout_tensor() + # Compute statistics for this sequence position (redundant but simple) var sum_val: Scalar[dtype] = 0 var sq_sum: Scalar[dtype] = 0 comptime for h in range(hidden_dim): - var val = input[batch_idx, seq_idx, h] + var val = input_lt[batch_idx, seq_idx, h] sum_val += rebind[Scalar[dtype]](val) sq_sum += rebind[Scalar[dtype]](val * val) @@ -148,11 +148,11 @@ def layernorm_kernel[ var inv_std = 1.0 / sqrt(var_val + 1e-5) # Apply LayerNorm to this element - var input_val = input[batch_idx, seq_idx, hidden_idx] + var input_val = input_lt[batch_idx, seq_idx, hidden_idx] var normalized = (input_val - mean_val) * inv_std * rebind[Scalar[dtype]]( - ln_weight[hidden_idx] - ) + rebind[Scalar[dtype]](ln_bias[hidden_idx]) - output[batch_idx, seq_idx, hidden_idx] = normalized + ln_weight_lt[hidden_idx] + ) + rebind[Scalar[dtype]](ln_bias_lt[hidden_idx]) + output_lt[batch_idx, seq_idx, hidden_idx] = normalized # ANCHOR_END: layernorm_kernel_solution @@ -160,33 +160,37 @@ def layernorm_kernel[ # ANCHOR: transpose_kernel_solution def transpose_kernel[ - layout_in: Layout, - layout_out: Layout, rows: Int, cols: Int, + OutLayout: TensorLayout, + InLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, layout_out, MutAnyOrigin], - inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + inp: TileTensor[mut=True, dtype, InLayout, MutAnyOrigin], ): """Transpose matrix using shared memory tiling for coalesced access. We will learn more about coalesced access in the next part. """ - var shared_tile = LayoutTensor[ - dtype, - Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + comptime shared_layout = row_major[ + TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY + ]() + var shared_tile = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](shared_layout) var local_row = thread_idx.y var local_col = thread_idx.x + var inp_lt = inp.to_layout_tensor() + var output_lt = output.to_layout_tensor() + var shared_tile_lt = shared_tile.to_layout_tensor() + var global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row var global_col = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_col if global_row < rows and global_col < cols: - shared_tile[local_row, local_col] = inp[global_row, global_col] + shared_tile_lt[local_row, local_col] = inp_lt[global_row, global_col] barrier() @@ -196,7 +200,7 @@ def transpose_kernel[ # Store data from shared memory to global memory (coalesced write) # Note: we transpose the shared memory access pattern if out_row < cols and out_col < rows: - output[out_row, out_col] = shared_tile[local_col, local_row] + output_lt[out_row, out_col] = shared_tile_lt[local_col, local_row] # ANCHOR_END: transpose_kernel_solution @@ -204,17 +208,17 @@ def transpose_kernel[ # ANCHOR: add_bias_kernel def add_bias_kernel[ - input_layout: Layout, - bias_layout: Layout, - output_layout: Layout, batch_size: Int, seq_len: Int, output_dim: Int, + OutputLayout: TensorLayout, + InputLayout: TensorLayout, + BiasLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, output_layout, MutAnyOrigin], - input: LayoutTensor[dtype, input_layout, MutAnyOrigin], - bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin], + input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin], + bias: TileTensor[mut=True, dtype, BiasLayout, MutAnyOrigin], ): """Simple bias addition.""" var batch_idx = block_idx.x @@ -224,9 +228,13 @@ def add_bias_kernel[ if batch_idx >= batch_size or seq_idx >= seq_len or out_idx >= output_dim: return - output[batch_idx, seq_idx, out_idx] = input[ + var output_lt = output.to_layout_tensor() + var input_lt = input.to_layout_tensor() + var bias_lt = bias.to_layout_tensor() + + output_lt[batch_idx, seq_idx, out_idx] = input_lt[ batch_idx, seq_idx, out_idx - ] + rebind[Scalar[dtype]](bias[out_idx]) + ] + rebind[Scalar[dtype]](bias_lt[out_idx]) # ANCHOR_END: add_bias_kernel @@ -234,23 +242,23 @@ def add_bias_kernel[ # ANCHOR: minimal_fused_forward_kernel_solution def minimal_fused_kernel[ - input_layout: Layout, - ln_params_layout: Layout, - weight_layout: Layout, - bias_layout: Layout, - output_layout: Layout, batch_size: Int, seq_len: Int, hidden_dim: Int, output_dim: Int, + OutputLayout: TensorLayout, + InputLayout: TensorLayout, + LnParamsLayout: TensorLayout, + WeightLayout: TensorLayout, + BiasLayout: TensorLayout, dtype: DType = DType.float32, ]( - output: LayoutTensor[dtype, output_layout, MutAnyOrigin], - input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin], - ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin], - linear_bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin], + input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin], + ln_weight: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin], + ln_bias: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin], + linear_weight: TileTensor[mut=True, dtype, WeightLayout, MutAnyOrigin], + linear_bias: TileTensor[mut=True, dtype, BiasLayout, MutAnyOrigin], ): """Minimal fused kernel - one thread per sequence position to avoid redundancy. """ @@ -262,12 +270,19 @@ def minimal_fused_kernel[ if batch_idx >= batch_size or seq_idx >= seq_len: return + var output_lt = output.to_layout_tensor() + var input_lt = input.to_layout_tensor() + var ln_weight_lt = ln_weight.to_layout_tensor() + var ln_bias_lt = ln_bias.to_layout_tensor() + var linear_weight_lt = linear_weight.to_layout_tensor() + var linear_bias_lt = linear_bias.to_layout_tensor() + # Step 1: Compute LayerNorm statistics once per sequence position var sum_val: Scalar[dtype] = 0 var sq_sum: Scalar[dtype] = 0 comptime for h in range(hidden_dim): - var val = input[batch_idx, seq_idx, h] + var val = input_lt[batch_idx, seq_idx, h] sum_val += rebind[Scalar[dtype]](val) sq_sum += rebind[Scalar[dtype]](val * val) @@ -280,14 +295,16 @@ def minimal_fused_kernel[ var acc: Scalar[dtype] = 0 comptime for h in range(hidden_dim): - var input_val = input[batch_idx, seq_idx, h] + var input_val = input_lt[batch_idx, seq_idx, h] var normalized = (input_val - mean_val) * inv_std * rebind[ Scalar[dtype] - ](ln_weight[h]) + rebind[Scalar[dtype]](ln_bias[h]) - acc += rebind[Scalar[dtype]](normalized * linear_weight[out_idx, h]) + ](ln_weight_lt[h]) + rebind[Scalar[dtype]](ln_bias_lt[h]) + acc += rebind[Scalar[dtype]]( + normalized * linear_weight_lt[out_idx, h] + ) - output[batch_idx, seq_idx, out_idx] = acc + rebind[Scalar[dtype]]( - linear_bias[out_idx] + output_lt[batch_idx, seq_idx, out_idx] = acc + rebind[Scalar[dtype]]( + linear_bias_lt[out_idx] ) @@ -296,31 +313,33 @@ def minimal_fused_kernel[ # ANCHOR: minimal_fused_backward_kernel_solution def minimal_fused_kernel_backward[ - grad_output_layout: Layout, - input_layout: Layout, - ln_params_layout: Layout, - weight_layout: Layout, - grad_input_layout: Layout, - grad_ln_weight_layout: Layout, - grad_ln_bias_layout: Layout, - grad_weight_layout: Layout, - grad_bias_layout: Layout, batch_size: Int, seq_len: Int, hidden_dim: Int, output_dim: Int, + GradInputLayout: TensorLayout, + GradLnWeightLayout: TensorLayout, + GradLnBiasLayout: TensorLayout, + GradWeightLayout: TensorLayout, + GradBiasLayout: TensorLayout, + GradOutputLayout: TensorLayout, + InputLayout: TensorLayout, + LnParamsLayout: TensorLayout, + WeightLayout: TensorLayout, dtype: DType = DType.float32, ]( - grad_input: LayoutTensor[dtype, grad_input_layout, MutAnyOrigin], - grad_ln_weight: LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin], - grad_ln_bias: LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin], - grad_weight: LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin], - grad_bias: LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin], - grad_output: LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin], - input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin], - ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin], - linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin], + grad_input: TileTensor[mut=True, dtype, GradInputLayout, MutAnyOrigin], + grad_ln_weight: TileTensor[ + mut=True, dtype, GradLnWeightLayout, MutAnyOrigin + ], + grad_ln_bias: TileTensor[mut=True, dtype, GradLnBiasLayout, MutAnyOrigin], + grad_weight: TileTensor[mut=True, dtype, GradWeightLayout, MutAnyOrigin], + grad_bias: TileTensor[mut=True, dtype, GradBiasLayout, MutAnyOrigin], + grad_output: TileTensor[mut=True, dtype, GradOutputLayout, MutAnyOrigin], + input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin], + ln_weight: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin], + ln_bias: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin], + linear_weight: TileTensor[mut=True, dtype, WeightLayout, MutAnyOrigin], ): """Fused backward kernel using atomic operations for safe gradient accumulation. """ @@ -332,6 +351,17 @@ def minimal_fused_kernel_backward[ if batch_idx >= batch_size or seq_idx >= seq_len: return + var grad_input_lt = grad_input.to_layout_tensor() + var grad_ln_weight_lt = grad_ln_weight.to_layout_tensor() + var grad_ln_bias_lt = grad_ln_bias.to_layout_tensor() + var grad_weight_lt = grad_weight.to_layout_tensor() + var grad_bias_lt = grad_bias.to_layout_tensor() + var grad_output_lt = grad_output.to_layout_tensor() + var input_lt = input.to_layout_tensor() + var ln_weight_lt = ln_weight.to_layout_tensor() + var ln_bias_lt = ln_bias.to_layout_tensor() + var linear_weight_lt = linear_weight.to_layout_tensor() + # Initialize gradient tensors to zero (block 0,0 only to avoid UB with atomic ops) if batch_idx == 0 and seq_idx == 0: # Initialize grad_ln_weight and grad_ln_bias @@ -356,7 +386,7 @@ def minimal_fused_kernel_backward[ var sq_sum: Scalar[dtype] = 0 comptime for h in range(hidden_dim): - var val = input[batch_idx, seq_idx, h] + var val = input_lt[batch_idx, seq_idx, h] sum_val += rebind[Scalar[dtype]](val) sq_sum += rebind[Scalar[dtype]](val * val) @@ -369,28 +399,28 @@ def minimal_fused_kernel_backward[ var grad_bias_ptr = grad_bias.ptr + out_idx _ = Atomic[dtype].fetch_add( grad_bias_ptr, - rebind[Scalar[dtype]](grad_output[batch_idx, seq_idx, out_idx]), + rebind[Scalar[dtype]](grad_output_lt[batch_idx, seq_idx, out_idx]), ) # Step 3: Atomically accumulate gradients w.r.t. linear weight comptime for out_idx in range(output_dim): comptime for h in range(hidden_dim): - var input_val = input[batch_idx, seq_idx, h] + var input_val = input_lt[batch_idx, seq_idx, h] var normalized = (input_val - mean_val) * inv_std var ln_output_val = normalized * rebind[Scalar[dtype]]( - ln_weight[h] - ) + rebind[Scalar[dtype]](ln_bias[h]) + ln_weight_lt[h] + ) + rebind[Scalar[dtype]](ln_bias_lt[h]) # Atomic gradient accumulation for linear weight var grad_w = ( - grad_output[batch_idx, seq_idx, out_idx] * ln_output_val + grad_output_lt[batch_idx, seq_idx, out_idx] * ln_output_val ) var grad_weight_ptr = grad_weight.ptr + out_idx * hidden_dim + h _ = Atomic.fetch_add(grad_weight_ptr, rebind[Scalar[dtype]](grad_w)) # Step 4: Atomically accumulate gradients w.r.t. LayerNorm parameters comptime for h in range(hidden_dim): - input_val = input[batch_idx, seq_idx, h] + input_val = input_lt[batch_idx, seq_idx, h] normalized = (input_val - mean_val) * inv_std # Compute gradient w.r.t. LayerNorm output for this h @@ -398,8 +428,8 @@ def minimal_fused_kernel_backward[ comptime for out_idx in range(output_dim): grad_ln_out = grad_ln_out + rebind[Scalar[dtype]]( - grad_output[batch_idx, seq_idx, out_idx] - * linear_weight[out_idx, h] + grad_output_lt[batch_idx, seq_idx, out_idx] + * linear_weight_lt[out_idx, h] ) # Atomic accumulation of LayerNorm parameter gradients @@ -418,18 +448,18 @@ def minimal_fused_kernel_backward[ var sum_grad_normalized_times_normalized: Scalar[dtype] = 0 comptime for h in range(hidden_dim): - h_input_val = input[batch_idx, seq_idx, h] + h_input_val = input_lt[batch_idx, seq_idx, h] h_normalized = (h_input_val - mean_val) * inv_std var h_grad_ln_out: Scalar[dtype] = 0 comptime for out_idx in range(output_dim): h_grad_ln_out = h_grad_ln_out + rebind[Scalar[dtype]]( - grad_output[batch_idx, seq_idx, out_idx] - * linear_weight[out_idx, h] + grad_output_lt[batch_idx, seq_idx, out_idx] + * linear_weight_lt[out_idx, h] ) - h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight[h]) + h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight_lt[h]) sum_grad_normalized = sum_grad_normalized + rebind[Scalar[dtype]]( h_grad_norm ) @@ -440,19 +470,19 @@ def minimal_fused_kernel_backward[ # Compute actual input gradients (no race conditions here - each thread writes to different positions) comptime for h in range(hidden_dim): - h_input_val = input[batch_idx, seq_idx, h] + h_input_val = input_lt[batch_idx, seq_idx, h] h_normalized = (h_input_val - mean_val) * inv_std var h_grad_ln_out: Scalar[dtype] = 0 comptime for out_idx in range(output_dim): h_grad_ln_out = h_grad_ln_out + rebind[Scalar[dtype]]( - grad_output[batch_idx, seq_idx, out_idx] - * linear_weight[out_idx, h] + grad_output_lt[batch_idx, seq_idx, out_idx] + * linear_weight_lt[out_idx, h] ) - h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight[h]) - grad_input[batch_idx, seq_idx, h] = inv_std * ( + h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight_lt[h]) + grad_input_lt[batch_idx, seq_idx, h] = inv_std * ( h_grad_norm - (sum_grad_normalized / hidden_dim) - (h_normalized * sum_grad_normalized_times_normalized / hidden_dim) @@ -482,31 +512,37 @@ struct LayerNormLinearCustomOp: linear_bias: InputTensor[dtype=dtype, rank=1, static_spec=_], ctx: DeviceContextPtr, ) raises: - comptime input_layout = input.static_spec.to_layout() - comptime ln_params_layout = ln_weight.static_spec.to_layout() - comptime weight_layout = linear_weight.static_spec.to_layout() - comptime bias_layout = linear_bias.static_spec.to_layout() - comptime output_layout = output.static_spec.to_layout() - - # Note: rebind is necessary now but it shouldn't be! - var output_tensor = rebind[ - LayoutTensor[dtype, output_layout, MutAnyOrigin] - ](output.to_layout_tensor()) - var input_tensor = rebind[ - LayoutTensor[dtype, input_layout, ImmutAnyOrigin] - ](input.to_layout_tensor()) - var ln_weight_tensor = rebind[ - LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin] - ](ln_weight.to_layout_tensor()) - var ln_bias_tensor = rebind[ - LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin] - ](ln_bias.to_layout_tensor()) - var linear_weight_tensor = rebind[ - LayoutTensor[dtype, weight_layout, ImmutAnyOrigin] - ](linear_weight.to_layout_tensor()) - var linear_bias_tensor = rebind[ - LayoutTensor[dtype, bias_layout, ImmutAnyOrigin] - ](linear_bias.to_layout_tensor()) + comptime input_layout_val = row_major[batch_size, seq_len, hidden_dim]() + comptime ln_params_layout_val = row_major[hidden_dim]() + comptime weight_layout_val = row_major[output_dim, hidden_dim]() + comptime bias_layout_val = row_major[output_dim]() + comptime output_layout_val = row_major[ + batch_size, seq_len, output_dim + ]() + comptime InputLayout = type_of(input_layout_val) + comptime LnParamsLayout = type_of(ln_params_layout_val) + comptime WeightLayout = type_of(weight_layout_val) + comptime BiasLayout = type_of(bias_layout_val) + comptime OutputLayout = type_of(output_layout_val) + + var output_tensor = TileTensor[ + mut=True, dtype, OutputLayout, MutAnyOrigin + ](output.unsafe_ptr(), output_layout_val) + var input_tensor = TileTensor[ + mut=True, dtype, InputLayout, MutAnyOrigin + ](input.unsafe_ptr(), input_layout_val) + var ln_weight_tensor = TileTensor[ + mut=True, dtype, LnParamsLayout, MutAnyOrigin + ](ln_weight.unsafe_ptr(), ln_params_layout_val) + var ln_bias_tensor = TileTensor[ + mut=True, dtype, LnParamsLayout, MutAnyOrigin + ](ln_bias.unsafe_ptr(), ln_params_layout_val) + var linear_weight_tensor = TileTensor[ + mut=True, dtype, WeightLayout, MutAnyOrigin + ](linear_weight.unsafe_ptr(), weight_layout_val) + var linear_bias_tensor = TileTensor[ + mut=True, dtype, BiasLayout, MutAnyOrigin + ](linear_bias.unsafe_ptr(), bias_layout_val) comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() @@ -515,15 +551,15 @@ struct LayerNormLinearCustomOp: comptime if algorithm == "fused": # fused case - one thread per sequence position comptime kernel = minimal_fused_kernel[ - input_layout, - ln_params_layout, - weight_layout, - bias_layout, - output_layout, batch_size, seq_len, hidden_dim, output_dim, + OutputLayout, + InputLayout, + LnParamsLayout, + WeightLayout, + BiasLayout, ] gpu_ctx.enqueue_function[kernel, kernel]( output_tensor, @@ -541,18 +577,18 @@ struct LayerNormLinearCustomOp: var normalized_buffer = gpu_ctx.enqueue_create_buffer[dtype]( batch_size * seq_len * hidden_dim ) - var normalized_tensor = LayoutTensor[ - dtype, input_layout, MutAnyOrigin - ](normalized_buffer) + var normalized_tensor = TileTensor[ + mut=True, dtype, InputLayout, MutAnyOrigin + ](normalized_buffer, input_layout_val) # Step 1: LayerNorm kernel comptime kernel = layernorm_kernel[ - input_layout, - ln_params_layout, - input_layout, batch_size, seq_len, hidden_dim, + InputLayout, + InputLayout, + LnParamsLayout, ] gpu_ctx.enqueue_function[kernel, kernel]( normalized_tensor, @@ -577,19 +613,26 @@ struct LayerNormLinearCustomOp: var matmul_buffer = gpu_ctx.enqueue_create_buffer[dtype]( batch_size * seq_len * output_dim ) - var matmul_tensor = LayoutTensor[ - dtype, output_layout, MutAnyOrigin - ](matmul_buffer) + var matmul_tensor = TileTensor[ + mut=True, dtype, OutputLayout, MutAnyOrigin + ](matmul_buffer, output_layout_val) # Create transposed weight matrix: [output_dim, hidden_dim] -> [hidden_dim, output_dim] var transposed_weight_buffer = gpu_ctx.enqueue_create_buffer[ dtype ](hidden_dim * output_dim) - var transposed_weight_tensor = LayoutTensor[ + comptime transposed_weight_layout = row_major[ + hidden_dim, output_dim + ]() + comptime TransposedWeightLayout = type_of( + transposed_weight_layout + ) + var transposed_weight_tensor = TileTensor[ + mut=True, dtype, - Layout.row_major(hidden_dim, output_dim), + TransposedWeightLayout, MutAnyOrigin, - ](transposed_weight_buffer) + ](transposed_weight_buffer, transposed_weight_layout) # Transpose the weight matrix var transpose_blocks_x = ( @@ -599,10 +642,10 @@ struct LayerNormLinearCustomOp: output_dim + TRANSPOSE_BLOCK_DIM_XY - 1 ) // TRANSPOSE_BLOCK_DIM_XY comptime kernel2 = transpose_kernel[ - weight_layout, - transposed_weight_tensor.layout, output_dim, hidden_dim, + TransposedWeightLayout, + WeightLayout, ] gpu_ctx.enqueue_function[kernel2, kernel2]( transposed_weight_tensor, @@ -612,20 +655,26 @@ struct LayerNormLinearCustomOp: ) # Reshape tensors for matmul: [batch*seq, hidden] @ [hidden, output] -> [batch*seq, output] - var flat_normalized = normalized_tensor.reshape[ - Layout.row_major(batch_size * seq_len, hidden_dim) + comptime flat_normalized_layout = row_major[ + batch_size * seq_len, hidden_dim ]() - var flat_matmul = matmul_tensor.reshape[ - Layout.row_major(batch_size * seq_len, output_dim) + comptime FlatNormalizedLayout = type_of(flat_normalized_layout) + comptime flat_matmul_layout = row_major[ + batch_size * seq_len, output_dim ]() + comptime FlatMatmulLayout = type_of(flat_matmul_layout) + var flat_normalized = normalized_tensor.reshape( + flat_normalized_layout + ) + var flat_matmul = matmul_tensor.reshape(flat_matmul_layout) comptime kernel3 = matmul_idiomatic_tiled[ - flat_normalized.layout, - transposed_weight_tensor.layout, - flat_matmul.layout, batch_size * seq_len, output_dim, hidden_dim, + FlatMatmulLayout, + FlatNormalizedLayout, + TransposedWeightLayout, ] gpu_ctx.enqueue_function[kernel3, kernel3]( flat_matmul, @@ -636,17 +685,21 @@ struct LayerNormLinearCustomOp: ) # Step 3: Add bias - reshape matmul result back to 3D for bias addition - var reshaped_matmul = matmul_tensor.reshape[ - Layout.row_major(batch_size, seq_len, output_dim) + comptime reshaped_matmul_layout = row_major[ + batch_size, seq_len, output_dim ]() + comptime ReshapedMatmulLayout = type_of(reshaped_matmul_layout) + var reshaped_matmul = matmul_tensor.reshape( + reshaped_matmul_layout + ) comptime kernel4 = add_bias_kernel[ - reshaped_matmul.layout, - bias_layout, - output_layout, batch_size, seq_len, output_dim, + OutputLayout, + ReshapedMatmulLayout, + BiasLayout, ] gpu_ctx.enqueue_function[kernel4, kernel4]( output_tensor, @@ -722,65 +775,78 @@ struct LayerNormLinearBackwardCustomOp: linear_weight: InputTensor[dtype=dtype, rank=2, static_spec=_], ctx: DeviceContextPtr, ) raises: - comptime grad_output_layout = grad_output.static_spec.to_layout() - comptime input_layout = input.static_spec.to_layout() - comptime ln_params_layout = ln_weight.static_spec.to_layout() - comptime weight_layout = linear_weight.static_spec.to_layout() - comptime grad_input_layout = grad_input.static_spec.to_layout() - comptime grad_ln_weight_layout = grad_ln_weight.static_spec.to_layout() - comptime grad_ln_bias_layout = grad_ln_bias.static_spec.to_layout() - comptime grad_weight_layout = grad_weight.static_spec.to_layout() - comptime grad_bias_layout = grad_bias.static_spec.to_layout() - - var grad_input_tensor = rebind[ - LayoutTensor[dtype, grad_input_layout, MutAnyOrigin] - ](grad_input.to_layout_tensor()) - var grad_ln_weight_tensor = rebind[ - LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin] - ](grad_ln_weight.to_layout_tensor()) - var grad_ln_bias_tensor = rebind[ - LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin] - ](grad_ln_bias.to_layout_tensor()) - var grad_weight_tensor = rebind[ - LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin] - ](grad_weight.to_layout_tensor()) - var grad_bias_tensor = rebind[ - LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin] - ](grad_bias.to_layout_tensor()) - var grad_output_tensor = rebind[ - LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin] - ](grad_output.to_layout_tensor()) - var input_tensor = rebind[ - LayoutTensor[dtype, input_layout, ImmutAnyOrigin] - ](input.to_layout_tensor()) - var ln_weight_tensor = rebind[ - LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin] - ](ln_weight.to_layout_tensor()) - var ln_bias_tensor = rebind[ - LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin] - ](ln_bias.to_layout_tensor()) - var linear_weight_tensor = rebind[ - LayoutTensor[dtype, weight_layout, ImmutAnyOrigin] - ](linear_weight.to_layout_tensor()) + comptime input_layout_val = row_major[batch_size, seq_len, hidden_dim]() + comptime ln_params_layout_val = row_major[hidden_dim]() + comptime weight_layout_val = row_major[output_dim, hidden_dim]() + comptime grad_input_layout_val = row_major[ + batch_size, seq_len, hidden_dim + ]() + comptime grad_ln_weight_layout_val = row_major[hidden_dim]() + comptime grad_ln_bias_layout_val = row_major[hidden_dim]() + comptime grad_weight_layout_val = row_major[output_dim, hidden_dim]() + comptime grad_bias_layout_val = row_major[output_dim]() + comptime grad_output_layout_val = row_major[ + batch_size, seq_len, output_dim + ]() + comptime GradOutputLayout = type_of(grad_output_layout_val) + comptime InputLayout = type_of(input_layout_val) + comptime LnParamsLayout = type_of(ln_params_layout_val) + comptime WeightLayout = type_of(weight_layout_val) + comptime GradInputLayout = type_of(grad_input_layout_val) + comptime GradLnWeightLayout = type_of(grad_ln_weight_layout_val) + comptime GradLnBiasLayout = type_of(grad_ln_bias_layout_val) + comptime GradWeightLayout = type_of(grad_weight_layout_val) + comptime GradBiasLayout = type_of(grad_bias_layout_val) + + var grad_input_tensor = TileTensor[ + mut=True, dtype, GradInputLayout, MutAnyOrigin + ](grad_input.unsafe_ptr(), grad_input_layout_val) + var grad_ln_weight_tensor = TileTensor[ + mut=True, dtype, GradLnWeightLayout, MutAnyOrigin + ](grad_ln_weight.unsafe_ptr(), grad_ln_weight_layout_val) + var grad_ln_bias_tensor = TileTensor[ + mut=True, dtype, GradLnBiasLayout, MutAnyOrigin + ](grad_ln_bias.unsafe_ptr(), grad_ln_bias_layout_val) + var grad_weight_tensor = TileTensor[ + mut=True, dtype, GradWeightLayout, MutAnyOrigin + ](grad_weight.unsafe_ptr(), grad_weight_layout_val) + var grad_bias_tensor = TileTensor[ + mut=True, dtype, GradBiasLayout, MutAnyOrigin + ](grad_bias.unsafe_ptr(), grad_bias_layout_val) + var grad_output_tensor = TileTensor[ + mut=True, dtype, GradOutputLayout, MutAnyOrigin + ](grad_output.unsafe_ptr(), grad_output_layout_val) + var input_tensor = TileTensor[ + mut=True, dtype, InputLayout, MutAnyOrigin + ](input.unsafe_ptr(), input_layout_val) + var ln_weight_tensor = TileTensor[ + mut=True, dtype, LnParamsLayout, MutAnyOrigin + ](ln_weight.unsafe_ptr(), ln_params_layout_val) + var ln_bias_tensor = TileTensor[ + mut=True, dtype, LnParamsLayout, MutAnyOrigin + ](ln_bias.unsafe_ptr(), ln_params_layout_val) + var linear_weight_tensor = TileTensor[ + mut=True, dtype, WeightLayout, MutAnyOrigin + ](linear_weight.unsafe_ptr(), weight_layout_val) comptime if target == "gpu": var gpu_ctx = ctx.get_device_context() # Launch backward kernel comptime kernel = minimal_fused_kernel_backward[ - grad_output_layout, - input_layout, - ln_params_layout, - weight_layout, - grad_input_layout, - grad_ln_weight_layout, - grad_ln_bias_layout, - grad_weight_layout, - grad_bias_layout, batch_size, seq_len, hidden_dim, output_dim, + GradInputLayout, + GradLnWeightLayout, + GradLnBiasLayout, + GradWeightLayout, + GradBiasLayout, + GradOutputLayout, + InputLayout, + LnParamsLayout, + WeightLayout, ] gpu_ctx.enqueue_function[kernel, kernel]( grad_input_tensor, diff --git a/solutions/p23/p23.mojo b/solutions/p23/p23.mojo index 49a1c04a..9ab446a6 100644 --- a/solutions/p23/p23.mojo +++ b/solutions/p23/p23.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_dim, block_idx, barrier from std.gpu.host import DeviceContext from std.gpu.host.compile import get_gpu_target -from layout import Layout, LayoutTensor +from layout import TileTensor, LayoutTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation from std.utils import Index, IndexList from std.math import log2 from std.algorithm.functional import elementwise, vectorize @@ -11,18 +13,19 @@ from std.benchmark import Bench, BenchConfig, Bencher, BenchId, keep comptime SIZE = 1024 comptime rank = 1 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) comptime dtype = DType.float32 comptime SIMD_WIDTH = simd_width_of[dtype, target=get_gpu_target()]() # ANCHOR: elementwise_add_solution def elementwise_add[ - layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int + LayoutT: TensorLayout, dtype: DType, simd_width: Int, rank: Int, size: Int ]( - output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], ctx: DeviceContext, ) raises: @parameter @@ -31,18 +34,19 @@ def elementwise_add[ simd_width: Int, rank: Int, alignment: Int = align_of[dtype]() ](indices: IndexList[rank]) capturing -> None: var idx = indices[0] + # Convert inside GPU kernel to avoid host-captured LayoutTensor issues + var a_lt = a.to_layout_tensor() + var b_lt = b.to_layout_tensor() + var out_lt = output.to_layout_tensor() # Note: This is thread-local SIMD - each thread processes its own vector of data # we'll later better see this hierarchy in Mojo: # SIMD within threads, warp across threads, block across warps - var a_simd = a.aligned_load[width=simd_width](Index(idx)) - var b_simd = b.aligned_load[width=simd_width](Index(idx)) + var a_simd = a_lt.aligned_load[width=simd_width](Index(idx)) + var b_simd = b_lt.aligned_load[width=simd_width](Index(idx)) var ret = a_simd + b_simd - # print( - # "idx:", idx, ", a_simd:", a_simd, ", b_simd:", b_simd, " sum:", ret - # ) - output.store[simd_width](Index(idx), ret) + out_lt.store[simd_width](Index(idx), ret) - elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx) + elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx) # ANCHOR_END: elementwise_add_solution @@ -53,16 +57,16 @@ comptime TILE_SIZE = 32 def tiled_elementwise_add[ - layout: Layout, + LayoutT: TensorLayout, dtype: DType, simd_width: Int, rank: Int, size: Int, tile_size: Int, ]( - output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], ctx: DeviceContext, ) raises: @parameter @@ -72,13 +76,13 @@ def tiled_elementwise_add[ ](indices: IndexList[rank]) capturing -> None: var tile_id = indices[0] - var output_tile = output.tile[tile_size](tile_id) - var a_tile = a.tile[tile_size](tile_id) - var b_tile = b.tile[tile_size](tile_id) + var output_tile = output.tile[tile_size](tile_id).to_layout_tensor() + var a_tile = a.tile[tile_size](tile_id).to_layout_tensor() + var b_tile = b.tile[tile_size](tile_id).to_layout_tensor() comptime for i in range(tile_size): - var a_vec = a_tile.load[simd_width](Index(i)) - var b_vec = b_tile.load[simd_width](Index(i)) + var a_vec = a_tile.aligned_load[width=simd_width](Index(i)) + var b_vec = b_tile.aligned_load[width=simd_width](Index(i)) var ret = a_vec + b_vec output_tile.store[simd_width](Index(i), ret) @@ -91,7 +95,7 @@ def tiled_elementwise_add[ # ANCHOR: manual_vectorized_tiled_elementwise_add_solution def manual_vectorized_tiled_elementwise_add[ - layout: Layout, + LayoutT: TensorLayout, dtype: DType, simd_width: Int, num_threads_per_tile: Int, @@ -99,9 +103,9 @@ def manual_vectorized_tiled_elementwise_add[ size: Int, tile_size: Int, ]( - output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], ctx: DeviceContext, ) raises: # Each tile contains tile_size groups of simd_width elements @@ -113,20 +117,18 @@ def manual_vectorized_tiled_elementwise_add[ num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]() ](indices: IndexList[rank]) capturing -> None: var tile_id = indices[0] - - var output_tile = output.tile[chunk_size](tile_id) - var a_tile = a.tile[chunk_size](tile_id) - var b_tile = b.tile[chunk_size](tile_id) + # Convert inside GPU kernel to avoid host-captured LayoutTensor issues + var a_lt = a.to_layout_tensor() + var b_lt = b.to_layout_tensor() + var out_lt = output.to_layout_tensor() comptime for i in range(tile_size): var global_start = tile_id * chunk_size + i * simd_width - var a_vec = a.aligned_load[simd_width](Index(global_start)) - var b_vec = b.aligned_load[simd_width](Index(global_start)) + var a_vec = a_lt.aligned_load[width=simd_width](Index(global_start)) + var b_vec = b_lt.aligned_load[width=simd_width](Index(global_start)) var ret = a_vec + b_vec - # print("tile:", tile_id, "simd_group:", i, "global_start:", global_start, "a_vec:", a_vec, "b_vec:", b_vec, "result:", ret) - - output.store[simd_width](Index(global_start), ret) + out_lt.store[simd_width](Index(global_start), ret) # Number of tiles needed: each tile processes chunk_size elements var num_tiles = (size + chunk_size - 1) // chunk_size @@ -140,7 +142,7 @@ def manual_vectorized_tiled_elementwise_add[ # ANCHOR: vectorize_within_tiles_elementwise_add_solution def vectorize_within_tiles_elementwise_add[ - layout: Layout, + LayoutT: TensorLayout, dtype: DType, simd_width: Int, num_threads_per_tile: Int, @@ -148,9 +150,9 @@ def vectorize_within_tiles_elementwise_add[ size: Int, tile_size: Int, ]( - output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin], ctx: DeviceContext, ) raises: # Each tile contains tile_size elements (not SIMD groups) @@ -163,16 +165,20 @@ def vectorize_within_tiles_elementwise_add[ var tile_start = tile_id * tile_size var tile_end = min(tile_start + tile_size, size) var actual_tile_size = tile_end - tile_start + # Convert inside GPU kernel to avoid host-captured LayoutTensor issues + var a_lt = a.to_layout_tensor() + var b_lt = b.to_layout_tensor() + var out_lt = output.to_layout_tensor() def vectorized_add[ width: Int - ](i: Int) unified {read tile_start, read a, read b, mut output}: + ](i: Int) unified {read tile_start, read a_lt, read b_lt, mut out_lt}: var global_idx = tile_start + i if global_idx + width <= size: - var a_vec = a.aligned_load[width](Index(global_idx)) - var b_vec = b.aligned_load[width](Index(global_idx)) + var a_vec = a_lt.aligned_load[width](Index(global_idx)) + var b_vec = b_lt.aligned_load[width](Index(global_idx)) var result = a_vec + b_vec - output.store[width](Index(global_idx), result) + out_lt.store[width](Index(global_idx), result) # Use vectorize within each tile vectorize[simd_width](actual_tile_size, vectorized_add) @@ -192,7 +198,8 @@ def benchmark_elementwise_parameterized[ test_size: Int, tile_size: Int ](mut b: Bencher) raises: var bench_ctx = DeviceContext() - comptime layout = Layout.row_major(test_size) + comptime bench_layout = row_major[test_size]() + comptime BenchLayoutType = type_of(bench_layout) var out = bench_ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = bench_ctx.enqueue_create_buffer[dtype](test_size) @@ -205,20 +212,20 @@ def benchmark_elementwise_parameterized[ a_host[i] = Scalar[dtype](2 * i) b_host[i] = Scalar[dtype](2 * i + 1) - var a_tensor = LayoutTensor[mut=False, dtype, layout, MutAnyOrigin]( - a.unsafe_ptr() - ) - var b_tensor = LayoutTensor[mut=False, dtype, layout, MutAnyOrigin]( - b_buf.unsafe_ptr() - ) - var out_tensor = LayoutTensor[mut=True, dtype, layout, MutAnyOrigin]( - out.unsafe_ptr() + var a_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](a, bench_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](b_buf, bench_layout) + var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin]( + out, bench_layout ) @parameter @always_inline def elementwise_workflow(ctx: DeviceContext) raises: - elementwise_add[layout, dtype, SIMD_WIDTH, rank, test_size]( + elementwise_add[BenchLayoutType, dtype, SIMD_WIDTH, rank, test_size]( out_tensor, a_tensor, b_tensor, ctx ) @@ -233,7 +240,8 @@ def benchmark_tiled_parameterized[ test_size: Int, tile_size: Int ](mut b: Bencher) raises: var bench_ctx = DeviceContext() - comptime layout = Layout.row_major(test_size) + comptime bench_layout = row_major[test_size]() + comptime BenchLayoutType = type_of(bench_layout) var out = bench_ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = bench_ctx.enqueue_create_buffer[dtype](test_size) @@ -246,15 +254,21 @@ def benchmark_tiled_parameterized[ a_host[i] = Scalar[dtype](2 * i) b_host[i] = Scalar[dtype](2 * i + 1) - var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr()) - var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr()) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var a_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](a, bench_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](b_buf, bench_layout) + var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin]( + out, bench_layout + ) @parameter @always_inline def tiled_workflow(ctx: DeviceContext) raises: tiled_elementwise_add[ - layout, dtype, SIMD_WIDTH, rank, test_size, tile_size + BenchLayoutType, dtype, SIMD_WIDTH, rank, test_size, tile_size ](out_tensor, a_tensor, b_tensor, ctx) b.iter_custom[tiled_workflow](bench_ctx) @@ -268,7 +282,8 @@ def benchmark_manual_vectorized_parameterized[ test_size: Int, tile_size: Int ](mut b: Bencher) raises: var bench_ctx = DeviceContext() - comptime layout = Layout.row_major(test_size) + comptime bench_layout = row_major[test_size]() + comptime BenchLayoutType = type_of(bench_layout) var out = bench_ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = bench_ctx.enqueue_create_buffer[dtype](test_size) @@ -281,15 +296,21 @@ def benchmark_manual_vectorized_parameterized[ a_host[i] = Scalar[dtype](2 * i) b_host[i] = Scalar[dtype](2 * i + 1) - var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr()) - var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr()) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var a_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](a, bench_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](b_buf, bench_layout) + var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin]( + out, bench_layout + ) @parameter @always_inline def manual_vectorized_workflow(ctx: DeviceContext) raises: manual_vectorized_tiled_elementwise_add[ - layout, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size + BenchLayoutType, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size ](out_tensor, a_tensor, b_tensor, ctx) b.iter_custom[manual_vectorized_workflow](bench_ctx) @@ -303,7 +324,8 @@ def benchmark_vectorized_parameterized[ test_size: Int, tile_size: Int ](mut b: Bencher) raises: var bench_ctx = DeviceContext() - comptime layout = Layout.row_major(test_size) + comptime bench_layout = row_major[test_size]() + comptime BenchLayoutType = type_of(bench_layout) var out = bench_ctx.enqueue_create_buffer[dtype](test_size) out.enqueue_fill(0) var a = bench_ctx.enqueue_create_buffer[dtype](test_size) @@ -316,15 +338,21 @@ def benchmark_vectorized_parameterized[ a_host[i] = Scalar[dtype](2 * i) b_host[i] = Scalar[dtype](2 * i + 1) - var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr()) - var b_tensor = LayoutTensor[mut=False, dtype, layout](b_buf.unsafe_ptr()) - var out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var a_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](a, bench_layout) + var b_tensor = TileTensor[ + mut=False, dtype, BenchLayoutType, ImmutAnyOrigin + ](b_buf, bench_layout) + var out_tensor = TileTensor[mut=True, dtype, BenchLayoutType, MutAnyOrigin]( + out, bench_layout + ) @parameter @always_inline def vectorized_workflow(ctx: DeviceContext) raises: vectorize_within_tiles_elementwise_add[ - layout, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size + BenchLayoutType, dtype, SIMD_WIDTH, 1, rank, test_size, tile_size ](out_tensor, a_tensor, b_tensor, ctx) b.iter_custom[vectorized_workflow](bench_ctx) @@ -349,8 +377,12 @@ def main() raises: b_host[i] = Scalar[dtype](2 * i + 1) expected[i] = a_host[i] + b_host[i] - var a_tensor = LayoutTensor[mut=False, dtype, layout](a.unsafe_ptr()) - var b_tensor = LayoutTensor[mut=False, dtype, layout](b.unsafe_ptr()) + var a_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin]( + a, layout + ) + var b_tensor = TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin]( + b, layout + ) ctx.synchronize() @@ -358,8 +390,10 @@ def main() raises: print("simd_width:", SIMD_WIDTH) if argv()[1] == "--elementwise": - out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) - elementwise_add[layout, dtype, SIMD_WIDTH, rank, SIZE]( + var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin]( + out, layout + ) + elementwise_add[LayoutType, dtype, SIMD_WIDTH, rank, SIZE]( out_tensor, a_tensor, b_tensor, ctx ) @@ -371,11 +405,13 @@ def main() raises: print("Puzzle 23 complete โœ…") elif argv()[1] == "--tiled": - out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) - print("tile size:", TILE_SIZE) - tiled_elementwise_add[layout, dtype, SIMD_WIDTH, rank, SIZE, TILE_SIZE]( - out_tensor, a_tensor, b_tensor, ctx + var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin]( + out, layout ) + print("tile size:", TILE_SIZE) + tiled_elementwise_add[ + LayoutType, dtype, SIMD_WIDTH, rank, SIZE, TILE_SIZE + ](out_tensor, a_tensor, b_tensor, ctx) with out.map_to_host() as out_host: print("out:", out_host) @@ -385,10 +421,12 @@ def main() raises: print("Puzzle 23 complete โœ…") elif argv()[1] == "--manual-vectorized": - out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin]( + out, layout + ) print("tile size:", TILE_SIZE) manual_vectorized_tiled_elementwise_add[ - layout, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE + LayoutType, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE ](out_tensor, a_tensor, b_tensor, ctx) with out.map_to_host() as out_host: @@ -399,10 +437,12 @@ def main() raises: print("Puzzle 23 complete โœ…") elif argv()[1] == "--vectorized": - out_tensor = LayoutTensor[mut=True, dtype, layout](out.unsafe_ptr()) + var out_tensor = TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin]( + out, layout + ) print("tile size:", TILE_SIZE) vectorize_within_tiles_elementwise_add[ - layout, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE + LayoutType, dtype, SIMD_WIDTH, 1, rank, SIZE, TILE_SIZE ](out_tensor, a_tensor, b_tensor, ctx) with out.map_to_host() as out_host: diff --git a/solutions/p24/p24.mojo b/solutions/p24/p24.mojo index c2f77764..55c3efc8 100644 --- a/solutions/p24/p24.mojo +++ b/solutions/p24/p24.mojo @@ -4,7 +4,9 @@ from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer from std.gpu.primitives.warp import sum as warp_sum, WARP_SIZE from std.gpu.memory import AddressSpace from std.algorithm.functional import elementwise -from layout import Layout, LayoutTensor +from layout import TileTensor, LayoutTensor +from layout.tile_layout import row_major, TensorLayout +from layout.tile_tensor import stack_allocation from std.utils import Index, IndexList from std.sys import argv, simd_width_of, align_of from std.testing import assert_equal @@ -26,32 +28,36 @@ comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (WARP_SIZE, 1) # optimal choice for warp kernel comptime dtype = DType.float32 comptime SIMD_WIDTH = simd_width_of[dtype]() -comptime in_layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(1) +comptime in_layout = row_major[SIZE]() +comptime out_layout = row_major[1]() +comptime InLayout = type_of(in_layout) +comptime OutLayout = type_of(out_layout) # ANCHOR: traditional_approach_from_p12 def traditional_dot_product_p12_style[ - in_layout: Layout, out_layout: Layout, size: Int + InLayoutT: TensorLayout, OutLayoutT: TensorLayout, size: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin], ): """ This is the complex approach from p12_layout_tensor.mojo - kept for comparison. """ - var shared = LayoutTensor[ - dtype, - Layout.row_major(WARP_SIZE), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var a_lt = a.to_layout_tensor() + var b_lt = b.to_layout_tensor() + var out_lt = output.to_layout_tensor() + var shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[WARP_SIZE]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x if global_i < size: - shared[local_i] = (a[global_i] * b[global_i]).reduce_add() + shared[local_i] = rebind[Scalar[dtype]](a_lt[global_i]) * rebind[ + Scalar[dtype] + ](b_lt[global_i]) else: shared[local_i] = 0.0 @@ -65,7 +71,7 @@ def traditional_dot_product_p12_style[ stride //= 2 if local_i == 0: - output[global_i // WARP_SIZE] = shared[0] + out_lt.store[1](Index(global_i // WARP_SIZE), shared[0]) # ANCHOR_END: traditional_approach_from_p12 @@ -73,25 +79,30 @@ def traditional_dot_product_p12_style[ # ANCHOR: simple_warp_kernel_solution def simple_warp_dot_product[ - in_layout: Layout, out_layout: Layout, size: Int + InLayoutT: TensorLayout, OutLayoutT: TensorLayout, size: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin], ): + var a_lt = a.to_layout_tensor() + var b_lt = b.to_layout_tensor() + var out_lt = output.to_layout_tensor() var global_i = block_dim.x * block_idx.x + thread_idx.x # Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based var partial_product: Scalar[dtype] = 0 if global_i < size: - partial_product = (a[global_i] * b[global_i]).reduce_add() + partial_product = rebind[Scalar[dtype]](a_lt[global_i]) * rebind[ + Scalar[dtype] + ](b_lt[global_i]) # warp_sum() replaces all the shared memory + barriers + tree reduction var total = warp_sum(partial_product) # Only lane 0 writes the result (all lanes have the same total) if lane_id() == 0: - output[global_i // WARP_SIZE] = total + out_lt.store[1](Index(global_i // WARP_SIZE), total) # ANCHOR_END: simple_warp_kernel_solution @@ -99,16 +110,16 @@ def simple_warp_dot_product[ # ANCHOR: functional_warp_approach_solution def functional_warp_dot_product[ - layout: Layout, - out_layout: Layout, + InLayoutT: TensorLayout, + OutLayoutT: TensorLayout, dtype: DType, simd_width: Int, rank: Int, size: Int, ]( - output: LayoutTensor[mut=True, dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], - b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayoutT, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin], ctx: DeviceContext, ) raises: @parameter @@ -117,13 +128,19 @@ def functional_warp_dot_product[ simd_width: Int, rank: Int, alignment: Int = align_of[dtype]() ](indices: IndexList[rank]) capturing -> None: var idx = indices[0] + # Convert inside GPU kernel to avoid host-captured LayoutTensor issues + var a_lt = a.to_layout_tensor() + var b_lt = b.to_layout_tensor() + var out_lt = output.to_layout_tensor() # Each thread computes one partial product var partial_product: Scalar[dtype] = 0.0 if idx < size: - var a_val = a.load[1](Index(idx)) - var b_val = b.load[1](Index(idx)) - partial_product = a_val * b_val + var a_val = a_lt.load[1](Index(idx)) + var b_val = b_lt.load[1](Index(idx)) + partial_product = rebind[Scalar[dtype]](a_val) * rebind[ + Scalar[dtype] + ](b_val) else: partial_product = 0.0 @@ -132,7 +149,7 @@ def functional_warp_dot_product[ # Only lane 0 writes the result (all lanes have the same total) if lane_id() == 0: - output.store[1](Index(idx // WARP_SIZE), total) + out_lt.store[1](Index(idx // WARP_SIZE), total) # Launch exactly size == WARP_SIZE threads (one warp) to process all elements elementwise[compute_dot_product, 1, target="gpu"](size, ctx) @@ -187,8 +204,10 @@ def benchmark_simple_warp_parameterized[ test_size: Int ](mut bencher: Bencher) raises: comptime n_warps = test_size // WARP_SIZE - comptime in_layout = Layout.row_major(test_size) - comptime out_layout = Layout.row_major(n_warps) + comptime bench_in_layout = row_major[test_size]() + comptime bench_out_layout = row_major[n_warps]() + comptime BenchInLayout = type_of(bench_in_layout) + comptime BenchOutLayout = type_of(bench_out_layout) comptime n_threads = WARP_SIZE comptime n_blocks = (ceildiv(test_size, n_threads), 1) @@ -207,15 +226,21 @@ def benchmark_simple_warp_parameterized[ rand_int[dtype, test_size](b) expected_output[dtype, n_warps](expected, a, b) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + var a_tensor = TileTensor[mut=False, dtype, BenchInLayout]( + a, bench_in_layout + ) + var b_tensor = TileTensor[mut=False, dtype, BenchInLayout]( + b, bench_in_layout + ) + var out_tensor = TileTensor[mut=True, dtype, BenchOutLayout]( + out, bench_out_layout + ) @parameter @always_inline def traditional_workflow(ctx: DeviceContext) raises: comptime kernel = simple_warp_dot_product[ - in_layout, out_layout, test_size + BenchInLayout, BenchOutLayout, test_size ] ctx.enqueue_function[kernel, kernel]( out_tensor, @@ -239,8 +264,10 @@ def benchmark_functional_warp_parameterized[ test_size: Int ](mut bencher: Bencher) raises: comptime n_warps = test_size // WARP_SIZE - comptime in_layout = Layout.row_major(test_size) - comptime out_layout = Layout.row_major(n_warps) + comptime bench_in_layout = row_major[test_size]() + comptime bench_out_layout = row_major[n_warps]() + comptime BenchInLayout = type_of(bench_in_layout) + comptime BenchOutLayout = type_of(bench_out_layout) var bench_ctx = DeviceContext() @@ -257,15 +284,21 @@ def benchmark_functional_warp_parameterized[ rand_int[dtype, test_size](b) expected_output[dtype, n_warps](expected, a, b) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + var a_tensor = rebind[ + TileTensor[mut=False, dtype, BenchInLayout, ImmutAnyOrigin] + ](TileTensor[mut=False, dtype, BenchInLayout](a, bench_in_layout)) + var b_tensor = rebind[ + TileTensor[mut=False, dtype, BenchInLayout, ImmutAnyOrigin] + ](TileTensor[mut=False, dtype, BenchInLayout](b, bench_in_layout)) + var out_tensor = rebind[ + TileTensor[mut=True, dtype, BenchOutLayout, MutAnyOrigin] + ](TileTensor[mut=True, dtype, BenchOutLayout](out, bench_out_layout)) @parameter @always_inline def functional_warp_workflow(ctx: DeviceContext) raises: functional_warp_dot_product[ - in_layout, out_layout, dtype, SIMD_WIDTH, 1, test_size + BenchInLayout, BenchOutLayout, dtype, SIMD_WIDTH, 1, test_size ](out_tensor, a_tensor, b_tensor, ctx) bencher.iter_custom[functional_warp_workflow](bench_ctx) @@ -282,8 +315,10 @@ def benchmark_traditional_parameterized[ test_size: Int ](mut bencher: Bencher) raises: comptime n_warps = test_size // WARP_SIZE - comptime in_layout = Layout.row_major(test_size) - comptime out_layout = Layout.row_major(n_warps) + comptime bench_in_layout = row_major[test_size]() + comptime bench_out_layout = row_major[n_warps]() + comptime BenchInLayout = type_of(bench_in_layout) + comptime BenchOutLayout = type_of(bench_out_layout) comptime n_blocks = (ceildiv(test_size, WARP_SIZE), 1) var bench_ctx = DeviceContext() @@ -301,16 +336,26 @@ def benchmark_traditional_parameterized[ rand_int[dtype, test_size](b) expected_output[dtype, n_warps](expected, a, b) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + var a_tensor = TileTensor[mut=False, dtype, BenchInLayout]( + a, bench_in_layout + ) + var b_tensor = TileTensor[mut=False, dtype, BenchInLayout]( + b, bench_in_layout + ) + var out_tensor = TileTensor[mut=True, dtype, BenchOutLayout]( + out, bench_out_layout + ) @parameter @always_inline def traditional_workflow(ctx: DeviceContext) raises: ctx.enqueue_function[ - traditional_dot_product_p12_style[in_layout, out_layout, test_size], - traditional_dot_product_p12_style[in_layout, out_layout, test_size], + traditional_dot_product_p12_style[ + BenchInLayout, BenchOutLayout, test_size + ], + traditional_dot_product_p12_style[ + BenchInLayout, BenchOutLayout, test_size + ], ]( out_tensor, a_tensor, @@ -333,6 +378,8 @@ def main() raises: print("WARP_SIZE:", WARP_SIZE) print("SIMD_WIDTH:", SIMD_WIDTH) comptime n_warps = SIZE // WARP_SIZE + comptime main_out_layout = row_major[n_warps]() + comptime MainOutLayout = type_of(main_out_layout) with DeviceContext() as ctx: var out = ctx.enqueue_create_buffer[dtype](n_warps) out.enqueue_fill(0) @@ -343,9 +390,15 @@ def main() raises: var expected = ctx.enqueue_create_host_buffer[dtype](n_warps) expected.enqueue_fill(0) - var out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) - var a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - var b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b) + var out_tensor = rebind[ + TileTensor[mut=True, dtype, MainOutLayout, MutAnyOrigin] + ](TileTensor[mut=True, dtype, MainOutLayout](out, main_out_layout)) + var a_tensor = rebind[ + TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin] + ](TileTensor[mut=False, dtype, InLayout](a, in_layout)) + var b_tensor = rebind[ + TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin] + ](TileTensor[mut=False, dtype, InLayout](b, in_layout)) with a.map_to_host() as a_host, b.map_to_host() as b_host: for i in range(SIZE): @@ -355,10 +408,10 @@ def main() raises: if argv()[1] == "--traditional": ctx.enqueue_function[ traditional_dot_product_p12_style[ - in_layout, out_layout, SIZE + InLayout, MainOutLayout, SIZE ], traditional_dot_product_p12_style[ - in_layout, out_layout, SIZE + InLayout, MainOutLayout, SIZE ], ]( out_tensor, @@ -369,8 +422,8 @@ def main() raises: ) elif argv()[1] == "--kernel": ctx.enqueue_function[ - simple_warp_dot_product[in_layout, out_layout, SIZE], - simple_warp_dot_product[in_layout, out_layout, SIZE], + simple_warp_dot_product[InLayout, MainOutLayout, SIZE], + simple_warp_dot_product[InLayout, MainOutLayout, SIZE], ]( out_tensor, a_tensor, @@ -380,7 +433,7 @@ def main() raises: ) elif argv()[1] == "--functional": functional_warp_dot_product[ - in_layout, out_layout, dtype, SIMD_WIDTH, 1, SIZE + InLayout, MainOutLayout, dtype, SIMD_WIDTH, 1, SIZE ](out_tensor, a_tensor, b_tensor, ctx) expected_output[dtype, n_warps](expected, a, b) check_result[dtype, n_warps, True](out, expected) diff --git a/solutions/p25/p25.mojo b/solutions/p25/p25.mojo index 36130776..c33da7f8 100644 --- a/solutions/p25/p25.mojo +++ b/solutions/p25/p25.mojo @@ -1,7 +1,8 @@ from std.gpu import thread_idx, block_idx, block_dim, lane_id from std.gpu.host import DeviceContext from std.gpu.primitives.warp import shuffle_down, broadcast, WARP_SIZE -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major, TensorLayout from std.sys import argv from std.testing import assert_equal, assert_almost_equal @@ -10,15 +11,16 @@ comptime SIZE = WARP_SIZE comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (WARP_SIZE, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) # ANCHOR: neighbor_difference_solution def neighbor_difference[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin], ): """ Compute finite differences: output[i] = input[i+1] - input[i] @@ -52,15 +54,16 @@ def neighbor_difference[ comptime SIZE_2 = 64 comptime BLOCKS_PER_GRID_2 = (2, 1) comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1) -comptime layout_2 = Layout.row_major(SIZE_2) +comptime layout_2 = row_major[SIZE_2]() +comptime Layout2Type = type_of(layout_2) # ANCHOR: moving_average_3_solution def moving_average_3[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, Layout2Type, MutAnyOrigin], + input: TileTensor[mut=False, dtype, Layout2Type, MutAnyOrigin], ): """ Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3 @@ -92,10 +95,10 @@ def moving_average_3[ # ANCHOR: broadcast_shuffle_coordination_solution def broadcast_shuffle_coordination[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin], ): """ Combine broadcast() and shuffle_down() for advanced warp coordination. @@ -107,11 +110,11 @@ def broadcast_shuffle_coordination[ if global_i < size: # Step 1: Lane 0 computes block-local scaling factor - var scale_factor: output.element_type = 0.0 + var scale_factor: output.ElementType = 0.0 if lane == 0: # Compute average of first 4 elements in this block's data var block_start = block_idx.x * block_dim.x - var sum: output.element_type = 0.0 + var sum: output.ElementType = 0.0 for i in range(4): if block_start + i < size: sum += input[block_start + i] @@ -138,10 +141,10 @@ def broadcast_shuffle_coordination[ # ANCHOR: basic_broadcast_solution def basic_broadcast[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin], ): """ Basic broadcast: Lane 0 computes a block-local value, broadcasts it to all lanes. @@ -152,10 +155,10 @@ def basic_broadcast[ if global_i < size: # Step 1: Lane 0 computes special value (sum of first 4 elements in this block) - var broadcast_value: output.element_type = 0.0 + var broadcast_value: output.ElementType = 0.0 if lane == 0: var block_start = block_idx.x * block_dim.x - var sum: output.element_type = 0.0 + var sum: output.ElementType = 0.0 for i in range(4): if block_start + i < size: sum += input[block_start + i] @@ -173,10 +176,10 @@ def basic_broadcast[ # ANCHOR: conditional_broadcast_solution def conditional_broadcast[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin], ): """ Conditional broadcast: Lane 0 makes a decision based on block-local data, broadcasts it to all lanes. @@ -187,7 +190,7 @@ def conditional_broadcast[ if global_i < size: # Step 1: Lane 0 analyzes block-local data and makes decision (find max of first 8 in block) - var decision_value: output.element_type = 0.0 + var decision_value: output.ElementType = 0.0 if lane == 0: var block_start = block_idx.x * block_dim.x decision_value = input[block_start] if block_start < size else 0.0 @@ -224,14 +227,14 @@ def test_neighbor_difference() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i * i) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf + var output_tensor = TileTensor[mut=True, dtype, LayoutType]( + output_buf, layout ) - comptime kernel = neighbor_difference[layout, SIZE] + comptime kernel = neighbor_difference[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -273,14 +276,14 @@ def test_moving_average() raises: for i in range(1, SIZE_2): input_host[i] = input_host[i - 1] + Scalar[dtype](i + 1) - var input_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin]( - input_buf + var input_tensor = TileTensor[mut=False, dtype, Layout2Type]( + input_buf, layout_2 ) - var output_tensor = LayoutTensor[dtype, layout_2, MutAnyOrigin]( - output_buf + var output_tensor = TileTensor[mut=True, dtype, Layout2Type]( + output_buf, layout_2 ) - comptime kernel = moving_average_3[layout_2, SIZE_2] + comptime kernel = moving_average_3[SIZE_2] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -344,14 +347,14 @@ def test_broadcast_shuffle_coordination() raises: else: input_host[i] = Scalar[dtype](((i - 4) % 4) * 2 + 1) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf + var output_tensor = TileTensor[mut=True, dtype, LayoutType]( + output_buf, layout ) - comptime kernel = broadcast_shuffle_coordination[layout, SIZE] + comptime kernel = broadcast_shuffle_coordination[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -399,14 +402,14 @@ def test_basic_broadcast() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i + 1) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf + var output_tensor = TileTensor[mut=True, dtype, LayoutType]( + output_buf, layout ) - comptime kernel = basic_broadcast[layout, SIZE] + comptime kernel = basic_broadcast[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -460,14 +463,14 @@ def test_conditional_broadcast() raises: for i in range(SIZE): input_host[i] = test_values[i % len(test_values)] - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf + var output_tensor = TileTensor[mut=True, dtype, LayoutType]( + output_buf, layout ) - comptime kernel = conditional_broadcast[layout, SIZE] + comptime kernel = conditional_broadcast[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, diff --git a/solutions/p26/p26.mojo b/solutions/p26/p26.mojo index ebf102b6..f3a68270 100644 --- a/solutions/p26/p26.mojo +++ b/solutions/p26/p26.mojo @@ -1,7 +1,8 @@ from std.gpu import thread_idx, block_idx, block_dim, lane_id from std.gpu.host import DeviceContext from std.gpu.primitives.warp import shuffle_xor, prefix_sum, WARP_SIZE -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major from std.sys import argv from std.testing import assert_equal, assert_almost_equal @@ -10,15 +11,16 @@ comptime SIZE = WARP_SIZE comptime BLOCKS_PER_GRID = (1, 1) comptime THREADS_PER_BLOCK = (WARP_SIZE, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) # ANCHOR: butterfly_pair_swap_solution def butterfly_pair_swap[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern. @@ -45,10 +47,10 @@ def butterfly_pair_swap[ # ANCHOR: butterfly_parallel_max_solution def butterfly_parallel_max[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Parallel maximum reduction using butterfly pattern. @@ -78,15 +80,16 @@ def butterfly_parallel_max[ comptime SIZE_2 = 64 comptime BLOCKS_PER_GRID_2 = (2, 1) comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1) -comptime layout_2 = Layout.row_major(SIZE_2) +comptime layout_2 = row_major[SIZE_2]() +comptime Layout2Type = type_of(layout_2) # ANCHOR: butterfly_conditional_max_solution def butterfly_conditional_max[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, Layout2Type, MutAnyOrigin], + input: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin], ): """ Conditional butterfly maximum: Perform butterfly max reduction, but only store result @@ -123,10 +126,10 @@ def butterfly_conditional_max[ # ANCHOR: warp_inclusive_prefix_sum_solution def warp_inclusive_prefix_sum[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], ): """ Inclusive prefix sum using warp primitive: Each thread gets sum of all elements up to and including its position. @@ -166,10 +169,10 @@ def warp_inclusive_prefix_sum[ # ANCHOR: warp_partition_solution def warp_partition[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin], pivot: Float32, ): """ @@ -237,14 +240,12 @@ def test_butterfly_pair_swap() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = butterfly_pair_swap[layout, SIZE] + comptime kernel = butterfly_pair_swap[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -288,14 +289,12 @@ def test_butterfly_parallel_max() raises: # Make sure we have a clear maximum input_host[SIZE - 1] = 1000.0 - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = butterfly_parallel_max[layout, SIZE] + comptime kernel = butterfly_parallel_max[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -334,14 +333,12 @@ def test_butterfly_conditional_max() raises: else: input_host[i] = Scalar[dtype](i % 10) - var input_tensor = LayoutTensor[dtype, layout_2, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout_2, MutAnyOrigin]( - output_buf + var input_tensor = TileTensor[mut=False, dtype, Layout2Type]( + input_buf, layout_2 ) + var output_tensor = TileTensor(output_buf, layout_2) - comptime kernel = butterfly_conditional_max[layout_2, SIZE_2] + comptime kernel = butterfly_conditional_max[SIZE_2] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -394,14 +391,12 @@ def test_warp_inclusive_prefix_sum() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i + 1) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = warp_inclusive_prefix_sum[layout, SIZE] + comptime kernel = warp_inclusive_prefix_sum[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -461,14 +456,12 @@ def test_warp_partition() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](test_values[i % len(test_values)]) - var input_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin]( - input_buf - ) - var output_tensor = LayoutTensor[dtype, layout, MutAnyOrigin]( - output_buf + var input_tensor = TileTensor[mut=False, dtype, LayoutType]( + input_buf, layout ) + var output_tensor = TileTensor(output_buf, layout) - comptime kernel = warp_partition[layout, SIZE] + comptime kernel = warp_partition[SIZE] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, diff --git a/solutions/p27/p27.mojo b/solutions/p27/p27.mojo index 7d2c6206..2f095125 100644 --- a/solutions/p27/p27.mojo +++ b/solutions/p27/p27.mojo @@ -4,7 +4,9 @@ from std.gpu.primitives.warp import WARP_SIZE from std.gpu.primitives import block from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_equal from std.math import floor @@ -12,18 +14,20 @@ from std.math import floor comptime SIZE = 128 comptime TPB = 128 comptime NUM_BINS = 8 -comptime in_layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(1) +comptime in_layout = row_major[SIZE]() +comptime out_layout = row_major[1]() comptime dtype = DType.float32 +comptime InLayout = type_of(in_layout) +comptime OutLayout = type_of(out_layout) # ANCHOR: block_sum_dot_product_solution def block_sum_dot_product[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], size: Int, ): """Dot product using block.sum() - convenience function like warp.sum()! @@ -35,7 +39,7 @@ def block_sum_dot_product[ # Each thread computes partial product var partial_product: Scalar[dtype] = 0.0 if global_i < size: - # LayoutTensor indexing `[0]` returns the underlying SIMD value + # TileTensor indexing `[0]` returns the underlying SIMD value partial_product = a[global_i][0] * b[global_i][0] # The magic: block.sum() replaces 15+ lines of manual reduction! @@ -54,22 +58,19 @@ def block_sum_dot_product[ # ANCHOR: traditional_dot_product_solution def traditional_dot_product[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], + b: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], size: Int, ): """Traditional dot product using shared memory + barriers + tree reduction. Educational but complex - shows the manual coordination needed.""" - var shared = LayoutTensor[ - dtype, - Layout.row_major(tpb), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[tpb]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -96,16 +97,17 @@ def traditional_dot_product[ # ANCHOR_END: traditional_dot_product_solution -comptime bin_layout = Layout.row_major(SIZE) # Max SIZE elements per bin +comptime bin_layout = row_major[SIZE]() # Max SIZE elements per bin +comptime BinLayout = type_of(bin_layout) # ANCHOR: block_histogram_solution def block_histogram_bin_extract[ - in_layout: Layout, bin_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - bin_output: LayoutTensor[dtype, bin_layout, MutAnyOrigin], - count_output: LayoutTensor[DType.int32, out_layout, MutAnyOrigin], + input_data: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], + bin_output: TileTensor[mut=True, dtype, BinLayout, MutAnyOrigin], + count_output: TileTensor[mut=True, DType.int32, OutLayout, MutAnyOrigin], size: Int, target_bin: Int, num_bins: Int, @@ -160,23 +162,24 @@ def block_histogram_bin_extract[ # ANCHOR_END: block_histogram_solution -comptime vector_layout = Layout.row_major(SIZE) # For full vector output +comptime vector_layout = row_major[SIZE]() # For full vector output +comptime VectorLayout = type_of(vector_layout) # ANCHOR: block_normalize_solution def block_normalize_vector[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin], + input_data: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin], + output_data: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin], size: Int, ): """Vector mean normalization using block.sum() + block.broadcast() combination. This demonstrates the complete block operations workflow: - 1. Use block.sum() to compute sum of all elements (all โ†’ one) + 1. Use block.sum() to compute sum of all elements (all -> one) 2. Thread 0 computes mean = sum / size - 3. Use block.broadcast() to share mean to all threads (one โ†’ all) + 3. Use block.broadcast() to share mean to all threads (one -> all) 4. Each thread normalizes: output[i] = input[i] / mean """ @@ -242,20 +245,18 @@ def main() raises: print("TPB:", TPB) print("Expected result:", expected) - a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b_buf) - out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + a_tensor = TileTensor[mut=False, dtype, InLayout](a, in_layout) + b_tensor = TileTensor[mut=False, dtype, InLayout](b_buf, in_layout) + out_tensor = TileTensor(out, out_layout) # Traditional approach: works perfectly when size == TPB - comptime kernel = traditional_dot_product[ - in_layout, out_layout, TPB - ] + comptime kernel = traditional_dot_product[TPB] ctx.enqueue_function[kernel, kernel]( out_tensor, a_tensor, b_tensor, SIZE, - grid_dim=(1, 1), # โœ… Single block works when size == TPB + grid_dim=(1, 1), block_dim=(TPB, 1), ) @@ -287,12 +288,12 @@ def main() raises: print("TPB:", TPB) print("Expected result:", expected) - a_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](a) - b_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin](b_buf) - out_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin](out) + a_tensor = TileTensor[mut=False, dtype, InLayout](a, in_layout) + b_tensor = TileTensor[mut=False, dtype, InLayout](b_buf, in_layout) + out_tensor = TileTensor(out, out_layout) # Block.sum(): Same result with dramatically simpler code! - comptime kernel = block_sum_dot_product[in_layout, out_layout, TPB] + comptime kernel = block_sum_dot_product[TPB] ctx.enqueue_function[kernel, kernel]( out_tensor, a_tensor, @@ -341,8 +342,8 @@ def main() raises: print("...") print() - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf + input_tensor = TileTensor[mut=False, dtype, InLayout]( + input_buf, in_layout ) # Demonstrate histogram for each bin using block.prefix_sum() @@ -363,17 +364,11 @@ def main() raises: var bin_count = ctx.enqueue_create_buffer[DType.int32](1) bin_count.enqueue_fill(0) - var bin_tensor = LayoutTensor[dtype, bin_layout, MutAnyOrigin]( - bin_data - ) - var count_tensor = LayoutTensor[ - DType.int32, out_layout, MutAnyOrigin - ](bin_count) + var bin_tensor = TileTensor(bin_data, bin_layout) + var count_tensor = TileTensor(bin_count, out_layout) # Execute histogram kernel for this specific bin - comptime kernel = block_histogram_bin_extract[ - in_layout, bin_layout, out_layout, TPB - ] + comptime kernel = block_histogram_bin_extract[TPB] ctx.enqueue_function[kernel, kernel]( input_tensor, bin_tensor, @@ -439,17 +434,13 @@ def main() raises: print("Mean value:", mean_value) print() - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf + input_tensor = TileTensor[mut=False, dtype, InLayout]( + input_buf, in_layout ) - var output_tensor = LayoutTensor[ - dtype, vector_layout, MutAnyOrigin - ](output_buf) + var output_tensor = TileTensor(output_buf, vector_layout) # Execute vector normalization kernel - comptime kernel = block_normalize_vector[ - in_layout, vector_layout, TPB - ] + comptime kernel = block_normalize_vector[TPB] ctx.enqueue_function[kernel, kernel]( input_tensor, output_tensor, diff --git a/solutions/p28/p28.mojo b/solutions/p28/p28.mojo index 047e8b94..a6438d95 100644 --- a/solutions/p28/p28.mojo +++ b/solutions/p28/p28.mojo @@ -1,7 +1,9 @@ from std.gpu import thread_idx, block_idx, block_dim, grid_dim, barrier from std.gpu.host import DeviceContext from std.gpu.memory import async_copy_wait_all, AddressSpace -from layout import Layout, LayoutTensor +from layout import Layout, LayoutTensor, TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from layout.layout_tensor import copy_dram_to_sram_async from std.sys import argv, info from std.testing import assert_equal, assert_almost_equal @@ -17,16 +19,18 @@ comptime BLOCKS_PER_GRID_ASYNC = ( ) // CONV_TILE_SIZE comptime THREADS_PER_BLOCK_ASYNC = 256 comptime dtype = DType.float32 -comptime layout_async = Layout.row_major(VECTOR_SIZE) +comptime layout_async = row_major[VECTOR_SIZE]() +comptime AsyncLayoutType = type_of(layout_async) +comptime kernel_layout = Layout.row_major(KERNEL_SIZE) # ANCHOR: async_copy_overlap_convolution_solution def async_copy_overlap_convolution[ - dtype: DType, layout: Layout + dtype: DType ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], - kernel: LayoutTensor[dtype, Layout.row_major(KERNEL_SIZE), ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, AsyncLayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, AsyncLayoutType, MutAnyOrigin], + kernel: LayoutTensor[dtype, kernel_layout, ImmutAnyOrigin], ): """Demonstrates async copy operations building on p14 patterns. @@ -52,7 +56,7 @@ def async_copy_overlap_convolution[ # Phase 1: Launch async copy for input tile # Note: tile() does NOT perform bounds checking - ensure valid tile bounds - var input_tile = input.tile[CONV_TILE_SIZE](block_idx.x) + var input_tile = input.tile[CONV_TILE_SIZE](block_idx.x).to_layout_tensor() # Use async copy with thread layout matching p14 pattern comptime load_layout = Layout.row_major(THREADS_PER_BLOCK_ASYNC) @@ -68,8 +72,8 @@ def async_copy_overlap_convolution[ # Phase 4: Compute convolution var global_i = block_idx.x * CONV_TILE_SIZE + local_i - if local_i < CONV_TILE_SIZE and global_i < output.shape[0](): - var result: output.element_type = 0 + if local_i < CONV_TILE_SIZE and global_i < Int(output.dim[0]()): + var result: output.ElementType = 0 # Simple convolution avoiding boundary issues if local_i >= HALO_SIZE and local_i < CONV_TILE_SIZE - HALO_SIZE: @@ -77,10 +81,12 @@ def async_copy_overlap_convolution[ for k in range(KERNEL_SIZE): var input_idx = local_i + k - HALO_SIZE if input_idx >= 0 and input_idx < CONV_TILE_SIZE: - result += input_shared[input_idx] * kernel_shared[k] + result += rebind[Scalar[dtype]]( + input_shared[input_idx] + ) * rebind[Scalar[dtype]](kernel_shared[k]) else: # For boundary elements, just copy input (no convolution) - result = input_shared[local_i] + result = rebind[Scalar[dtype]](input_shared[local_i]) output[global_i] = result @@ -108,17 +114,17 @@ def test_async_copy_overlap_convolution() raises: for i in range(KERNEL_SIZE): kernel_host[i] = Scalar[dtype](i + 1) - var input_tensor = LayoutTensor[dtype, layout_async, ImmutAnyOrigin]( - input_buf + var input_tensor = TileTensor[mut=False, dtype, AsyncLayoutType]( + input_buf, layout_async ) - var output_tensor = LayoutTensor[dtype, layout_async, MutAnyOrigin]( - output_buf + var output_tensor = TileTensor[mut=True, dtype, AsyncLayoutType]( + output_buf, layout_async + ) + var kernel_tensor = LayoutTensor[dtype, kernel_layout, ImmutAnyOrigin]( + kernel_buf ) - var kernel_tensor = LayoutTensor[ - mut=False, dtype, Layout.row_major(KERNEL_SIZE) - ](kernel_buf) - comptime kernel = async_copy_overlap_convolution[dtype, layout_async] + comptime kernel = async_copy_overlap_convolution[dtype] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, diff --git a/solutions/p29/p29.mojo b/solutions/p29/p29.mojo index 31a9d858..9d92a2eb 100644 --- a/solutions/p29/p29.mojo +++ b/solutions/p29/p29.mojo @@ -6,7 +6,9 @@ from std.gpu.sync import ( ) from std.gpu.host import DeviceContext from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from layout.layout_tensor import copy_dram_to_sram_async from std.sys import argv, info from std.testing import assert_true, assert_almost_equal @@ -16,7 +18,8 @@ comptime SIZE = 1024 # Image size (1D for simplicity) comptime BLOCKS_PER_GRID = (4, 1) comptime THREADS_PER_BLOCK = (TPB, 1) comptime dtype = DType.float32 -comptime layout = Layout.row_major(SIZE) +comptime layout = row_major[SIZE]() +comptime LayoutType = type_of(layout) # Multi-stage processing configuration comptime STAGE1_THREADS = TPB // 2 @@ -25,11 +28,9 @@ comptime BLUR_RADIUS = 2 # ANCHOR: multi_stage_pipeline_solution -def multi_stage_image_blur_pipeline[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def multi_stage_image_blur_pipeline( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin], size: Int, ): """Multi-stage image blur pipeline with barrier coordination. @@ -40,18 +41,12 @@ def multi_stage_image_blur_pipeline[ """ # Shared memory buffers for pipeline stages - var input_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var blur_shared = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var input_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) + var blur_shared = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -133,11 +128,9 @@ comptime BUFFER_COUNT = 2 # ANCHOR: double_buffered_stencil_solution -def double_buffered_stencil_computation[ - layout: Layout -]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - input: LayoutTensor[dtype, layout, ImmutAnyOrigin], +def double_buffered_stencil_computation( + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin], size: Int, ): """Double-buffered stencil computation with memory barrier coordination. @@ -147,38 +140,23 @@ def double_buffered_stencil_computation[ """ # Double-buffering: Two shared memory buffers - var buffer_A = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var buffer_B = LayoutTensor[ - dtype, - Layout.row_major(TPB), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var buffer_A = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) + var buffer_B = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[TPB]()) # Memory barriers for coordinating buffer swaps - var init_barrier = LayoutTensor[ - DType.uint64, - Layout.row_major(1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var iter_barrier = LayoutTensor[ - DType.uint64, - Layout.row_major(1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() - var final_barrier = LayoutTensor[ - DType.uint64, - Layout.row_major(1), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var init_barrier = stack_allocation[ + dtype=DType.uint64, address_space=AddressSpace.SHARED + ](row_major[1]()) + var iter_barrier = stack_allocation[ + dtype=DType.uint64, address_space=AddressSpace.SHARED + ](row_major[1]()) + var final_barrier = stack_allocation[ + dtype=DType.uint64, address_space=AddressSpace.SHARED + ](row_major[1]()) var global_i = block_dim.x * block_idx.x + thread_idx.x var local_i = thread_idx.x @@ -284,11 +262,11 @@ def test_multi_stage_pipeline() raises: # Create a simple wave pattern for blurring inp_host[i] = Scalar[dtype](i % 10) + Scalar[dtype](i) / 100.0 - # Create LayoutTensors - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var inp_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp) + # Create TileTensors + var out_tensor = TileTensor[mut=True, dtype, LayoutType](out, layout) + var inp_tensor = TileTensor[mut=False, dtype, LayoutType](inp, layout) - comptime kernel = multi_stage_image_blur_pipeline[layout] + comptime kernel = multi_stage_image_blur_pipeline ctx.enqueue_function[kernel, kernel]( out_tensor, inp_tensor, @@ -346,11 +324,11 @@ def test_double_buffered_stencil() raises: # Create a step pattern that will be smoothed by stencil inp_host[i] = Scalar[dtype](1.0 if i % 20 < 10 else 0.0) - # Create LayoutTensors for Puzzle 26B - var out_tensor = LayoutTensor[dtype, layout, MutAnyOrigin](out) - var inp_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp) + # Create TileTensors for Puzzle 26B + var out_tensor = TileTensor[mut=True, dtype, LayoutType](out, layout) + var inp_tensor = TileTensor[mut=False, dtype, LayoutType](inp, layout) - comptime kernel = double_buffered_stencil_computation[layout] + comptime kernel = double_buffered_stencil_computation ctx.enqueue_function[kernel, kernel]( out_tensor, inp_tensor, diff --git a/solutions/p33/p33.mojo b/solutions/p33/p33.mojo index 0e57b8fd..6d6c4a10 100644 --- a/solutions/p33/p33.mojo +++ b/solutions/p33/p33.mojo @@ -1,6 +1,7 @@ from std.gpu import thread_idx, block_idx, block_dim, barrier, WARP_SIZE from std.gpu.host import DeviceContext -from layout import Layout, LayoutTensor +from layout import Layout, LayoutTensor, TileTensor +from layout.tile_layout import row_major from layout.tensor_core import TensorCore from layout.layout_tensor import copy_dram_to_sram_async from std.gpu.memory import async_copy_wait_all, AddressSpace @@ -10,7 +11,8 @@ from std.testing import assert_equal, assert_almost_equal comptime dtype = DType.float32 comptime SIZE = 1024 -comptime layout = Layout.row_major(SIZE, SIZE) +comptime layout = row_major[SIZE, SIZE]() +comptime LayoutType = type_of(layout) comptime BLOCK_DIM_COUNT = 2 comptime TILE_SIZE = 32 @@ -23,11 +25,11 @@ comptime THREADS_PER_BLOCK_TILED = (TILE_SIZE, TILE_SIZE) # ANCHOR: matmul_idiomatic_tiled_solution def matmul_idiomatic_tiled[ - layout: Layout, size: Int + size: Int ]( - output: LayoutTensor[dtype, layout, MutAnyOrigin], - a: LayoutTensor[dtype, layout, ImmutAnyOrigin], - b: LayoutTensor[dtype, layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin], + a: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin], + b: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin], ): # Use block_dim to get actual tile size dynamically var tile_size_x = block_dim.x @@ -53,7 +55,7 @@ def matmul_idiomatic_tiled[ address_space=AddressSpace.SHARED, ].stack_allocation() - var acc: output.element_type = 0 + var acc: output.ElementType = 0 comptime load_a_layout = Layout.row_major(1, TILE_SIZE) # Coalesced loading comptime load_b_layout = Layout.row_major(1, TILE_SIZE) # Coalesced loading @@ -62,8 +64,12 @@ def matmul_idiomatic_tiled[ for idx in range(size // TILE_SIZE): # Iterate over K tiles # Get tiles from A and B matrices - var a_tile = a.tile[TILE_SIZE, TILE_SIZE](block_idx.y, idx) - var b_tile = b.tile[TILE_SIZE, TILE_SIZE](idx, block_idx.x) + var a_tile = a.tile[TILE_SIZE, TILE_SIZE]( + block_idx.y, idx + ).to_layout_tensor() + var b_tile = b.tile[TILE_SIZE, TILE_SIZE]( + idx, block_idx.x + ).to_layout_tensor() # Asynchronously copy tiles to shared memory with consistent orientation copy_dram_to_sram_async[ @@ -87,7 +93,9 @@ def matmul_idiomatic_tiled[ and local_col < TILE_SIZE and k < TILE_SIZE ): - acc += a_shared[local_row, k] * b_shared[k, local_col] + acc += rebind[Scalar[dtype]](a_shared[local_row, k]) * rebind[ + Scalar[dtype] + ](b_shared[k, local_col]) barrier() @@ -289,19 +297,29 @@ def main() raises: inp1_host[i * SIZE + k] * inp2_host[k * SIZE + j] ) # Create layout tensors - var out_tensor_core_layout = LayoutTensor[dtype, layout]( + comptime old_layout = Layout.row_major(SIZE, SIZE) + var out_tensor_core_layout = LayoutTensor[dtype, old_layout]( out_tensor_core.unsafe_ptr() ) - var a_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp1) - var b_tensor = LayoutTensor[dtype, layout, ImmutAnyOrigin](inp2) + var a_tensor = LayoutTensor[dtype, old_layout, ImmutAnyOrigin](inp1) + var b_tensor = LayoutTensor[dtype, old_layout, ImmutAnyOrigin](inp2) + + # Create TileTensors for the tiled kernel + var out_tile_tensor = TileTensor(out_tensor_core, layout) + var a_tile_tensor = TileTensor[mut=False, dtype, LayoutType]( + inp1, layout + ) + var b_tile_tensor = TileTensor[mut=False, dtype, LayoutType]( + inp2, layout + ) if mode == "--tensor-core": print("\n=== Running ACTUAL Tensor Core Matrix Multiplication ===") comptime kernel = tensor_core_matrix_multiplication[ dtype, - layout, - layout, - layout, + old_layout, + old_layout, + old_layout, BM, BN, BK, @@ -328,16 +346,14 @@ def main() raises: # Create separate buffer for tiled result out_tiled = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) out_tiled.enqueue_fill(0) - out_tiled_layout = LayoutTensor[dtype, layout]( - out_tiled.unsafe_ptr() - ) + out_tiled_layout = TileTensor(out_tiled, layout) # Run idiomatic tiled version with proper 2D block configuration - comptime kernel = matmul_idiomatic_tiled[layout, SIZE] + comptime kernel = matmul_idiomatic_tiled[SIZE] ctx.enqueue_function[kernel, kernel]( out_tiled_layout, - a_tensor, - b_tensor, + a_tile_tensor, + b_tile_tensor, grid_dim=BLOCK_PER_GRID_TILED, block_dim=THREADS_PER_BLOCK_TILED, ) @@ -356,9 +372,9 @@ def main() raises: print("\n--- Test 1: Tensor Core vs CPU Reference ---") comptime kernel = tensor_core_matrix_multiplication[ dtype, - layout, - layout, - layout, + old_layout, + old_layout, + old_layout, BM, BN, BK, @@ -435,15 +451,13 @@ def main() raises: print("\n--- Test 2: Idiomatic Tiled vs CPU Reference ---") out_tiled = ctx.enqueue_create_buffer[dtype](SIZE * SIZE) out_tiled.enqueue_fill(0) - out_tiled_layout = LayoutTensor[dtype, layout]( - out_tiled.unsafe_ptr() - ) + out_tiled_layout = TileTensor(out_tiled, layout) - comptime kernel2 = matmul_idiomatic_tiled[layout, SIZE] + comptime kernel2 = matmul_idiomatic_tiled[SIZE] ctx.enqueue_function[kernel2, kernel2]( out_tiled_layout, - a_tensor, - b_tensor, + a_tile_tensor, + b_tile_tensor, grid_dim=BLOCK_PER_GRID_TILED, block_dim=THREADS_PER_BLOCK_TILED, ) diff --git a/solutions/p34/p34.mojo b/solutions/p34/p34.mojo index e2a62a55..4aa02af4 100644 --- a/solutions/p34/p34.mojo +++ b/solutions/p34/p34.mojo @@ -8,7 +8,9 @@ from std.gpu.primitives.cluster import ( elect_one_sync, ) from std.gpu.memory import AddressSpace -from layout import Layout, LayoutTensor +from layout import TileTensor +from layout.tile_layout import row_major +from layout.tile_tensor import stack_allocation from std.sys import argv from std.testing import assert_equal, assert_almost_equal, assert_true @@ -16,16 +18,20 @@ comptime SIZE = 1024 comptime TPB = 256 comptime CLUSTER_SIZE = 4 comptime dtype = DType.float32 -comptime in_layout = Layout.row_major(SIZE) -comptime out_layout = Layout.row_major(1) +comptime in_layout = row_major[SIZE]() +comptime out_layout = row_major[1]() +comptime InLayout = type_of(in_layout) +comptime OutLayout = type_of(out_layout) +comptime cluster_layout = row_major[CLUSTER_SIZE]() +comptime ClusterLayout = type_of(cluster_layout) # ANCHOR: cluster_coordination_basics_solution def cluster_coordination_basics[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, ClusterLayout, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin], size: Int, ): """Real cluster coordination using SM90+ cluster APIs.""" @@ -36,12 +42,9 @@ def cluster_coordination_basics[ var my_block_rank = Int(block_rank_in_cluster()) var block_id = block_idx.x - var shared_data = LayoutTensor[ - dtype, - Layout.row_major(tpb), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_data = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[tpb]()) # FIX: Use block_idx.x for data distribution instead of cluster rank # Each block should process different portions of the data @@ -77,13 +80,11 @@ def cluster_coordination_basics[ # ANCHOR: cluster_collective_operations_solution def cluster_collective_operations[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], - temp_storage: LayoutTensor[ - dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin - ], + output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin], + temp_storage: TileTensor[mut=True, dtype, ClusterLayout, MutAnyOrigin], size: Int, ): """Cluster-wide collective operations using real cluster APIs.""" @@ -98,12 +99,9 @@ def cluster_collective_operations[ my_value = input[global_i][0] # Block-level reduction using shared memory - var shared_mem = LayoutTensor[ - dtype, - Layout.row_major(tpb), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_mem = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[tpb]()) shared_mem[local_i] = my_value barrier() @@ -135,10 +133,10 @@ def cluster_collective_operations[ # ANCHOR: advanced_cluster_patterns_solution def advanced_cluster_patterns[ - in_layout: Layout, out_layout: Layout, tpb: Int + tpb: Int ]( - output: LayoutTensor[dtype, out_layout, MutAnyOrigin], - input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin], + output: TileTensor[mut=True, dtype, ClusterLayout, MutAnyOrigin], + input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin], size: Int, ): """Advanced cluster programming using cluster masks and relaxed synchronization. @@ -148,12 +146,9 @@ def advanced_cluster_patterns[ var my_block_rank = Int(block_rank_in_cluster()) var block_id = block_idx.x - var shared_data = LayoutTensor[ - dtype, - Layout.row_major(tpb), - MutAnyOrigin, - address_space=AddressSpace.SHARED, - ].stack_allocation() + var shared_data = stack_allocation[ + dtype=dtype, address_space=AddressSpace.SHARED + ](row_major[tpb]()) # Compute cluster mask for advanced coordination # base_mask = cluster_mask_base() # Requires cluster_shape parameter @@ -216,16 +211,14 @@ def main() raises: for i in range(SIZE): input_host[i] = Scalar[dtype](i % 10) * 0.1 - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf + input_tensor = TileTensor[mut=False, dtype, InLayout]( + input_buf, in_layout + ) + output_tensor = TileTensor[mut=True, dtype, ClusterLayout]( + output_buf, cluster_layout ) - output_tensor = LayoutTensor[ - dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin - ](output_buf) - comptime kernel = cluster_coordination_basics[ - in_layout, Layout.row_major(CLUSTER_SIZE), TPB - ] + comptime kernel = cluster_coordination_basics[TPB] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -280,19 +273,17 @@ def main() raises: print("Expected sum:", expected_sum) - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf + input_tensor = TileTensor[mut=False, dtype, InLayout]( + input_buf, in_layout + ) + var output_tensor = TileTensor[mut=True, dtype, OutLayout]( + output_buf, out_layout ) - var output_tensor = LayoutTensor[dtype, out_layout, MutAnyOrigin]( - output_buf + var temp_tensor = TileTensor[mut=True, dtype, ClusterLayout]( + temp_buf, cluster_layout ) - var temp_tensor = LayoutTensor[ - dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin - ](temp_buf) - comptime kernel = cluster_collective_operations[ - in_layout, out_layout, TPB - ] + comptime kernel = cluster_collective_operations[TPB] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor, @@ -332,16 +323,14 @@ def main() raises: Scalar[dtype](i % 50) * 0.02 ) # Pattern for testing - input_tensor = LayoutTensor[dtype, in_layout, ImmutAnyOrigin]( - input_buf + input_tensor = TileTensor[mut=False, dtype, InLayout]( + input_buf, in_layout + ) + output_tensor = TileTensor[mut=True, dtype, ClusterLayout]( + output_buf, cluster_layout ) - output_tensor = LayoutTensor[ - dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin - ](output_buf) - comptime kernel = advanced_cluster_patterns[ - in_layout, Layout.row_major(CLUSTER_SIZE), TPB - ] + comptime kernel = advanced_cluster_patterns[TPB] ctx.enqueue_function[kernel, kernel]( output_tensor, input_tensor,