[内容勘误] 卷一Cache 机制与内存层次的示例代码、参考输出和后续解释之间似乎存在不一致之处

### 相关页面或文件

https://github.com/Awesome-Embedded-Learning-Studio/Tutorial_AwesomeModernCPP/blob/8bae16a6be92f9252cb536339ccef1a12df3a484/documents/vol1-fundamentals/c_tutorials/advanced_feature/02-cache-and-memory-hierarchy.md

### 问题类型

代码讲解不准确

### 当前内容

https://github.com/Awesome-Embedded-Learning-Studio/Tutorial_AwesomeModernCPP/blob/8bae16a6be92f9252cb536339ccef1a12df3a484/documents/vol1-fundamentals/c_tutorials/advanced_feature/02-cache-and-memory-hierarchy.md?plain=1#L69-L128

### 问题说明


这段代码中stride的变化会使得循环访问次数会按比例减少 ，使用`end - start`计算总时长而不是每次访问的平均耗时，在这种情况下，随着 stride 变大，总访问次数减少，总耗时通常应该下降。这段代码和原文解释的cache带来的优化作用不相符。

在`x86-64 Linux`, `gcc 11.4.0`环境下，启用`-O2 -std=c11`编译输出如下：
```
stride=    1  time=13.813 ms
stride=    2  time=7.066 ms
stride=    4  time=3.457 ms
stride=    8  time=1.740 ms
stride=   16  time=0.863 ms
stride=   32  time=0.428 ms
stride=   64  time=0.224 ms
stride=  128  time=0.103 ms
stride=  256  time=0.056 ms
stride=  512  time=0.034 ms
stride= 1024  time=0.014 ms
stride= 2048  time=0.007 ms
stride= 4096  time=0.004 ms
```

### 建议修改

_No response_

### 补充材料

_No response_

	现在我们知道数据在 Cache 和主存之间不是按字节交换的，而是按缓存行（Cache Line）为单位搬运的。x86 上一个缓存行通常是 64 字节，ARM 上也有 32 字节的（不过现代 ARM64 也基本统一到 64 字节了）。这意味着哪怕你只读了一个 `int`（4 字节），Cache 控制器也会把那个 `int` 所在的整条缓存行（64 字节）全部从主存拉上来。

	这个设计的动机很直观——既然我们有空间局部性，那不如一次多搬一点，万一你接下来要访问的就是相邻的数据呢？大部分程序的访问模式确实都具有相当好的空间局部性，所以这个策略在统计上是赚的。

	我们可以写一段简单的 C 代码来直观感受缓存行的存在。这个程序以不同的步长遍历同一个数组，观察耗时变化：

	```c
	#include <stdio.h>
	#include <stdlib.h>
	#include <time.h>

	#define kArraySize (64 * 1024 * 1024) // 64M 个 int

	int main(void)
	{
	int* arr = (int)malloc(kArraySize sizeof(int));
	// 先预热，确保数据在 Cache 里
	for (int i = 0; i < kArraySize; i++) {
	arr[i] = i;
	}

	// 以不同步长遍历，只做读操作
	for (int stride = 1; stride <= 4096; stride *= 2) {
	clock_t start = clock();
	int sum = 0;
	for (int i = 0; i < kArraySize; i += stride) {
	sum += arr[i];
	}
	clock_t end = clock();
	printf("stride=%5d time=%.3f ms\n",
	stride,
	(double)(end - start) / CLOCKS_PER_SEC * 1000);
	}

	free(arr);
	return 0;
	}
	```

	编译运行后你会看到一个有趣的现象：

	```text
	$ gcc -O2 -std=c11 stride_test.c -o stride_test && ./stride_test
	stride= 1 time=68.245 ms
	stride= 2 time=68.891 ms
	stride= 4 time=69.012 ms
	stride= 8 time=69.453 ms
	stride= 16 time=70.102 ms
	stride= 32 time=132.567 ms
	stride= 64 time=201.345 ms
	stride= 128 time=215.789 ms
	stride= 256 time=218.901 ms
	stride= 512 time=220.134 ms
	stride= 1024 time=221.567 ms
	stride= 2048 time=222.890 ms
	stride= 4096 time=223.456 ms
	```

	当步长从 1 增长到 16（16 个 int = 64 字节，正好一条缓存行）的过程中，耗时几乎不怎么变化——因为无论你是逐个访问还是每隔几个访问，反正一条缓存行被拉上来之后里面的所有数据都已经在 Cache 里了。但步长一旦超过 16（跨越缓存行边界），每次访问都会触发新的 Cache Line 加载，耗时就会明显上升。这个小实验非常好地展示了缓存行作为最小搬运单位的效果。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[内容勘误] 卷一Cache 机制与内存层次的示例代码、参考输出和后续解释之间似乎存在不一致之处 #67

相关页面或文件

问题类型

当前内容

问题说明

建议修改

补充材料

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[内容勘误] 卷一Cache 机制与内存层次的示例代码、参考输出和后续解释之间似乎存在不一致之处 #67

Description

相关页面或文件

问题类型

当前内容

问题说明

建议修改

补充材料

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions