Skip to content

Commit c7db565

Browse files
authored
[0019] 优化 string-ref/cursor 性能,消除临时 bytevector 分配 (#785)
1 parent 518b0b8 commit c7db565

6 files changed

Lines changed: 381 additions & 17 deletions

File tree

bench/string-cursor.scm

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
;; string-ref/cursor 性能基准测试
2+
;; 测试当前实现中 bytevector-copy + utf8->codepoint 的性能开销
3+
4+
(import (liii timeit)
5+
(liii string-cursor)
6+
(scheme base)
7+
)
8+
9+
;; 运行单次 benchmark
10+
(define (bench name stmt number)
11+
(let ((elapsed (timeit stmt '() number)))
12+
(display name)
13+
(display ": ")
14+
(display elapsed)
15+
(display " 秒 (")
16+
(display number)
17+
(display " 次)\n")
18+
)
19+
)
20+
21+
;; 使用 string-ref/cursor 遍历字符串的所有字符
22+
(define (cursor-traverse str)
23+
(let ((end (string-cursor-end str)))
24+
(let loop ((cur (string-cursor-start str)))
25+
(if (string-cursor=? cur end)
26+
'done
27+
(begin
28+
(string-ref/cursor str cur)
29+
(loop (string-cursor-next str cur))
30+
)
31+
)
32+
)
33+
)
34+
)
35+
36+
;; 使用原生 string-ref 遍历字符串的所有字符
37+
(define (native-traverse str)
38+
(let ((len (string-length str)))
39+
(let loop ((i 0))
40+
(if (= i len)
41+
'done
42+
(begin
43+
(string-ref str i)
44+
(loop (+ i 1))
45+
)
46+
)
47+
)
48+
)
49+
)
50+
51+
;; 使用 string-fold 遍历(内部使用 string-ref/cursor)
52+
(define (fold-traverse str)
53+
(string-fold (lambda (ch acc) (+ acc 1)) 0 str)
54+
)
55+
56+
;; 使用 string-every 遍历(内部使用 string-ref/cursor)
57+
(define (every-traverse str)
58+
(string-every char? str)
59+
)
60+
61+
(define (run-benchmarks)
62+
(display "=== string-ref/cursor 性能测试 ===\n\n")
63+
64+
;; ASCII 短字符串(50 字符)
65+
(let ((ascii-short (make-string 50 #\a)))
66+
(display "--- ASCII 短字符串 (50 字符) ---\n")
67+
(bench " cursor 遍历 " (lambda () (cursor-traverse ascii-short)) 10000)
68+
(bench " native 遍历 " (lambda () (native-traverse ascii-short)) 10000)
69+
(bench " string-fold " (lambda () (fold-traverse ascii-short)) 10000)
70+
(bench " string-every " (lambda () (every-traverse ascii-short)) 10000)
71+
(display "\n")
72+
)
73+
74+
;; ASCII 长字符串(500 字符)
75+
(let ((ascii-long (make-string 500 #\a)))
76+
(display "--- ASCII 长字符串 (500 字符) ---\n")
77+
(bench " cursor 遍历 " (lambda () (cursor-traverse ascii-long)) 1000)
78+
(bench " native 遍历 " (lambda () (native-traverse ascii-long)) 1000)
79+
(bench " string-fold " (lambda () (fold-traverse ascii-long)) 1000)
80+
(bench " string-every " (lambda () (every-traverse ascii-long)) 1000)
81+
(display "\n")
82+
)
83+
84+
;; 混合 UTF-8 短字符串(20 个汉字,每个 3 字节)
85+
(let ((utf8-short (string-tabulate (lambda (i) #\中) 20)))
86+
(display "--- UTF-8 短字符串 (20 个汉字) ---\n")
87+
(bench " cursor 遍历 " (lambda () (cursor-traverse utf8-short)) 10000)
88+
(bench " native 遍历 " (lambda () (native-traverse utf8-short)) 10000)
89+
(bench " string-fold " (lambda () (fold-traverse utf8-short)) 10000)
90+
(bench " string-every " (lambda () (every-traverse utf8-short)) 10000)
91+
(display "\n")
92+
)
93+
94+
;; 混合 UTF-8 长字符串(200 个汉字)
95+
(let ((utf8-long (string-tabulate (lambda (i) #\中) 200)))
96+
(display "--- UTF-8 长字符串 (200 个汉字) ---\n")
97+
(bench " cursor 遍历 " (lambda () (cursor-traverse utf8-long)) 1000)
98+
(bench " native 遍历 " (lambda () (native-traverse utf8-long)) 1000)
99+
(bench " string-fold " (lambda () (fold-traverse utf8-long)) 1000)
100+
(bench " string-every " (lambda () (every-traverse utf8-long)) 1000)
101+
(display "\n")
102+
)
103+
104+
;; Emoji 短字符串(10 个 emoji,每个 4 字节)
105+
(let ((emoji-short (string-tabulate (lambda (i) #\🚀) 10)))
106+
(display "--- Emoji 短字符串 (10 个 emoji) ---\n")
107+
(bench " cursor 遍历 " (lambda () (cursor-traverse emoji-short)) 10000)
108+
(bench " native 遍历 " (lambda () (native-traverse emoji-short)) 10000)
109+
(bench " string-fold " (lambda () (fold-traverse emoji-short)) 10000)
110+
(bench " string-every " (lambda () (every-traverse emoji-short)) 10000)
111+
(display "\n")
112+
)
113+
)
114+
115+
(run-benchmarks)

devel/0019.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# [0019] 优化 string-ref/cursor 性能,消除临时 bytevector 分配
2+
3+
## 任务相关的代码文件
4+
- `goldfish/liii/unicode.scm` — 新增 `utf8->codepoint-at`
5+
- `goldfish/liii/string-cursor.scm` — 使用 `utf8->codepoint-at` 替代 `bytevector-copy` + `utf8->codepoint`
6+
- `goldfish/scheme/char.scm` — 同上
7+
- `tests/liii/unicode/utf8-to-codepoint-at-test.scm` — 新增测试
8+
- `bench/string-cursor.scm` — 性能基准测试
9+
10+
## 如何测试
11+
```bash
12+
# 1. 构建
13+
xmake b goldfish
14+
15+
# 2. 运行新增测试
16+
bin/gf tests/liii/unicode/utf8-to-codepoint-at-test.scm
17+
18+
# 3. 运行 string-cursor 全量测试
19+
bin/gf test tests/liii/string-cursor/
20+
21+
# 4. 运行 benchmark
22+
bin/gf bench/string-cursor.scm
23+
```
24+
25+
## 2026-05-11 优化 string-ref/cursor 临时对象分配
26+
27+
### What
28+
1. 新增 `utf8->codepoint-at` 函数,支持直接从 bytevector 的指定偏移位置解码 UTF-8 字符,无需创建临时 bytevector。
29+
2. 修改 `string-ref/cursor`,使用 `utf8->codepoint-at` 替代 `(bytevector-copy bv start end)` + `(utf8->codepoint char-bv)`
30+
3. 修改 `string-cursor.scm` 中另外 4 处字符比较代码(`string-prefix-length``string-suffix-length``string-prefix?` 内部辅助函数)。
31+
4. 修改 `scheme/char.scm``utf8-string-map` 的字符解码逻辑。
32+
5. 新增 `utf8->codepoint-at` 的完整单元测试。
33+
6. 添加 `bench/string-cursor.scm` 性能基准测试。
34+
35+
### Why
36+
`string-ref/cursor``(liii string-cursor)` 模块最底层的字符读取操作,几乎所有基于 cursor 的字符串遍历函数(`string-fold``string-every``string-index` 等)都会反复调用它。
37+
38+
原实现每次读取一个字符都要:
39+
1. 通过 `bytevector-copy` 分配一个 1~4 字节的临时 bytevector
40+
2. 通过 `utf8->codepoint` 解码这个临时对象
41+
42+
当遍历一个包含 N 个字符的字符串时,就会产生 N 次临时 bytevector 分配。对于高频字符串操作,这造成了显著的 GC 压力和性能开销。
43+
44+
### How
45+
核心思路是**消除临时 bytevector 分配**
46+
47+
新增 `utf8->codepoint-at bv start`,其解码逻辑与 `utf8->codepoint` 完全一致,但直接从 `bv``start` 位置开始读取字节,而不是先 `bytevector-copy` 到一个新 bytevector 再解码。
48+
49+
修改后的 `string-ref/cursor`
50+
```scheme
51+
;; 优化前
52+
(let* (...
53+
(start (vector-ref pos idx))
54+
(end (vector-ref pos (+ idx 1)))
55+
(char-bv (bytevector-copy bv start end)))
56+
(integer->char (utf8->codepoint char-bv)))
57+
58+
;; 优化后
59+
(let* (...
60+
(start (vector-ref pos idx)))
61+
(integer->char (utf8->codepoint-at bv start)))
62+
```
63+
64+
### 性能对比
65+
66+
#### string-fold(遍历场景,benchmark 写法未变,对比最公平)
67+
68+
| 测试场景 | 优化前 (秒) | 优化后 (秒) | 提升 |
69+
|---------|-----------|-----------|------|
70+
| ASCII 短字符串 (50 字符, 10000 次) | 2.94 | 2.41 | **18%** |
71+
| ASCII 长字符串 (500 字符, 1000 次) | 2.62 | 2.37 | **10%** |
72+
| UTF-8 短字符串 (20 汉字, 10000 次) | 1.73 | 1.58 | **9%** |
73+
| UTF-8 长字符串 (200 汉字, 1000 次) | 1.58 | 1.37 | **13%** |
74+
| Emoji 短字符串 (10 emoji, 10000 次) | 0.99 | 0.88 | **11%** |
75+
76+
#### cursor 遍历(benchmark 已修正 string-cursor-end 位置)
77+
78+
最初的 benchmark 发现 `cursor-traverse``native-traverse`**1000 倍以上**。经分析,发现除 `bytevector-copy` 外,另一个主要开销是 `cursor-traverse` 每次循环都调用 `string-cursor-end` 创建新 cursor 对象。
79+
80+
修正 benchmark(将 `string-cursor-end` 移出循环)后:
81+
82+
| 测试场景 | cursor 遍历 (秒) | native 遍历 (秒) | 差距 |
83+
|---------|----------------|----------------|------|
84+
| ASCII 短字符串 (50 字符, 10000 次) | 1.84 | 0.021 | ~87x |
85+
| ASCII 长字符串 (500 字符, 1000 次) | 1.71 | 0.017 | ~100x |
86+
| UTF-8 短字符串 (20 汉字, 10000 次) | 1.09 | 0.019 | ~57x |
87+
| UTF-8 长字符串 (200 汉字, 1000 次) | 1.00 | 0.021 | ~48x |
88+
| Emoji 短字符串 (10 emoji, 10000 次) | 0.63 | 0.015 | ~42x |
89+
90+
**说明**:修正 benchmark 后,cursor 遍历与 native 遍历仍有数十倍差距,这主要是由 cursor 抽象层(record 访问、vector 索引、cursor 对象创建等)带来的固有开销,而非 `bytevector-copy` 单一因素。`utf8->codepoint-at` 的优化将 `string-fold` 等上层遍历函数提升了 **10~18%**,并彻底消除了每次字符访问的临时对象分配。

goldfish/liii/string-cursor.scm

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -307,10 +307,8 @@
307307
) ;when
308308
) ;_
309309
(start (vector-ref pos idx))
310-
(end (vector-ref pos (+ idx 1)))
311-
(char-bv (bytevector-copy bv start end))
312310
) ;
313-
(integer->char (utf8->codepoint char-bv))
311+
(integer->char (utf8->codepoint-at bv start))
314312
) ;let*
315313
) ;define
316314

@@ -738,11 +736,9 @@
738736
(if (or (>= i end1-idx) (>= j end2-idx))
739737
count
740738
(let* ((b1-start (vector-ref pos1 i))
741-
(b1-end (vector-ref pos1 (+ i 1)))
742-
(ch1 (integer->char (utf8->codepoint (bytevector-copy bv1 b1-start b1-end))))
739+
(ch1 (integer->char (utf8->codepoint-at bv1 b1-start)))
743740
(b2-start (vector-ref pos2 j))
744-
(b2-end (vector-ref pos2 (+ j 1)))
745-
(ch2 (integer->char (utf8->codepoint (bytevector-copy bv2 b2-start b2-end))))
741+
(ch2 (integer->char (utf8->codepoint-at bv2 b2-start)))
746742
) ;
747743
(if (char=? ch1 ch2) (loop (+ i 1) (+ j 1) (+ count 1)) count)
748744
) ;let*
@@ -781,11 +777,9 @@
781777
(if (or (< i start1-idx) (< j start2-idx))
782778
count
783779
(let* ((b1-start (vector-ref pos1 i))
784-
(b1-end (vector-ref pos1 (+ i 1)))
785-
(ch1 (integer->char (utf8->codepoint (bytevector-copy bv1 b1-start b1-end))))
780+
(ch1 (integer->char (utf8->codepoint-at bv1 b1-start)))
786781
(b2-start (vector-ref pos2 j))
787-
(b2-end (vector-ref pos2 (+ j 1)))
788-
(ch2 (integer->char (utf8->codepoint (bytevector-copy bv2 b2-start b2-end))))
782+
(ch2 (integer->char (utf8->codepoint-at bv2 b2-start)))
789783
) ;
790784
(if (char=? ch1 ch2) (loop (- i 1) (- j 1) (+ count 1)) count)
791785
) ;let*
@@ -858,11 +852,9 @@
858852
(if (>= j s2-end)
859853
#t
860854
(let* ((b1-start (vector-ref pos1 i))
861-
(b1-end (vector-ref pos1 (+ i 1)))
862-
(ch1 (integer->char (utf8->codepoint (bytevector-copy bv1 b1-start b1-end))))
855+
(ch1 (integer->char (utf8->codepoint-at bv1 b1-start)))
863856
(b2-start (vector-ref pos2 j))
864-
(b2-end (vector-ref pos2 (+ j 1)))
865-
(ch2 (integer->char (utf8->codepoint (bytevector-copy bv2 b2-start b2-end))))
857+
(ch2 (integer->char (utf8->codepoint-at bv2 b2-start)))
866858
) ;
867859
(if (char=? ch1 ch2) (loop (+ i 1) (+ j 1)) #f)
868860
) ;let*

goldfish/liii/unicode.scm

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
bytevector-advance-utf8
1111
codepoint->utf8
1212
utf8->codepoint
13+
utf8->codepoint-at
1314
utf8-string-trim-right
1415
utf8-string-trim-left
1516
utf8-string-trim-both
@@ -362,6 +363,94 @@
362363
) ;let
363364
) ;define
364365

366+
(define (utf8->codepoint-at bv start)
367+
(unless (bytevector? bv)
368+
(error 'type-error "utf8->codepoint-at: expected bytevector")
369+
) ;unless
370+
371+
(let ((len (bytevector-length bv)))
372+
(when (>= start len)
373+
(error 'value-error "utf8->codepoint-at: start index past end of bytevector")
374+
) ;when
375+
376+
(let ((first-byte (bytevector-u8-ref bv start)))
377+
(cond ((<= first-byte 127) first-byte)
378+
379+
((<= 194 first-byte 223)
380+
(when (> (+ start 2) len)
381+
(error 'value-error "utf8->codepoint-at: incomplete 2-byte sequence")
382+
) ;when
383+
(let ((byte2 (bytevector-u8-ref bv (+ start 1))))
384+
(unless (<= 128 byte2 191)
385+
(error 'value-error "utf8->codepoint-at: invalid continuation byte")
386+
) ;unless
387+
(bitwise-ior (ash (bitwise-and first-byte 31) 6) (bitwise-and byte2 63))
388+
) ;let
389+
) ;
390+
391+
((<= 224 first-byte 239)
392+
(when (> (+ start 3) len)
393+
(error 'value-error "utf8->codepoint-at: incomplete 3-byte sequence")
394+
) ;when
395+
(let ((byte2 (bytevector-u8-ref bv (+ start 1)))
396+
(byte3 (bytevector-u8-ref bv (+ start 2)))
397+
) ;
398+
(unless (and (<= 128 byte2 191) (<= 128 byte3 191))
399+
(error 'value-error "utf8->codepoint-at: invalid continuation byte")
400+
) ;unless
401+
(let ((codepoint (bitwise-ior (ash (bitwise-and first-byte 15) 12)
402+
(ash (bitwise-and byte2 63) 6)
403+
(bitwise-and byte3 63)
404+
) ;bitwise-ior
405+
) ;codepoint
406+
) ;
407+
(when (or (<= 55296 codepoint 57343)
408+
(and (= first-byte 224) (< codepoint 2048))
409+
(and (= first-byte 237) (>= codepoint 55296))
410+
) ;or
411+
(error 'value-error "utf8->codepoint-at: invalid codepoint")
412+
) ;when
413+
codepoint
414+
) ;let
415+
) ;let
416+
) ;
417+
418+
((<= 240 first-byte 244)
419+
(when (> (+ start 4) len)
420+
(error 'value-error "utf8->codepoint-at: incomplete 4-byte sequence")
421+
) ;when
422+
(let ((byte2 (bytevector-u8-ref bv (+ start 1)))
423+
(byte3 (bytevector-u8-ref bv (+ start 2)))
424+
(byte4 (bytevector-u8-ref bv (+ start 3)))
425+
) ;
426+
(unless (and (<= 128 byte2 191) (<= 128 byte3 191) (<= 128 byte4 191))
427+
(error 'value-error "utf8->codepoint-at: invalid continuation byte")
428+
) ;unless
429+
(let ((codepoint (bitwise-ior (ash (bitwise-and first-byte 7) 18)
430+
(ash (bitwise-and byte2 63) 12)
431+
(ash (bitwise-and byte3 63) 6)
432+
(bitwise-and byte4 63)
433+
) ;bitwise-ior
434+
) ;codepoint
435+
) ;
436+
(when (or (< codepoint 65536)
437+
(> codepoint 1114111)
438+
(and (= first-byte 240) (< codepoint 65536))
439+
(and (= first-byte 244) (> codepoint 1114111))
440+
) ;or
441+
(error 'value-error "utf8->codepoint-at: invalid codepoint")
442+
) ;when
443+
codepoint
444+
) ;let
445+
) ;let
446+
) ;
447+
448+
(else (error 'value-error "utf8->codepoint-at: invalid UTF-8 sequence"))
449+
) ;cond
450+
) ;let
451+
) ;let
452+
) ;define
453+
365454
(define unicode-max-codepoint 1114111)
366455
(define unicode-replacement-char 65533)
367456

goldfish/scheme/char.scm

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5855,8 +5855,7 @@
58555855
(if (>= pos len)
58565856
(apply utf8-string (reverse result))
58575857
(let* ((next (bytevector-advance-utf8 bv pos len))
5858-
(char-bv (bytevector-copy bv pos next))
5859-
(ch (integer->char (utf8->codepoint char-bv)))
5858+
(ch (integer->char (utf8->codepoint-at bv pos)))
58605859
(new-ch (proc ch))
58615860
) ;
58625861
(loop next (cons new-ch result))

0 commit comments

Comments
 (0)