TextMate grammars used for syntax highlighting can individually contain thousands of regexes and are, collectively, quite large. Their regexes can be optimized via minification that also improves their performance. An existing library, oniguruma-parser's Optimizer module, is made specifically for this. It minifies Oniguruma regexes (the regex flavor used by TM grammars) and, when possible, applies performance improvements to them, all without any change to what regexes match.
These size reductions and performance improvements can be substantial.
Example: C++
The C++ TM grammar is very large and complex, so I'll use it as an example here.
- Size: The optimizer shaves more than 40,000 characters off of just the regexes in the C++ grammar, despite the C++ grammar not including any insignificant whitespace or comments in its regexes (some grammars do include them in regexes, creating even more potential for minification).
- Performance: The optimizer improves the C++ grammar's performance by ~30%! It does so by making changes that significantly reduce the amount of backtracking needed by some very large, complex, and slow regexes.
Following are rough performance numbers for the difference between running Shiki with this same C++ grammar file used by VS Code, with and without pre-optimization by oniguruma-parser/optimizer. In the test, I simply loaded and ran an otherwise-empty webpage that used Shiki 3.2.1 and its vscode-oniguruma/vscode-textmate based Oniguruma engine to highlight a C++ sample in Chrome 134 on my local system.
With a short C++ sample
Source: https://github.com/shikijs/textmate-grammars-themes/blob/main/samples/cpp.sample
- Unoptimized: 300-330ms
- Optimized: 230-250ms (23% less, comparing the low end of each)
With a medium-length C++ sample
Source: https://github.com/fmtlib/fmt/blob/master/src/os.cc
- Unoptimized: 1,490-1,600ms
- Optimized: 1,030-1,110ms (31% less, comparing the low end of each)
With a long C++ sample
Source: Same as the medium-length sample, but repeated 20 times.
- Unoptimized: 5,350-5,790ms
- Optimized: 3,810-3,920ms (29% less, comparing the low end of each)
Robustness
oniguruma-parser's optimizer has been battle-tested by the popular Shiki library, which runs all of its more than 220 included TM grammars through it. Shiki does so in tm-grammars, and tests that syntax highlighting results are identical for all grammars before and after optimization. You can see how Shiki applies the optimizer to TM grammars here.
TextMate grammars used for syntax highlighting can individually contain thousands of regexes and are, collectively, quite large. Their regexes can be optimized via minification that also improves their performance. An existing library, oniguruma-parser's Optimizer module, is made specifically for this. It minifies Oniguruma regexes (the regex flavor used by TM grammars) and, when possible, applies performance improvements to them, all without any change to what regexes match.
These size reductions and performance improvements can be substantial.
Example: C++
The C++ TM grammar is very large and complex, so I'll use it as an example here.
Following are rough performance numbers for the difference between running Shiki with this same C++ grammar file used by VS Code, with and without pre-optimization by oniguruma-parser/optimizer. In the test, I simply loaded and ran an otherwise-empty webpage that used Shiki 3.2.1 and its vscode-oniguruma/vscode-textmate based Oniguruma engine to highlight a C++ sample in Chrome 134 on my local system.
With a short C++ sample
Source: https://github.com/shikijs/textmate-grammars-themes/blob/main/samples/cpp.sample
With a medium-length C++ sample
Source: https://github.com/fmtlib/fmt/blob/master/src/os.cc
With a long C++ sample
Source: Same as the medium-length sample, but repeated 20 times.
Robustness
oniguruma-parser's optimizer has been battle-tested by the popular Shiki library, which runs all of its more than 220 included TM grammars through it. Shiki does so in tm-grammars, and tests that syntax highlighting results are identical for all grammars before and after optimization. You can see how Shiki applies the optimizer to TM grammars here.