CDATA codegen updates, part 3 by silentbicycle · Pull Request #33 · fastly/libfsm

silentbicycle · 2024-12-09T17:56:31Z

Some further updates, for a new Device Detection canary. This still isn't quite ready to go upstream, but it's getting closer.

Changes:

Move .end and .endid_offset fields from the state struct table to a separate table. Those fields aren't used until the end of input, moving them out makes the state table more dense and improves locality.
Intern individual words from each state's 256-bitsets: These tend to be duplicated across several states, so if DFAs store them via four uint8_t or uint16_t offsets into a shared word table the overall binary ends up considerably smaller. They are sorted by use (descending), so the most frequently used ones are likely to stay in cache.
Fix a memory leak -- already merged upstream, but has not been synced here: Fix a memory leak during fsm_determinise_with_config's early exit katef/libfsm#504
Buffer eager_outputs until a successful match, rather than immediately writing them into the caller's buffer.

These fields are only used at the end of input, and moving them into a different struct will make the per-state data accessed during DFA execution more compact.

Each state has two 256-bitsets, stored as a uint64_t[4], but the individual words in those have a lot of duplication. Add a table with every unique word, sorted descending by frequency, and replace the per-state labels and label_group_starts arrays with an array of offsets into the label_word table. Typically these offsets will fit in a uint8_t (though the code generation will switch to a uint16_t when necessary), making the per-state data much smaller. The label_word table's most commonly used entries are all grouped together and should stay in cache.

The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.

Previously, the CDATA codegen wrote eager_outputs directly into the caller's match bit buffer as they were encountered. Instead, set them in a stack-allocated buffer, and then copy them to the caller's if the DFA match succeeds overall. In order to avoid repeatedly checking for whether an eager_output has already been set (in the buffer), this collects the set of all distinct eager_output IDs and then remaps the array with eager output IDs to offsets into the unique set. This condenses the (sparse) set into a dense series 0..n that can be represented by flags in a stack-allocated bit vector (with a size known at compile time), and redundant eager_outputs harmlessly set flag bits that are already set. If the overall match succeeds, that bit vector is matched up with the unique ID array and the sparse values are written into the caller's buffer. Because the unique ID array is sorted, the relative ordering of the sparse and dense IDs is preserved (and 0 stays 0), so using non-ascending values as terminators still works.

deg4uss3r

Admittedly I am coming into the code base pretty cold and C isn't my most preferred language but I do not see anything here that would prevent this from being approved.

Because of the for loop init condition, this was checking bits `1 << 1..64` in every word -- it was unintentionally ignoring the least significant bit. It should check `1 << 0..64`.

Bubble up the allocation failure errors. Use `f_calloc` rather than `calloc`.

These test combining `[""]` and `["", ""]`.

Previously this didn't handle mixed anchoring correctly, potentially leading to false positives the case represented by eager_output_alt_mixing_anchored_and_unanchored.c. See comments in fsm_union_repeated_pattern_group for details. Fuzzing did not turn up any new issues. Another commit after this will make a few small interface changes and update callers.

- Instead of taking an array of `struct fsm_union_array *` pointers, this now takes an array of `struct fsm *` pointers. The other fields on `fsm_union_array` are no longer used, so the extra struct layer has been removed. - This now takes an extra argument, id_base, because each nfa[i] will get end IDs and/or output IDs (i + id_base) set on them. Previously these were set by the caller. - Rename parameters, to emphasize that the FSMs must be NFAs. - Update the test code for the interface changes. - Remove flags from the test code that are no longer used.

There are two bugs captured in eager_output_unanchored_end_plus.c: - Regexes ending in '+' weren't combining correctly, because analysis wasn't properly handling the construction for matching but optionally repeating the last character. - Eager matching after consuming a single character from the start state wasn't linked correctly to the global_unanchored_start_loop, so while the labeled edges were copied the eager output was lost. The other test files are focused on variants of that -- the + and start cases individually, and when + precedes a `()` subtree with more than one character.

Fuzzing has produced inputs that cause this to fail, but they all depend on embedded '\0' characters. I wasn't able to reproduce the failure without those present, but I will investigate further later. For now, adding a TODO.

`(^|wax-)((?:banana|^apple))` is an example of a regex that needs multiple anchored_start states linked in order to combine correctly.

Something strange is happening in CI, and these may be involved.

…eated_pattern_group Fix anchoring for `fsm_union_repeated_pattern_group`

silentbicycle · 2025-06-20T13:03:56Z

Force pushing to drop several temporary commits investigating CI failures (ultimately due to https://www.githubstatus.com/incidents/9qcwpy3ckdrf)

This reverts commit 6078cdf. This is not the right place to do this, and this wasn't the actual root cause of the CI failures.

silentbicycle added 4 commits December 9, 2024 12:12

cdata: Move .end and (optional) .endid_offset into a separata array.

ed2e4d0

These fields are only used at the end of input, and moving them into a different struct will make the per-state data accessed during DFA execution more compact.

Fix memory leak.

af0ae65

The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.

silentbicycle requested review from cxreg and katef December 9, 2024 17:56

deg4uss3r approved these changes Dec 11, 2024

View reviewed changes

Fix a bug in the CDATA interpreter's bitset checking.

8cd564b

Because of the for loop init condition, this was checking bits `1 << 1..64` in every word -- it was unintentionally ignoring the least significant bit. It should check `1 << 0..64`.

katef reviewed Jun 16, 2025

View reviewed changes

Comment thread src/libfsm/print/cdata.c Outdated

katef reviewed Jun 16, 2025

View reviewed changes

Comment thread src/libfsm/print/cdata.c Outdated

katef reviewed Jun 16, 2025

View reviewed changes

Comment thread Makefile Outdated

cdata: Instead of asserting, the print interface returns -1 on error.

a99b913

Bubble up the allocation failure errors. Use `f_calloc` rather than `calloc`.

silentbicycle force-pushed the sv/cdata-codegen-updates-part-3 branch from 5df68c8 to a99b913 Compare June 16, 2025 19:35

silentbicycle and others added 17 commits June 16, 2025 15:38

cdata: Bubble up other allocation errors.

0509696

Rename test to _single, add _multiple variant.

c783ce8

These test combining `[""]` and `["", ""]`.

Fix memory leak in fsm_eager_output_compact, found while fuzzing.

0b601de

fsm_union_repeated_pattern_group: fix linkage for mixed start anchoring.

971bac6

Interface change: Add 'const'.

b7fd1bc

update CI to lock to ubuntu 22.04 for clang 14.0.0

b79100a

keep matrix.os so we can easily add more os variations in the future

9040973

fuzz/target.c: fsm_union_repeated_pattern_group interface changes.

64ba903

union: Fix trivial memory leak.

e80d3c5

union.c: Add comments for assertions.

e91d25b

Fuzzing has produced inputs that cause this to fail, but they all depend on embedded '\0' characters. I wasn't able to reproduce the failure without those present, but I will investigate further later. For now, adding a TODO.

Switch to collecting an anchored_start state set, not just one state.

7a434af

`(^|wax-)((?:banana|^apple))` is an example of a regex that needs multiple anchored_start states linked in order to combine correctly.

retest: extra error checking for retest's fileno, isatty

a72fc08

Something strange is happening in CI, and these may be involved.

cdata: change return types to void, since these functions can't fail.

f565b93

cdata: Error check.

8f1838b

katef approved these changes Jun 17, 2025

View reviewed changes

silentbicycle and others added 5 commits June 17, 2025 16:57

Merge pull request #34 from fastly/sv/fix-anchoring-for-fsm_union_rep…

aa17202

…eated_pattern_group Fix anchoring for `fsm_union_repeated_pattern_group`

cdata: Fix NULL check. This looks like a merge error.

285f444

github workflows: ensure ${env.wc} exists.

6078cdf

Merge branch 'main' into sv/cdata-codegen-updates-part-3

b7a3ffc

cdata: Add a few comments about internal structs.

43c851e

silentbicycle force-pushed the sv/cdata-codegen-updates-part-3 branch from 57773a2 to 43c851e Compare June 20, 2025 13:03

Revert "github workflows: ensure ${env.wc} exists."

3827dc5

This reverts commit 6078cdf. This is not the right place to do this, and this wasn't the actual root cause of the CI failures.

katef approved these changes Jun 20, 2025

View reviewed changes

silentbicycle merged commit b8e2e9f into main Jun 20, 2025
349 checks passed

silentbicycle deleted the sv/cdata-codegen-updates-part-3 branch June 20, 2025 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDATA codegen updates, part 3#33

CDATA codegen updates, part 3#33
silentbicycle merged 29 commits into
mainfrom
sv/cdata-codegen-updates-part-3

silentbicycle commented Dec 9, 2024

Uh oh!

deg4uss3r left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

silentbicycle commented Jun 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

silentbicycle commented Dec 9, 2024

Uh oh!

deg4uss3r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

silentbicycle commented Jun 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants