Skip to content

CDATA codegen updates, part 3#33

Merged
silentbicycle merged 29 commits into
mainfrom
sv/cdata-codegen-updates-part-3
Jun 20, 2025
Merged

CDATA codegen updates, part 3#33
silentbicycle merged 29 commits into
mainfrom
sv/cdata-codegen-updates-part-3

Conversation

@silentbicycle
Copy link
Copy Markdown

Some further updates, for a new Device Detection canary. This still isn't quite ready to go upstream, but it's getting closer.

Changes:

  • Move .end and .endid_offset fields from the state struct table to a separate table. Those fields aren't used until the end of input, moving them out makes the state table more dense and improves locality.
  • Intern individual words from each state's 256-bitsets: These tend to be duplicated across several states, so if DFAs store them via four uint8_t or uint16_t offsets into a shared word table the overall binary ends up considerably smaller. They are sorted by use (descending), so the most frequently used ones are likely to stay in cache.
  • Fix a memory leak -- already merged upstream, but has not been synced here: Fix a memory leak during fsm_determinise_with_config's early exit katef/libfsm#504
  • Buffer eager_outputs until a successful match, rather than immediately writing them into the caller's buffer.

These fields are only used at the end of input, and moving them into a
different struct will make the per-state data accessed during DFA
execution more compact.
Each state has two 256-bitsets, stored as a uint64_t[4], but the
individual words in those have a lot of duplication. Add a table
with every unique word, sorted descending by frequency, and replace
the per-state labels and label_group_starts arrays with an array of
offsets into the label_word table. Typically these offsets will
fit in a uint8_t (though the code generation will switch to
a uint16_t when necessary), making the per-state data much smaller.
The label_word table's most commonly used entries are all grouped
together and should stay in cache.
The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.
Previously, the CDATA codegen wrote eager_outputs directly into the
caller's match bit buffer as they were encountered. Instead, set them in
a stack-allocated buffer, and then copy them to the caller's if the DFA
match succeeds overall.

In order to avoid repeatedly checking for whether an eager_output has
already been set (in the buffer), this collects the set of all distinct
eager_output IDs and then remaps the array with eager output IDs to
offsets into the unique set. This condenses the (sparse) set
into a dense series 0..n that can be represented by flags in a
stack-allocated bit vector (with a size known at compile time), and
redundant eager_outputs harmlessly set flag bits that are already set.
If the overall match succeeds, that bit vector is matched up with the
unique ID array and the sparse values are written into the caller's
buffer.

Because the unique ID array is sorted, the relative ordering of the
sparse and dense IDs is preserved (and 0 stays 0), so using
non-ascending values as terminators still works.
@silentbicycle silentbicycle requested review from cxreg and katef December 9, 2024 17:56
Copy link
Copy Markdown
Member

@deg4uss3r deg4uss3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly I am coming into the code base pretty cold and C isn't my most preferred language but I do not see anything here that would prevent this from being approved.

Because of the for loop init condition, this was checking bits
`1 << 1..64` in every word -- it was unintentionally ignoring
the least significant bit. It should check `1 << 0..64`.
Comment thread src/libfsm/print/cdata.c Outdated
Comment thread src/libfsm/print/cdata.c Outdated
Comment thread Makefile Outdated
Bubble up the allocation failure errors.

Use `f_calloc` rather than `calloc`.
@silentbicycle silentbicycle force-pushed the sv/cdata-codegen-updates-part-3 branch from 5df68c8 to a99b913 Compare June 16, 2025 19:35
silentbicycle and others added 17 commits June 16, 2025 15:38
These test combining `[""]` and `["", ""]`.
Previously this didn't handle mixed anchoring correctly, potentially
leading to false positives the case represented by
eager_output_alt_mixing_anchored_and_unanchored.c.

See comments in fsm_union_repeated_pattern_group for details.

Fuzzing did not turn up any new issues.

Another commit after this will make a few small interface changes
and update callers.
- Instead of taking an array of `struct fsm_union_array *` pointers,
  this now takes an array of `struct fsm *` pointers. The other
  fields on `fsm_union_array` are no longer used, so the extra
  struct layer has been removed.

- This now takes an extra argument, id_base, because each
  nfa[i] will get end IDs and/or output IDs (i + id_base) set on them.
  Previously these were set by the caller.

- Rename parameters, to emphasize that the FSMs must be NFAs.

- Update the test code for the interface changes.

- Remove flags from the test code that are no longer used.
There are two bugs captured in eager_output_unanchored_end_plus.c:

- Regexes ending in '+' weren't combining correctly, because analysis
  wasn't properly handling the construction for matching but optionally
  repeating the last character.

- Eager matching after consuming a single character from the start state
  wasn't linked correctly to the global_unanchored_start_loop, so while
  the labeled edges were copied the eager output was lost.

The other test files are focused on variants of that -- the + and
start cases individually, and when + precedes a `()` subtree with
more than one character.
Fuzzing has produced inputs that cause this to fail, but they
all depend on embedded '\0' characters. I wasn't able to reproduce
the failure without those present, but I will investigate further
later. For now, adding a TODO.
`(^|wax-)((?:banana|^apple))` is an example of a regex that needs
multiple anchored_start states linked in order to combine correctly.
Something strange is happening in CI, and these may be involved.
@silentbicycle silentbicycle force-pushed the sv/cdata-codegen-updates-part-3 branch from 57773a2 to 43c851e Compare June 20, 2025 13:03
@silentbicycle
Copy link
Copy Markdown
Author

Force pushing to drop several temporary commits investigating CI failures (ultimately due to https://www.githubstatus.com/incidents/9qcwpy3ckdrf)

This reverts commit 6078cdf.

This is not the right place to do this, and this wasn't the actual
root cause of the CI failures.
@silentbicycle silentbicycle merged commit b8e2e9f into main Jun 20, 2025
349 checks passed
@silentbicycle silentbicycle deleted the sv/cdata-codegen-updates-part-3 branch June 20, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants