CDATA codegen updates, part 3#33
Merged
Merged
Conversation
These fields are only used at the end of input, and moving them into a different struct will make the per-state data accessed during DFA execution more compact.
Each state has two 256-bitsets, stored as a uint64_t[4], but the individual words in those have a lot of duplication. Add a table with every unique word, sorted descending by frequency, and replace the per-state labels and label_group_starts arrays with an array of offsets into the label_word table. Typically these offsets will fit in a uint8_t (though the code generation will switch to a uint16_t when necessary), making the per-state data much smaller. The label_word table's most commonly used entries are all grouped together and should stay in cache.
The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.
Previously, the CDATA codegen wrote eager_outputs directly into the caller's match bit buffer as they were encountered. Instead, set them in a stack-allocated buffer, and then copy them to the caller's if the DFA match succeeds overall. In order to avoid repeatedly checking for whether an eager_output has already been set (in the buffer), this collects the set of all distinct eager_output IDs and then remaps the array with eager output IDs to offsets into the unique set. This condenses the (sparse) set into a dense series 0..n that can be represented by flags in a stack-allocated bit vector (with a size known at compile time), and redundant eager_outputs harmlessly set flag bits that are already set. If the overall match succeeds, that bit vector is matched up with the unique ID array and the sparse values are written into the caller's buffer. Because the unique ID array is sorted, the relative ordering of the sparse and dense IDs is preserved (and 0 stays 0), so using non-ascending values as terminators still works.
deg4uss3r
approved these changes
Dec 11, 2024
Member
deg4uss3r
left a comment
There was a problem hiding this comment.
Admittedly I am coming into the code base pretty cold and C isn't my most preferred language but I do not see anything here that would prevent this from being approved.
Because of the for loop init condition, this was checking bits `1 << 1..64` in every word -- it was unintentionally ignoring the least significant bit. It should check `1 << 0..64`.
katef
reviewed
Jun 16, 2025
katef
reviewed
Jun 16, 2025
katef
reviewed
Jun 16, 2025
Bubble up the allocation failure errors. Use `f_calloc` rather than `calloc`.
5df68c8 to
a99b913
Compare
These test combining `[""]` and `["", ""]`.
Previously this didn't handle mixed anchoring correctly, potentially leading to false positives the case represented by eager_output_alt_mixing_anchored_and_unanchored.c. See comments in fsm_union_repeated_pattern_group for details. Fuzzing did not turn up any new issues. Another commit after this will make a few small interface changes and update callers.
- Instead of taking an array of `struct fsm_union_array *` pointers, this now takes an array of `struct fsm *` pointers. The other fields on `fsm_union_array` are no longer used, so the extra struct layer has been removed. - This now takes an extra argument, id_base, because each nfa[i] will get end IDs and/or output IDs (i + id_base) set on them. Previously these were set by the caller. - Rename parameters, to emphasize that the FSMs must be NFAs. - Update the test code for the interface changes. - Remove flags from the test code that are no longer used.
There are two bugs captured in eager_output_unanchored_end_plus.c: - Regexes ending in '+' weren't combining correctly, because analysis wasn't properly handling the construction for matching but optionally repeating the last character. - Eager matching after consuming a single character from the start state wasn't linked correctly to the global_unanchored_start_loop, so while the labeled edges were copied the eager output was lost. The other test files are focused on variants of that -- the + and start cases individually, and when + precedes a `()` subtree with more than one character.
Fuzzing has produced inputs that cause this to fail, but they all depend on embedded '\0' characters. I wasn't able to reproduce the failure without those present, but I will investigate further later. For now, adding a TODO.
`(^|wax-)((?:banana|^apple))` is an example of a regex that needs multiple anchored_start states linked in order to combine correctly.
Something strange is happening in CI, and these may be involved.
katef
approved these changes
Jun 17, 2025
…eated_pattern_group Fix anchoring for `fsm_union_repeated_pattern_group`
57773a2 to
43c851e
Compare
Author
|
Force pushing to drop several temporary commits investigating CI failures (ultimately due to https://www.githubstatus.com/incidents/9qcwpy3ckdrf) |
This reverts commit 6078cdf. This is not the right place to do this, and this wasn't the actual root cause of the CI failures.
katef
approved these changes
Jun 20, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Some further updates, for a new Device Detection canary. This still isn't quite ready to go upstream, but it's getting closer.
Changes:
.endand.endid_offsetfields from the state struct table to a separate table. Those fields aren't used until the end of input, moving them out makes the state table more dense and improves locality.uint8_toruint16_toffsets into a shared word table the overall binary ends up considerably smaller. They are sorted by use (descending), so the most frequently used ones are likely to stay in cache.fsm_determinise_with_config's early exit katef/libfsm#504