feat: Generate all possible wildcard subqueries for non-capture schema-based search, preventing unnecessary archive decompression.#1313
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a new clp::SchemaSearcher for schema-driven query expansion, removes GrepCore's lexer-based token-analysis overload and private SearchToken, consolidates token scanning into a single string-based implementation, updates process_raw_query to optionally call SchemaSearcher, and adds tests, mocks, utilities, and CMake updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant GrepCore as GrepCore
participant SchemaSearcher as SchemaSearcher
participant Lexer as Lexer
participant LogTypeDict as LogTypeDict
participant VarDict as VarDict
Client->>GrepCore: process_raw_query(search_string, use_heuristic?)
alt use_heuristic = false
GrepCore->>SchemaSearcher: search(search_string, lexer, LogTypeDict, VarDict, ignore_case)
SchemaSearcher->>Lexer: tokenize(search_string)
Lexer-->>SchemaSearcher: tokens
SchemaSearcher->>SchemaSearcher: normalize_interpretations()
loop per interpretation
SchemaSearcher->>SchemaSearcher: get_wildcard_encodable_positions()
SchemaSearcher->>SchemaSearcher: enumerate mask combinations
loop per mask
SchemaSearcher->>SchemaSearcher: generate_logtype_string(mask)
SchemaSearcher->>LogTypeDict: get matching logtypes(logtype_string)
LogTypeDict-->>SchemaSearcher: matching entries
SchemaSearcher->>VarDict: resolve variable tokens
VarDict-->>SchemaSearcher: variable matches
end
SchemaSearcher->>SchemaSearcher: assemble SubQuery
end
SchemaSearcher-->>GrepCore: vector<SubQuery>
else use_heuristic = true
GrepCore->>GrepCore: heuristic token scan (get_bounds_of_next_potential_var)
GrepCore->>GrepCore: build sub_queries heuristically
end
GrepCore-->>Client: sub_queries
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (6)
components/core/src/clp/GrepCore.cpp(2 hunks)components/core/src/clp/GrepCore.hpp(5 hunks)components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp(1 hunks)components/core/src/clp/ffi/ir_stream/search/test/utils.cpp(1 hunks)components/core/tests/test-GrepCore.cpp(1 hunks)taskfiles/deps/main.yaml(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cppcomponents/core/src/clp/ffi/ir_stream/search/test/utils.cppcomponents/core/src/clp/GrepCore.cppcomponents/core/tests/test-GrepCore.cppcomponents/core/src/clp/GrepCore.hpp
🧠 Learnings (3)
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
PR: y-scope/clp#557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.
Applied to files:
components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp/ffi/ir_stream/search/test/utils.cpp
🧬 Code graph analysis (3)
components/core/src/clp/GrepCore.cpp (2)
components/core/src/clp/GrepCore.hpp (2)
interpretation(193-195)interpretation(207-210)components/core/src/clp/EncodedVariableInterpreter.hpp (5)
add_dict_var(258-263)add_dict_var(533-546)add_dict_var(533-538)encoded_var(52-52)encoded_var(111-111)
components/core/tests/test-GrepCore.cpp (1)
components/core/src/clp/GrepCore.cpp (2)
get_bounds_of_next_potential_var(28-146)get_bounds_of_next_potential_var(28-33)
components/core/src/clp/GrepCore.hpp (4)
components/core/src/clp/QueryToken.cpp (2)
is_var(137-145)is_var(137-137)components/core/src/clp/EncodedVariableInterpreter.hpp (13)
logtype(58-60)logtype(58-58)logtype(66-68)logtype(66-66)logtype(74-76)logtype(74-74)logtype(82-84)logtype(82-82)wildcard_search_dictionary_and_get_encoded_matches(218-223)wildcard_search_dictionary_and_get_encoded_matches(486-511)wildcard_search_dictionary_and_get_encoded_matches(486-491)encoded_var(52-52)encoded_var(111-111)components/core/src/clp/GrepCore.cpp (4)
get_wildcard_encodable_positions(148-164)get_wildcard_encodable_positions(148-149)generate_logtype_string(166-233)generate_logtype_string(166-169)components/core/src/glt/Query.hpp (1)
logtype_entries(103-105)
🔇 Additional comments (13)
components/core/src/clp/ffi/ir_stream/search/test/utils.cpp (1)
15-15: LGTM! Required header for fmt::join usage.The addition of
<fmt/ranges.h>is necessary to enable thefmt::joinfunction used on line 64. This aligns with the fmt 11.x upgrade.components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp (1)
15-15: LGTM! Consistent header addition for fmt::join.The addition of
<fmt/ranges.h>enablesfmt::joinfunctionality on line 237, consistent with the fmt 11.x upgrade.components/core/tests/test-GrepCore.cpp (2)
12-14: Good use of brace initialization for consistency.The change to brace initialization enhances code consistency and prevents narrowing conversions.
79-99: Wildcard handling changes are intentional.GrepCore documents a fallback to "decompression + wildcard matching" for tokens with greedy wildcards (components/core/src/clp/GrepCore.hpp ~lines 472–476), and the updated tests in components/core/tests/test-GrepCore.cpp (wildcard case starting ~line 75) reflect the DFA-based behaviour: escaped '' now delimits "1" while unescaped '' is preserved in "abc*123".
components/core/src/clp/GrepCore.hpp (3)
130-182: Well-documented schema sub-query generation logic.The documentation clearly explains the complex logic for handling encodable wildcard variables and the binary choices for interpretation. The example effectively illustrates the 2^k possible combinations.
336-346: Verify thread-safety of static variables in GrepCore.hppinterpretations_generated and interpretations (components/core/src/clp/GrepCore.hpp: lines 336–346) are static and can race under concurrent access. The verification script produced no output, so concurrency of this code path or existing synchronization could not be confirmed. Confirm whether this runs on multiple threads; if so, replace static state with per-call/per-archive state, use thread_local, or guard access with a mutex/atomic.
613-614: Verify SymbolId underlying type and make casts consistent.SymbolId is defined in external log_surgeon — confirm its underlying integer type and align casts: GrepCore.hpp:613–614 and GrepCore.cpp:156–157,196–197 use static_cast<uint32_t>(TokenInt/TokenFloat) while Utils.cpp (≈140–156) uses static_cast(log_surgeon::SymbolId::TokenInt/TokenFloat). Match the cast to the enum's underlying type (or change var_type) to avoid subtle mismatches.
taskfiles/deps/main.yaml (3)
443-447: Approved — spdlog v1.15.3 is compatible with fmt v11.2.0.
Release notes show spdlog 1.15.3 bumped the bundled {fmt} to 11.2.0, so no changes required.
253-254: Verify fmt 11.2.0 compatibility across the codebase.
- Static scan shows multiple fmt call sites; representative hits: components/core/src/clp_s/Utils.cpp:81, components/core/src/clp_s/kv_ir_search.cpp:175, components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:31,58, components/core/tests/test-clp_s-search.cpp:117,215, components/core/src/clp_s/JsonConstructor.cpp:27.
- ast-grep returned no exact matches for custom formatters (no
template<> struct fmt::formatter<...>found).- Action: build the project with fmt 11.2.0 and run all unit/integration tests; if you get compiler or test failures, update affected call sites or add/update any custom formatters (also search for partial specializations or ADL-based formatters) and re-run tests.
320-321: Confirm log_surgeon update (a722d07) includes QueryInterpretation.
a722d07 tarball contains src/log_surgeon/wildcard_query_parser/QueryInterpretation.hpp and QueryInterpretation.cpp (also Query.hpp/Query.cpp and tests/test-query-interpretation.cpp), so the log_surgeon::wildcard_query_parser::QueryInterpretation referenced by components/core/src/clp/GrepCore.* is present.components/core/src/clp/GrepCore.cpp (3)
18-26: LGTM: token/type aliases are clear and scopedThe using-declarations improve readability without polluting too broadly. No issues spotted.
213-230: LGTM: representable-int/float fast-path with dictionary fallbackThe detection path is clear and orders checks correctly; encoded_var capture is fine as an out-param.
199-211: C++20 is already enforced for the target that builds this TU — no action required.The clp target sets target_compile_features(... PRIVATE cxx_std_20) (components/core/src/clp/clp/CMakeLists.txt:172) and other CLP targets likewise use cxx_std_20, so std::unordered_map::contains is supported.
There was a problem hiding this comment.
Actionable comments posted: 7
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp/GrepCore.hpp(5 hunks)components/core/tests/test-GrepCore.cpp(1 hunks)taskfiles/deps/main.yaml(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/tests/test-GrepCore.cppcomponents/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (2)
components/core/tests/test-GrepCore.cpp (1)
components/core/src/clp/GrepCore.cpp (2)
get_bounds_of_next_potential_var(28-146)get_bounds_of_next_potential_var(28-33)
components/core/src/clp/GrepCore.hpp (3)
components/core/src/clp/DictionaryReader.hpp (1)
search_string(94-94)components/core/src/clp/EncodedVariableInterpreter.hpp (13)
logtype(63-65)logtype(63-63)logtype(71-73)logtype(71-71)logtype(79-81)logtype(79-79)logtype(87-89)logtype(87-87)wildcard_search_dictionary_and_get_encoded_matches(222-227)wildcard_search_dictionary_and_get_encoded_matches(497-522)wildcard_search_dictionary_and_get_encoded_matches(497-502)encoded_var(57-57)encoded_var(116-116)components/core/src/clp/GrepCore.cpp (4)
get_wildcard_encodable_positions(148-164)get_wildcard_encodable_positions(148-149)generate_logtype_string(166-233)generate_logtype_string(166-169)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (7)
components/core/tests/test-GrepCore.cpp (2)
13-16: LGTM: test initialisation simplified.Brace‑initialising begin/end/is_var improves clarity; signature update aligns with the header.
21-101: LGTM: expectations preserved after signature change.The call sites correctly drop the lexer argument and still validate tokenisation semantics.
If CI allows, add one integration test covering the new interpretation‑driven path in process_raw_query to ensure schema‑based subquery generation remains sound.
components/core/src/clp/GrepCore.hpp (2)
184-195: Fix typos in documentation.“indicies” → “indices”; also remove the duplicated spacing.
- * Scans the interpretation and returns the indicies of all encodable * wildcard variables. + * Scans the interpretation and returns the indices of all encodable wildcard variables. @@ - * @return A vector of positions of encodabe wildcard variables. + * @return A vector of positions of encodable wildcard variables.
540-542: Replace magic number 32 with a named constant.Carry forward the earlier feedback and make the limit self-documenting.
- if (wildcard_encodable_positions.size() > 32) { + if (wildcard_encodable_positions.size() > cMaxEncodableWildcardVariables) { throw std::runtime_error("Too many encodable variables."); }Add near the class’ private section (example placement shown):
// Place inside class GrepCore, private: section static constexpr size_t cMaxEncodableWildcardVariables{32};taskfiles/deps/main.yaml (3)
443-448: spdlog–fmt ABI compatibility — verify in CIv1.15.3 bundles fmt 11.2.0 and is compatible with fmt 11.x; run a minimal CI build compiling with -DSPDLOG_FMT_EXTERNAL=ON against fmt 11.2.0 (or the latest 11.x) to confirm no platform/toolchain-specific issues.
253-255: SHA256 verified — run CI to confirm build compatibility
- taskfiles/deps/main.yaml (lines 253–255): TARBALL_SHA256 matches the downloaded fmt 11.2.0 tarball: bc23066d87ab3168f27cef3e97d545fa63314f5c79df5ea444d41d56f962c6af.
- Repository shows widespread fmt usage (find_package(fmt REQUIRED), fmt::fmt target, many fmt::format/format_to/formatter/includes).
- Cannot compile here; run CI or a local full build to ensure all consumers compile cleanly with fmt 11.2.0.
320-322: APIs present in commit a722d07 — OK to pin.Query::get_all_multi_token_interpretations is in src/log_surgeon/wildcard_query_parser/Query.hpp; QueryInterpretation, VariableQueryToken and StaticQueryToken headers are present under src/log_surgeon/wildcard_query_parser/; SymbolId enum is in src/log_surgeon/Constants.hpp. No missing headers or symbols found.
There was a problem hiding this comment.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/core/src/clp/GrepCore.cpp(2 hunks)components/core/src/clp/GrepCore.hpp(6 hunks)components/core/src/clp/Query.hpp(1 hunks)components/core/tests/test-GrepCore.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/Query.hppcomponents/core/src/clp/GrepCore.hppcomponents/core/tests/test-GrepCore.cppcomponents/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.
Applied to files:
components/core/src/clp/GrepCore.cpp
🧬 Code graph analysis (3)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
interpretation(48-51)interpretation(48-49)interpretation(53-58)interpretation(53-56)logtype(207-216)logtype(207-208)logtype(218-228)logtype(218-222)generate_schema_sub_queries(81-94)generate_schema_sub_queries(81-86)components/core/src/clp/EncodedVariableInterpreter.hpp (13)
logtype(63-65)logtype(63-63)logtype(71-73)logtype(71-71)logtype(79-81)logtype(79-79)logtype(87-89)logtype(87-87)wildcard_search_dictionary_and_get_encoded_matches(222-227)wildcard_search_dictionary_and_get_encoded_matches(497-522)wildcard_search_dictionary_and_get_encoded_matches(497-502)encoded_var(57-57)encoded_var(116-116)components/core/src/clp/GrepCore.cpp (4)
get_wildcard_encodable_positions(148-164)get_wildcard_encodable_positions(148-149)generate_logtype_string(166-233)generate_logtype_string(166-169)
components/core/tests/test-GrepCore.cpp (4)
components/core/src/clp/GrepCore.hpp (9)
interpretation(197-199)interpretation(211-214)process_schema_var_token(235-241)process_schema_var_token(606-668)process_schema_var_token(606-612)generate_schema_sub_queries(178-185)generate_schema_sub_queries(535-603)generate_schema_sub_queries(535-541)value(75-80)components/core/src/clp/GrepCore.cpp (6)
get_wildcard_encodable_positions(148-164)get_wildcard_encodable_positions(148-149)generate_logtype_string(166-233)generate_logtype_string(166-169)get_bounds_of_next_potential_var(28-146)get_bounds_of_next_potential_var(28-33)components/core/src/clp/EncodedVariableInterpreter.hpp (14)
id(56-56)value(97-100)value(107-110)add_dict_var(266-271)add_dict_var(548-561)add_dict_var(548-553)logtype(63-65)logtype(63-63)logtype(71-73)logtype(71-71)logtype(79-81)logtype(79-79)logtype(87-89)logtype(87-87)components/core/src/clp/Query.cpp (2)
add_dict_var(102-105)add_dict_var(103-103)
components/core/src/clp/GrepCore.cpp (2)
components/core/src/clp/GrepCore.hpp (2)
interpretation(197-199)interpretation(211-214)components/core/tests/test-GrepCore.cpp (7)
interpretation(48-51)interpretation(48-49)interpretation(53-58)interpretation(53-56)add_int_var(175-175)add_float_var(177-177)add_dict_var(179-179)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (15)
components/core/src/clp/Query.hpp (1)
158-159: Deterministic default for wildcard flag — good.Default-initializing m_wildcard_match_required to false prevents UB from uninitialized reads. Please confirm SubQuery::clear() also resets this flag to false to maintain invariants across reuse.
components/core/tests/test-GrepCore.cpp (1)
590-718: Solid coverage for schema subquery generation.Good end-to-end assertions on wildcard-encodable combinations, variable shapes, and logtype IDs.
components/core/src/clp/GrepCore.hpp (8)
9-14: Add missing include for std::set.This header declares std::set in public APIs; include to avoid relying on transitive includes.
Apply:
#include <string_view> +#include <set> #include <unordered_map> #include <unordered_set> #include <vector>
188-196: Typos in documentation.Fix “indicies” → “indices” and “encodabe” → “encodable”.
- * Scans the interpretation and returns the indicies of all encodable wildcard variables. + * Scans the interpretation and returns the indices of all encodable wildcard variables. ... - * @return A vector of positions of encodabe wildcard variables. + * @return A vector of positions of encodable wildcard variables.
207-213: Typo in documentation.“interpetation” → “interpretation”.
- * @param interpretation The interpetation to convert to a logtype string. + * @param interpretation The interpretation to convert to a logtype string.
288-294: Follow project style: usefalse == exprinstead of!expr.Aligns with coding guidelines.
As per coding guidelines
- if (!query_token.has_greedy_wildcard_in_middle() && query_token.is_ambiguous_token()) { + if (false == query_token.has_greedy_wildcard_in_middle() + && query_token.is_ambiguous_token()) + {
342-352: Remove thread-unsafe statics; compute interpretations per call (or add a proper cache).Function-local statics here are shared across calls and archives, race-prone, and nullified by cExecuteForEveryArchive. Construct Query and get interpretations per call.
- static bool interpretations_generated{false}; - static std::set<log_surgeon::wildcard_query_parser::QueryInterpretation> interpretations; - - // TODO: This needs to be done for every archive until we have per schema logic. - constexpr bool cExecuteForEveryArchive{true}; - if (cExecuteForEveryArchive || false == interpretations_generated) { - log_surgeon::wildcard_query_parser::Query const query(search_string); - interpretations.clear(); - interpretations = query.get_all_multi_token_interpretations(lexer); - interpretations_generated = true; - } + log_surgeon::wildcard_query_parser::Query const query(search_string); + auto interpretations = query.get_all_multi_token_interpretations(lexer);If caching is required, introduce a schema-keyed cache with synchronisation.
353-356: Fix comment typo “Transfrom”.- // Transfrom log-surgeon interpretations into CLP sub-queries. + // Transform log-surgeon interpretations into CLP sub-queries.
543-551: Avoid undefined/implementation-defined shifts; extract magic number.
- size_t shifting by variable bit width is UB on 32-bit; use uint64_t and guard.
- Replace magic 32 with a named constant.
- if (wildcard_encodable_positions.size() > 32) { - throw std::runtime_error("Too many encodable variables."); - } - size_t const num_combos{static_cast<size_t>(1) << wildcard_encodable_positions.size()}; - for (size_t mask{0}; mask < num_combos; ++mask) { + constexpr size_t cMaxEncodableWildcardVariables{32}; + auto const k = wildcard_encodable_positions.size(); + if (k > cMaxEncodableWildcardVariables) { + throw std::runtime_error("Too many encodable variables."); + } + if (k >= 64) { + throw std::runtime_error("Wildcard encodable variable count exceeds mask width."); + } + uint64_t const num_combos{1ULL << k}; + for (uint64_t mask{0}; mask < num_combos; ++mask) {
549-552: Reduce per-iteration allocations.Building an unordered_map per mask is expensive. Use a flat container (e.g., vector<uint8_t>) keyed by local index, or reuse a pre-sized map.
Example approach:
- Precompute vector<size_t> positions = wildcard_encodable_positions.
- For each mask, fill vector<uint8_t> encoded(k) with (mask >> i) & 1 and consult positions[i] inside generate_logtype_string via a small adapter.
components/core/src/clp/GrepCore.cpp (5)
6-7: Include for std::holds_alternative/std::get.Avoid relying on transitive includes.
#include <string_view> +#include <variant> #include <unordered_map> #include <vector>
155-158: Prefer strongly typed enum comparisons.Compare as SymbolId, not via uint32_t.
- auto const var_type{variable_token.get_variable_type()}; - bool const is_int{static_cast<uint32_t>(TokenInt) == var_type}; - bool const is_float{static_cast<uint32_t>(TokenFloat) == var_type}; + auto const var_symbol = static_cast<decltype(TokenInt)>(variable_token.get_variable_type()); + bool const is_int = (var_symbol == TokenInt); + bool const is_float = (var_symbol == TokenFloat);
172-181: Reserve logic assumes one byte per placeholder.If EncodedVariableInterpreter ever appends more than one byte, this under-reserves. Consider exposing/using a constexpr placeholder size or asserting the invariant.
184-185: Fix comment typo.“Generete” → “Generate”.
- // Generete `logtype_string`. + // Generate `logtype_string`.
196-198: Prefer strongly typed enum comparisons.Repeat: compare as SymbolId, not via uint32_t.
- bool const is_int{static_cast<uint32_t>(TokenInt) == var_type}; - bool const is_float{static_cast<uint32_t>(TokenFloat) == var_type}; + bool const is_int = (var_type == TokenInt); + bool const is_float = (var_type == TokenFloat);
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp/GrepCore.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.
Applied to files:
components/core/src/clp/GrepCore.cpp
🧬 Code graph analysis (1)
components/core/src/clp/GrepCore.cpp (3)
components/core/src/clp/GrepCore.hpp (2)
interpretation(197-199)interpretation(211-214)components/core/tests/test-GrepCore.cpp (7)
interpretation(48-51)interpretation(48-49)interpretation(53-58)interpretation(53-56)add_int_var(175-175)add_float_var(177-177)add_dict_var(179-179)components/core/src/clp/EncodedVariableInterpreter.hpp (5)
add_dict_var(266-271)add_dict_var(548-561)add_dict_var(548-553)encoded_var(57-57)encoded_var(116-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: package-image
🔇 Additional comments (5)
components/core/src/clp/GrepCore.cpp (5)
19-23: LGTM: explicit SymbolId enumerators.Bringing TokenInt/TokenFloat into scope is clear and consistent with usage below.
25-26: LGTM: local aliases for STL types.Keeps signatures succinct without hiding ownership semantics.
173-183: Preallocation assumes add_*_var appends one character.If add_int_var/add_float_var/add_dict_var ever append more than one byte, this under-reserves. Either expose constexpr sizes or assert the invariant.
Can you confirm each add_*_var appends exactly one byte today?
185-185: Fix typo in comment.“Generete” -> “Generate”.
- // Generete `logtype_string`. + // Generate `logtype_string`.
167-232: Cacheinterpretation.get_logtype(), iterate by const& and replacecontains+atwith a singlefind()
Removes double map lookup and drops the C++20unordered_map::containsdependency; verify your CMakeLists.txt C++ standard to ensure C++17 compatibility.
There was a problem hiding this comment.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/GrepCore.cpp(2 hunks)components/core/src/clp/GrepCore.hpp(6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/GrepCore.hppcomponents/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.
Applied to files:
components/core/src/clp/GrepCore.hppcomponents/core/src/clp/GrepCore.cpp
🧬 Code graph analysis (2)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
interpretation(48-51)interpretation(48-49)interpretation(53-58)interpretation(53-56)logtype(207-216)logtype(207-208)logtype(218-228)logtype(218-222)generate_schema_sub_queries(81-94)generate_schema_sub_queries(81-86)components/core/src/clp/EncodedVariableInterpreter.hpp (13)
logtype(63-65)logtype(63-63)logtype(71-73)logtype(71-71)logtype(79-81)logtype(79-79)logtype(87-89)logtype(87-87)wildcard_search_dictionary_and_get_encoded_matches(222-227)wildcard_search_dictionary_and_get_encoded_matches(497-522)wildcard_search_dictionary_and_get_encoded_matches(497-502)encoded_var(57-57)encoded_var(116-116)components/core/src/clp/GrepCore.cpp (4)
get_wildcard_encodable_positions(149-165)get_wildcard_encodable_positions(149-150)generate_logtype_string(167-232)generate_logtype_string(167-170)
components/core/src/clp/GrepCore.cpp (3)
components/core/src/clp/GrepCore.hpp (2)
interpretation(198-200)interpretation(212-215)components/core/tests/test-GrepCore.cpp (5)
interpretation(48-51)interpretation(48-49)interpretation(53-58)interpretation(53-56)add_dict_var(179-179)components/core/src/clp/EncodedVariableInterpreter.hpp (5)
add_dict_var(266-271)add_dict_var(548-561)add_dict_var(548-553)encoded_var(57-57)encoded_var(116-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (13)
components/core/src/clp/GrepCore.hpp (10)
7-7: Good fix: add missing include.Prevents reliance on transitive includes.
212-215: LGTM: clear API for logtype-string generation.
235-243: LGTM: helper API surface reads well.
293-297: LGTM: project style honoured (false == …).
624-627: Confirm intent: encoded-wildcard path adds no var constraint.Early-return marks wildcard_match_required without adding an encoded var. If that’s intentional (accept any encoded value and defer to message-level matching), fine; otherwise, add a non-dict var placeholder for clarity/consistency.
28-32: Gate test-only friend behind a build flag.Expose the friend only in test builds to avoid leaking test details in public headers.
- class GrepCoreTest; + #ifdef CLP_ENABLE_TESTS + class GrepCoreTest; + #endif ... class GrepCore { - friend class GrepCoreTest; + #ifdef CLP_ENABLE_TESTS + friend class GrepCoreTest; + #endif
345-355: Remove unsafe function-local statics; compute interpretations per call.Statics here are thread-unsafe and nullified by cExecuteForEveryArchive{true}. Build the interpretations locally.
- static bool interpretations_generated{false}; - static std::set<log_surgeon::wildcard_query_parser::QueryInterpretation> interpretations; - - // TODO: This needs to be done for every archive until we have per schema logic. - constexpr bool cExecuteForEveryArchive{true}; - if (cExecuteForEveryArchive || false == interpretations_generated) { - log_surgeon::wildcard_query_parser::Query const query(search_string); - interpretations.clear(); - interpretations = query.get_all_multi_token_interpretations(lexer); - interpretations_generated = true; - } + log_surgeon::wildcard_query_parser::Query const query(search_string); + auto interpretations = query.get_all_multi_token_interpretations(lexer);
356-356: Fix typo: “Transfrom” → “Transform”.- // Transfrom log-surgeon interpretations into CLP sub-queries. + // Transform log-surgeon interpretations into CLP sub-queries.
545-556: Fix undefined/implementation-defined shifts; use 64-bit masks and guard k.Shifting size_t by its bit width is UB on 32-bit. Also 2^32 combos is infeasible.
- size_t const num_combos{static_cast<size_t>(1) << wildcard_encodable_positions.size()}; - for (size_t mask{0}; mask < num_combos; ++mask) { + auto const k = wildcard_encodable_positions.size(); + if (k >= 64) { + throw std::runtime_error("Too many encodable variables for mask enumeration (k >= 64)."); + } + uint64_t const num_combos{1ULL << k}; + for (uint64_t mask{0}; mask < num_combos; ++mask) { std::unordered_map<size_t, bool> wildcard_mask_map; - for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) { - wildcard_mask_map[wildcard_encodable_positions[i]] = mask >> i & 1; + for (size_t i{0}; i < k; ++i) { + wildcard_mask_map[wildcard_encodable_positions[i]] = (mask >> i) & 1ULL; }Additionally, consider lowering the practical cap from 32 to something manageable (e.g., 20) to avoid 2^k explosion.
609-623: Use strongly typed enum comparisons for SymbolId.Avoid uint32_t-based comparisons; compare as SymbolId.
- auto const var_type{variable_token.get_variable_type()}; - bool const is_int{static_cast<uint32_t>(log_surgeon::SymbolId::TokenInt) == var_type}; - bool const is_float{static_cast<uint32_t>(log_surgeon::SymbolId::TokenFloat) == var_type}; + auto const var_symbol = static_cast<log_surgeon::SymbolId>(variable_token.get_variable_type()); + bool const is_int = (var_symbol == log_surgeon::SymbolId::TokenInt); + bool const is_float = (var_symbol == log_surgeon::SymbolId::TokenFloat);components/core/src/clp/GrepCore.cpp (3)
6-8: LGTM: explicit STL includes (unordered_map, variant, vector).
19-23: LGTM: using-declarations improve readability and locality.
149-165: Avoid variant copies and repeated get_logtype() calls.Cache the logtype and iterate by const reference; reserve capacity.
-auto GrepCore::get_wildcard_encodable_positions(QueryInterpretation const& interpretation) - -> vector<size_t> { - vector<size_t> wildcard_encodable_positions; - for (size_t i{0}; i < interpretation.get_logtype().size(); ++i) { - auto const token{interpretation.get_logtype()[i]}; +auto GrepCore::get_wildcard_encodable_positions(QueryInterpretation const& interpretation) + -> vector<size_t> { + auto const& logtype = interpretation.get_logtype(); + vector<size_t> wildcard_encodable_positions; + wildcard_encodable_positions.reserve(logtype.size()); + for (size_t i{0}; i < logtype.size(); ++i) { + auto const& token = logtype[i];
…esting code for DP algo; Add final unit-test for testing PR end-to-end; Still lots of edge cases and sub-steps that could be tested more rigorously.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (6)
components/core/src/clp/GrepCore.hpp (5)
191-197: Fix typos in docs.“than” → “that”; “encodabe” → “encodable”.
- * An encodable variable is a variable token than: + * An encodable variable is a variable token that: @@ - * @return A vector of positions of encodabe wildcard variables. + * @return A vector of positions of encodable wildcard variables.
28-32: Gate test‑only friend with a build flag.Avoid leaking test details to consumers; wrap the friend in CLP_BUILD_TESTING (or your project’s test macro).
-class GrepCoreTest; + #if defined(CLP_BUILD_TESTING) + class GrepCoreTest; + #endif class GrepCore { - friend class GrepCoreTest; + #if defined(CLP_BUILD_TESTING) + friend class GrepCoreTest; + #endif
544-547: Avoid per‑iteration unordered_map allocations for the mask.This map is rebuilt for every mask; switch to a flat mask (vector<uint8_t>/bitset) or add an overload of generate_logtype_string that takes (positions, mask).
- std::unordered_map<size_t, bool> wildcard_mask_map; - for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) { - wildcard_mask_map[wildcard_encodable_positions[i]] = mask >> i & 1ULL; - } + // Lightweight mask: local index -> bit + std::vector<uint8_t> mask_bits(wildcard_encodable_positions.size()); + for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) { + mask_bits[i] = static_cast<uint8_t>((mask >> i) & 1ULL); + } + // Prefer an overload: generate_logtype_string(interpretation, wildcard_encodable_positions, mask_bits)If changing signature isn’t feasible now, at least reserve:
+ wildcard_mask_map.reserve(wildcard_encodable_positions.size());
536-543: Cap combinatorial explosion (2^k) to a practical limit.k=32 allows 4,294,967,296 combinations; this can hang a search. Lower the cap (e.g., 16) and fail fast.
- constexpr size_t cMaxEncodableWildcardVariables{32}; + constexpr size_t cMaxEncodableWildcardVariables{16}; // practical cap to bound 2^k work @@ - uint64_t const num_combos{1ULL << wildcard_encodable_positions.size()}; + auto const k = wildcard_encodable_positions.size(); + uint64_t const num_combos{1ULL << k};Optionally add telemetry to surface when the cap is hit.
612-614: Use strongly‑typed enum comparisons for symbol IDs.Avoid underlying‑integer compares; cast once to SymbolId and compare enums.
- auto const var_type{variable_token.get_variable_type()}; - - bool const is_int{static_cast<uint32_t>(log_surgeon::SymbolId::TokenInt) == var_type}; - bool const is_float{static_cast<uint32_t>(log_surgeon::SymbolId::TokenFloat) == var_type}; + auto const var_symbol{ + static_cast<log_surgeon::SymbolId>(variable_token.get_variable_type()) + }; + bool const is_int = (var_symbol == log_surgeon::SymbolId::TokenInt); + bool const is_float= (var_symbol == log_surgeon::SymbolId::TokenFloat);components/core/tests/test-GrepCore.cpp (1)
493-500: Fix mask width and bit extraction.Shift on size_t with literal 1 risks UB on 32‑bit; also mask bit should use 1ULL.
- size_t const num_combos{static_cast<size_t>(1) << wildcard_encodable_positions.size()}; + uint64_t const num_combos{1ULL << wildcard_encodable_positions.size()}; REQUIRE(num_combos == 4); unordered_set<string> logtype_strings; - for (size_t mask{0}; mask < num_combos; ++mask) { + for (uint64_t mask{0}; mask < num_combos; ++mask) { unordered_map<size_t, bool> wildcard_mask_map; for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) { - wildcard_mask_map[wildcard_encodable_positions[i]] = mask >> i & 1; + wildcard_mask_map[wildcard_encodable_positions[i]] = (mask >> i) & 1ULL; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/core/src/clp/GrepCore.hpp(6 hunks)components/core/src/clp/string_utils/string_utils.cpp(3 hunks)components/core/tests/test-GrepCore.cpp(1 hunks)components/core/tests/test-string_utils.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/tests/test-string_utils.cppcomponents/core/src/clp/GrepCore.hppcomponents/core/src/clp/string_utils/string_utils.cppcomponents/core/tests/test-GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (3)
components/core/tests/test-string_utils.cpp (2)
components/core/src/clp/string_utils/string_utils.cpp (2)
wildcard_match_unsafe_case_sensitive(238-342)wildcard_match_unsafe_case_sensitive(238-238)components/core/src/clp/string_utils/string_utils.hpp (1)
wildcard_match_unsafe_case_sensitive(142-142)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
interpretation(55-58)interpretation(55-56)interpretation(60-65)interpretation(60-63)logtype(220-230)logtype(221-222)logtype(232-242)logtype(232-236)generate_schema_sub_queries(88-101)generate_schema_sub_queries(88-93)components/core/src/clp/EncodedVariableInterpreter.hpp (13)
logtype(63-65)logtype(63-63)logtype(71-73)logtype(71-71)logtype(79-81)logtype(79-79)logtype(87-89)logtype(87-87)wildcard_search_dictionary_and_get_encoded_matches(222-227)wildcard_search_dictionary_and_get_encoded_matches(497-522)wildcard_search_dictionary_and_get_encoded_matches(497-502)encoded_var(57-57)encoded_var(116-116)components/core/src/clp/GrepCore.cpp (4)
get_wildcard_encodable_positions(149-165)get_wildcard_encodable_positions(149-150)generate_logtype_string(167-232)generate_logtype_string(167-170)
components/core/tests/test-GrepCore.cpp (6)
components/core/src/clp/string_utils/string_utils.cpp (2)
wildcard_match_unsafe_case_sensitive(238-342)wildcard_match_unsafe_case_sensitive(238-238)components/core/src/clp/GrepCore.hpp (12)
interpretation(198-200)interpretation(212-215)process_schema_var_token(236-242)process_schema_var_token(601-663)process_schema_var_token(601-607)generate_schema_sub_queries(179-186)generate_schema_sub_queries(529-598)generate_schema_sub_queries(529-535)value(76-81)process_raw_query(56-65)process_raw_query(248-368)process_raw_query(248-257)components/core/src/clp/GrepCore.cpp (6)
get_wildcard_encodable_positions(149-165)get_wildcard_encodable_positions(149-150)generate_logtype_string(167-232)generate_logtype_string(167-170)get_bounds_of_next_potential_var(29-147)get_bounds_of_next_potential_var(29-34)components/core/src/clp/EncodedVariableInterpreter.hpp (14)
id(56-56)value(97-100)value(107-110)add_dict_var(266-271)add_dict_var(548-561)add_dict_var(548-553)logtype(63-65)logtype(63-63)logtype(71-73)logtype(71-71)logtype(79-81)logtype(79-79)logtype(87-89)logtype(87-87)components/core/src/clp/Query.cpp (2)
add_dict_var(102-105)add_dict_var(103-103)components/core/src/clp/QueryToken.cpp (2)
is_var(137-145)is_var(137-137)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: package-image
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: rust-checks (macos-15)
- GitHub Check: build-macos (macos-15, false)
🔇 Additional comments (4)
components/core/src/clp/string_utils/string_utils.cpp (3)
17-18: LGTM! Clear precondition documentation.The added comment clarifies the assumption that
wild_currenthas no duplicate greedy wildcards, which aligns with the implementation on line 41-42.
263-269: LGTM! Improved consecutive wildcard handling.The while loop correctly consumes all consecutive '*' characters and efficiently handles the trailing wildcard case with an early return.
316-319: LGTM! Correct trailing wildcard handling.The loop correctly consumes all trailing '*' characters in wild when tame ends, and the final comparison properly determines if the match is successful.
components/core/tests/test-string_utils.cpp (1)
148-151: Good additions: double-asterisk coverage is solid.These cases mirror single-asterisk variants and match the matcher’s semantics for consecutive '*'. No changes requested.
Also applies to: 158-161, 168-171, 178-181, 188-191, 198-201
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/core/src/clp/GrepCore.hpp (2)
616-618: Compare SymbolId values stronglyThis still relies on
uint32_tcasts; please castvar_typeback toSymbolIdand compare strongly typed, matching the.cppimplementation. Doing so removes the magic numbers and avoids future breakage if the enum underlying type changes.- auto const var_type{variable_token.get_variable_type()}; - - bool const is_int{static_cast<uint32_t>(log_surgeon::SymbolId::TokenInt) == var_type}; - bool const is_float{static_cast<uint32_t>(log_surgeon::SymbolId::TokenFloat) == var_type}; + auto const var_type{ + static_cast<log_surgeon::SymbolId>(variable_token.get_variable_type()) + }; + + bool const is_int{log_surgeon::SymbolId::TokenInt == var_type}; + bool const is_float{log_surgeon::SymbolId::TokenFloat == var_type};
195-201: Fix typos in the new DoxygenPlease correct “than” → “that” and “encodabe” → “encodable” so the comment reads cleanly.
- * An encodable variable is a variable token than: + * An encodable variable is a variable token that: @@ - * @return A vector of positions of encodabe wildcard variables. + * @return A vector of positions of encodable wildcard variables.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/CMakeLists.txt(1 hunks)components/core/src/clp/GrepCore.hpp(6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/GrepCore.hpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (1)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
interpretation(55-58)interpretation(55-56)interpretation(60-65)interpretation(60-63)logtype(220-230)logtype(221-222)logtype(232-242)logtype(232-236)generate_schema_sub_queries(88-101)generate_schema_sub_queries(88-93)components/core/src/clp/EncodedVariableInterpreter.hpp (13)
logtype(63-65)logtype(63-63)logtype(71-73)logtype(71-71)logtype(79-81)logtype(79-79)logtype(87-89)logtype(87-87)wildcard_search_dictionary_and_get_encoded_matches(222-227)wildcard_search_dictionary_and_get_encoded_matches(497-522)wildcard_search_dictionary_and_get_encoded_matches(497-502)encoded_var(57-57)encoded_var(116-116)components/core/src/clp/GrepCore.cpp (4)
get_wildcard_encodable_positions(149-165)get_wildcard_encodable_positions(149-150)generate_logtype_string(167-232)generate_logtype_string(167-170)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: rust-checks (macos-15)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/core/CMakeLists.txt (1)
774-776: Scoped test macro wiring looks goodDefining
CLP_BUILD_TESTINGonly for theunitTesttarget is exactly what the header guard expects, so production builds stay untouched. Nicely done.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@components/core/src/clp/Utils.cpp`:
- Around line 156-161: The call to lexer.add_rule uses
std::move(make_unique<RegexASTLiteral<ByteNfaState>>(RegexASTLiteral<ByteNfaState>('\n')))
which creates an unnecessary temporary and moves a prvalue; replace it by
calling make_unique<RegexASTLiteral<ByteNfaState>>('\n') directly (no explicit
move and no intermediate RegexASTLiteral temporary) so the constructor argument
is forwarded into the unique_ptr creation; update the call sites involving
lexer.add_rule / lexer.m_symbol_id["newLine"] accordingly.
d837950 to
47705f7
Compare
Reference
This PR does not provide support for searching schema's with capture-groups as this requires TNFA intersection support in Log Surgeon, which is currently unavailable. See, y-scope/log-surgeon#211.
Description
This PR updates schema-based search in CLP to leverage the new
log_surgeon::Queryclass to improve accuracy on schema compressed archives. This class improves accuracy by performing DFA intersections between a user's query and the variables in the schema. When generating subqueries using this class the following must be done:QueryInterpretationmatches the logtype and variable dictionaries.QueryInterpretationinto a set ofSubQuerys if it matches the dictionaries.To compare against the dictionaries, CLP must consider the various combinations of encodable variable types a single
QueryInterpretationcan have. For example:Key changes:
process_raw_query: updated to use thelog_surgeon::Queryclass for DFA-based matching.generate_schema_sub_queries: iterate over all encodable combinations of interpretations and add a sub query if it matches the dictionaries.generate_logtype_string: generates a logtype string for a given interpretation and encodable variable maskget_wildcard_encodable_positions: converts mask for encodable variables to array positions.process_schema_var_token: adds variable to the sub query.get_bounds_of_next_potential_varoverload (schema case) andSearchToken.Validation performed
Summary by CodeRabbit
New Features
Improvements
Testing
Chores