Skip to content

feat: Generate all possible wildcard subqueries for non-capture schema-based search, preventing unnecessary archive decompression.#1313

Merged
davidlion merged 148 commits into
y-scope:mainfrom
SharafMohamed:dfa-search-using-ls-port
Feb 20, 2026
Merged

feat: Generate all possible wildcard subqueries for non-capture schema-based search, preventing unnecessary archive decompression.#1313
davidlion merged 148 commits into
y-scope:mainfrom
SharafMohamed:dfa-search-using-ls-port

Conversation

@SharafMohamed
Copy link
Copy Markdown
Contributor

@SharafMohamed SharafMohamed commented Sep 19, 2025

Reference

This PR does not provide support for searching schema's with capture-groups as this requires TNFA intersection support in Log Surgeon, which is currently unavailable. See, y-scope/log-surgeon#211.

Description

This PR updates schema-based search in CLP to leverage the new log_surgeon::Query class to improve accuracy on schema compressed archives. This class improves accuracy by performing DFA intersections between a user's query and the variables in the schema. When generating subqueries using this class the following must be done:

  • Check if each QueryInterpretation matches the logtype and variable dictionaries.
  • Transform the QueryInterpretation into a set of SubQuerys if it matches the dictionaries.

To compare against the dictionaries, CLP must consider the various combinations of encodable variable types a single QueryInterpretation can have. For example:

  • Search query: "a *1 *2 b",
  • QueryInterpretation (one of many): "a (*1) (*2) b"
  • Possible logtypes (for the above interpretation):
    • mask 00 -> "a \d \d b"
    • mask 01 ->"a \d \f b"
    • mask 10 ->"a \i \d b"
    • mask 11 ->"a \i \f b"

Key changes:

  • process_raw_query: updated to use the log_surgeon::Query class for DFA-based matching.
  • Added methods to support sub query generation:
    • generate_schema_sub_queries: iterate over all encodable combinations of interpretations and add a sub query if it matches the dictionaries.
    • generate_logtype_string: generates a logtype string for a given interpretation and encodable variable mask
    • get_wildcard_encodable_positions: converts mask for encodable variables to array positions.
    • process_schema_var_token: adds variable to the sub query.
  • Removes unused code: get_bounds_of_next_potential_var overload (schema case) and SearchToken.

Validation performed

  • Unit-tests covering each new method.

Summary by CodeRabbit

  • New Features

    • Added a schema-driven search API that returns richer, deduplicated sub-queries.
  • Improvements

    • Unified and simplified variable/token detection; removed alternate lexer-based matching path.
    • Centralized sub-query generation with explicit matchability handling and early-exit behavior.
    • Defaulted equality for query types and initialized wildcard flags.
    • Schema parsing now tolerates a known header/timestamp capture pattern.
  • Testing

    • Expanded unit tests, mocks and utilities covering schema search and encoding paths.
  • Chores

    • Updated test build configuration to include new test sources.

@SharafMohamed SharafMohamed requested a review from a team as a code owner September 19, 2025 16:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 19, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a new clp::SchemaSearcher for schema-driven query expansion, removes GrepCore's lexer-based token-analysis overload and private SearchToken, consolidates token scanning into a single string-based implementation, updates process_raw_query to optionally call SchemaSearcher, and adds tests, mocks, utilities, and CMake updates.

Changes

Cohort / File(s) Summary
GrepCore refactor
components/core/src/clp/GrepCore.cpp, components/core/src/clp/GrepCore.hpp
Removed the lexer-based overload of get_bounds_of_next_potential_var and private SearchToken; consolidated to a single string-scanning implementation and heuristic classification. process_raw_query now invokes SchemaSearcher::search(...) for the non-heuristic path and retains a reworked heuristic path.
SchemaSearcher implementation
components/core/src/clp/SchemaSearcher.hpp, components/core/src/clp/SchemaSearcher.cpp
New public clp::SchemaSearcher with templated search() API. Implements normalization of interpretations, wildcard-encodable position detection, logtype-string generation, combinatorial encoding masks, schema-driven sub-query generation, variable processing, deduplication, and an OperationFailed exception.
Query types
components/core/src/clp/Query.hpp
Added defaulted operator== for QueryVar and SubQuery; default-initialized SubQuery::m_wildcard_match_required{false}.
Tests — GrepCore updates
components/core/tests/test-GrepCore.cpp
Replaced file-backed lexer usage with in-test lexer construction (make_test_lexer); removed lexer-parameter overload usage; updated calls to the simplified get_bounds_of_next_potential_var; added a process_raw_query test integrating lexer and mock dictionaries.
Tests — SchemaSearcher suite & test wrappers
components/core/tests/test-SchemaSearcher.cpp, components/core/tests/SchemaSearcherTest.hpp
Added comprehensive tests for SchemaSearcher behaviors and a test-only wrapper (SchemaSearcherTest) exposing internal helpers (normalize, generate_logtype_string, get_wildcard_encodable_positions, schema sub-query generation, token processing).
Test utilities & mocks
components/core/tests/search_test_utils.hpp, components/core/tests/search_test_utils.cpp, components/core/tests/MockLogTypeDictionary.hpp, components/core/tests/MockVariableDictionary.hpp
Added helpers to construct mock dictionaries, generate encoded logtype strings, and validate SubQuery structures; introduced in-memory mock logtype and variable dictionaries for unit tests.
Build system updates
components/core/CMakeLists.txt, components/core/src/clp/clg/CMakeLists.txt, components/core/src/clp/clo/CMakeLists.txt, components/core/src/clp_s/CMakeLists.txt
Added SchemaSearcher sources to multiple targets, exposed project root include directories where needed, and extended unit-test sources and compile definitions for new tests and mocks.
Test helpers/header additions
components/core/tests/SchemaSearcherTest.hpp, other new headers
Added test helper headers and types (e.g., VarInfo) to surface and validate internal SchemaSearcher logic in unit tests.
Utilities & runtime tweaks
components/core/src/clp/Utils.cpp, components/core/src/clp/clp/run.cpp
Simplified includes and namespace scoping in Utils.cpp; added token header mapping and tightened header/timestamp capture handling. run.cpp now allows a specific header capture pattern (single timestamp) instead of rejecting all captures.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant GrepCore as GrepCore
    participant SchemaSearcher as SchemaSearcher
    participant Lexer as Lexer
    participant LogTypeDict as LogTypeDict
    participant VarDict as VarDict

    Client->>GrepCore: process_raw_query(search_string, use_heuristic?)
    alt use_heuristic = false
        GrepCore->>SchemaSearcher: search(search_string, lexer, LogTypeDict, VarDict, ignore_case)
        SchemaSearcher->>Lexer: tokenize(search_string)
        Lexer-->>SchemaSearcher: tokens
        SchemaSearcher->>SchemaSearcher: normalize_interpretations()
        loop per interpretation
            SchemaSearcher->>SchemaSearcher: get_wildcard_encodable_positions()
            SchemaSearcher->>SchemaSearcher: enumerate mask combinations
            loop per mask
                SchemaSearcher->>SchemaSearcher: generate_logtype_string(mask)
                SchemaSearcher->>LogTypeDict: get matching logtypes(logtype_string)
                LogTypeDict-->>SchemaSearcher: matching entries
                SchemaSearcher->>VarDict: resolve variable tokens
                VarDict-->>SchemaSearcher: variable matches
            end
            SchemaSearcher->>SchemaSearcher: assemble SubQuery
        end
        SchemaSearcher-->>GrepCore: vector<SubQuery>
    else use_heuristic = true
        GrepCore->>GrepCore: heuristic token scan (get_bounds_of_next_potential_var)
        GrepCore->>GrepCore: build sub_queries heuristically
    end
    GrepCore-->>Client: sub_queries
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.77% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the PR: implementing schema-based search with wildcard subquery generation to optimize archive decompression.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b83e3cf and add11a2.

📒 Files selected for processing (6)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
  • components/core/src/clp/GrepCore.hpp (5 hunks)
  • components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp (1 hunks)
  • components/core/src/clp/ffi/ir_stream/search/test/utils.cpp (1 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
  • taskfiles/deps/main.yaml (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp
  • components/core/src/clp/ffi/ir_stream/search/test/utils.cpp
  • components/core/src/clp/GrepCore.cpp
  • components/core/tests/test-GrepCore.cpp
  • components/core/src/clp/GrepCore.hpp
🧠 Learnings (3)
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
PR: y-scope/clp#557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp/ffi/ir_stream/search/test/utils.cpp
🧬 Code graph analysis (3)
components/core/src/clp/GrepCore.cpp (2)
components/core/src/clp/GrepCore.hpp (2)
  • interpretation (193-195)
  • interpretation (207-210)
components/core/src/clp/EncodedVariableInterpreter.hpp (5)
  • add_dict_var (258-263)
  • add_dict_var (533-546)
  • add_dict_var (533-538)
  • encoded_var (52-52)
  • encoded_var (111-111)
components/core/tests/test-GrepCore.cpp (1)
components/core/src/clp/GrepCore.cpp (2)
  • get_bounds_of_next_potential_var (28-146)
  • get_bounds_of_next_potential_var (28-33)
components/core/src/clp/GrepCore.hpp (4)
components/core/src/clp/QueryToken.cpp (2)
  • is_var (137-145)
  • is_var (137-137)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (58-60)
  • logtype (58-58)
  • logtype (66-68)
  • logtype (66-66)
  • logtype (74-76)
  • logtype (74-74)
  • logtype (82-84)
  • logtype (82-82)
  • wildcard_search_dictionary_and_get_encoded_matches (218-223)
  • wildcard_search_dictionary_and_get_encoded_matches (486-511)
  • wildcard_search_dictionary_and_get_encoded_matches (486-491)
  • encoded_var (52-52)
  • encoded_var (111-111)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (148-164)
  • get_wildcard_encodable_positions (148-149)
  • generate_logtype_string (166-233)
  • generate_logtype_string (166-169)
components/core/src/glt/Query.hpp (1)
  • logtype_entries (103-105)
🔇 Additional comments (13)
components/core/src/clp/ffi/ir_stream/search/test/utils.cpp (1)

15-15: LGTM! Required header for fmt::join usage.

The addition of <fmt/ranges.h> is necessary to enable the fmt::join function used on line 64. This aligns with the fmt 11.x upgrade.

components/core/src/clp/ffi/ir_stream/search/test/test_QueryHandlerImpl.cpp (1)

15-15: LGTM! Consistent header addition for fmt::join.

The addition of <fmt/ranges.h> enables fmt::join functionality on line 237, consistent with the fmt 11.x upgrade.

components/core/tests/test-GrepCore.cpp (2)

12-14: Good use of brace initialization for consistency.

The change to brace initialization enhances code consistency and prevents narrowing conversions.


79-99: Wildcard handling changes are intentional.

GrepCore documents a fallback to "decompression + wildcard matching" for tokens with greedy wildcards (components/core/src/clp/GrepCore.hpp ~lines 472–476), and the updated tests in components/core/tests/test-GrepCore.cpp (wildcard case starting ~line 75) reflect the DFA-based behaviour: escaped '' now delimits "1" while unescaped '' is preserved in "abc*123".

components/core/src/clp/GrepCore.hpp (3)

130-182: Well-documented schema sub-query generation logic.

The documentation clearly explains the complex logic for handling encodable wildcard variables and the binary choices for interpretation. The example effectively illustrates the 2^k possible combinations.


336-346: Verify thread-safety of static variables in GrepCore.hpp

interpretations_generated and interpretations (components/core/src/clp/GrepCore.hpp: lines 336–346) are static and can race under concurrent access. The verification script produced no output, so concurrency of this code path or existing synchronization could not be confirmed. Confirm whether this runs on multiple threads; if so, replace static state with per-call/per-archive state, use thread_local, or guard access with a mutex/atomic.


613-614: Verify SymbolId underlying type and make casts consistent.

SymbolId is defined in external log_surgeon — confirm its underlying integer type and align casts: GrepCore.hpp:613–614 and GrepCore.cpp:156–157,196–197 use static_cast<uint32_t>(TokenInt/TokenFloat) while Utils.cpp (≈140–156) uses static_cast(log_surgeon::SymbolId::TokenInt/TokenFloat). Match the cast to the enum's underlying type (or change var_type) to avoid subtle mismatches.

taskfiles/deps/main.yaml (3)

443-447: Approved — spdlog v1.15.3 is compatible with fmt v11.2.0.
Release notes show spdlog 1.15.3 bumped the bundled {fmt} to 11.2.0, so no changes required.


253-254: Verify fmt 11.2.0 compatibility across the codebase.

  • Static scan shows multiple fmt call sites; representative hits: components/core/src/clp_s/Utils.cpp:81, components/core/src/clp_s/kv_ir_search.cpp:175, components/core/src/clp_s/indexer/MySQLIndexStorage.cpp:31,58, components/core/tests/test-clp_s-search.cpp:117,215, components/core/src/clp_s/JsonConstructor.cpp:27.
  • ast-grep returned no exact matches for custom formatters (no template<> struct fmt::formatter<...> found).
  • Action: build the project with fmt 11.2.0 and run all unit/integration tests; if you get compiler or test failures, update affected call sites or add/update any custom formatters (also search for partial specializations or ADL-based formatters) and re-run tests.

320-321: Confirm log_surgeon update (a722d07) includes QueryInterpretation.
a722d07 tarball contains src/log_surgeon/wildcard_query_parser/QueryInterpretation.hpp and QueryInterpretation.cpp (also Query.hpp/Query.cpp and tests/test-query-interpretation.cpp), so the log_surgeon::wildcard_query_parser::QueryInterpretation referenced by components/core/src/clp/GrepCore.* is present.

components/core/src/clp/GrepCore.cpp (3)

18-26: LGTM: token/type aliases are clear and scoped

The using-declarations improve readability without polluting too broadly. No issues spotted.


213-230: LGTM: representable-int/float fast-path with dictionary fallback

The detection path is clear and orders checks correctly; encoded_var capture is fine as an out-param.


199-211: C++20 is already enforced for the target that builds this TU — no action required.

The clp target sets target_compile_features(... PRIVATE cxx_std_20) (components/core/src/clp/clp/CMakeLists.txt:172) and other CLP targets likewise use cxx_std_20, so std::unordered_map::contains is supported.

Comment thread components/core/src/clp/GrepCore.cpp Outdated
Comment thread components/core/src/clp/GrepCore.cpp Outdated
Comment thread components/core/src/clp/GrepCore.cpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between add11a2 and c82dbab.

📒 Files selected for processing (3)
  • components/core/src/clp/GrepCore.hpp (5 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
  • taskfiles/deps/main.yaml (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-GrepCore.cpp
  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (2)
components/core/tests/test-GrepCore.cpp (1)
components/core/src/clp/GrepCore.cpp (2)
  • get_bounds_of_next_potential_var (28-146)
  • get_bounds_of_next_potential_var (28-33)
components/core/src/clp/GrepCore.hpp (3)
components/core/src/clp/DictionaryReader.hpp (1)
  • search_string (94-94)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (148-164)
  • get_wildcard_encodable_positions (148-149)
  • generate_logtype_string (166-233)
  • generate_logtype_string (166-169)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (7)
components/core/tests/test-GrepCore.cpp (2)

13-16: LGTM: test initialisation simplified.

Brace‑initialising begin/end/is_var improves clarity; signature update aligns with the header.


21-101: LGTM: expectations preserved after signature change.

The call sites correctly drop the lexer argument and still validate tokenisation semantics.

If CI allows, add one integration test covering the new interpretation‑driven path in process_raw_query to ensure schema‑based subquery generation remains sound.

components/core/src/clp/GrepCore.hpp (2)

184-195: Fix typos in documentation.

“indicies” → “indices”; also remove the duplicated spacing.

-     * Scans the interpretation and returns the indicies of all encodable     * wildcard variables.
+     * Scans the interpretation and returns the indices of all encodable wildcard variables.
@@
-     * @return A vector of positions of encodabe wildcard variables.
+     * @return A vector of positions of encodable wildcard variables.

540-542: Replace magic number 32 with a named constant.

Carry forward the earlier feedback and make the limit self-documenting.

-        if (wildcard_encodable_positions.size() > 32) {
+        if (wildcard_encodable_positions.size() > cMaxEncodableWildcardVariables) {
             throw std::runtime_error("Too many encodable variables.");
         }

Add near the class’ private section (example placement shown):

// Place inside class GrepCore, private: section
static constexpr size_t cMaxEncodableWildcardVariables{32};
taskfiles/deps/main.yaml (3)

443-448: spdlog–fmt ABI compatibility — verify in CI

v1.15.3 bundles fmt 11.2.0 and is compatible with fmt 11.x; run a minimal CI build compiling with -DSPDLOG_FMT_EXTERNAL=ON against fmt 11.2.0 (or the latest 11.x) to confirm no platform/toolchain-specific issues.


253-255: SHA256 verified — run CI to confirm build compatibility

  • taskfiles/deps/main.yaml (lines 253–255): TARBALL_SHA256 matches the downloaded fmt 11.2.0 tarball: bc23066d87ab3168f27cef3e97d545fa63314f5c79df5ea444d41d56f962c6af.
  • Repository shows widespread fmt usage (find_package(fmt REQUIRED), fmt::fmt target, many fmt::format/format_to/formatter/includes).
  • Cannot compile here; run CI or a local full build to ensure all consumers compile cleanly with fmt 11.2.0.

320-322: APIs present in commit a722d07 — OK to pin.

Query::get_all_multi_token_interpretations is in src/log_surgeon/wildcard_query_parser/Query.hpp; QueryInterpretation, VariableQueryToken and StaticQueryToken headers are present under src/log_surgeon/wildcard_query_parser/; SymbolId enum is in src/log_surgeon/Constants.hpp. No missing headers or symbols found.

Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread taskfiles/deps/main.yaml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c82dbab and 9d3e152.

📒 Files selected for processing (4)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
  • components/core/src/clp/Query.hpp (1 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/Query.hpp
  • components/core/src/clp/GrepCore.hpp
  • components/core/tests/test-GrepCore.cpp
  • components/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.

Applied to files:

  • components/core/src/clp/GrepCore.cpp
🧬 Code graph analysis (3)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • logtype (207-216)
  • logtype (207-208)
  • logtype (218-228)
  • logtype (218-222)
  • generate_schema_sub_queries (81-94)
  • generate_schema_sub_queries (81-86)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (148-164)
  • get_wildcard_encodable_positions (148-149)
  • generate_logtype_string (166-233)
  • generate_logtype_string (166-169)
components/core/tests/test-GrepCore.cpp (4)
components/core/src/clp/GrepCore.hpp (9)
  • interpretation (197-199)
  • interpretation (211-214)
  • process_schema_var_token (235-241)
  • process_schema_var_token (606-668)
  • process_schema_var_token (606-612)
  • generate_schema_sub_queries (178-185)
  • generate_schema_sub_queries (535-603)
  • generate_schema_sub_queries (535-541)
  • value (75-80)
components/core/src/clp/GrepCore.cpp (6)
  • get_wildcard_encodable_positions (148-164)
  • get_wildcard_encodable_positions (148-149)
  • generate_logtype_string (166-233)
  • generate_logtype_string (166-169)
  • get_bounds_of_next_potential_var (28-146)
  • get_bounds_of_next_potential_var (28-33)
components/core/src/clp/EncodedVariableInterpreter.hpp (14)
  • id (56-56)
  • value (97-100)
  • value (107-110)
  • add_dict_var (266-271)
  • add_dict_var (548-561)
  • add_dict_var (548-553)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
components/core/src/clp/Query.cpp (2)
  • add_dict_var (102-105)
  • add_dict_var (103-103)
components/core/src/clp/GrepCore.cpp (2)
components/core/src/clp/GrepCore.hpp (2)
  • interpretation (197-199)
  • interpretation (211-214)
components/core/tests/test-GrepCore.cpp (7)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • add_int_var (175-175)
  • add_float_var (177-177)
  • add_dict_var (179-179)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (15)
components/core/src/clp/Query.hpp (1)

158-159: Deterministic default for wildcard flag — good.

Default-initializing m_wildcard_match_required to false prevents UB from uninitialized reads. Please confirm SubQuery::clear() also resets this flag to false to maintain invariants across reuse.

components/core/tests/test-GrepCore.cpp (1)

590-718: Solid coverage for schema subquery generation.

Good end-to-end assertions on wildcard-encodable combinations, variable shapes, and logtype IDs.

components/core/src/clp/GrepCore.hpp (8)

9-14: Add missing include for std::set.

This header declares std::set in public APIs; include to avoid relying on transitive includes.

Apply:

 #include <string_view>
+#include <set>
 #include <unordered_map>
 #include <unordered_set>
 #include <vector>

188-196: Typos in documentation.

Fix “indicies” → “indices” and “encodabe” → “encodable”.

-     * Scans the interpretation and returns the indicies of all encodable wildcard variables.
+     * Scans the interpretation and returns the indices of all encodable wildcard variables.
...
-     * @return A vector of positions of encodabe wildcard variables.
+     * @return A vector of positions of encodable wildcard variables.

207-213: Typo in documentation.

“interpetation” → “interpretation”.

-     * @param interpretation The interpetation to convert to a logtype string.
+     * @param interpretation The interpretation to convert to a logtype string.

288-294: Follow project style: use false == expr instead of !expr.

Aligns with coding guidelines.

As per coding guidelines

-            if (!query_token.has_greedy_wildcard_in_middle() && query_token.is_ambiguous_token()) {
+            if (false == query_token.has_greedy_wildcard_in_middle()
+                && query_token.is_ambiguous_token())
+            {

342-352: Remove thread-unsafe statics; compute interpretations per call (or add a proper cache).

Function-local statics here are shared across calls and archives, race-prone, and nullified by cExecuteForEveryArchive. Construct Query and get interpretations per call.

-        static bool interpretations_generated{false};
-        static std::set<log_surgeon::wildcard_query_parser::QueryInterpretation> interpretations;
-
-        // TODO: This needs to be done for every archive until we have per schema logic.
-        constexpr bool cExecuteForEveryArchive{true};
-        if (cExecuteForEveryArchive || false == interpretations_generated) {
-            log_surgeon::wildcard_query_parser::Query const query(search_string);
-            interpretations.clear();
-            interpretations = query.get_all_multi_token_interpretations(lexer);
-            interpretations_generated = true;
-        }
+        log_surgeon::wildcard_query_parser::Query const query(search_string);
+        auto interpretations = query.get_all_multi_token_interpretations(lexer);

If caching is required, introduce a schema-keyed cache with synchronisation.


353-356: Fix comment typo “Transfrom”.

-        // Transfrom log-surgeon interpretations into CLP sub-queries.
+        // Transform log-surgeon interpretations into CLP sub-queries.

543-551: Avoid undefined/implementation-defined shifts; extract magic number.

  • size_t shifting by variable bit width is UB on 32-bit; use uint64_t and guard.
  • Replace magic 32 with a named constant.
-        if (wildcard_encodable_positions.size() > 32) {
-            throw std::runtime_error("Too many encodable variables.");
-        }
-        size_t const num_combos{static_cast<size_t>(1) << wildcard_encodable_positions.size()};
-        for (size_t mask{0}; mask < num_combos; ++mask) {
+        constexpr size_t cMaxEncodableWildcardVariables{32};
+        auto const k = wildcard_encodable_positions.size();
+        if (k > cMaxEncodableWildcardVariables) {
+            throw std::runtime_error("Too many encodable variables.");
+        }
+        if (k >= 64) {
+            throw std::runtime_error("Wildcard encodable variable count exceeds mask width.");
+        }
+        uint64_t const num_combos{1ULL << k};
+        for (uint64_t mask{0}; mask < num_combos; ++mask) {

549-552: Reduce per-iteration allocations.

Building an unordered_map per mask is expensive. Use a flat container (e.g., vector<uint8_t>) keyed by local index, or reuse a pre-sized map.

Example approach:

  • Precompute vector<size_t> positions = wildcard_encodable_positions.
  • For each mask, fill vector<uint8_t> encoded(k) with (mask >> i) & 1 and consult positions[i] inside generate_logtype_string via a small adapter.
components/core/src/clp/GrepCore.cpp (5)

6-7: Include for std::holds_alternative/std::get.

Avoid relying on transitive includes.

 #include <string_view>
+#include <variant>
 #include <unordered_map>
 #include <vector>

155-158: Prefer strongly typed enum comparisons.

Compare as SymbolId, not via uint32_t.

-    auto const var_type{variable_token.get_variable_type()};
-    bool const is_int{static_cast<uint32_t>(TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(TokenFloat) == var_type};
+    auto const var_symbol = static_cast<decltype(TokenInt)>(variable_token.get_variable_type());
+    bool const is_int   = (var_symbol == TokenInt);
+    bool const is_float = (var_symbol == TokenFloat);

172-181: Reserve logic assumes one byte per placeholder.

If EncodedVariableInterpreter ever appends more than one byte, this under-reserves. Consider exposing/using a constexpr placeholder size or asserting the invariant.


184-185: Fix comment typo.

“Generete” → “Generate”.

-    // Generete `logtype_string`.
+    // Generate `logtype_string`.

196-198: Prefer strongly typed enum comparisons.

Repeat: compare as SymbolId, not via uint32_t.

-    bool const is_int{static_cast<uint32_t>(TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(TokenFloat) == var_type};
+    bool const is_int   = (var_type == TokenInt);
+    bool const is_float = (var_type == TokenFloat);

Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/tests/test-GrepCore.cpp Outdated
Comment thread components/core/tests/test-GrepCore.cpp Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d3e152 and b587085.

📒 Files selected for processing (1)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.

Applied to files:

  • components/core/src/clp/GrepCore.cpp
🧬 Code graph analysis (1)
components/core/src/clp/GrepCore.cpp (3)
components/core/src/clp/GrepCore.hpp (2)
  • interpretation (197-199)
  • interpretation (211-214)
components/core/tests/test-GrepCore.cpp (7)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • add_int_var (175-175)
  • add_float_var (177-177)
  • add_dict_var (179-179)
components/core/src/clp/EncodedVariableInterpreter.hpp (5)
  • add_dict_var (266-271)
  • add_dict_var (548-561)
  • add_dict_var (548-553)
  • encoded_var (57-57)
  • encoded_var (116-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: package-image
🔇 Additional comments (5)
components/core/src/clp/GrepCore.cpp (5)

19-23: LGTM: explicit SymbolId enumerators.

Bringing TokenInt/TokenFloat into scope is clear and consistent with usage below.


25-26: LGTM: local aliases for STL types.

Keeps signatures succinct without hiding ownership semantics.


173-183: Preallocation assumes add_*_var appends one character.

If add_int_var/add_float_var/add_dict_var ever append more than one byte, this under-reserves. Either expose constexpr sizes or assert the invariant.

Can you confirm each add_*_var appends exactly one byte today?


185-185: Fix typo in comment.

“Generete” -> “Generate”.

-    // Generete `logtype_string`.
+    // Generate `logtype_string`.

167-232: Cache interpretation.get_logtype(), iterate by const& and replace contains+at with a single find()
Removes double map lookup and drops the C++20 unordered_map::contains dependency; verify your CMakeLists.txt C++ standard to ensure C++17 compatibility.

Comment thread components/core/src/clp/GrepCore.cpp
Comment thread components/core/src/clp/GrepCore.cpp Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b587085 and d7cfb0d.

📒 Files selected for processing (2)
  • components/core/src/clp/GrepCore.cpp (2 hunks)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/GrepCore.hpp
  • components/core/src/clp/GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
  • components/core/src/clp/GrepCore.cpp
🧬 Code graph analysis (2)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • logtype (207-216)
  • logtype (207-208)
  • logtype (218-228)
  • logtype (218-222)
  • generate_schema_sub_queries (81-94)
  • generate_schema_sub_queries (81-86)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (149-165)
  • get_wildcard_encodable_positions (149-150)
  • generate_logtype_string (167-232)
  • generate_logtype_string (167-170)
components/core/src/clp/GrepCore.cpp (3)
components/core/src/clp/GrepCore.hpp (2)
  • interpretation (198-200)
  • interpretation (212-215)
components/core/tests/test-GrepCore.cpp (5)
  • interpretation (48-51)
  • interpretation (48-49)
  • interpretation (53-58)
  • interpretation (53-56)
  • add_dict_var (179-179)
components/core/src/clp/EncodedVariableInterpreter.hpp (5)
  • add_dict_var (266-271)
  • add_dict_var (548-561)
  • add_dict_var (548-553)
  • encoded_var (57-57)
  • encoded_var (116-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (13)
components/core/src/clp/GrepCore.hpp (10)

7-7: Good fix: add missing include.

Prevents reliance on transitive includes.


212-215: LGTM: clear API for logtype-string generation.


235-243: LGTM: helper API surface reads well.


293-297: LGTM: project style honoured (false == …).


624-627: Confirm intent: encoded-wildcard path adds no var constraint.

Early-return marks wildcard_match_required without adding an encoded var. If that’s intentional (accept any encoded value and defer to message-level matching), fine; otherwise, add a non-dict var placeholder for clarity/consistency.


28-32: Gate test-only friend behind a build flag.

Expose the friend only in test builds to avoid leaking test details in public headers.

- class GrepCoreTest;
+ #ifdef CLP_ENABLE_TESTS
+ class GrepCoreTest;
+ #endif
...
 class GrepCore {
-    friend class GrepCoreTest;
+  #ifdef CLP_ENABLE_TESTS
+    friend class GrepCoreTest;
+  #endif

345-355: Remove unsafe function-local statics; compute interpretations per call.

Statics here are thread-unsafe and nullified by cExecuteForEveryArchive{true}. Build the interpretations locally.

-        static bool interpretations_generated{false};
-        static std::set<log_surgeon::wildcard_query_parser::QueryInterpretation> interpretations;
-
-        // TODO: This needs to be done for every archive until we have per schema logic.
-        constexpr bool cExecuteForEveryArchive{true};
-        if (cExecuteForEveryArchive || false == interpretations_generated) {
-            log_surgeon::wildcard_query_parser::Query const query(search_string);
-            interpretations.clear();
-            interpretations = query.get_all_multi_token_interpretations(lexer);
-            interpretations_generated = true;
-        }
+        log_surgeon::wildcard_query_parser::Query const query(search_string);
+        auto interpretations = query.get_all_multi_token_interpretations(lexer);

356-356: Fix typo: “Transfrom” → “Transform”.

-        // Transfrom log-surgeon interpretations into CLP sub-queries.
+        // Transform log-surgeon interpretations into CLP sub-queries.

545-556: Fix undefined/implementation-defined shifts; use 64-bit masks and guard k.

Shifting size_t by its bit width is UB on 32-bit. Also 2^32 combos is infeasible.

-        size_t const num_combos{static_cast<size_t>(1) << wildcard_encodable_positions.size()};
-        for (size_t mask{0}; mask < num_combos; ++mask) {
+        auto const k = wildcard_encodable_positions.size();
+        if (k >= 64) {
+            throw std::runtime_error("Too many encodable variables for mask enumeration (k >= 64).");
+        }
+        uint64_t const num_combos{1ULL << k};
+        for (uint64_t mask{0}; mask < num_combos; ++mask) {
             std::unordered_map<size_t, bool> wildcard_mask_map;
-            for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) {
-                wildcard_mask_map[wildcard_encodable_positions[i]] = mask >> i & 1;
+            for (size_t i{0}; i < k; ++i) {
+                wildcard_mask_map[wildcard_encodable_positions[i]] = (mask >> i) & 1ULL;
             }

Additionally, consider lowering the practical cap from 32 to something manageable (e.g., 20) to avoid 2^k explosion.


609-623: Use strongly typed enum comparisons for SymbolId.

Avoid uint32_t-based comparisons; compare as SymbolId.

-    auto const var_type{variable_token.get_variable_type()};
-    bool const is_int{static_cast<uint32_t>(log_surgeon::SymbolId::TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(log_surgeon::SymbolId::TokenFloat) == var_type};
+    auto const var_symbol = static_cast<log_surgeon::SymbolId>(variable_token.get_variable_type());
+    bool const is_int   = (var_symbol == log_surgeon::SymbolId::TokenInt);
+    bool const is_float = (var_symbol == log_surgeon::SymbolId::TokenFloat);
components/core/src/clp/GrepCore.cpp (3)

6-8: LGTM: explicit STL includes (unordered_map, variant, vector).


19-23: LGTM: using-declarations improve readability and locality.


149-165: Avoid variant copies and repeated get_logtype() calls.

Cache the logtype and iterate by const reference; reserve capacity.

-auto GrepCore::get_wildcard_encodable_positions(QueryInterpretation const& interpretation)
-        -> vector<size_t> {
-    vector<size_t> wildcard_encodable_positions;
-    for (size_t i{0}; i < interpretation.get_logtype().size(); ++i) {
-        auto const token{interpretation.get_logtype()[i]};
+auto GrepCore::get_wildcard_encodable_positions(QueryInterpretation const& interpretation)
+        -> vector<size_t> {
+    auto const& logtype = interpretation.get_logtype();
+    vector<size_t> wildcard_encodable_positions;
+    wildcard_encodable_positions.reserve(logtype.size());
+    for (size_t i{0}; i < logtype.size(); ++i) {
+        auto const& token = logtype[i];

Comment thread components/core/src/clp/GrepCore.cpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
Comment thread components/core/src/clp/GrepCore.hpp Outdated
…esting code for DP algo; Add final unit-test for testing PR end-to-end; Still lots of edge cases and sub-steps that could be tested more rigorously.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
components/core/src/clp/GrepCore.hpp (5)

191-197: Fix typos in docs.

“than” → “that”; “encodabe” → “encodable”.

-     * An encodable variable is a variable token than:
+     * An encodable variable is a variable token that:
@@
-     * @return A vector of positions of encodabe wildcard variables.
+     * @return A vector of positions of encodable wildcard variables.

28-32: Gate test‑only friend with a build flag.

Avoid leaking test details to consumers; wrap the friend in CLP_BUILD_TESTING (or your project’s test macro).

-class GrepCoreTest;
+ #if defined(CLP_BUILD_TESTING)
+ class GrepCoreTest;
+ #endif

 class GrepCore {
-    friend class GrepCoreTest;
+  #if defined(CLP_BUILD_TESTING)
+    friend class GrepCoreTest;
+  #endif

544-547: Avoid per‑iteration unordered_map allocations for the mask.

This map is rebuilt for every mask; switch to a flat mask (vector<uint8_t>/bitset) or add an overload of generate_logtype_string that takes (positions, mask).

-            std::unordered_map<size_t, bool> wildcard_mask_map;
-            for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) {
-                wildcard_mask_map[wildcard_encodable_positions[i]] = mask >> i & 1ULL;
-            }
+            // Lightweight mask: local index -> bit
+            std::vector<uint8_t> mask_bits(wildcard_encodable_positions.size());
+            for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) {
+                mask_bits[i] = static_cast<uint8_t>((mask >> i) & 1ULL);
+            }
+            // Prefer an overload: generate_logtype_string(interpretation, wildcard_encodable_positions, mask_bits)

If changing signature isn’t feasible now, at least reserve:

+            wildcard_mask_map.reserve(wildcard_encodable_positions.size());

536-543: Cap combinatorial explosion (2^k) to a practical limit.

k=32 allows 4,294,967,296 combinations; this can hang a search. Lower the cap (e.g., 16) and fail fast.

-    constexpr size_t cMaxEncodableWildcardVariables{32};
+    constexpr size_t cMaxEncodableWildcardVariables{16}; // practical cap to bound 2^k work
@@
-        uint64_t const num_combos{1ULL << wildcard_encodable_positions.size()};
+        auto const k = wildcard_encodable_positions.size();
+        uint64_t const num_combos{1ULL << k};

Optionally add telemetry to surface when the cap is hit.


612-614: Use strongly‑typed enum comparisons for symbol IDs.

Avoid underlying‑integer compares; cast once to SymbolId and compare enums.

-    auto const var_type{variable_token.get_variable_type()};
-
-    bool const is_int{static_cast<uint32_t>(log_surgeon::SymbolId::TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(log_surgeon::SymbolId::TokenFloat) == var_type};
+    auto const var_symbol{
+        static_cast<log_surgeon::SymbolId>(variable_token.get_variable_type())
+    };
+    bool const is_int  = (var_symbol == log_surgeon::SymbolId::TokenInt);
+    bool const is_float= (var_symbol == log_surgeon::SymbolId::TokenFloat);
components/core/tests/test-GrepCore.cpp (1)

493-500: Fix mask width and bit extraction.

Shift on size_t with literal 1 risks UB on 32‑bit; also mask bit should use 1ULL.

-    size_t const num_combos{static_cast<size_t>(1) << wildcard_encodable_positions.size()};
+    uint64_t const num_combos{1ULL << wildcard_encodable_positions.size()};
     REQUIRE(num_combos == 4);
     unordered_set<string> logtype_strings;
-    for (size_t mask{0}; mask < num_combos; ++mask) {
+    for (uint64_t mask{0}; mask < num_combos; ++mask) {
         unordered_map<size_t, bool> wildcard_mask_map;
         for (size_t i{0}; i < wildcard_encodable_positions.size(); ++i) {
-            wildcard_mask_map[wildcard_encodable_positions[i]] = mask >> i & 1;
+            wildcard_mask_map[wildcard_encodable_positions[i]] = (mask >> i) & 1ULL;
         }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d7cfb0d and b633980.

📒 Files selected for processing (4)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
  • components/core/src/clp/string_utils/string_utils.cpp (3 hunks)
  • components/core/tests/test-GrepCore.cpp (1 hunks)
  • components/core/tests/test-string_utils.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-string_utils.cpp
  • components/core/src/clp/GrepCore.hpp
  • components/core/src/clp/string_utils/string_utils.cpp
  • components/core/tests/test-GrepCore.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (3)
components/core/tests/test-string_utils.cpp (2)
components/core/src/clp/string_utils/string_utils.cpp (2)
  • wildcard_match_unsafe_case_sensitive (238-342)
  • wildcard_match_unsafe_case_sensitive (238-238)
components/core/src/clp/string_utils/string_utils.hpp (1)
  • wildcard_match_unsafe_case_sensitive (142-142)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretation (55-58)
  • interpretation (55-56)
  • interpretation (60-65)
  • interpretation (60-63)
  • logtype (220-230)
  • logtype (221-222)
  • logtype (232-242)
  • logtype (232-236)
  • generate_schema_sub_queries (88-101)
  • generate_schema_sub_queries (88-93)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (149-165)
  • get_wildcard_encodable_positions (149-150)
  • generate_logtype_string (167-232)
  • generate_logtype_string (167-170)
components/core/tests/test-GrepCore.cpp (6)
components/core/src/clp/string_utils/string_utils.cpp (2)
  • wildcard_match_unsafe_case_sensitive (238-342)
  • wildcard_match_unsafe_case_sensitive (238-238)
components/core/src/clp/GrepCore.hpp (12)
  • interpretation (198-200)
  • interpretation (212-215)
  • process_schema_var_token (236-242)
  • process_schema_var_token (601-663)
  • process_schema_var_token (601-607)
  • generate_schema_sub_queries (179-186)
  • generate_schema_sub_queries (529-598)
  • generate_schema_sub_queries (529-535)
  • value (76-81)
  • process_raw_query (56-65)
  • process_raw_query (248-368)
  • process_raw_query (248-257)
components/core/src/clp/GrepCore.cpp (6)
  • get_wildcard_encodable_positions (149-165)
  • get_wildcard_encodable_positions (149-150)
  • generate_logtype_string (167-232)
  • generate_logtype_string (167-170)
  • get_bounds_of_next_potential_var (29-147)
  • get_bounds_of_next_potential_var (29-34)
components/core/src/clp/EncodedVariableInterpreter.hpp (14)
  • id (56-56)
  • value (97-100)
  • value (107-110)
  • add_dict_var (266-271)
  • add_dict_var (548-561)
  • add_dict_var (548-553)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
components/core/src/clp/Query.cpp (2)
  • add_dict_var (102-105)
  • add_dict_var (103-103)
components/core/src/clp/QueryToken.cpp (2)
  • is_var (137-145)
  • is_var (137-137)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: build-macos (macos-15, false)
🔇 Additional comments (4)
components/core/src/clp/string_utils/string_utils.cpp (3)

17-18: LGTM! Clear precondition documentation.

The added comment clarifies the assumption that wild_current has no duplicate greedy wildcards, which aligns with the implementation on line 41-42.


263-269: LGTM! Improved consecutive wildcard handling.

The while loop correctly consumes all consecutive '*' characters and efficiently handles the trailing wildcard case with an early return.


316-319: LGTM! Correct trailing wildcard handling.

The loop correctly consumes all trailing '*' characters in wild when tame ends, and the final comparison properly determines if the match is successful.

components/core/tests/test-string_utils.cpp (1)

148-151: Good additions: double-asterisk coverage is solid.

These cases mirror single-asterisk variants and match the matcher’s semantics for consecutive '*'. No changes requested.

Also applies to: 158-161, 168-171, 178-181, 188-191, 198-201

Comment thread components/core/tests/test-GrepCore.cpp Outdated
Comment thread components/core/tests/test-GrepCore.cpp Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
components/core/src/clp/GrepCore.hpp (2)

616-618: Compare SymbolId values strongly

This still relies on uint32_t casts; please cast var_type back to SymbolId and compare strongly typed, matching the .cpp implementation. Doing so removes the magic numbers and avoids future breakage if the enum underlying type changes.

-    auto const var_type{variable_token.get_variable_type()};
-
-    bool const is_int{static_cast<uint32_t>(log_surgeon::SymbolId::TokenInt) == var_type};
-    bool const is_float{static_cast<uint32_t>(log_surgeon::SymbolId::TokenFloat) == var_type};
+    auto const var_type{
+            static_cast<log_surgeon::SymbolId>(variable_token.get_variable_type())
+    };
+
+    bool const is_int{log_surgeon::SymbolId::TokenInt == var_type};
+    bool const is_float{log_surgeon::SymbolId::TokenFloat == var_type};

195-201: Fix typos in the new Doxygen

Please correct “than” → “that” and “encodabe” → “encodable” so the comment reads cleanly.

-     * An encodable variable is a variable token than:
+     * An encodable variable is a variable token that:
@@
-     * @return A vector of positions of encodabe wildcard variables.
+     * @return A vector of positions of encodable wildcard variables.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b633980 and 62f23b0.

📒 Files selected for processing (2)
  • components/core/CMakeLists.txt (1 hunks)
  • components/core/src/clp/GrepCore.hpp (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/GrepCore.hpp
🧠 Learnings (1)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/src/clp/GrepCore.hpp
🧬 Code graph analysis (1)
components/core/src/clp/GrepCore.hpp (3)
components/core/tests/test-GrepCore.cpp (10)
  • interpretation (55-58)
  • interpretation (55-56)
  • interpretation (60-65)
  • interpretation (60-63)
  • logtype (220-230)
  • logtype (221-222)
  • logtype (232-242)
  • logtype (232-236)
  • generate_schema_sub_queries (88-101)
  • generate_schema_sub_queries (88-93)
components/core/src/clp/EncodedVariableInterpreter.hpp (13)
  • logtype (63-65)
  • logtype (63-63)
  • logtype (71-73)
  • logtype (71-71)
  • logtype (79-81)
  • logtype (79-79)
  • logtype (87-89)
  • logtype (87-87)
  • wildcard_search_dictionary_and_get_encoded_matches (222-227)
  • wildcard_search_dictionary_and_get_encoded_matches (497-522)
  • wildcard_search_dictionary_and_get_encoded_matches (497-502)
  • encoded_var (57-57)
  • encoded_var (116-116)
components/core/src/clp/GrepCore.cpp (4)
  • get_wildcard_encodable_positions (149-165)
  • get_wildcard_encodable_positions (149-150)
  • generate_logtype_string (167-232)
  • generate_logtype_string (167-170)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: rust-checks (ubuntu-22.04)
  • GitHub Check: rust-checks (ubuntu-24.04)
  • GitHub Check: rust-checks (macos-15)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/core/CMakeLists.txt (1)

774-776: Scoped test macro wiring looks good

Defining CLP_BUILD_TESTING only for the unitTest target is exactly what the header guard expects, so production builds stay untouched. Nicely done.

Comment thread components/core/src/clp/GrepCore.hpp Outdated
@SharafMohamed SharafMohamed changed the title feat: Generate all possible wildcard subqueries for schema-based search, preventing unnecessary archive decompression. feat: Generate all possible wildcard subqueries for non-capture schema-based search, preventing unnecessary archive decompression. Feb 9, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@components/core/src/clp/Utils.cpp`:
- Around line 156-161: The call to lexer.add_rule uses
std::move(make_unique<RegexASTLiteral<ByteNfaState>>(RegexASTLiteral<ByteNfaState>('\n')))
which creates an unnecessary temporary and moves a prvalue; replace it by
calling make_unique<RegexASTLiteral<ByteNfaState>>('\n') directly (no explicit
move and no intermediate RegexASTLiteral temporary) so the constructor argument
is forwarded into the unique_ptr creation; update the call sites involving
lexer.add_rule / lexer.m_symbol_id["newLine"] accordingly.

Comment thread components/core/src/clp/Utils.cpp Outdated
Copy link
Copy Markdown
Member

@davidlion davidlion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides waiting on #1972, lets make CLP_ENABLE_TESTS_IN_CPP the same as the cmake name as discussed offline. Otherwise, lgtm.

@SharafMohamed SharafMohamed force-pushed the dfa-search-using-ls-port branch from d837950 to 47705f7 Compare February 19, 2026 17:51
@davidlion davidlion merged commit c87f897 into y-scope:main Feb 20, 2026
27 checks passed
@junhaoliao junhaoliao modified the milestones: Backlog, February 2026 Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants