Skip to content

[ASIM] Add File and Parser Validation Workflow#14450

Closed
yummyblabla wants to merge 37 commits into
masterfrom
derricklee/asim-file-parser-validation
Closed

[ASIM] Add File and Parser Validation Workflow#14450
yummyblabla wants to merge 37 commits into
masterfrom
derricklee/asim-file-parser-validation

Conversation

@yummyblabla

@yummyblabla yummyblabla commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Do not merge until Azure organization has access to GitHub Models (not free tier)

@yummyblabla yummyblabla requested review from a team as code owners June 9, 2026 23:00
@contentautomationbot

Copy link
Copy Markdown

ASIM parsers have been changed. ARM templates were regenerated from the updated KQL function YAML files.
To find the new ARM templates, pull your branch.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

ASIM File Validation Failed

The following validation errors were found:

  • Expected new CHANGELOG file not found: Parsers/ASimAuthentication/CHANGELOG/ASimAuthenticationTestProduct.md

This comment was generated automatically by the ASIM File and Parser Validation workflow.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM File Validation Failed

The following validation errors were found:

  • Expected modified CHANGELOG file not found: Parsers/ASimAuthentication/CHANGELOG/ASimAuthentication.md

This comment was generated automatically by the ASIM File and Parser Validation workflow.

@yummyblabla yummyblabla added the SafeToRun This is used only for ASim parsers Fork PR Pipeline run. label Jun 10, 2026
@github-actions

Copy link
Copy Markdown
Contributor

ASIM Parser KQL Review

Readiness Rating: 6/10

# Priority Issue Suggestion
1 🔴 High Multiple full scans of Syslog due to per-branch reads before union Build a single, filtered, projected base dataset and materialize it once, then reuse in all branches. Example: let Base = materialize(Syslog
2 🔴 High project-away used repeatedly (sub-queries and final stage). This is not schema-safe and can break on source schema changes Replace all project-away uses with a single final project that explicitly lists the ASIM output schema (and any required intermediate fields). Keep only the desired ASIM fields and aliases in the final projection. Remove intermediate project-away statements entirely.
3 🟡 Medium Parsing after only minimal filtering in branches; some branches could filter more aggressively before parse When using a materialized Base, keep the message prefilter in Base (has_any list) and retain branch-specific predicates (e.g., startswith "Accepted", "Invalid user", etc.) before parse. This reduces parse workload and memory.
4 🟡 Medium parse used where parse-where can both parse and filter, reducing unnecessary rows earlier Switch to parse-where for deterministic patterns already constrained by startswith. Examples: Accepted/Invalid user/Timeout branches: replace parse with parse-where (e.g.,
5 🟢 Low Redundant extend Type = 'Syslog' overwrites a native column and adds work without value Remove extend Type = 'Syslog'. The table is already Syslog; overriding is unnecessary and can be confusing. Also drop Type from the SyslogProjects/Base projection since it’s not used.
6 🟢 Low Extra alias column Rule duplicates RuleName, increasing payload size Unless Rule is required by downstream content, drop extend Rule = RuleName. If needed, derive at the very end via project-rename or keep just one of them.
7 🟢 Low Repeated per-branch lookup is fine, but minor cleanup possible After lookup LogonMethodLookup, immediately avoid carrying Method by relying on the final project (see #2). No need for intermediate project-away.

Notes:

  • AdditionalFields/pack: Not applicable; the parser does not use AdditionalFields, so no action needed.
  • Parsing operators: Good use of parse/split; no regex present. Keep it that way.
  • union isfuzzy=false: Good for schema stability and performance; keep.

This review was generated automatically by the ASIM File and Parser Validation workflow using GitHub Models.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM Parser KQL Review

Readiness Rating: 6/10

# Priority Issue Suggestion
1 🔴 High Uses project-away in multiple places (sub-parsers and after union). This violates ASIM guidance and risks breaking on source schema changes; also may keep more columns than needed until late in the pipeline. Replace each project-away with an explicit project of the required columns. Do this within each sub-parser and at the final stage. Example for a sub-parser: replace “...
2 🟡 Medium Re-reading the Syslog table across multiple branches. Each sub-parser references SyslogProjects (which expands to Syslog) and applies its own filters, potentially causing multiple scans. Create a shared, prefiltered base for sshd events to reduce scans: let SSHDBase = Syslog
3 🟡 Medium SSHDFailed parses every candidate row twice (once for the “Failed …” body and again for “message repeated … times”). This adds unnecessary compute. Split SSHDFailed into two branches and union them: one branch for “message repeated … times” using parse-where to extract EventCount, and a second branch for single “Failed …” messages. Example: let FailedRepeated = SSHDBase
4 🟢 Low Using startswith() string function repeatedly on large text fields; the operator version is slightly more efficient. Prefer hasprefix for prefix filters and has for substring filters when using literals. Example: replace SyslogMessage startswith "Failed" with SyslogMessage hasprefix "Failed".
5 🟢 Low Redundant reassignment of Type column: extend Type = 'Syslog'. This is already the source table name/value and adds no value, slightly increasing compute. Remove extend Type = 'Syslog'. Keep the original Type from Syslog if needed for downstream filters, or project/rename to a different field if a normalized field is required.
6 🟢 Low Minor overhead from int(1) casts in multiple extends. Replace int(1) with literal 1 (Kusto will type it as int in numeric contexts).
7 🟢 Low DNS/FQDN resolution invoked for all parsed rows in the “break-in attempt” sub-parsers; some rows may already have only IPs, making resolution unnecessary. Add a light prefilter before invoking _ASIM_ResolveSrcFQDN to reduce calls: e.g.,

Notes:

  • Filter → Parse → Map pattern is largely correct: each sub-parser filters on ProcessName and message pattern before parsing, and mapping happens post-parse. Good use of parse/parse-where and datatable lookup without regex.
  • No AdditionalFields used; therefore no pack parameter is required.

This review was generated automatically by the ASIM File and Parser Validation workflow using GitHub Models.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM Parser KQL Review

Readiness Rating: 6/10

# Priority Issue Suggestion
1 🔴 High Multiple scans of Syslog due to each branch re-reading the table Create a single prefiltered/materialized base for sshd events and reuse it across branches. Example: let Base = materialize(Syslog
2 🔴 High Uses project-away in multiple places; not resilient to schema changes and adds work Replace all project-away usage with a single final project that enumerates the ASIM normalized schema fields to output. Avoid intermediate pruning; rely on final projection to define the parser’s contract and protect against upstream schema drift.
3 🟡 Medium Redundant gating with where not(disabled) in every branch (applied after reading SyslogProjects) Move not(disabled) to the base dataset (e.g., in Base above) so it’s applied once and can be pushed down by the engine. Remove per-branch where not(disabled) to reduce redundant filters.
4 🟡 Medium Minor join overhead using lookup to map three fixed LogonMethod values Replace the lookup to the small datatable with a direct case mapping (e.g., extend LogonMethod = case(Method == "password","Username & password", Method == "publickey","PKI", Method == "keyboard-interactive/pam","PAM", SyslogMessage has "key RSA","PKI","Other")). This avoids the join operator entirely and is faster at scale.
5 🟡 Medium Per-branch filter before parse is good, but parse and filter could be combined In branches already filtering on a literal prefix (e.g., Accepted, Failed, Timeout), consider parse-where SyslogMessage with ... to combine filter+parse in one operator where it simplifies the pipeline. Keep the early ProcessName filter as is.
6 🟢 Low Overwriting the native Type column with a constant string Avoid overriding Type or exclude it from the initial projection. If needed, add a new normalized field instead of mutating the native column. This reduces confusion and prevents potential conflicts with upstream changes.
7 🟢 Low Minor duplication of parsing work in SSHDInvalidUser (two parse statements plus split) Consider splitting into two lightweight branches (with and without username) or use parse-where for each pattern. This avoids split on nulls and prevents unnecessary parse attempts on non-matching rows.
8 🟢 Low Final schema selection occurs after several extends and project-rename with intermediate prune steps After union and enrichment, perform a single project with the final ASIM fields (including EventUid, DvcId, DvcIpAddr, etc.). This simplifies the pipeline and ensures a stable output schema.

Notes:

  • AdditionalFields/pack: Not applicable here since AdditionalFields is not used. If you add AdditionalFields in the future, include a pack:bool=false parameter and only pack when pack == true.
  • The use of isfuzzy=false in union and avoidance of regex in parsing are good practices.

This review was generated automatically by the ASIM File and Parser Validation workflow using GitHub Models.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM vim Parser KQL Review

Readiness Rating: 7/10

# Priority Issue Suggestion
1 🔴 High srchostname_has_any is effectively ignored. The prefilter forces this condition to always-true and there is no post-filter applying it anywhere, leading to user-provided src hostname filters having no effect. Either remove the parameter or implement it: resolve source host for all branches (e.g., via _ASIM_ResolveSrcFQDN for Accepted/Failed/Invalid/Timeout as well) and add a post-filter like: and (array_length(srchostname_has_any) == 0 or SrcHostname has_any (srchostname_has_any)). If resolution is too expensive, at least apply it conditionally only when the parameter is provided.
2 🟡 Medium targetappname_has_any prefilter is a constant check ('sshd' in~ list) and does not use any row data. There’s also no post-filter on TargetAppName, so semantics rely solely on that constant condition. Apply the filter to native/app fields: in prefilter use ProcessName in~ (targetappname_has_any), and/or after extend TargetAppName='sshd' add a post-filter: and (array_length(targetappname_has_any)==0 or TargetAppName in~ (targetappname_has_any)). Optionally short-circuit early: if array_length(targetappname_has_any)>0 and not('sshd' in~ targetappname_has_any) then return an empty result to avoid scanning.
3 🟡 Medium No branch-level gating using eventtype_in and eventresult. All branches run (including parses/lookups) even when the parameters exclude their outcomes. Gate branches early to avoid unnecessary computation. Examples: only run SSHDTimeout when ('Logoff' in eventtype_in or eventtype_in empty); only run SSHDAccepted when (eventresult is '*' or eventresult =~ 'Success'); only run SSHDFailed/Invalid/Break-in branches when eventresult includes 'Failure'; if eventresultdetails_in is supplied, run only branches that can produce those details (e.g., 'No such user' => SSHDInvalidUser, 'Logon violates policy' => break-in branches). Implement as simple where conditions at top of each branch or by conditionally including branches in the union.
4 🟢 Low eventresult filter is case-sensitive (EventResult == eventresult). Users may pass 'success'/'failure' and get no matches. Make it case-insensitive: (eventresult == '' or tolower(EventResult) == tolower(eventresult)) or (eventresult == '' or EventResult =~ eventresult).
5 🟡 Medium Prefilter does not leverage native columns for target app and cannot be extended easily due to the typed signature (T: (SyslogMessage, TimeGenerated)). Broaden the prefilter function signature (or omit explicit column typing) so you can use ProcessName (and potentially Computer/_ResourceId) inside prefilter. Then apply native-column filters there, e.g., ProcessName in~ targetappname_has_any.
6 🟡 Medium Missing common scoping parameters (e.g., Computer/Dvc, _ResourceId, _SubscriptionId) that can drastically reduce scanned rows early. Add parameters like dvc_has_any (Computer), dvcid_has_any (_ResourceId), dvcscopeid_has_any (_SubscriptionId) and apply them in prefilter using native columns. This is a common and effective way to scope Syslog queries in large workspaces.
7 🟢 Low No parameter to filter by source port (SrcPortNumber), which is parsed for most branches and often used operationally. Add srcport_in (dynamic) and apply: pre-parse filtering isn’t possible, but add a post-filter after parsing: and (array_length(srcport_in) == 0 or SrcPortNumber in (srcport_in)).
8 🟢 Low Lookup (_ASIM_ResolveSrcFQDN and LogonMethodLookup) runs before parameter-based pruning at the union level, incurring work on rows that may be filtered out later. Combine with issue #3: after adding branch gating for eventresult/eventtype/eventresultdetails, these lookups will run on smaller sets. Additionally, you can move lookups after any trivial where filters that depend on parameters within the branch.

This review was generated automatically by the ASIM File and Parser Validation workflow using GitHub Models.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM Parser Validation Failed

The following validation errors were found:

  • _ASim_Authentication_TestProduct from Parsers/ASimAuthentication/Parsers/ASimAuthenticationTestProduct.yaml is not listed in Parsers/ASimAuthentication/Parsers/ASimAuthentication.yaml Parsers list
  • EquivalentBuiltInParser in Parsers/ASimAuthentication/Parsers/vimAuthenticationTestProduct.yaml must follow the format Im_Authentication, but found: Im_Authentication_TestProduct
  • ParserName in Parsers/ASimAuthentication/Parsers/vimAuthenticationTestProduct.yaml must follow the format vimAuthentication, but found: vimNetworkSessionTestProduct
  • ParserName vimNetworkSessionTestProduct from Parsers/ASimAuthentication/Parsers/vimAuthenticationTestProduct.yaml is not referenced in Parsers/ASimAuthentication/Parsers/imAuthentication.yaml ParserQuery
  • Im_Authentication_TestProduct from Parsers/ASimAuthentication/Parsers/vimAuthenticationTestProduct.yaml is not listed in Parsers/ASimAuthentication/Parsers/imAuthentication.yaml Parsers list

This comment was generated automatically by the ASIM File and Parser Validation workflow.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM Parser Validation Failed

The following validation errors were found:

  • _ASim_Authentication_TestProduct from Parsers/ASimAuthentication/Parsers/ASimAuthenticationTestProduct.yaml is not listed in Parsers/ASimAuthentication/Parsers/ASimAuthentication.yaml Parsers list
  • EquivalentBuiltInParser in Parsers/ASimAuthentication/Parsers/vimAuthenticationTestProduct.yaml must follow the format Im_Authentication, but found: Im_Authentication_TestProduct
  • ParserName in Parsers/ASimAuthentication/Parsers/vimAuthenticationTestProduct.yaml must follow the format vimAuthentication, but found: vimNetworkSessionTestProduct
  • ParserName vimNetworkSessionTestProduct from Parsers/ASimAuthentication/Parsers/vimAuthenticationTestProduct.yaml is not referenced in Parsers/ASimAuthentication/Parsers/imAuthentication.yaml ParserQuery
  • Im_Authentication_TestProduct from Parsers/ASimAuthentication/Parsers/vimAuthenticationTestProduct.yaml is not listed in Parsers/ASimAuthentication/Parsers/imAuthentication.yaml Parsers list

This comment was generated automatically by the ASIM File and Parser Validation workflow.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM Parser KQL Review

Readiness Rating: 7/10

# Priority Issue Suggestion
1 🔴 High Use of project-away in multiple places (per-branch and post-union) Replace all project-away with a single, explicit project at the end that enumerates the ASIM Authentication schema fields and required aliases. This protects against source schema drift and reduces unnecessary column traffic. Example: after union and field mapping, do a final project listing all output fields (e.g., TimeGenerated, EventStartTime, EventEndTime, EventResult, EventSeverity, EventType, TargetUsername, SrcIpAddr, SrcPortNumber, LogonMethod, DvcId, DvcIpAddr, DvcScopeId, EventUid, DvcFQDN, DvcHostname, DvcDomain, DvcDomainType, DvcAction, RuleName, etc.), instead of project-away in each branch.
2 🟡 Medium Repeated full scans of Syslog due to independent sub-queries in union Introduce a common prefilter for ProcessName and high-selectivity message tokens before branching. For example, define a base let with Syslog
3 🟡 Medium lookup to a small datatable for LogonMethod mapping adds a join per row Replace lookup with a case() expression to avoid the join. Example: extend LogonMethod = case(Method == "password", "Username & password", Method == "publickey" or SyslogMessage has "key RSA", "PKI", Method == "keyboard-interactive/pam", "PAM", "Other"). This removes the overhead of a lookup and simplifies the flow.
4 🟢 Low Direct field aliasing using extend where project-rename or removal would suffice (e.g., extend Rule = RuleName) If only Rule is needed, use project-rename Rule = RuleName and drop RuleName, or keep one of them but avoid redundant duplication. Reserve extend for calculated/normalized fields.
5 🟢 Low Type is projected from source (SyslogProjects) and then overwritten with a constant Remove Type from SyslogProjects and only set it once via extend Type = "Syslog" in the mapping phase. This avoids carrying an unused column and potential confusion.
6 🟢 Low Repeated where ProcessName == "sshd" across branches Push ProcessName == "sshd" into SyslogProjects (or the suggested common base) to centralize and push down filtering. While Kusto inlines lets, this improves maintainability and helps ensure early filtering.
7 🟢 Low startswith on SyslogMessage may be less index-friendly than token operators Where semantics permit, prefer hasprefix() or has_any() over startswith(), especially for tokenized words at the beginning of the message (e.g., "Accepted", "Failed", "Timeout"). This can leverage the text index and improve scan performance.

Notes:

  • Filter → Parse → Map: The parser generally follows the correct flow. Filters on ProcessName and message patterns are applied before parsing, parsing uses parse/parse-where, and mapping is done via extend and project-rename. The main deviation is the use of project-away, which should be replaced with a final explicit project.
  • AdditionalFields/pack: Not applicable; the parser does not emit AdditionalFields. No change needed.

This review was generated automatically by the ASIM File and Parser Validation workflow using GitHub Models.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM vim Parser KQL Review

Readiness Rating: 7/10

# Priority Issue Suggestion
1 🔴 High srcipaddr_has_any_prefix prefilter uses has_any_ipv4_prefix(SyslogMessage, …), which expects a single IP string, not a free-text message. When the parameter is non-empty, this will often evaluate to false and drop valid rows, or at best yield no early benefit. Remove the SrcIpAddr prefix filter from prefilter (keep only array_length check there) and rely on the accurate post-filter has_any_ipv4_prefix(SrcIpAddr, …) after parsing. If early pruning is required, extract the IP cheaply (e.g., parse-where for the specific patterns in each branch) and apply the prefix check on the extracted IP before the heavier parsing.
2 🟡 Medium eventresult and eventtype_in are only applied post-union. Since each branch has a constant EventResult and EventType, the parser processes branches that can’t match the filter. Gate branches earlier based on parameters: for example, if eventresult == 'Success', don’t evaluate SSHDFailed and break-in attempt branches; if eventresult == 'Failure', skip SSHDAccepted/SSHDTimeout. Similarly, prune SSHDTimeout when eventtype_in excludes 'Logoff', and prune other branches when eventtype_in excludes 'Logon'. This reduces rows and parsing work early.
3 🟡 Medium targetappname_has_any is evaluated per-row in prefilter via a constant check ('sshd' in~ targetappname_has_any). This is redundant with the per-branch ProcessName == 'sshd' filter and needlessly evaluated for every row. Short-circuit once at the top: if array_length(targetappname_has_any) > 0 and not('sshd' in~ targetappname_has_any) then return an empty result (or add a single global where false). Otherwise, omit the per-row prefilter check. This avoids unnecessary per-row evaluation.
4 🟢 Low Prefilter is invoked after message-type predicates (startswith/has). Time and coarse user/IP filters could be applied even earlier to reduce the number of rows checked for message patterns. Reorder to apply prefilter before message-pattern filters in each branch: SyslogProjects
5 🟢 Low Parameter pack is defined in ParserParams but is completely absent from the query (not in the parser signature or logic). Add pack to the parser function signature and handle it consistently with ASIM VIM guidance (even if it doesn’t affect filtering), or remove it from ParserParams if not supported for this product.
6 🟢 Low eventresultdetails_in only applied post-parse. Some branches have determinable details (e.g., 'No such user', 'Logon violates policy', or inferring 'Incorrect key' via 'publickey' in message), allowing earlier pruning. In the relevant branches, add optional early checks tied to eventresultdetails_in (e.g., in SSHDFailed, use presence/absence of 'publickey' to prune when the requested details cannot match). Keep them simple to avoid overcomplication.

This review was generated automatically by the ASIM File and Parser Validation workflow using GitHub Models.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM File Validation Failed

The following validation errors were found:

  • Expected exactly 2 new YAML files in Parsers/ASim/Parsers/, found 0.

This comment was generated automatically by the ASIM File and Parser Validation workflow.

@github-actions

Copy link
Copy Markdown
Contributor

ASIM File Validation Failed

The following validation errors were found:

  • Expected exactly 2 new YAML files in Parsers/ASim/Parsers/, found 0.

This comment was generated automatically by the ASIM File and Parser Validation workflow.

@yummyblabla yummyblabla changed the title [ASIM] Test file validation workflow [ASIM] Add File and Parser Validation Workflow Jun 11, 2026
@yummyblabla

Copy link
Copy Markdown
Collaborator Author

Models is not aactivated in the Azure organization, so adding LLM analysis in a workflow is not a feasible option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ASIM SafeToRun This is used only for ASim parsers Fork PR Pipeline run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants