[OCSF] Zeek pipeline#23712
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29e5c076c3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
This PR does not modify any files shipped with the agent. To help streamline the release process, please consider adding the |
Add OCSF v1.5.0 normalization for Zeek/Corelight logs, covering 7 log types across 5 OCSF classes (Detection Finding, Network Activity, HTTP Activity, DNS Activity, File Hosting Activity). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5e74b0b to
6c59086
Compare
Resolve 36 validation errors flagged by the datadog-assets validator: - Add missing `overrideOnConflict: false` to 3 attribute-remappers - Fix 2 schema-remapper names to backtick individual fields - Rename 25 facets to match validator's canonical names and add `type: integer`/`facetType: range` where required - Remove 6 facets with unresolvable path conflicts (validator demanded unique paths with no canonical definition available) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Notice events emit `severity.name` capitalized ("High", "Medium", etc.),
so the lowercase `@severity.name:informational` filters never matched
and the fallback assigned `ocsf.severity_id: 99` while preserving the
capitalized name as `ocsf.severity`. Switch the schema-category-mapper
to filter on the numeric `severity.id` (1-5) which Corelight reliably
emits, and update the notice fixture's expected `severity_id` from 99
to 4 to reflect the corrected mapping.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each schema-category-mapper that defines a fallback must also have a catch-all filter category at the end matching the fallback's values. Six mappers were missing the trailing catch-all: notice/alert severity_id (2004), http activity_id/status_id (4002), dns rcode_id, and dns status_id (4003). Append `query: "*"` -> Other/99 to each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| - type: string-builder-processor | ||
| name: Stringify tx_hosts | ||
| enabled: true | ||
| template: "%{tx_hosts}" | ||
| target: _tx_hosts_str | ||
| replaceMissing: false | ||
| - type: string-builder-processor | ||
| name: Stringify rx_hosts | ||
| enabled: true | ||
| template: "%{rx_hosts}" | ||
| target: _rx_hosts_str | ||
| replaceMissing: false | ||
| - type: grok-parser | ||
| name: Extract first IP from tx_hosts | ||
| enabled: true | ||
| source: _tx_hosts_str | ||
| samples: | ||
| - '["10.104.10.60"]' | ||
| grok: | ||
| supportRules: "" | ||
| matchRules: 'g \[?"?%{ip:_tx_host}"?' | ||
| - type: grok-parser | ||
| name: Extract first IP from rx_hosts | ||
| enabled: true | ||
| source: _rx_hosts_str | ||
| samples: | ||
| - '["10.104.10.65"]' | ||
| grok: | ||
| supportRules: "" | ||
| matchRules: 'g \[?"?%{ip:_rx_host}"?' |
There was a problem hiding this comment.
This flow is overly complicated - just use an array-processor to get the first array out of the hosts. The current implementation is inefficient and generates a bunch of extra intermediate fields
There was a problem hiding this comment.
Tried this but grok-parser requires a string source - when I pointed it directly at the array tx_hosts/rx_hosts, the IP didn't extract. array-processor type: select requires filter+valueToExtract which only work on object arrays, not primitive string arrays. Kept one stringify intermediate (_tx_hosts_str/_rx_hosts_str) but eliminated the second one - grok now targets ocsf.{src,dst}_endpoint.ip directly. Open to a cleaner pattern if there is one.
Direct mappings, dead-code removal, correctness fixes, and OCSF validator
cleanups across notice, suricata, conn, ssl, weird, http, dns, and file
hosting sub-pipelines:
- Map directly to OCSF targets where intermediates were unnecessary
(ocsf.time, ocsf.duration, ocsf.traffic.packets, JA3/JA3S algorithm_id,
weird protocol_name).
- Drop dead/auto-generated mappers: notice/suricata category_uid (set by
schema-processor), self-maps of finding_info.uid, event_code, file.hashes
(when unbuilt upstream), suricata community_id correlation_uid, HTTP
version-as-protocol_ver, DNS direction derivation, and the DNS rcode_id
catch-all/fallback (recommended-not-required).
- Convert suricata alert.signature_id event_code from string-builder to
schema-remapper.
- Combine domain/query into single ocsf.query.hostname schema-remapper.
- Fix DNS Activity filters: use rcode_name presence to discriminate
Response/Query instead of dns.answer.name (handles NXDOMAIN responses).
- DNS status_id catch-all renamed Other/99 -> Unknown/0 to satisfy the
OCSF validator's suspicious-Other check.
- File Hosting tx_hosts/rx_hosts: drop the second intermediate field;
grok targets ocsf.{src,dst}_endpoint.ip directly off a single stringify.
- Switch fallback source fields per Jonah's suggestions:
severity -> severity.name, alert.severity -> alert_severity,
http status -> status_msg, dns rcode/status -> rcode_name.
- Notice fixture: use id.orig_h/id.resp_h connection fields instead of
the suricata-style src.
Regenerated zeek_tests.yaml with the OCSF validator (--check-all --write).
All 14 logs pass validation with no errors or warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Use two array-processors to wrap each Zeek `answers` string into a dns_answer object and append to ocsf.answers: the first selects the first array element into ocsf.answer.rdata, the second appends ocsf.answer onto ocsf.answers. Only the first answer is captured (the pipeline DSL has no per-element iteration), but that covers the common single-A-record case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous array-processor type:select required operation.filter and operation.valueToExtract per the asset validator, but those only apply to object arrays - Zeek's `answers` is a primitive string array. Switch to string-builder + grok-parser to extract the first answer string into ocsf.answer.rdata, then keep the array-processor append to wrap it into ocsf.answers as a dns_answer object. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 501511a3ed
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Include `files_red` in the File Hosting [6006] sub-pipeline filter so redacted file events get OCSF class_uid/activity_id/file fields, not just the pre-transform metadata. - Prefer `filename` over `fuid` when populating `ocsf.file.name`; fall back to `fuid` only when `filename` is absent. The `fuid` mapping to `ocsf.file.uid` is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jbfeldman-dd
left a comment
There was a problem hiding this comment.
A few outstanding issues around intermediate fields and grok parsers
| name: Stringify answers | ||
| enabled: true | ||
| template: "%{answers}" | ||
| target: _answers_str |
There was a problem hiding this comment.
This still leaves _answers_str a leftover field in the output. This can be fixed by setting target here to ocsf.answer, and then having that be the source in the grok-parser
| matchRules: 'g \[?"?%{ip:ocsf.src_endpoint.ip}"?' | ||
| - type: grok-parser | ||
| name: Extract first IP from rx_hosts | ||
| enabled: true | ||
| source: _rx_hosts_str | ||
| samples: | ||
| - '["10.104.10.65"]' | ||
| grok: | ||
| supportRules: "" | ||
| matchRules: 'g \[?"?%{ip:ocsf.dst_endpoint.ip}"?' |
There was a problem hiding this comment.
These match rules won't actually correctly parse multiple IPs. Use the grok parser g %{ip:ocsf.src_endpoint.ip}(,%{data})? instead
| name: Set is_alert to boolean true | ||
| enabled: true | ||
| template: "true" | ||
| target: _is_alert_str |
There was a problem hiding this comment.
Set directly to ocsf.is_alert to prevent creation of extra fields
- is_alert (notice 2004, suricata 2004): string-builder writes directly
to `ocsf.is_alert`; grok-parser converts in place. Drops the
`_is_alert_str` intermediate.
- DNS answers: stringify directly into `ocsf.answer`; grok extracts
`ocsf.answer.rdata` via `a %{data:ocsf.answer.rdata}(,%{data})?` so
the comma-separated multi-IP form parses correctly. Drops the
`_answers_str` intermediate.
- File Hosting tx/rx hosts: stringify directly into
`ocsf.{src,dst}_endpoint`; grok extracts `.ip` via
`g %{ip:ocsf.{src,dst}_endpoint.ip}(,%{data})?` for multi-IP. Drops
the `_tx_hosts_str`/`_rx_hosts_str` intermediates.
- Connection 4001: arithmetic-processor writes total bytes directly to
`ocsf.traffic.bytes`; the schema-processor remapper becomes a
self-map. Drops the `_total_bytes` intermediate (matches the
earlier _total_packets/_duration_ms cleanup).
- Restore `ocsf.file.hashes`: build `tmp_md5`/`tmp_sha1`/`tmp_sha256`
fingerprint objects (algorithm name, integer algorithm_id, value),
array-processor append each into `ocsf.file.hashes`, and self-map
the array inside the 6006 schema-processor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validation ReportAll 20 validations passed. Show details
|
|
/merge |
|
View all feedbacks in Devflow UI.
This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
This pull request was merged directly. |
What does this PR do?
Adds OCSF v1.5.0 normalization for Zeek/Corelight logs, covering 7 Zeek log types across 5 OCSF classes.
JIRA: SCI2-5871
OCSF classes covered
_pathfilter@_path:noticeATTACK::Discovery,SSH::Password_Guessing). Network context goes intoevidences[].src_endpoint/dst_endpoint.@_path:suricata_corelightfinding_info.analytic,finding_info.uid_alt(suri_id),confidence_idderived fromalert.metadata, severity fromalert.severity.@_path:(conn OR conn_long OR conn_red)conn_statevalues mapped (SF/S0/S1/S2/S3/REJ/RSTO/RSTR/RSTOS0/RSTRH/SH/SHR/OTH). Traffic counters via arithmetic (orig_bytes + resp_bytes).connection_info.direction_id/boundary_idfromlocal_orig/local_resp.@_path:(ssl OR ssl_red)tls.version,tls.cipher,tls.sni, JA3/JA3S hashes).tls.certificate.*intentionally not mapped — Zeekssl.loglacks the OCSF-requiredserial_number.@_path:weird_redname→metadata.event_code,source→connection_info.protocol_name.@_path:(http OR http_red)http_request.*/http_response.*+ URL parsing. activity_id from HTTP method; severity from response code range.@_path:(dns OR dns_red)rcode_idenum (0-23) mapped.query.*andanswers.*from raw Zeek fields.@_path:(files OR files_red)is_orig. (Class chosen over deprecated File System Activity [1001] and Network File Activity [4010].)Implementation notes
OCSFgroup.ocsf.metadata.product.name/vendor_nameand extractsocsf.timefromts(epoch ms) once for all OCSF-eligible logs.class_uid, mappers inside each schema-processor alphabetized by target.id.orig_h,proto,query, etc.) rather than DD-normalized fields (network.client.ip,zeek.proto,dns.question.name). Upstream attribute-remappers flipped topreserveSource: trueso the raw fields remain available to the OCSF mappers.tmp_<algo>+array-processor appendpattern.ocsf.evidencethen array-appended toocsf.evidences.Motivation
Brings Zeek/Corelight into the OCSF normalization rollout (SIEM use cases, cross-vendor analytics, Datadog SIEM detection rules) alongside other recent integrations (Palo Alto NGFW, Cisco Duo, Zscaler, etc.).
Review checklist (to be filled by reviewers)