Skip to content

[OCSF] Zeek pipeline#23712

Merged
cepolation-datadog merged 10 commits into
masterfrom
andy.anske/ocsf-transformations-zeek
May 18, 2026
Merged

[OCSF] Zeek pipeline#23712
cepolation-datadog merged 10 commits into
masterfrom
andy.anske/ocsf-transformations-zeek

Conversation

@cepolation-datadog

@cepolation-datadog cepolation-datadog commented May 14, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds OCSF v1.5.0 normalization for Zeek/Corelight logs, covering 7 Zeek log types across 5 OCSF classes.

JIRA: SCI2-5871

OCSF classes covered

OCSF class Zeek _path filter Notes
Detection Finding [2004] — Notice @_path:notice Notice events (e.g., ATTACK::Discovery, SSH::Password_Guessing). Network context goes into evidences[].src_endpoint/dst_endpoint.
Detection Finding [2004] — Suricata @_path:suricata_corelight Suricata alerts. Includes finding_info.analytic, finding_info.uid_alt (suri_id), confidence_id derived from alert.metadata, severity from alert.severity.
Network Activity [4001] — conn @_path:(conn OR conn_long OR conn_red) All 13 Zeek conn_state values mapped (SF/S0/S1/S2/S3/REJ/RSTO/RSTR/RSTOS0/RSTRH/SH/SHR/OTH). Traffic counters via arithmetic (orig_bytes + resp_bytes). connection_info.direction_id/boundary_id from local_orig/local_resp.
Network Activity [4001] — ssl @_path:(ssl OR ssl_red) TLS metadata (tls.version, tls.cipher, tls.sni, JA3/JA3S hashes). tls.certificate.* intentionally not mapped — Zeek ssl.log lacks the OCSF-required serial_number.
Network Activity [4001] — weird @_path:weird_red Protocol-anomaly events; namemetadata.event_code, sourceconnection_info.protocol_name.
HTTP Activity [4002] @_path:(http OR http_red) Full http_request.*/http_response.* + URL parsing. activity_id from HTTP method; severity from response code range.
DNS Activity [4003] @_path:(dns OR dns_red) Full rcode_id enum (0-23) mapped. query.* and answers.* from raw Zeek fields.
File Hosting Activity [6006] @_path:(files OR files_red) File transfers over HTTP/FTP/SMB/etc. activity_id from is_orig. (Class chosen over deprecated File System Activity [1001] and Network File Activity [4010].)

Implementation notes

  • OCSF facets added for all OCSF target paths (~80 facets) under the OCSF group.
  • Pre-transformations pipeline sets ocsf.metadata.product.name/vendor_name and extracts ocsf.time from ts (epoch ms) once for all OCSF-eligible logs.
  • Sub-pipelines ordered by ascending class_uid, mappers inside each schema-processor alphabetized by target.
  • Sources are raw Zeek JSON fields (id.orig_h, proto, query, etc.) rather than DD-normalized fields (network.client.ip, zeek.proto, dns.question.name). Upstream attribute-remappers flipped to preserveSource: true so the raw fields remain available to the OCSF mappers.
  • Hash arrays for file fingerprints built via the standard tmp_<algo> + array-processor append pattern.
  • Detection Finding evidence assembled in singular ocsf.evidence then array-appended to ocsf.evidences.

Motivation

Brings Zeek/Corelight into the OCSF normalization rollout (SIEM use cases, cross-vendor analytics, Datadog SIEM detection rules) alongside other recent integrations (Palo Alto NGFW, Cisco Duo, Zscaler, etc.).

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the `qa/skip-qa` label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the `backport/` label to the PR and it will automatically open a backport PR once this one is merged

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29e5c076c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread zeek/assets/logs/zeek.yaml Outdated
@cepolation-datadog cepolation-datadog changed the title [OCSF] Zeek/Corelight pipeline [OCSF] Zeekpipeline May 14, 2026
@cepolation-datadog cepolation-datadog changed the title [OCSF] Zeekpipeline [OCSF] Zeek pipeline May 14, 2026
@dd-octo-sts

dd-octo-sts Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

⚠️ Recommendation: Add qa/skip-qa label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

Add OCSF v1.5.0 normalization for Zeek/Corelight logs, covering 7 log
types across 5 OCSF classes (Detection Finding, Network Activity, HTTP
Activity, DNS Activity, File Hosting Activity).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cepolation-datadog cepolation-datadog force-pushed the andy.anske/ocsf-transformations-zeek branch from 5e74b0b to 6c59086 Compare May 14, 2026 21:44
cepolation-datadog and others added 3 commits May 14, 2026 17:26
Resolve 36 validation errors flagged by the datadog-assets validator:
- Add missing `overrideOnConflict: false` to 3 attribute-remappers
- Fix 2 schema-remapper names to backtick individual fields
- Rename 25 facets to match validator's canonical names and add
  `type: integer`/`facetType: range` where required
- Remove 6 facets with unresolvable path conflicts (validator demanded
  unique paths with no canonical definition available)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Notice events emit `severity.name` capitalized ("High", "Medium", etc.),
so the lowercase `@severity.name:informational` filters never matched
and the fallback assigned `ocsf.severity_id: 99` while preserving the
capitalized name as `ocsf.severity`. Switch the schema-category-mapper
to filter on the numeric `severity.id` (1-5) which Corelight reliably
emits, and update the notice fixture's expected `severity_id` from 99
to 4 to reflect the corrected mapping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each schema-category-mapper that defines a fallback must also have a
catch-all filter category at the end matching the fallback's values.
Six mappers were missing the trailing catch-all: notice/alert
severity_id (2004), http activity_id/status_id (4002), dns rcode_id,
and dns status_id (4003). Append `query: "*"` -> Other/99 to each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@jbfeldman-dd jbfeldman-dd left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments

Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment thread zeek/assets/logs/zeek.yaml
Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment thread zeek/assets/logs/zeek.yaml
Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment on lines +3477 to +3506
- type: string-builder-processor
name: Stringify tx_hosts
enabled: true
template: "%{tx_hosts}"
target: _tx_hosts_str
replaceMissing: false
- type: string-builder-processor
name: Stringify rx_hosts
enabled: true
template: "%{rx_hosts}"
target: _rx_hosts_str
replaceMissing: false
- type: grok-parser
name: Extract first IP from tx_hosts
enabled: true
source: _tx_hosts_str
samples:
- '["10.104.10.60"]'
grok:
supportRules: ""
matchRules: 'g \[?"?%{ip:_tx_host}"?'
- type: grok-parser
name: Extract first IP from rx_hosts
enabled: true
source: _rx_hosts_str
samples:
- '["10.104.10.65"]'
grok:
supportRules: ""
matchRules: 'g \[?"?%{ip:_rx_host}"?'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flow is overly complicated - just use an array-processor to get the first array out of the hosts. The current implementation is inefficient and generates a bunch of extra intermediate fields

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this but grok-parser requires a string source - when I pointed it directly at the array tx_hosts/rx_hosts, the IP didn't extract. array-processor type: select requires filter+valueToExtract which only work on object arrays, not primitive string arrays. Kept one stringify intermediate (_tx_hosts_str/_rx_hosts_str) but eliminated the second one - grok now targets ocsf.{src,dst}_endpoint.ip directly. Open to a cleaner pattern if there is one.

Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment thread zeek/assets/logs/zeek.yaml
cepolation-datadog and others added 4 commits May 15, 2026 11:27
Direct mappings, dead-code removal, correctness fixes, and OCSF validator
cleanups across notice, suricata, conn, ssl, weird, http, dns, and file
hosting sub-pipelines:

- Map directly to OCSF targets where intermediates were unnecessary
  (ocsf.time, ocsf.duration, ocsf.traffic.packets, JA3/JA3S algorithm_id,
  weird protocol_name).
- Drop dead/auto-generated mappers: notice/suricata category_uid (set by
  schema-processor), self-maps of finding_info.uid, event_code, file.hashes
  (when unbuilt upstream), suricata community_id correlation_uid, HTTP
  version-as-protocol_ver, DNS direction derivation, and the DNS rcode_id
  catch-all/fallback (recommended-not-required).
- Convert suricata alert.signature_id event_code from string-builder to
  schema-remapper.
- Combine domain/query into single ocsf.query.hostname schema-remapper.
- Fix DNS Activity filters: use rcode_name presence to discriminate
  Response/Query instead of dns.answer.name (handles NXDOMAIN responses).
- DNS status_id catch-all renamed Other/99 -> Unknown/0 to satisfy the
  OCSF validator's suspicious-Other check.
- File Hosting tx_hosts/rx_hosts: drop the second intermediate field;
  grok targets ocsf.{src,dst}_endpoint.ip directly off a single stringify.
- Switch fallback source fields per Jonah's suggestions:
  severity -> severity.name, alert.severity -> alert_severity,
  http status -> status_msg, dns rcode/status -> rcode_name.
- Notice fixture: use id.orig_h/id.resp_h connection fields instead of
  the suricata-style src.

Regenerated zeek_tests.yaml with the OCSF validator (--check-all --write).
All 14 logs pass validation with no errors or warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Use two array-processors to wrap each Zeek `answers` string into a
dns_answer object and append to ocsf.answers: the first selects the
first array element into ocsf.answer.rdata, the second appends
ocsf.answer onto ocsf.answers. Only the first answer is captured (the
pipeline DSL has no per-element iteration), but that covers the common
single-A-record case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous array-processor type:select required operation.filter and
operation.valueToExtract per the asset validator, but those only apply
to object arrays - Zeek's `answers` is a primitive string array. Switch
to string-builder + grok-parser to extract the first answer string into
ocsf.answer.rdata, then keep the array-processor append to wrap it into
ocsf.answers as a dns_answer object.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cepolation-datadog

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 501511a3ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment thread zeek/assets/logs/zeek.yaml Outdated
- Include `files_red` in the File Hosting [6006] sub-pipeline filter so
  redacted file events get OCSF class_uid/activity_id/file fields, not
  just the pre-transform metadata.
- Prefer `filename` over `fuid` when populating `ocsf.file.name`; fall
  back to `fuid` only when `filename` is absent. The `fuid` mapping to
  `ocsf.file.uid` is unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@jbfeldman-dd jbfeldman-dd left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few outstanding issues around intermediate fields and grok parsers

Comment thread zeek/assets/logs/zeek.yaml Outdated
name: Stringify answers
enabled: true
template: "%{answers}"
target: _answers_str

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still leaves _answers_str a leftover field in the output. This can be fixed by setting target here to ocsf.answer, and then having that be the source in the grok-parser

Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment thread zeek/assets/logs/zeek.yaml
Comment thread zeek/assets/logs/zeek.yaml
Comment thread zeek/assets/logs/zeek.yaml Outdated
Comment on lines +3434 to +3443
matchRules: 'g \[?"?%{ip:ocsf.src_endpoint.ip}"?'
- type: grok-parser
name: Extract first IP from rx_hosts
enabled: true
source: _rx_hosts_str
samples:
- '["10.104.10.65"]'
grok:
supportRules: ""
matchRules: 'g \[?"?%{ip:ocsf.dst_endpoint.ip}"?'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These match rules won't actually correctly parse multiple IPs. Use the grok parser g %{ip:ocsf.src_endpoint.ip}(,%{data})? instead

Comment thread zeek/assets/logs/zeek.yaml
Comment thread zeek/assets/logs/zeek.yaml Outdated
name: Set is_alert to boolean true
enabled: true
template: "true"
target: _is_alert_str

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set directly to ocsf.is_alert to prevent creation of extra fields

- is_alert (notice 2004, suricata 2004): string-builder writes directly
  to `ocsf.is_alert`; grok-parser converts in place. Drops the
  `_is_alert_str` intermediate.
- DNS answers: stringify directly into `ocsf.answer`; grok extracts
  `ocsf.answer.rdata` via `a %{data:ocsf.answer.rdata}(,%{data})?` so
  the comma-separated multi-IP form parses correctly. Drops the
  `_answers_str` intermediate.
- File Hosting tx/rx hosts: stringify directly into
  `ocsf.{src,dst}_endpoint`; grok extracts `.ip` via
  `g %{ip:ocsf.{src,dst}_endpoint.ip}(,%{data})?` for multi-IP. Drops
  the `_tx_hosts_str`/`_rx_hosts_str` intermediates.
- Connection 4001: arithmetic-processor writes total bytes directly to
  `ocsf.traffic.bytes`; the schema-processor remapper becomes a
  self-map. Drops the `_total_bytes` intermediate (matches the
  earlier _total_packets/_duration_ms cleanup).
- Restore `ocsf.file.hashes`: build `tmp_md5`/`tmp_sha1`/`tmp_sha256`
  fingerprint objects (algorithm name, integer algorithm_id, value),
  array-processor append each into `ocsf.file.hashes`, and self-map
  the array inside the 6006 schema-processor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dd-octo-sts

dd-octo-sts Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

@cepolation-datadog

Copy link
Copy Markdown
Contributor Author

/merge

@gh-worker-devflow-routing-ef8351

gh-worker-devflow-routing-ef8351 Bot commented May 18, 2026

Copy link
Copy Markdown

View all feedbacks in Devflow UI.

2026-05-18 17:55:41 UTC ℹ️ Start processing command /merge


2026-05-18 17:55:48 UTC ℹ️ MergeQueue: waiting for PR to be ready

This pull request is not mergeable according to GitHub. Common reasons include pending required checks, missing approvals, or merge conflicts — but it could also be blocked by other repository rules or settings.
It will be added to the queue as soon as checks pass and/or get approvals. View in MergeQueue UI.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.


2026-05-18 17:58:52 UTC ℹ️ MergeQueue: This merge request was already merged

This pull request was merged directly.

@cepolation-datadog cepolation-datadog added this pull request to the merge queue May 18, 2026
Merged via the queue into master with commit b4f366a May 18, 2026
58 checks passed
@cepolation-datadog cepolation-datadog deleted the andy.anske/ocsf-transformations-zeek branch May 18, 2026 17:58
@dd-octo-sts dd-octo-sts Bot added this to the 7.81.0 milestone May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants