Skip to content

pipelines: drop 16 transform_ocsf entries with first-party ingestion paths#63

Merged
nate-smalls-s1 merged 3 commits intoSentinel-One:mainfrom
natesmalley:transform-ocsf-platform-overlap-cleanup
Apr 27, 2026
Merged

pipelines: drop 16 transform_ocsf entries with first-party ingestion paths#63
nate-smalls-s1 merged 3 commits intoSentinel-One:mainfrom
natesmalley:transform-ocsf-platform-overlap-cleanup

Conversation

@natesmalley
Copy link
Copy Markdown
Contributor

Summary

Scope refinement for pipelines/community/transform_ocsf/. Removes 16 directories whose vendors are typically ingested into AI SIEM via first-party or vendor-native paths in supported deployments rather than via community-contributed Observo transforms. The community pipelines directory is intended for vendors that require contributor-authored parsing and OCSF mapping.

Removed entries

  • aws_guardduty_logs/
  • aws_waf/
  • azure_ad/ (legacy name for Microsoft Entra ID; removed alongside microsoft_entra_logs/ to avoid leaving the same product under two paths)
  • azure_platform/
  • cisco_duo/
  • darktrace_darktrace_logs/
  • microsoft_defender_for_cloud/
  • microsoft_entra_logs/
  • microsoft_eventhub_azure_signin_logs/
  • microsoft_eventhub_defender_email_logs/
  • microsoft_eventhub_defender_emailforcloud_logs/
  • netskope/
  • proofpoint/
  • snyk/
  • tenable_vulnerability_management_audit_logging/
  • wiz_cloud_security_logs/

Why these specifically

Each removed entry was previously signed_off and functional — this is a scope refinement, not a quality fix. The criterion is "is there a typical first-party / vendor-native ingestion path users rely on for this vendor's data?" rather than any defect in the transforms themselves.

This is distinct from the prior cleanup PRs (#60 and #62) which removed entries that were broken (F-grade, analyzer_limit, no OCSF class produced). Those entries were dropped on quality grounds; the entries in this PR are dropped on scope grounds.

What is NOT in this PR (intentional)

  • No serializer logic changed in any surviving entry.
  • No surviving entry's metadata, sample, or pipeline JSON changed.
  • No directory renames or migrations — that work is the next PR (migration of remaining transform_ocsf/ entries into the push/pull/<mode>/<vendor>/<product>/ taxonomy).
  • m365_audit_logs/ and microsoft_365_mgmt_api_logs/ are retained — they cover Microsoft 365 audit/management API surfaces that are not first-party ingestion paths in supported deployments.
  • azure_logs/, azure_nsg_flow_logs/, iis_w3c/, microsoft_activedirectory_logs/, windows_event_log_logs/ are retained — these are general Azure Monitor, NSG flow exports to storage, on-prem IIS, on-prem Active Directory, and Windows Event Log ingestion paths that customers configure manually.

Recovery

Each removed entry remains accessible via git log --diff-filter=D --name-only and can be restored from git history if a deployment specifically requires the community transform.

Test plan

  • CI passes (CodeQL, secret scanning, contributor automation)
  • git log --stat shows exactly 64 file deletions across 16 directories (metadata.yaml + <name>.json + sample.json + serializer.lua per directory)
  • No content outside the 16 listed directories and CHANGELOG.md is modified
  • No other repo content references the removed paths
  • Surviving transform_ocsf/ entries continue to render cleanly on github.com

Nate Smalley and others added 2 commits April 26, 2026 22:07
…paths

Removes 16 directories from pipelines/community/transform_ocsf/ for vendors
whose log streams are typically delivered to AI SIEM via first-party or
vendor-native ingestion paths in supported deployments, rather than via
community-contributed Observo transforms.

Removed:

  aws_guardduty_logs/
  aws_waf/
  azure_ad/
  azure_platform/
  cisco_duo/
  darktrace_darktrace_logs/
  microsoft_defender_for_cloud/
  microsoft_entra_logs/
  microsoft_eventhub_azure_signin_logs/
  microsoft_eventhub_defender_email_logs/
  microsoft_eventhub_defender_emailforcloud_logs/
  netskope/
  proofpoint/
  snyk/
  tenable_vulnerability_management_audit_logging/
  wiz_cloud_security_logs/

(azure_ad/ is the legacy name for Microsoft Entra ID and is removed alongside
microsoft_entra_logs/ to avoid leaving the same product under two paths.)

Each removed entry was previously signed_off and functional, so removal is a
scope refinement rather than a quality fix. The community pipelines directory
is intended for vendors that require contributor-authored parsing and OCSF
mapping; entries where users typically rely on a vendor-native or first-party
ingestion path are out of scope. Anyone who specifically needs a community
transform for one of these vendors can recover it from git history.

No serializer logic, no other metadata, and no surviving entries are modified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cope)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nate-smalls-s1 nate-smalls-s1 merged commit 97eb4eb into Sentinel-One:main Apr 27, 2026
2 checks passed
nate-smalls-s1 pushed a commit that referenced this pull request Apr 27, 2026
Moves 91 community pipeline directories from
pipelines/community/transform_ocsf/<name>/ into the ingest-mode-first
taxonomy introduced in #59:

  pipelines/push/syslog/<vendor>/<product>/      57 entries
  pipelines/pull/api/<vendor>/<product>/         29 entries
  pipelines/pull/object_store/<vendor>/<product>/  5 entries

The mode bucket is determined by each entry's ingest_mode field (backfilled
in #61). The vendor and product split is derived per entry from the
upstream parser binding and vendor/product convention; collisions across
the cluster (Cisco Meraki, Fortinet, Cloudflare, Zscaler, Microsoft, etc.)
are disambiguated with explicit product-name overrides documented in
.reorg-prep/inventory/transform_ocsf_migration_plan.tsv.

History is preserved on every entry (git mv).

What stays in pipelines/community/transform_ocsf/ (15 entries):
  - Generic / template / unknown-vendor entries: agent_metrics_logs,
    generic_access_logs, inngate_gateway_logs, json_generic_logs,
    json_nested_kv_logs, leef_template_logs, log4shell_detection_logs,
    mail_server_logs, microservice_tracing_logs, sample_test_logs,
    spam_detection_logs, sql_database_logs, syslog_space_delimited_logs,
    vpc_logs, jruby_application_logs.

What is NOT in this PR (intentional):
  - 23 entries scheduled for removal in #62 (broken-legacy, 7) and #63
    (first-party ingestion paths, 16) are NOT moved; they remain in
    transform_ocsf/ until those PRs merge. This PR has no overlap or
    conflict with #62/#63 -- merge order does not matter.
  - No serializer logic, no metadata.yaml content, and no pipeline JSON
    content was modified. Every change is a directory rename.
  - No naming-consistency cleanup (e.g., paloalto_* -> palo_alto/*) is
    applied yet; that is a separate follow-up.

The pipelines/push/{syslog,hec}/ and pipelines/pull/{api,object_store}/
directories are now populated -- the empty scaffolding from #59 finally
has content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants