Skip to content

Fix graph orphans for omitted event types#3038

Open
liquidsec wants to merge 4 commits into
devfrom
fix-graph-orphan
Open

Fix graph orphans for omitted event types#3038
liquidsec wants to merge 4 commits into
devfrom
fix-graph-orphan

Conversation

@liquidsec
Copy link
Copy Markdown
Contributor

Summary

Closes #1151

_is_graph_important() was rejecting events with _omit=True even when they were marked as _graph_important=True. This caused graph orphans in JSON/neo4j output when an omitted event type (e.g. a JS URL_UNVERIFIED) was the parent of a non-omitted event (e.g. DNS_NAME).

The _omit check in _is_graph_important() was redundant — output modules without _preserve_graph (stdout, txt, csv) already return False from _is_graph_important() because self.preserve_graph is False. Only modules with _preserve_graph=True (json, neo4j) use graph importance, and those are exactly the modules that need to accept omitted parents to prevent orphans.

Fix

Remove and not getattr(event, "_omit", False) from _is_graph_important().

Test

Added test verifying that a _preserve_graph output module accepts graph-important omitted events, while a non-_preserve_graph module still rejects them.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 14, 2026

📊 Performance Benchmark Report

Comparing dev (baseline) vs fix-graph-orphan (current)

📈 Detailed Results (All Benchmarks)

📋 Complete results for all benchmarks - includes both significant and insignificant changes

🧪 Test Name 📏 Base 📏 Current 📈 Change 🎯 Status
Bloom Filter Dns Mutation Tracking Performance 4.22ms 4.20ms -0.4%
Bloom Filter Large Scale Dns Brute Force 17.37ms 17.39ms +0.1%
Large Closest Match Lookup 356.65ms 343.89ms -3.6%
Realistic Closest Match Workload 188.22ms 191.08ms +1.5%
Event Memory Medium Scan 1784 B/event 1779 B/event -0.3%
Event Memory Large Scan 1768 B/event 1766 B/event -0.1%
Event Validation Full Scan Startup Small Batch 410.38ms 420.44ms +2.5%
Event Validation Full Scan Startup Large Batch 571.95ms 570.77ms -0.2%
Make Event Autodetection Small 30.85ms 30.38ms -1.5%
Make Event Autodetection Large 317.25ms 312.97ms -1.4%
Make Event Explicit Types 13.88ms 13.68ms -1.4%
Excavate Single Thread Small 3.997s 3.967s -0.7%
Excavate Single Thread Large 9.649s 9.530s -1.2%
Excavate Parallel Tasks Small 4.161s 4.156s -0.1%
Excavate Parallel Tasks Large 6.618s 6.595s -0.4%
Is Ip Performance 3.18ms 3.15ms -1.1%
Make Ip Type Performance 11.74ms 11.57ms -1.4%
Mixed Ip Operations 4.50ms 4.52ms +0.3%
Memory Use Web Crawl 684.4 MB 632.7 MB -7.6%
Memory Use Subdomain Enum 33.3 MB 33.3 MB +0.0%
Memory Use Deep Chain 7.8 MB 7.8 MB +0.0%
Memory Use Parallel Chains 23.0 MB 22.0 MB -4.5%
Scan Throughput 100 4.150s 4.049s -2.4%
Scan Throughput 1000 31.810s 31.935s +0.4%
Typical Queue Shuffle 62.82µs 64.09µs +2.0%
Priority Queue Shuffle 725.27µs 715.46µs -1.4%

🎯 Performance Summary

No significant performance changes detected (all changes <10%)


🐍 Python Version 3.11.15

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 90%. Comparing base (e895bc0) to head (a191047).

Files with missing lines Patch % Lines
.../test/test_step_2/module_tests/test_module_json.py 96% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #3038   +/-   ##
=====================================
- Coverage     90%     90%   -0%     
=====================================
  Files        446     446           
  Lines      38377   38398   +21     
=====================================
+ Hits       34319   34336   +17     
- Misses      4058    4062    +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TheTechromancer
Copy link
Copy Markdown
Collaborator

_is_graph_important() was rejecting events with _omit=True even when they were marked as _graph_important=True

This was intended functionality. graph importance is meant to override internal events, but not omitted ones. omitted ones should always be excluded from output, and that should never create orphans because the omitted event is skipped, preserving the parent chain.

The mystery is why JS URLs specifically seem to be slipping through this system.

@liquidsec liquidsec changed the base branch from 3.0 to dev May 12, 2026 20:44
@liquidsec
Copy link
Copy Markdown
Contributor Author

Need to add additional tests to try to replicate the original bug to understand if its actually been fixed or not

@liquidsec
Copy link
Copy Markdown
Contributor Author

_is_graph_important() was rejecting events with _omit=True even when they were marked as _graph_important=True

This was intended functionality. graph importance is meant to override internal events, but not omitted ones. omitted ones should always be excluded from output, and that should never create orphans because the omitted event is skipped, preserving the parent chain.

The mystery is why JS URLs specifically seem to be slipping through this system.

_is_graph_important() was rejecting events with _omit=True even when they were marked as _graph_important=True

This was intended functionality. graph importance is meant to override internal events, but not omitted ones. omitted ones should always be excluded from output, and that should never create orphans because the omitted event is skipped, preserving the parent chain.

The mystery is why JS URLs specifically seem to be slipping through this system.

So it turns out the "fix" i added never mattered... The _omit=True AND _graph_important=True combination the fix targets is unreachable through normal scan flow.

I think this was fixed somewhere else since the issue was opened.

I reverted the fix but added a regression test to make sure we don't enter this state again.

The original change in this PR allowed _omit=True events into output if they
were also _graph_important. After tracing every assignment, that combination
is structurally unreachable through normal scan flow — _graph_important is
only ever set in manager.py:254 against event.get_parent(), which by design
skips _omit parents. The black-box reproducer (TestJSONGraphOrphanJSURL)
walks a real JS URL chain end-to-end and confirms no parent_uuid orphans
appear in output.json with or without the change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants