feat(ScanPptx): Add comprehensive CLSID database and active content d…#149
Conversation
…etection Enhance ScanPptx with 199-entry CLSID database for ActiveX/OLE object identification and expanded threat detection capabilities. Changes: - Add 199 CLSIDs (137 oletools + 62 MaxScript) - Add ActiveX control detection from ppt/activeX/ directory - Add hover action detection (in addition to click actions) - Add ppaction:// URI parsing (program, macro, ole verbs) - Add remote template detection - Add presentation/slide-level relationship tracking - Add OLE object metadata extraction with risk classification - Add 8 new detection flags (has_ppaction_program, has_remote_template, etc.) Security: - No code execution (read-only analysis) - XXE protected (lxml 4.9.1+ default protection) - Comprehensive exception handling - Scanner timeout protection
|
|
||
| # Extract actions (click and hover) | ||
| try: | ||
| shape_actions = _extract_shape_actions(shape, slide_num) |
There was a problem hiding this comment.
By extracting these shape actions I know we are extracting the urls but are these the same urls that will be in click_actions or do we need to update the urls with whatever is extracted from this shape action function ?
Merge URLs from all extraction methods into urls[] array: - Click action URLs (existing python-pptx API method) - Hover action and ppaction:// URLs (from actions array) - Text hyperlink URLs (from relationships array) This addresses PR feedback to ensure all extracted URLs are available in the urls[] field for backward compatibility and comprehensive coverage. Verified with test scans showing successful URL extraction from all sources.
| def _parse_ppaction(action_url): | ||
| """ | ||
| Parse ppaction:// URLs to extract verb and query parameters. | ||
|
|
||
| Args: | ||
| action_url: String like "ppaction://program?file=malware.exe" | ||
|
|
||
| Returns: | ||
| dict with 'ppaction_url', 'verb', and 'fields' | ||
| """ | ||
| if not action_url or not action_url.startswith("ppaction://"): | ||
| return {"ppaction_url": None, "verb": None, "fields": {}} | ||
|
|
||
| parsed = urlparse(action_url) | ||
| return { | ||
| "ppaction_url": action_url, | ||
| "verb": (parsed.netloc or None), | ||
| "fields": { | ||
| k: (v[0] if len(v) == 1 else v) | ||
| for k, v in parse_qs(parsed.query).items() | ||
| }, | ||
| } |
There was a problem hiding this comment.
i wonder if this is better suited to be handled as part of URL parsing that happens outside of Strelka?
| ] | ||
| is_dangerous = any(indicator in desc for indicator in dangerous_indicators) | ||
|
|
||
| return (desc, is_dangerous) |
There was a problem hiding this comment.
I think this type of logic is better suited to belong in MQL. Expose the CLSIDS, write MQL for the "dangerous" ones as an ASR rule.
This ensures we aren't placing an opinion within backend systems, but placing opinion within detection logic.
| HIGH_RISK_PROGIDS = { | ||
| "Shell.Explorer", # Web browser control | ||
| "WScript.Shell", # Shell execution | ||
| "Shell.Application", # Shell application | ||
| "WScript.Network", # Network access | ||
| "Scripting.FileSystemObject", # File system access | ||
| } |
There was a problem hiding this comment.
while I don't disagree these are high risk, i think the general approach should be exposing them in the returned object and writing MQL as the detection element that applies the opinion.
| self.event["has_ppaction_ole"] = any( | ||
| a["verb"] == "ole" for a in all_actions | ||
| ) |
There was a problem hiding this comment.
if these "questions" (does this pptx have ppaction program?, etc) can be answered by looking through self.event['actions'] (which gets returned to the MQL) do the flags need to be part of the response?
Three pytest.mark.integration tests verify the full Redis data-flow: single-file two-scanner event assembly, child-file FIN guard, and timed-out scanner result injection. Tests skip automatically when Redis is unavailable.
…native stream extraction
- Remove HIGH_RISK_PROGIDS and is_high_risk opinion logic (belongs in MQL)
- Remove has_high_risk_ole event flag
- _lookup_clsid now returns str|None instead of (str, bool) tuple
- Rename action dict 'fields' -> 'params' for ppaction parsed parameters
- Add action_type field ('ppaction' or 'hyperlink') to all action dicts
- Add _normalize_pptx_bytes(): coerces .ppsx/.ppsm/.potx/.potm content
types so python-pptx can parse them; strips malformed <Relationship>
elements missing Target attribute with malformed_relationships_N flag
- Add _extract_ole_native_info(): surfaces filename, src_path, temp_path,
actual_size, is_link from Ole10Native stream in embedded OLE Package objects
- Add 31 unit tests covering ppaction parsing, CLSID lookup, content type
normalization, malformed rel handling, and OLE native stream extraction
…etection
Enhance ScanPptx with 199-entry CLSID database for ActiveX/OLE object identification and expanded threat detection capabilities.
Changes:
Add 199 CLSIDs (137 oletools + 62 MaxScript)
Add ActiveX control detection from ppt/activeX/ directory
Add hover action detection (in addition to click actions)
Add ppaction:// URI parsing (program, macro, ole verbs)
Add remote template detection
Add presentation/slide-level relationship tracking
Add OLE object metadata extraction with risk classification
Add 8 new detection flags (has_ppaction_program, has_remote_template, etc.)
Describe the change
This PR transforms ScanPptx from a basic metadata extractor (119 lines) into a comprehensive PowerPoint threat detection scanner (895 lines, +776 lines added).
Summary of Changes
Original Scanner (PR New ScanPptx scanner #147):
Enhanced Scanner:
Dependencies
Issues Fixed
Motivation
PowerPoint files are a common malware delivery vector (CVE-2017-0199, CVE-2017-11882). The original scanner only extracted metadata. This enhancement adds comprehensive threat detection including:
Describe testing procedures
Test Dataset
Test Results
Reproduction Steps
Test 1: ActiveX Detection
Scan a file with Microsoft Forms 2.0 controls
./strelka-fileshot -c fileshot.yaml test_activex.pptx
Expected: Detects TextBox, Label, CommandButton with CLSID descriptions
Test 2: CLSID Identification
Before enhancement: Unknown CLSID returned null
After enhancement: All CLSIDs identified with descriptions
Verify CLSID database coverage
grep -c "': '" src/python/strelka/scanners/scan_pptx.py
Returns: 199
Test 3: Threat Detection
Test ppaction:// URI parsing
File with ppaction://program?file=calc.exe
Expected output: has_ppaction_program: true
Test 4: Large File Handling
Test 38MB PPTX file
./strelka-fileshot -c fileshot.yaml large_presentation.pptx
Expected: Completes within scanner timeout (150s), no crashes
Security Testing
Test Files Available
Can provide sample files upon request for maintainer testing.
Sample output
Before (Original Scanner)
{
"scan": {
"scan_pptx": {
"author": "John Doe",
"title": "Quarterly Report",
"slide_count": 15,
"word_count": 842,
"urls": ["https://example.com/report"]
}
}
}
After (Enhanced Scanner)
{
"scan": {
"scan_pptx": {
"author": "John Doe",
"title": "Quarterly Report",
"slide_count": 15,
"word_count": 842,
}
}
Real-World Example: CLSID Detection Improvement
File: d56fd2074f6b327ea2917de9e4260c8cd03baf6e002239f1f486d603e3416b42.ppt
BEFORE:
"classid": "{8BD21D10-EC42-11CE-9E0D-00AA006002F3}",
"clsid_desc": null
AFTER:
"classid": "{8BD21D10-EC42-11CE-9E0D-00AA006002F3}",
"clsid_desc": "Microsoft Forms 2.0 TextBox"
Checklist