feat(ScanPptx): Add comprehensive CLSID database and active content d… by wmetcalf · Pull Request #149 · sublime-security/strelka

wmetcalf · 2026-01-19T23:50:03Z

…etection

Enhance ScanPptx with 199-entry CLSID database for ActiveX/OLE object identification and expanded threat detection capabilities.

Changes:

Add 199 CLSIDs (137 oletools + 62 MaxScript)
Add ActiveX control detection from ppt/activeX/ directory
Add hover action detection (in addition to click actions)
Add ppaction:// URI parsing (program, macro, ole verbs)
Add remote template detection
Add presentation/slide-level relationship tracking
Add OLE object metadata extraction with risk classification
Add 8 new detection flags (has_ppaction_program, has_remote_template, etc.)

Describe the change

This PR transforms ScanPptx from a basic metadata extractor (119 lines) into a comprehensive PowerPoint threat detection scanner (895 lines, +776 lines added).

Summary of Changes

Original Scanner (PR New ScanPptx scanner #147):
- Basic metadata extraction (author, title, keywords)
- Simple click action URL collection
- 119 lines total
Enhanced Scanner:
- 199-entry CLSID database for ActiveX/OLE object identification
- Advanced action detection: Click + hover actions, ppaction:// URI parsing
- Relationship tracking: Remote templates, external resources, presentation/slide-level relationships
- OLE/ActiveX metadata extraction with automatic risk classification
- 8 detection flags for instant threat assessment
- Comprehensive exception handling for robust error recovery
- 895 lines total (+651% increase)
Dependencies
- lxml (already in requirements) - For ActiveX XML parsing
- urllib.parse (stdlib) - For ppaction:// URI parsing
Issues Fixed
- Resolves unknown CLSID detection (previously returned null)
- Addresses missing threat indicators in PowerPoint files
- Enhances PR New ScanPptx scanner #147 (New ScanPptx scanner)
Motivation

PowerPoint files are a common malware delivery vector (CVE-2017-0199, CVE-2017-11882). The original scanner only extracted metadata. This enhancement adds comprehensive threat detection including:
- Remote template injection detection
- Embedded ActiveX control identification
- Program/macro execution URI detection
- Hover actions (no user click required)
Describe testing procedures

Test Dataset
- 106 PowerPoint files from production environment
- Formats: PPT (Office 97-2003), PPTX (Office 2007+), PPAM (Add-ins)
- File sizes: 100KB to 38MB
- Sources: Real-world presentations with various content types
Test Results
- Success rate: 105/106 files (99.1%)
- Failures: 1 processing_error (acceptable for severely malformed file)
- Error handling: 88 files with value_error flag (expected for old PPT format, non-blocking)
Reproduction Steps

Test 1: ActiveX Detection

Scan a file with Microsoft Forms 2.0 controls

./strelka-fileshot -c fileshot.yaml test_activex.pptx

Expected: Detects TextBox, Label, CommandButton with CLSID descriptions

Test 2: CLSID Identification

Before enhancement: Unknown CLSID returned null

After enhancement: All CLSIDs identified with descriptions

Verify CLSID database coverage

grep -c "': '" src/python/strelka/scanners/scan_pptx.py

Returns: 199

Test 3: Threat Detection

Test ppaction:// URI parsing

File with ppaction://program?file=calc.exe

Expected output: has_ppaction_program: true

Test 4: Large File Handling

Test 38MB PPTX file

./strelka-fileshot -c fileshot.yaml large_presentation.pptx

Expected: Completes within scanner timeout (150s), no crashes

Security Testing
- XXE Attack Test: Confirmed lxml 4.9.1 blocks external entities
- Malformed File Test: Scanner handles bad_zip, value_error gracefully
- Resource Leak Test: No memory leaks after scanning 106 files
- Timeout Test: Large files complete within configured timeout
Test Files Available

Can provide sample files upon request for maintainer testing.

Sample output

Before (Original Scanner)

{
"scan": {
"scan_pptx": {
"author": "John Doe",
"title": "Quarterly Report",
"slide_count": 15,
"word_count": 842,
"urls": ["https://example.com/report"]
}
}
}

After (Enhanced Scanner)

{
"scan": {
"scan_pptx": {
"author": "John Doe",
"title": "Quarterly Report",
"slide_count": 15,
"word_count": 842,
```
  "ole_objects": [
    {
      "type": "activex_control",
      "slide": 5,
      "classid": "{8BD21D10-EC42-11CE-9E0D-00AA006002F3}",
      "clsid_desc": "Microsoft Forms 2.0 TextBox",
      "persistence": "persistPropertyBag",
      "properties": {
        "VariousPropertyBits": "746604571",
        "DisplayStyle": "1"
      },
      "is_activex": true,
      "is_high_risk": false,
      "source_file": "ppt/activeX/activeX1.xml"
    },
    {
      "type": "activex_control",
      "slide": 5,
      "classid": "{D7053240-CE69-11CD-A777-00DD01143C57}",
      "clsid_desc": "Microsoft Forms 2.0 CommandButton",
      "persistence": "persistPropertyBag",
      "properties": {"Caption": "Submit"},
      "is_activex": true,
      "is_high_risk": false,
      "source_file": "ppt/activeX/activeX2.xml"
    }
  ],

  "actions": [
    {
      "slide": 3,
      "shape": "Rectangle 7",
      "trigger": "click",
      "verb": null,
      "ppaction_url": null,
      "fields": {},
      "rid": "rId12",
      "target": "https://example.com/report",
      "is_external": true
    },
    {
      "slide": 10,
      "shape": "Text Box 3",
      "trigger": "hover",
      "verb": "hlinksldjump",
      "ppaction_url": "ppaction://hlinksldjump?slideid=5",
      "fields": {"slideid": "5"},
      "rid": null,
      "target": null,
      "is_external": null
    }
  ],

  "relationships": [
    {
      "type": "hyperlink",
      "target": "https://example.com/report",
      "is_external": true,
      "rid": "rId12",
      "location": "slide_3"
    },
    {
      "type": "external_image",
      "target": "https://cdn.example.com/logo.png",
      "is_external": true,
      "rid": "rId8",
      "location": "slide_1"
    }
  ],

  "urls": ["https://example.com/report"],

  "has_hover_actions": true,
  "has_ppaction_program": false,
  "has_ppaction_macro": false,
  "has_ppaction_ole": false,
  "has_external_relationships": true,
  "has_remote_template": false,
  "has_activex_controls": true,
  "has_high_risk_ole": false
}
```
}
}

Real-World Example: CLSID Detection Improvement

File: d56fd2074f6b327ea2917de9e4260c8cd03baf6e002239f1f486d603e3416b42.ppt

BEFORE:
"classid": "{8BD21D10-EC42-11CE-9E0D-00AA006002F3}",
"clsid_desc": null

AFTER:
"classid": "{8BD21D10-EC42-11CE-9E0D-00AA006002F3}",
"clsid_desc": "Microsoft Forms 2.0 TextBox"

Checklist
- My code follows the style guidelines of this project
  - Follows existing Strelka scanner patterns
  - Uses snake_case for JSON keys per CONTRIBUTING.md
  - Function naming matches project conventions
  - Comprehensive docstrings included
- I have performed a self-review of and tested my code
  - Tested with 106 production PowerPoint files
  - Success rate: 99.1% (105/106)
  - Security audit completed (no critical/high issues)
  - All error cases handled gracefully
- I have commented my code, particularly in hard-to-understand areas
  - All helper functions documented with docstrings
  - CLSID database sources cited with URLs
  - Complex logic (ppaction parsing, relationship classification) commented
  - High-risk indicators explained
- I have made corresponding changes to the documentation
  - Created SCANPPTX_OUTPUT_SCHEMA.md (19KB) - Complete field reference
  - Created SCANPPTX_SECURITY_AUDIT.md (18KB) - Security analysis
  - Created SECURITY_AUDIT_SUMMARY.txt (8KB) - Executive summary
  - Can provide these docs in follow-up PR if desired
- My changes generate no new warnings
  - Docker backend rebuilt with --no-cache (clean build)
  - No linter warnings
  - No deprecation warnings
  - Exception handling prevents runtime warnings

…etection Enhance ScanPptx with 199-entry CLSID database for ActiveX/OLE object identification and expanded threat detection capabilities. Changes: - Add 199 CLSIDs (137 oletools + 62 MaxScript) - Add ActiveX control detection from ppt/activeX/ directory - Add hover action detection (in addition to click actions) - Add ppaction:// URI parsing (program, macro, ole verbs) - Add remote template detection - Add presentation/slide-level relationship tracking - Add OLE object metadata extraction with risk classification - Add 8 new detection flags (has_ppaction_program, has_remote_template, etc.) Security: - No code execution (read-only analysis) - XXE protected (lxml 4.9.1+ default protection) - Comprehensive exception handling - Scanner timeout protection

MSAdministrator · 2026-01-20T15:12:53Z

+
+                        # Extract actions (click and hover)
+                        try:
+                            shape_actions = _extract_shape_actions(shape, slide_num)


By extracting these shape actions I know we are extracting the urls but are these the same urls that will be in click_actions or do we need to update the urls with whatever is extracted from this shape action function ?

fixed in 5e03588

Merge URLs from all extraction methods into urls[] array: - Click action URLs (existing python-pptx API method) - Hover action and ppaction:// URLs (from actions array) - Text hyperlink URLs (from relationships array) This addresses PR feedback to ensure all extracted URLs are available in the urls[] field for backward compatibility and comprehensive coverage. Verified with test scans showing successful URL extraction from all sources.

zoomequipd · 2026-03-03T15:57:18Z

+def _parse_ppaction(action_url):
+    """
+    Parse ppaction:// URLs to extract verb and query parameters.
+
+    Args:
+        action_url: String like "ppaction://program?file=malware.exe"
+
+    Returns:
+        dict with 'ppaction_url', 'verb', and 'fields'
+    """
+    if not action_url or not action_url.startswith("ppaction://"):
+        return {"ppaction_url": None, "verb": None, "fields": {}}
+
+    parsed = urlparse(action_url)
+    return {
+        "ppaction_url": action_url,
+        "verb": (parsed.netloc or None),
+        "fields": {
+            k: (v[0] if len(v) == 1 else v)
+            for k, v in parse_qs(parsed.query).items()
+        },
+    }


i wonder if this is better suited to be handled as part of URL parsing that happens outside of Strelka?

zoomequipd · 2026-03-03T15:57:21Z

+        ]
+        is_dangerous = any(indicator in desc for indicator in dangerous_indicators)
+
+    return (desc, is_dangerous)


I think this type of logic is better suited to belong in MQL. Expose the CLSIDS, write MQL for the "dangerous" ones as an ASR rule.

This ensures we aren't placing an opinion within backend systems, but placing opinion within detection logic.

zoomequipd · 2026-03-03T16:04:43Z

+HIGH_RISK_PROGIDS = {
+    "Shell.Explorer",      # Web browser control
+    "WScript.Shell",       # Shell execution
+    "Shell.Application",   # Shell application
+    "WScript.Network",     # Network access
+    "Scripting.FileSystemObject",  # File system access
+}


while I don't disagree these are high risk, i think the general approach should be exposing them in the returned object and writing MQL as the detection element that applies the opinion.

zoomequipd · 2026-03-03T16:16:20Z

+                self.event["has_ppaction_ole"] = any(
+                    a["verb"] == "ole" for a in all_actions
+                )


if these "questions" (does this pptx have ppaction program?, etc) can be answered by looking through self.event['actions'] (which gets returned to the MQL) do the flags need to be part of the response?

Three pytest.mark.integration tests verify the full Redis data-flow: single-file two-scanner event assembly, child-file FIN guard, and timed-out scanner result injection. Tests skip automatically when Redis is unavailable.

…native stream extraction - Remove HIGH_RISK_PROGIDS and is_high_risk opinion logic (belongs in MQL) - Remove has_high_risk_ole event flag - _lookup_clsid now returns str|None instead of (str, bool) tuple - Rename action dict 'fields' -> 'params' for ppaction parsed parameters - Add action_type field ('ppaction' or 'hyperlink') to all action dicts - Add _normalize_pptx_bytes(): coerces .ppsx/.ppsm/.potx/.potm content types so python-pptx can parse them; strips malformed <Relationship> elements missing Target attribute with malformed_relationships_N flag - Add _extract_ole_native_info(): surfaces filename, src_path, temp_path, actual_size, is_link from Ole10Native stream in embedded OLE Package objects - Add 31 unit tests covering ppaction parsing, CLSID lookup, content type normalization, malformed rel handling, and OLE native stream extraction

MSAdministrator reviewed Jan 20, 2026

View reviewed changes

zoomequipd reviewed Mar 3, 2026

View reviewed changes

wmetcalf added 2 commits March 4, 2026 09:22

test: add integration tests for distributed scanner pipeline

4cb7ab5

Three pytest.mark.integration tests verify the full Redis data-flow: single-file two-scanner event assembly, child-file FIN guard, and timed-out scanner result injection. Tests skip automatically when Redis is unavailable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ScanPptx): Add comprehensive CLSID database and active content d…#149

feat(ScanPptx): Add comprehensive CLSID database and active content d…#149
wmetcalf wants to merge 4 commits into
sublime-security:mainfrom
wmetcalf:feat/scanpptx-part-deux-the-search-for-more-evil-bits

wmetcalf commented Jan 19, 2026

Uh oh!

MSAdministrator Jan 20, 2026

Uh oh!

node5-sublime Mar 3, 2026

Uh oh!

zoomequipd Mar 3, 2026

Uh oh!

zoomequipd Mar 3, 2026

Uh oh!

zoomequipd Mar 3, 2026

Uh oh!

zoomequipd Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

wmetcalf commented Jan 19, 2026

Scan a file with Microsoft Forms 2.0 controls

Expected: Detects TextBox, Label, CommandButton with CLSID descriptions

Before enhancement: Unknown CLSID returned null

After enhancement: All CLSIDs identified with descriptions

Verify CLSID database coverage

Returns: 199

Test ppaction:// URI parsing

File with ppaction://program?file=calc.exe

Expected output: has_ppaction_program: true

Test 38MB PPTX file

Expected: Completes within scanner timeout (150s), no crashes

Uh oh!

MSAdministrator Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

node5-sublime Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

zoomequipd Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

zoomequipd Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

zoomequipd Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

zoomequipd Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants