Skip to content

feat(ScanPptx): Add comprehensive CLSID database and active content d…#149

Open
wmetcalf wants to merge 4 commits into
sublime-security:mainfrom
wmetcalf:feat/scanpptx-part-deux-the-search-for-more-evil-bits
Open

feat(ScanPptx): Add comprehensive CLSID database and active content d…#149
wmetcalf wants to merge 4 commits into
sublime-security:mainfrom
wmetcalf:feat/scanpptx-part-deux-the-search-for-more-evil-bits

Conversation

@wmetcalf

Copy link
Copy Markdown

…etection

Enhance ScanPptx with 199-entry CLSID database for ActiveX/OLE object identification and expanded threat detection capabilities.

Changes:

  • Add 199 CLSIDs (137 oletools + 62 MaxScript)

  • Add ActiveX control detection from ppt/activeX/ directory

  • Add hover action detection (in addition to click actions)

  • Add ppaction:// URI parsing (program, macro, ole verbs)

  • Add remote template detection

  • Add presentation/slide-level relationship tracking

  • Add OLE object metadata extraction with risk classification

  • Add 8 new detection flags (has_ppaction_program, has_remote_template, etc.)

    Describe the change

    This PR transforms ScanPptx from a basic metadata extractor (119 lines) into a comprehensive PowerPoint threat detection scanner (895 lines, +776 lines added).

    Summary of Changes

    Original Scanner (PR New ScanPptx scanner #147):

    • Basic metadata extraction (author, title, keywords)
    • Simple click action URL collection
    • 119 lines total

    Enhanced Scanner:

    • 199-entry CLSID database for ActiveX/OLE object identification
    • Advanced action detection: Click + hover actions, ppaction:// URI parsing
    • Relationship tracking: Remote templates, external resources, presentation/slide-level relationships
    • OLE/ActiveX metadata extraction with automatic risk classification
    • 8 detection flags for instant threat assessment
    • Comprehensive exception handling for robust error recovery
    • 895 lines total (+651% increase)

    Dependencies

    • lxml (already in requirements) - For ActiveX XML parsing
    • urllib.parse (stdlib) - For ppaction:// URI parsing

    Issues Fixed

    • Resolves unknown CLSID detection (previously returned null)
    • Addresses missing threat indicators in PowerPoint files
    • Enhances PR New ScanPptx scanner #147 (New ScanPptx scanner)

    Motivation

    PowerPoint files are a common malware delivery vector (CVE-2017-0199, CVE-2017-11882). The original scanner only extracted metadata. This enhancement adds comprehensive threat detection including:

    • Remote template injection detection
    • Embedded ActiveX control identification
    • Program/macro execution URI detection
    • Hover actions (no user click required)

    Describe testing procedures

    Test Dataset

    • 106 PowerPoint files from production environment
    • Formats: PPT (Office 97-2003), PPTX (Office 2007+), PPAM (Add-ins)
    • File sizes: 100KB to 38MB
    • Sources: Real-world presentations with various content types

    Test Results

    • Success rate: 105/106 files (99.1%)
    • Failures: 1 processing_error (acceptable for severely malformed file)
    • Error handling: 88 files with value_error flag (expected for old PPT format, non-blocking)

    Reproduction Steps

    Test 1: ActiveX Detection

    Scan a file with Microsoft Forms 2.0 controls

    ./strelka-fileshot -c fileshot.yaml test_activex.pptx

    Expected: Detects TextBox, Label, CommandButton with CLSID descriptions

    Test 2: CLSID Identification

    Before enhancement: Unknown CLSID returned null

    After enhancement: All CLSIDs identified with descriptions

    Verify CLSID database coverage

    grep -c "': '" src/python/strelka/scanners/scan_pptx.py

    Returns: 199

    Test 3: Threat Detection

    Test ppaction:// URI parsing

    File with ppaction://program?file=calc.exe

    Expected output: has_ppaction_program: true

    Test 4: Large File Handling

    Test 38MB PPTX file

    ./strelka-fileshot -c fileshot.yaml large_presentation.pptx

    Expected: Completes within scanner timeout (150s), no crashes

    Security Testing

    • XXE Attack Test: Confirmed lxml 4.9.1 blocks external entities
    • Malformed File Test: Scanner handles bad_zip, value_error gracefully
    • Resource Leak Test: No memory leaks after scanning 106 files
    • Timeout Test: Large files complete within configured timeout

    Test Files Available

    Can provide sample files upon request for maintainer testing.


    Sample output

    Before (Original Scanner)

    {
    "scan": {
    "scan_pptx": {
    "author": "John Doe",
    "title": "Quarterly Report",
    "slide_count": 15,
    "word_count": 842,
    "urls": ["https://example.com/report"]
    }
    }
    }

    After (Enhanced Scanner)

    {
    "scan": {
    "scan_pptx": {
    "author": "John Doe",
    "title": "Quarterly Report",
    "slide_count": 15,
    "word_count": 842,

      "ole_objects": [
        {
          "type": "activex_control",
          "slide": 5,
          "classid": "{8BD21D10-EC42-11CE-9E0D-00AA006002F3}",
          "clsid_desc": "Microsoft Forms 2.0 TextBox",
          "persistence": "persistPropertyBag",
          "properties": {
            "VariousPropertyBits": "746604571",
            "DisplayStyle": "1"
          },
          "is_activex": true,
          "is_high_risk": false,
          "source_file": "ppt/activeX/activeX1.xml"
        },
        {
          "type": "activex_control",
          "slide": 5,
          "classid": "{D7053240-CE69-11CD-A777-00DD01143C57}",
          "clsid_desc": "Microsoft Forms 2.0 CommandButton",
          "persistence": "persistPropertyBag",
          "properties": {"Caption": "Submit"},
          "is_activex": true,
          "is_high_risk": false,
          "source_file": "ppt/activeX/activeX2.xml"
        }
      ],
    
      "actions": [
        {
          "slide": 3,
          "shape": "Rectangle 7",
          "trigger": "click",
          "verb": null,
          "ppaction_url": null,
          "fields": {},
          "rid": "rId12",
          "target": "https://example.com/report",
          "is_external": true
        },
        {
          "slide": 10,
          "shape": "Text Box 3",
          "trigger": "hover",
          "verb": "hlinksldjump",
          "ppaction_url": "ppaction://hlinksldjump?slideid=5",
          "fields": {"slideid": "5"},
          "rid": null,
          "target": null,
          "is_external": null
        }
      ],
    
      "relationships": [
        {
          "type": "hyperlink",
          "target": "https://example.com/report",
          "is_external": true,
          "rid": "rId12",
          "location": "slide_3"
        },
        {
          "type": "external_image",
          "target": "https://cdn.example.com/logo.png",
          "is_external": true,
          "rid": "rId8",
          "location": "slide_1"
        }
      ],
    
      "urls": ["https://example.com/report"],
    
      "has_hover_actions": true,
      "has_ppaction_program": false,
      "has_ppaction_macro": false,
      "has_ppaction_ole": false,
      "has_external_relationships": true,
      "has_remote_template": false,
      "has_activex_controls": true,
      "has_high_risk_ole": false
    }
    

    }
    }

    Real-World Example: CLSID Detection Improvement

    File: d56fd2074f6b327ea2917de9e4260c8cd03baf6e002239f1f486d603e3416b42.ppt

    BEFORE:
    "classid": "{8BD21D10-EC42-11CE-9E0D-00AA006002F3}",
    "clsid_desc": null

    AFTER:
    "classid": "{8BD21D10-EC42-11CE-9E0D-00AA006002F3}",
    "clsid_desc": "Microsoft Forms 2.0 TextBox"


    Checklist

    • My code follows the style guidelines of this project
      • Follows existing Strelka scanner patterns
      • Uses snake_case for JSON keys per CONTRIBUTING.md
      • Function naming matches project conventions
      • Comprehensive docstrings included
    • I have performed a self-review of and tested my code
      • Tested with 106 production PowerPoint files
      • Success rate: 99.1% (105/106)
      • Security audit completed (no critical/high issues)
      • All error cases handled gracefully
    • I have commented my code, particularly in hard-to-understand areas
      • All helper functions documented with docstrings
      • CLSID database sources cited with URLs
      • Complex logic (ppaction parsing, relationship classification) commented
      • High-risk indicators explained
    • I have made corresponding changes to the documentation
      • Created SCANPPTX_OUTPUT_SCHEMA.md (19KB) - Complete field reference
      • Created SCANPPTX_SECURITY_AUDIT.md (18KB) - Security analysis
      • Created SECURITY_AUDIT_SUMMARY.txt (8KB) - Executive summary
      • Can provide these docs in follow-up PR if desired
    • My changes generate no new warnings
      • Docker backend rebuilt with --no-cache (clean build)
      • No linter warnings
      • No deprecation warnings
      • Exception handling prevents runtime warnings

…etection

Enhance ScanPptx with 199-entry CLSID database for ActiveX/OLE object
identification and expanded threat detection capabilities.

Changes:
- Add 199 CLSIDs (137 oletools + 62 MaxScript)
- Add ActiveX control detection from ppt/activeX/ directory
- Add hover action detection (in addition to click actions)
- Add ppaction:// URI parsing (program, macro, ole verbs)
- Add remote template detection
- Add presentation/slide-level relationship tracking
- Add OLE object metadata extraction with risk classification
- Add 8 new detection flags (has_ppaction_program, has_remote_template, etc.)

Security:
- No code execution (read-only analysis)
- XXE protected (lxml 4.9.1+ default protection)
- Comprehensive exception handling
- Scanner timeout protection

# Extract actions (click and hover)
try:
shape_actions = _extract_shape_actions(shape, slide_num)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By extracting these shape actions I know we are extracting the urls but are these the same urls that will be in click_actions or do we need to update the urls with whatever is extracted from this shape action function ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 5e03588

Merge URLs from all extraction methods into urls[] array:
- Click action URLs (existing python-pptx API method)
- Hover action and ppaction:// URLs (from actions array)
- Text hyperlink URLs (from relationships array)

This addresses PR feedback to ensure all extracted URLs are available
in the urls[] field for backward compatibility and comprehensive coverage.

Verified with test scans showing successful URL extraction from all sources.
Comment on lines +344 to +365
def _parse_ppaction(action_url):
"""
Parse ppaction:// URLs to extract verb and query parameters.

Args:
action_url: String like "ppaction://program?file=malware.exe"

Returns:
dict with 'ppaction_url', 'verb', and 'fields'
"""
if not action_url or not action_url.startswith("ppaction://"):
return {"ppaction_url": None, "verb": None, "fields": {}}

parsed = urlparse(action_url)
return {
"ppaction_url": action_url,
"verb": (parsed.netloc or None),
"fields": {
k: (v[0] if len(v) == 1 else v)
for k, v in parse_qs(parsed.query).items()
},
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if this is better suited to be handled as part of URL parsing that happens outside of Strelka?

]
is_dangerous = any(indicator in desc for indicator in dangerous_indicators)

return (desc, is_dangerous)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this type of logic is better suited to belong in MQL. Expose the CLSIDS, write MQL for the "dangerous" ones as an ASR rule.

This ensures we aren't placing an opinion within backend systems, but placing opinion within detection logic.

Comment on lines +37 to +43
HIGH_RISK_PROGIDS = {
"Shell.Explorer", # Web browser control
"WScript.Shell", # Shell execution
"Shell.Application", # Shell application
"WScript.Network", # Network access
"Scripting.FileSystemObject", # File system access
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I don't disagree these are high risk, i think the general approach should be exposing them in the returned object and writing MQL as the detection element that applies the opinion.

Comment on lines +848 to +850
self.event["has_ppaction_ole"] = any(
a["verb"] == "ole" for a in all_actions
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these "questions" (does this pptx have ppaction program?, etc) can be answered by looking through self.event['actions'] (which gets returned to the MQL) do the flags need to be part of the response?

wmetcalf added 2 commits March 4, 2026 09:22
Three pytest.mark.integration tests verify the full Redis data-flow:
single-file two-scanner event assembly, child-file FIN guard, and
timed-out scanner result injection.  Tests skip automatically when
Redis is unavailable.
…native stream extraction

- Remove HIGH_RISK_PROGIDS and is_high_risk opinion logic (belongs in MQL)
- Remove has_high_risk_ole event flag
- _lookup_clsid now returns str|None instead of (str, bool) tuple
- Rename action dict 'fields' -> 'params' for ppaction parsed parameters
- Add action_type field ('ppaction' or 'hyperlink') to all action dicts
- Add _normalize_pptx_bytes(): coerces .ppsx/.ppsm/.potx/.potm content
  types so python-pptx can parse them; strips malformed <Relationship>
  elements missing Target attribute with malformed_relationships_N flag
- Add _extract_ole_native_info(): surfaces filename, src_path, temp_path,
  actual_size, is_link from Ole10Native stream in embedded OLE Package objects
- Add 31 unit tests covering ppaction parsing, CLSID lookup, content type
  normalization, malformed rel handling, and OLE native stream extraction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants