Skip to content

perf(scanner): lazily materialize imported modules#630

Open
king-tero wants to merge 1 commit into
VirusTotal:mainfrom
king-tero:perf/lazy-module-materialization
Open

perf(scanner): lazily materialize imported modules#630
king-tero wants to merge 1 commit into
VirusTotal:mainfrom
king-tero:perf/lazy-module-materialization

Conversation

@king-tero
Copy link
Copy Markdown
Contributor

@king-tero king-tero commented Apr 20, 2026

Summary

Add an opt-in lazy module mode for scans.

With ScanOptions::lazy_modules(true), imported modules are no longer
materialized eagerly at scan start. A module is executed only when rule
evaluation actually reads one of its fields or calls one of its functions.

This avoids unnecessary PE, Mach-O, and .NET parsing work for rules that
short-circuit before touching the module, and for files that import a module
but never use it at all.

Motivation

Today, importing a module eagerly executes its parser before condition
evaluation.

For example:

import "pe"

rule t {
  strings:
    $a = "rare marker"
  condition:
    $a and pe.is_pe
}

Even when $a is not present, the scan still pays the full PE parsing cost.

With lazy module materialization enabled, the module is not executed in that
case.

More importantly, YARA imports are file-scoped. If a rule file imports pe,
the current scanner eagerly materializes the PE module for the scan even when:

- a given rule short-circuits before touching pe
- only some rules in the file use pe
- the file imports pe but never references pe.* at all

That last case is not hypothetical.

## Evidence from public rule repositories

I surveyed 8 public YARA rule repositories, covering 3,860 rule files:

- Neo23x0/signature-base
- Yara-Rules/rules
- bartblaze/Yara-rules
- mikesxrs/Open-Source-YARA-rules
- ReversingLabs/reversinglabs-yara-rules
- airbnb/binaryalert
- eset/malware-ioc
- citizenlab/malware-signatures

### PE imports

Across those repositories:

- 504 files import pe
- 0 files use only pe.is_pe
- 2 files use pe.is_pe together with other pe features
- 359 files use other pe features without pe.is_pe
- 143 files import pe but do not reference pe.* anywhere in the file

So while the exact pe.is_pe-only pattern is rare in public repos, eager
module work is still wasted in a significant number of files because the import
exists but the module is never actually used.

Of those 143 unused pe imports, 29 files also contained manual PE gating
checks such as uint16(0) == 0x5A4D or $mz = "MZ".

### Other modules

The same pattern appears outside pe as well:

- dotnet: 11 files import it, including simple gating cases like
  dotnet.is_dotnet and ...
- elf: 6 files import it, including simple structural use such as
  elf.entry_point

This means the optimization is relevant beyond a synthetic PE-only scenario.

## What changed

- added ScanOptions::lazy_modules(true)
- deferred module execution until first root-field access
- deferred module execution until first module function call
- cached the materialized module output for the rest of the scan
- preserved default behavior unless lazy mode is explicitly enabled
- documented that module_output() / module_outputs() omit never-used
  imported modules when lazy mode is enabled
- added coverage for set_module_output() in lazy mode

## Results

Measured on real PE / Mach-O / .NET samples.

### Imported module, condition short-circuits before using it

| Workload | Baseline | Lazy | Speedup |
|---|---:|---:|---:|
| PE, $a and pe.is_pe | 7.062 ms | 1.115 ms | 84.2% |
| Mach-O, $a and macho.magic == ... | 21.053 ms | 3.678 ms | 82.5% |
| .NET, $a and dotnet.is_dotnet | 9.294 ms | 0.429 ms | 95.4% |

### Imported module, condition never uses it

| Workload | Baseline | Lazy |
|---|---:|---:|
| PE, import "pe"; condition: true | 10.421 ms | ~0 ms |
| Mach-O, import "macho"; condition: true | 19.526 ms | ~0 ms |
| .NET, import "dotnet"; condition: true | 8.882 ms | ~0 ms |

### Imported module, condition does use it

These cases stay roughly neutral, which is why this remains opt-in.

| Workload | Baseline | Lazy |
|---|---:|---:|
| Mach-O, macho.filetype >= 0 | 19.930 ms | 20.234 ms |
| .NET, dotnet.is_dotnet | 9.109 ms | 9.192 ms |

## Validation

Ran:

- cargo +1.91.0 fmt --all --check
- cargo +1.91.0 check -p yara-x
- cargo +1.91.0 test -p yara-x --lib lazy_ -- --nocapture
- cargo +1.91.0 test -p yara-x --lib module_output -- --nocapture

## Notes

Default scan behavior is unchanged.

When lazy mode is enabled, imported modules that were never accessed during
condition evaluation do not appear in ScanResults::module_output() or
ScanResults::module_outputs().

@king-tero king-tero force-pushed the perf/lazy-module-materialization branch from a4e0237 to f8b8dc5 Compare April 21, 2026 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant