This document is highly technical and not for unexperienced users. It is also not intended to keep the reader engaged with fancy rhetorical questions, formatting and that kind of stuff, I don't like blogs. Read it if you just love the topics (CPU microarchitecture, kernel internals, and compilers).
It describes how a single-threaded, fully synchronous scanner can reach full PCIe Gen5 speed and run at near your hardware limit.
In the future, the scanner might be optimized even more, as this is only half of the story. It took me days just for documenting and make these mermaid renders.
Straight to the point, we continue with some of the optimizations applied.
When designing the file ingestion engine for ntstrings, three main distinct paradigms were evaluated:
- Memory-Mapped Files to map data directly into the virtual address space.
- Asynchronous Overlapped I/O (Double-Buffering). For example a
NtReadFilewith anEVENTor IOCP to fetch chunk$N+1$ while the CPU processes chunk$N$ . - Sequential
NtReadFilecalls blocking until the chunk is delivered.
Which one is the fastest here? At first glance no one would believe the answer is 3, because it seems super contraintuitive, but the reason I choosed it was based on this reverse-engineering phase and experimental tests. This path contains multiple kernel functions that were reversed but not shown, such as CcScheduleReadAhead, CcSetAdditionalCacheAttributes, etc...)
When you map a file into memory, you aren't actually loading the file into RAM. You are simply asking the kernel to create VADs.
__int64 __fastcall NtCreateSection(
__int64 *a1, __int64 a2, int a3, __int64 *a4, int a5, int a6,
__int64 a7, volatile void *Address, int a9, char a10, char a11)
{
// .setup and flag validation removed for brevity
if ( (a6 & 0xC000000) == 0xC000000 ) return 3221225716LL;
result = sub_140940BF8(Address, (__int64)&v30);
if ( (int)result >= 0 )
{
CurrentThread = KeGetCurrentThread();
v19 = *((_QWORD *)CurrentThread + 23); // Get current process
ProcessSessionId = PsGetProcessSessionIdEx(v19);
// attempt to create the section object and back it with the FileObject
while ( 1 )
{
v22 = sub_140941160((unsigned int)&Object, a3, v27, a5, a6, v17, (__int64)v12, v20, a7, 0LL, a11, ProcessSessionId, (__int64)&v30);
if ( v22 != -1073741740 ) break;
KeDelayExecutionThread(0, 0, (PLARGE_INTEGER)&stru_140014BE8);
}
if ( v22 < 0 ) return (unsigned int)v22;
// attach to the File Object (sub_1402D4800 / sub_1402464D0 handle the Object Manager structures)
v23 = sub_1402D4800(Object);
if ( v23 && *(_QWORD *)(v23 + 64) ) {
v25 = (struct _FILE_OBJECT *)sub_1402464D0(v23);
sub_14044C774(v25);
sub_14042C500(v24, v25);
}
// ...
}
return result;
}Actually map those pages into the CPL3 address space:
__int64 __fastcall NtMapViewOfSection(
__int64 a1, __int64 a2, _QWORD *a3, ULONGLONG a4, int a5,
_QWORD *a6, _QWORD *a7, int a8, int a9, int a10)
{
// ...
result = sub_140899F20(a2, a1, 0LL, a3, a7, a6, a10, v10, v39, &v40);
if ( (int)result >= 0 )
{
v15 = sub_1402D4800(Object[0]);
// set up VAD properties and bounds checking
if ( (*(_DWORD *)(v15 + 56) & 0x20) == 0 && (v16 & 1) != 0 )
v18 = *((_QWORD *)Object[1] + 109) - 1LL;
// limits applied ...
// actually map the view (sub_14089A1A0 is likely MiMapViewOfSection)
v19 = sub_14089A1A0(Object[0], (unsigned int)&v47, (unsigned int)&v40, a5, (__int64)&v41, a8, 0);
if ( v19 >= 0 ) {
// Fire ETW telemetry events indicating a new memory map
if ( EtwProviderEnabled(qword_140EFED00, 0, 0xF00uLL) ) {
// ... tracing logic stuff
EtwWrite(qword_140EFED00, v25, 0LL, v32 + 1, v57);
}
*a7 = *((_QWORD *)&v48 + 1); // Return ViewSize
*a3 = v40; // Return BaseAddress
goto LABEL_20;
}
}
// ...
LABEL_20:
ObfDereferenceObjectWithTag(Object[0], 0x77566D4Du);
ObfDereferenceObjectWithTag(Object[1], 0x77566D4Du);
return (unsigned int)v19;
}Notice that at no point does the kernel issue a hardware read. Data is brought in via demand paging triggering a hardware #PF to halt your thread, build a Paging IRP, and fetch data.
Worse, if the disk encounters an error during this demand-paging read, the kernel raises an EXCEPTION_IN_PAGE_ERROR triggering the SEH path taking continuous exception traps and page faults on every memory page boundary.
If we decide to explicitly manage our buffers, we might open the handle with FILE_FLAG_OVERLAPPED and FILE_FLAG_NO_BUFFERING. This instructs the kernel to bypass the cache and force the NVMe drive to DMA data directly into our user-mode buffer.
While it sounds fast, it forces the system to do direct I/O bookkeeping on the critical path. Before the NVMe can DMA into user space, the kernel must call MmProbeAndLockPages to pin the memory into physical RAM, and later MmUnlockPages. This introduces lock contention in the MM PFN db.
Furthermore, there's a common misconception about asynchronous completion. Standard overlapped I/O completes via IoCompleteRequest, which signals an Event or an IOCP. It does not queue an Asynchronous Procedure Call to your thread (APCs are reserved for the alertable ReadFileEx paths). However, even without APC interruptions, managing IRPs, polling events, and wrestling with strict sector-alignment limits adds stupid overhead compared to a straight memory copy.
The goal is to hit the Fast I/O path and bypass IRP creation entirely.
The gatekeeper for Fast I/O is FastIoCheckIfPossible. Before the kernel lets you read directly from the cache, the file system must verify there are no conflicting byte-range locks using FsRtlFastCheckLockForRead:
BOOLEAN __stdcall FsRtlFastCheckLockForRead(
PFILE_LOCK FileLock, PLARGE_INTEGER StartingByte,
PLARGE_INTEGER Length, ULONG Key, PFILE_OBJECT FileObject, PVOID ProcessId)
{
// ...
LockInformation = (unsigned __int64 *)FileLock->LockInformation;
if ( !LockInformation || !LockInformation[5] || !Length->QuadPart )
return 1; // no locks exist, fast path approved
QuadPart = StartingByte->QuadPart;
v10 = StartingByte->QuadPart + Length->QuadPart - 1;
// acquire a DPC-level spinlock to safely read the lock state
v11 = KeAcquireSpinLockRaiseToDpc(LockInformation + 3);
// quick check against the FileObject's LastLock cache
LastLock = (unsigned __int64 *)FileObject->LastLock;
if ( !LastLock || QuadPart < *LastLock || v10 > LastLock[5] || *((_DWORD *)LastLock + 5) != Key ) {
// slow path check if LastLock cache missed
v14 = sub_1402479FC((int)LockInformation + 24, (unsigned int)&v17, (unsigned int)&v19, Key, (__int64)FileObject, (__int64)v13);
KeReleaseSpinLock(LockInformation + 3, v11);
return v14;
}
// ...
return 1;
}Once the gate is passed, the system uses routines like CcCopyReadEx or CcFastCopyRead to execute a blistering fast memcpy from the SHARED_CACHE_MAP into your user buffer. For drivers that need to map or pin the cache directly, the kernel provides tools like CcMapData and CcPinRead:
BOOLEAN __stdcall CcMapData(
PFILE_OBJECT FileObject, PLARGE_INTEGER FileOffset, ULONG Length,
ULONG Flags, PVOID *Bcb, PVOID *Buffer)
{
// setup
SharedCacheMap = FileObject->SectionObjectPointer->SharedCacheMap;
if ( Flags & 1 ) {
*Buffer = (PVOID)sub_140273EF0((_DWORD)SharedCacheMap, FileOffset->QuadPart, (unsigned int)&v16, (unsigned int)&v15, (Flags >> 6) & 1, 0);
} else {
// Pin the data into memory
result = sub_140272D30(v8, (_DWORD)FileOffset, Length, 1, 0, Flags, (__int64)&v16, (__int64)Buffer, (__int64)v17);
if ( !result ) return result; // Missed cache
}
*Bcb = (PVOID)(v16 + 1); // Return the Buffer Control Block
return 1;
}
BOOLEAN __stdcall CcPinRead(
PFILE_OBJECT FileObject, PLARGE_INTEGER FileOffset, ULONG Length,
ULONG Flags, PVOID *Bcb, PVOID *Buffer)
{
// ...
SharedCacheMap = FileObject->SectionObjectPointer->SharedCacheMap;
do {
// attempt to pin the cached pages, if it fails (sub_140272D30 returns 0),
// the data is not in RAM and a Hard Fault/IRP is necessary.
v14 = SharedCacheMap[38] >> 9;
LOBYTE(v14) = (SharedCacheMap[38] & 0x200) == 0;
if ( !(unsigned __int8)sub_140272D30(v8, (unsigned int)&QuadPart, v11, v14, 0, Flags, (__int64)v13, (__int64)&v19, (__int64)&v20) ) {
v16 = 0;
goto LABEL_12;
}
v12 = v20;
} while ( v20 - QuadPart < v11 );
// ...
}Sometimes the kernel generates a MDL representing the cached pages using CcMdlRead. Notice how the kernel checks the PrivateCacheMap for the 0x20000 flag, triggering the lazy read-ahead mechanism (CcScheduleReadAheadEx):
void __stdcall CcMdlRead(
PFILE_OBJECT FileObject, PLARGE_INTEGER FileOffset, ULONG Length,
PMDL *MdlChain, PIO_STATUS_BLOCK IoStatus)
{
SharedCacheMap = FileObject->SectionObjectPointer->SharedCacheMap;
PrivateCacheMap = FileObject->PrivateCacheMap;
// read-ahead trigger
if ( (*PrivateCacheMap & 0x20000) != 0 )
CcScheduleReadAheadEx(FileObject, FileOffset, Length, 0LL);
while ( v11 ) {
VirtualAddress = (PVOID)sub_140273EF0((_DWORD)SharedCacheMap, QuadPart, ...);
// ...
// Allocate a memory descriptor list for the cache virtual address
Mdl = IoAllocateMdl(VirtualAddress, v13, 0, 0, 0LL);
MmProbeAndLockPages(Mdl, 0, IoReadAccess); // Lock the SYSTEM cache pages, not the user buffer
// ...
}
// If we missed the early check, schedule read ahead now
if ( (*PrivateCacheMap & 0x20000) == 0 && v19[0] )
CcScheduleReadAheadEx(FileObject, FileOffset, Length, 0LL);
IoStatus->Status = 0;
}
void __stdcall CcMdlReadComplete(PFILE_OBJECT FileObject, PMDL MdlChain)
{
// Cleanup the MDL generated by CcMdlRead
RelatedDeviceObject = IoGetRelatedDeviceObject(FileObject);
FastIoDispatch = RelatedDeviceObject->DriverObject->FastIoDispatch;
if ( !FastIoDispatch || !FastIoDispatch->MdlReadComplete ||
!(unsigned __int8)sub_1406B3DF0(FileObject, MdlChain, RelatedDeviceObject) )
{
sub_140A3C878(RelatedDeviceObject, MdlChain);
}
}If you step into CcScheduleReadAhead (which is internally called by CcFastCopyRead via sub_1404992C0), you see the kernel's cache manager analyzing the sequential stride:
// v5+32 and v5+16 are previous read offsets (FileOffset1 / FileOffset2)
v31 = *(_QWORD *)(v5 + 32) - *(_QWORD *)(v5 + 16);
// If the current read offset (v51) perfectly matches the expected next sequence:
if ( *(_QWORD *)v51 - *(_QWORD *)(v5 + 32) == v31 )
{
v19 = 3;
LODWORD(v32) = v52;
// extrapolate the next future offset to read ahead
v46 = 2LL * *(_QWORD *)v51 - *(_QWORD *)(v5 + 32);
*(_QWORD *)(v5 + 56) = v46; // Save Read-Ahead offset
// page align the length of the read-ahead buffer to a 4KB boundary
v33 = ((_DWORD)v32 + 4095) & 0xFFFFF000;
// allocate an I/O work item and dispatch it
if ( (int)sub_1402A7D3C(v44, v47, v6, &v45) >= 0 )
{
// added to ExWorkerQueue for a background system thread to process
sub_1403C6140(CurrentThread, v40 + 24);
sub_1402A7488(v40, v6 + 88);
return;
}
}The background system worker thread is the one that absorbs the cost of IRP creation, MmProbeAndLockPages, and waiting for IoCompleteRequest. As long as the cache is warm and FsRtlFastCheckLockForRead passes, the foreground application stays on the Fast I/O path.
You retrieve your data via a fast memory copy without ever constructing an I/O request packet.
To understand why explicit user-mode double-buffering was ditched, we must understand how NVMe drivers (stornvme.sys) interact with the PCIe bus and the NT Cache Manager.
When an application issues an I/O request, an IRP is generated. In a user-mode double-buffering scheme, the CPU is running AVX-512 instructions on Buffer A polluting L1/L2/L3 caches with target data while the NVMe controller is simultaneously performing DMA over PCIe into the physical memory backing Buffer B.
If we know anything about the MESI protocol, when the PCIe root complex writes to physical RAM (Buffer B) via DMA, it issues invalidate signals to the CPU's Last Level Cache (so L3). If the CPU is simultaneously aggressively fetching from Buffer A, the memory controller becomes heavily contested. The L3 cache experiences thrashing, stalling the SIMD execution units waiting on memory fetches.
sequenceDiagram
participant App as ntstrings (User-Mode)
participant IO as NT I/O Manager
participant Cc as Cache Manager (Cc)
participant NVMe as NVMe Controller (Hardware)
participant RAM as Physical RAM
App->>IO: NtReadFile (16MB Chunk 1)
IO->>Cc: Request Data
Cc->>NVMe: Issue DMA Read (Chunk 1 + Chunk 2 Read-Ahead)
NVMe-->>RAM: DMA Write (PCIe Gen4)
RAM-->>Cc: Data in Standby Page List
Cc-->>IO: Map/Copy to User Buffer
IO-->>App: Return STATUS_SUCCESS
Note right of App: CPU begins AVX-512 processing.
Note left of NVMe: NVMe goes idle or prefetches deeper, avoiding PCI/RAM contention with SIMD.
App->>IO: NtReadFile (16MB Chunk 2)
IO->>Cc: Request Data
Note right of Cc: Chunk 2 is already in ram (Standby List)
Cc-->>App: Instant Memory Copy (No Hardware Delay)
By passing FILE_SEQUENTIAL_ONLY | FILE_SYNCHRONOUS_IO_NONALERT to NtCreateFile, we instruct the kernel to perform the read-ahead mentioned before.
CcScheduleReadAhead is the routine that actually performs read-ahead, and CcSetReadAheadGranularity controls the size of that prefetch.
void __stdcall CcSetReadAheadGranularity(PFILE_OBJECT FileObject, ULONG Granularity)
{
*((_DWORD *)FileObject->PrivateCacheMap + 1) = Granularity - 1;
}
void __fastcall CcSetReadAheadGranularityEx(__int64 a1, _DWORD *a2)
{
_DWORD *v2; // r8
_DWORD *v3; // rax
int v4; // ecx
int v5; // eax
v2 = *(_DWORD **)(a1 + 48);
if ( a2 )
{
v2[1] = a2[1] - 1;
v3 = v2 + 22;
v4 = a2[2];
if ( v4 )
{
*v2 |= 0x200000u;
*v3 = v4;
}
if ( *v3 > 0x800000u )
*v3 = 0x800000;
v5 = 50;
if ( a2[3] )
v5 = a2[3];
v2[23] = v5;
}
}In undocumented stuff or if you ever traced this with a debugger, you could get that when you pass FILE_SEQUENTIAL_ONLY to NtCreateFile, the kernel translates this into the FO_SEQUENTIAL_ONLY (0x20) bit inside the FileObject->Flags.
If you open CcScheduleReadAhead on IDA, you will lead into a direct function return statement that will contain will contain objects, Object is the FILE_OBJECT. In a 64-bit Windows kernel, the Flags field is located at offset 0x50 (which is 80 in decimal).
For read-ahead operations, it checks the bit in two places.
When creating a new Shared Cache Map
// Object + 80 is FileObject->Flags. 0x20 is FO_SEQUENTIAL_ONLY.
if ( (*(_DWORD *)(Object + 80) & 0x20) != 0 )
v11[38] |= 0x40u; // Sets an internal "Sequential Optimization" flag in the SHARED_CACHE_MAPHere, v11 is a newly allocated SHARED_CACHE_MAP. Because the file was opened for sequential access, the kernel sets a specific internal flag (0x40 at offset 152 / v11[38]) that tells the system to fetch much larger blocks of data in the background.
When attaching to an EXISTING Shared Cache Map:
KeAcquireInStackQueuedSpinLockAtDpcLevel((PKSPIN_LOCK)(v14 + 768), &v71);
// If a new handle opens the file WITHOUT the sequential flag,
// strip the optimization flag so we don't fetch unnecessary data.
if ( (*(_DWORD *)(Object + 80) & 0x20) == 0 )
v20[38] &= ~0x40u; If another program already opened this file, the Cache Manager reuses the existing SHARED_CACHE_MAP (v20). However, if your new handle does not specify FILE_SEQUENTIAL_ONLY, the kernel disables the aggressive read-ahead (v20[38] &= ~0x40u) because you might perform random reads.
You will also notice there are no direct checks for the synchronous I/O flag (FO_SYNCHRONOUS_IO, which is 0x02).
This is because FILE_SYNCHRONOUS_IO_NONALERT does not change how the Cache Map is initialized in CcInitializeCacheMap. Instead, it alters the behavior of the I/O Manager.
Is Option 3 a universal guarantee of absolute maximum speed? No. Microsoft introduced the BypassIO infrastructure in Windows 11 (the foundation of DirectStorage) precisely because the Cache Manager and Fast I/O paths do still impose CPU overhead (such as the memcpy inside CcCopyReadEx).
BypassIO allows supported NVMe storage and NTFS to strip away filter manager overhead and DMA directly to memory. For a completely cold sequential read on supported hardware, it will often outperform standard cached sequential reads because it removes the CPU from the data transfer path entirely. However, for a warm workload where data might already reside in the system cache, Option 3 will always beat BypassIO, as Fast I/O hits the RAM directly and completely avoids asking the NVMe drive for data.
Cache-performance counters explicitly have stuff like SyncFastReads, SyncCopyReads, and ReadAheads as cache-based sequential-read mechanisms that reduce overhead and return control immediately when the data is already in memory. DirectStorage also exposes DisableBypassIO and ForceFileBuffering, which confirms that user-mode runtimes can explicitly opt in or out of the optimization.
This alone was enough for bypassing all performance benefits described in the other methods. However, my tool supports DeviceIoControl calls to integrate the BypassIO mechanism on supported stacks after detecting a cold cache run (not integrated currertly) by just doing:
if (_fs_control_file) {
FS_BPIO_INPUT bpio_in = { 0 };
FS_BPIO_OUTPUT bpio_out = { 0 };
bpio_in.Operation = FS_BPIO_OP_ENABLE;
bpio_in.InFlags = FSBPIO_INFL_None;
bpio_in.Reserved1 = 0;
bpio_in.Reserved2 = 0;
bpio_out.Operation = FS_BPIO_OP_ENABLE;
bpio_out.OutFlags = 0;
bpio_out.Reserved1 = 0;
bpio_out.Reserved2 = 0;
_fs_control_file(h_file, NULL, NULL, NULL, &iosb, FSCTL_MANAGE_BYPASS_IO,
&bpio_in, sizeof(bpio_in), &bpio_out, sizeof(bpio_out));
}The most critical optimization in process_single_file occurs during memory allocation:
SIZE_T large_page_size = GetLargePageMinimum();
if (large_page_size > 0) {
alloc_size = (raw_alloc_size + large_page_size - 1) & ~(large_page_size - 1);
}
const ULONG alloc_type_large = MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES;
status = _allocate_virtual_memory(current_process, &buffer, 0, &alloc_size, alloc_type_large, PAGE_READWRITE);A modern x86-64 CPU does not access physical memory directly; it relies on the Memory Management Unit to translate virtual addresses to physical addresses. I choose a 16-Megabyte buffer because by default, Windows allocates 4-Kilobyte (4KB) pages, so it would require exactly 4,096 page translations.
Because page table walks require reading from RAM, the CPU caches recent translations in the Translation Lookaside Buffer. A typical Intel core has an L1 D-TLB of roughly 64 entries and an L2 TLB of ~1536 entries. When this program blasts through a 16MB buffer using AVX-512 (fetching 64 bytes per clock cycle), it traverses 4KB in fractions of a microsecond. The CPU rapidly exhausts the TLB.
Once the TLB is full, every new 4KB boundary crossing incurs a TLB Miss. The CPU must halt execution and perform a PT walk.
A standard 4KB page table walk requires 4 sequential memory reads:
- CR3 Register points to the PML4 (Page Map Level 4).
- PML4 entry points to the PDPT (Page Directory Pointer Table).
- PDPT entry points to the PD (Page Directory).
- PD entry points to the PT (Page Table).
- PT entry contains the final Physical Address.
If the page tables are not in the CPU cache, a TLB miss can cost upwards of 10-100 nanoseconds per page, severely starving the SIMD engines I've implemented.
ntstrings requests to the virtual memory manager to allocate contiguous 2-Megabyte (2MB) pages. In x86-64 hardware, the 2MB mapping stops at the Page Directory (PD) level.
The PS bit in the PD entry is set to 1, so it bypasses the final PT lookup.
graph TD
VA[Virtual Address] --> |Bits 39-47| PML4[PML4 Table]
PML4 --> |Bits 30-38| PDPT[PDP Table]
PDPT --> |Bits 21-29| PD[Page Directory]
subgraph Standard 4KB Paging
PD --> |PS=0| PT[Page Table]
PT --> |Bits 12-20| PA_4K[4KB Physical Frame]
end
subgraph Large Page Optimization
PD --> |PS=1 <br> Bypass PT| PA_2M[2MB Physical Frame]
end
style PA_2M stroke:#0f0,stroke-width:4px
style PT stroke:#f00,stroke-width:2px,stroke-dasharray: 5 5
A 16MB buffer now requires only 8 TLB entries instead of 4,096, so the entire buffer comfortably fits inside the CPU's L1 TLB. TLB misses drop effectively to zero, so the AVX vectorizer can sustain peak IPC.
If system memory pressure rises, the kernel proactively trims pages from process working sets, unmapping the PTEs/PDEs and moving the physical frames to the Standby list.
If the kernel trims parts of our 16MB buffer while ntstrings is scanning it, the CPU throws a Page Fault (#PF). The kernel catches the fault, executes its handler`, rebuilds the PTE, and resumes the thread.
if (_lock_virtual_memory) {
PVOID lock_addr = buffer;
SIZE_T lock_sz = alloc_size;
_lock_virtual_memory(current_process, &lock_addr, &lock_sz, 1);
}Due to this, the program interfaces directly with the kernel's working set list via NtLockVirtualMemory
Once the chunk is securely placed in locked, large-page memory, the execution transitions to the CPU execution units via SIMD vectoring.
if (g_cpu_caps.has_avx512) {
__avx512_buf(p, end, f_ctx, out_ctx, tail);
}The design defers to the widest possible vector instructions.
The synchronous linear loop (while (search_start < end)) is intentionally branchless where possible. Branch predictor misses (pipeline flushes) cost ~15-20 cycles. _mm512_cmpeq_epi8_mask is used to calculate the presence of valid strings mathematically rather than through control flow
A primary challenge of chunked memory processing is the "split string" problem. If the word PASSWORD exists exactly across the boundary of the 16MB chunk (e.g., PASS at byte 16,777,212 and WORD at byte 0 of the next read), a naive scanner will miss it.
if (tail->len > 0) {
char* scan = p;
while (scan < end && is_printable_ascii((unsigned char)*scan)) {
scan++;
}
size_t extra = scan - p;
// ... memcpy to tail ...
check_and_output_with_search(tail->buf, tail->len, f_ctx, out_ctx);
tail->len = 0;
p += extra;
}- At the end of Chunk
$N$ , if the last bytes are printable, they are reverse-scanned to find the start of the contiguous string. - This "tail" is copied to a separate
tail_buffer. - When Chunk
$N+1$ is read,ntstringsimmediately checks if the start of the new chunk is a continuation of a printable string. - If so, it scans forward (
scan++) until the string terminates, copies theextrabytes into thetail_buffer, processes the merged buffer, and then increments the main pointerppast the split string.
flowchart TD
subgraph Chunk N Execution
S1[Scan 16MB Buffer] --> C1{Ends with Printable Data?}
C1 -- Yes --> R1[Reverse scan to find start]
R1 --> T1[Copy 'PASS' to Tail Buffer]
C1 -- No --> FinishN[Acknowledge completion]
end
subgraph Chunk N+1 Execution
StartN1[Load Next 16MB Buffer] --> C2{Tail Buffer Has Data?}
C2 -- Yes --> F1[Forward scan start of new chunk]
F1 --> Merge[Extract 'WORD', Merge to Tail]
Merge --> ProcTail[Process 'PASSWORD' via SIMD]
ProcTail --> AdvP[Advance Main Pointer past 'WORD']
AdvP --> S2[Resume standard SIMD scan of Chunk N+1]
C2 -- No --> S2
end
T1 --> StartN1
With the I/O and memory hierarchy effectively silenced, the bottleneck shifts entirely to the CPU execution units. Scanning a 16MB buffer for contiguous printable ASCII strings sounds like a trivial while loop, but if you've ever profiled scalar code over gigabytes of memory, you know the branch predictor will basically shit you on.
A loop (if (c >= 0x20 && c <= 0x7E)) contains a conditional branch for every single byte. When a file contains binary noise mixed with text, the branch history table becomes utterly chaotic cause the pipeline flushes constantly.
To fix this, we process the buffer in 32-byte (AVX2) or 64-byte (AVX-512) strides using pure mathematics... No branches are evaluated until we already know exactly where a string begins and ends.
We need to determine if 32 distinct characters fall within the printable ASCII range (0x20 to 0x7E).
If you dive into the Intel intrinsics guide, you'll notice there is no fucking native unsigned less-than-or-equal-to instruction for 8-bit integers.
If I use something like _mm256_cmpgt_epi8, bytes with the high bit set (like 0x80, which are not ASCII) would be evaluated as negative numbers, requiring us to perform multiple masking and range-checking steps. This eats up execution ports.
Instead, the algorithm implements a wrap-around math trick.
The range of printable ASCII is 0x5E characters wide (0x7E - 0x20 = 0x5E).
If we subtract 0x20 from every byte in the 32-byte vector, printable characters shift down to the 0x00 - 0x5E range. What happens to bytes that were less than 0x20 (like null bytes or carriage returns)? Because we are doing unsigned subtraction, they underflow and wrap around to massive values (0xFF, 0xFE, etc.).
So, our problem reduces to: Are the resulting unsigned bytes $\le$ 0x5E?
Before, this was the applied methodology:

However, using macros was best. Since AVX2 lacks cmple_epu8, we abuse the maximum function (yes, you can actually do this in C):
#define AVX2_CHECK_PRINTABLE(v_data, v_min, v_range, v_result) { \
__m256i v_diff = _mm256_sub_epi8(v_data, v_min); \
v_result = _mm256_cmpeq_epi8(_mm256_max_epu8(v_diff, v_range), v_range); \
}If a byte is 0x5E, calculating MAX(byte, 0x5E) will always equal 0x5E.
If it is 0x5E, the MAX function returns the larger value.
Therefore, if MAX(v_diff, 0x5E) == 0x5E, the character is printable. This condenses the entire bounds check into three micro-ops (psubb, pmaxub, pcmpeqb) that pipeline across execution ports 0, 1, and 5.
graph TD
subgraph Vector Register [YMM0 - 32 Bytes]
B1[0x41 'A'] --> |sub 0x20| D1[0x21]
B2[0x0A '\\n'] --> |sub 0x20| D2[0xEA underflow]
B3[0x7A 'z'] --> |sub 0x20| D3[0x5A]
end
subgraph max_epu8 Operation [Against 0x5E]
D1 --> |Max| M1[0x5E]
D2 --> |Max| M2[0xEA]
D3 --> |Max| M3[0x5E]
end
subgraph cmpeq_epi8[Equals 0x5E?]
M1 --> |== 0x5E| R1[0xFF Valid]
M2 --> |!= 0x5E| R2[0x00 Invalid]
M3 --> |== 0x5E| R3[0xFF Valid]
end
style R1 fill:#0a0,stroke:#0f0,stroke-width:2px
style R3 fill:#0a0,stroke:#0f0,stroke-width:2px
style R2 fill:#a00,stroke:#f00,stroke-width:2px
Once we have this resulting vector (0xFF for printable, 0x00 for non-printable), we extract the MSB of each byte into a standard 32-bit integer scalar using _mm256_movemask_epi8. We now have a standard uint32_t bitmask where 1 represents a valid string character.
AVX-512 is fundamentally different from AVX2. It utilizes the EVEX prefix, which introduces dedicated hardware opmask registers (k0 through k7). Instead of generating a 512-bit vector of 0xFF and 0x00 and wasting cycles pulling it out into a GPR via movemask, AVX-512 comparison instructions write directly to a 64-bit mask register.
Furthermore, AVX-512 finally implements the unsigned comparison instruction we were missing. (So our macro becomes blissfully simple):
#define AVX512_CHECK_PRINTABLE(v_data, v_min, v_range) \
_mm512_cmple_epu8_mask(_mm512_sub_epi8(v_data, v_min), v_range)This directly returns a __mmask64 (a uint64_t). We process 64 bytes in precisely two instructions.
Early AVX-512 implementations (like Skylake-X) gained a bad reputation for aggressively downclocking the CPU ring bus and cores when executing 512-bit instructions (the "AVX-512 offset"). However, because our payload consists strictly of integer byte operations (vpcmpub, vpsubb), it utilizes the "Light" AVX-512 power license. The CPU stays pinned at near-max boost clocks, so the throughput will scale linearly.
Having a 32-bit or 64-bit mask is only half of the stuff. We must locate the start and end indices of contiguous 1s (our strings) without writing loops that would induce branch prediction failures.
Let's dissect the mask parsing logic used in __avx512_buf:
uint64_t s_idx = tzcnt_u64_custom(m);
uint64_t not_m = ~m & ~((1ULL << s_idx) - 1);If our 64-byte chunk looks like [binary garbage]HELLO[binary garbage], our uint64_t mask m will have a sequence of five 1s embedded somewhere inside it.
I used the BMI1 instruction tzcnt, because unlike the legacy bsf which has undefined behavior if the input is zero, tzcnt is a single-micro-op instruction that executes on Port 1. It returns the exact bit index of the lowest set bit. This s_idx is the start of our string.
This was, at the time, the best optimization since then.

We need to find the first 0 after s_idx. A naive approach would be a while loop checking bits sequentially. That is slow.
Instead, we invert the mask (~m). Now, all the 0s (non-printable boundaries) become 1s.
However, there are also 1s before our string started (the original binary garbage). We must clear them.
((1ULL << s_idx) - 1) generates a mask of all 1s up to s_idx. We bitwise-NOT this to create a mask that zeros out everything before our string.
Bitwise-ANDing this with ~m (not_m = ~m & ...) yields a value where the lowest set bit is guaranteed to be the exact boundary where our string ends.
sequenceDiagram
participant M as Original Mask (m)
participant TZ as tzcnt(m)
participant Inv as Inverted (~m)
participant Clr as Cleared (not_m)
participant EndIdx as tzcnt(not_m)
Note over M: 0b0011 1000 (String at index 3, len 3)
M->>TZ: tzcnt(0b0011 1000)
TZ-->>M: s_idx = 3
M->>Inv: Bitwise NOT
Note over Inv: 0b1100 0111
Inv->>Clr: Apply Clearance Mask <br/> ~((1<<3) - 1)
Note over Clr: 0b1100 0111 & 0b1111 1000 <br/> = 0b1100 0000
Clr->>EndIdx: tzcnt(0b1100 0000)
EndIdx-->>Clr: e_idx = 6
Note over EndIdx: String length = 6 - 3 = 3.
If not_m is zero, it means the string runs off the edge of our 64-byte vector and into the next chunk. We simply pass the start pointer down to the tail-buffer mechanism (explained previously), advance curr by 64 bytes, and continue the SIMD loop.
If you're scanning Windows memory dumps or NT kernel files, you are dealing with WCHAR. In memory, the string "Hello" is represented as 48 00 65 00 6C 00 6C 00 6F 00.
From a microarchitectural standpoint, scanning UTF-16 is a nightmare. A scalar loop would have to read a byte, check if it's printable, advance one byte, check if it's a null byte (0x00), and repeat. The branch predictor will completely fall apart on this.
Instead of writing a separate, complex SIMD scanner for UTF-16, I used a vectorization trick to dynamically "swizzle" (compress) 32 bytes of UTF-16 down to 16 bytes of pure ASCII in roughly 4 clock cycles, and then feeds it right back into my ASCII engine.
Let's look at fast_unicode_convert_and_emit:
const __m256i v_high_mask = _mm256_set1_epi16((short)0xFF00);
__m256i chunk = _mm256_loadu_si256((const __m256i*)(src + i * 2));
if (_mm256_testz_si256(chunk, v_high_mask)) {
__m256i packed = _mm256_packus_epi16(chunk, chunk);
__m256i perm = _mm256_permute4x64_epi64(packed, 0x08);
_mm_storeu_si128((__m128i*)(dst + i), _mm256_castsi256_si128(perm));
}First, we load 32 bytes (16 wide characters) into a YMM register. We use _mm256_testz_si256 against 0xFF00. This instantly tells us if all 16 characters have 0x00 in their high byte (meaning they are standard English/ASCII characters padded with nulls).
If true, we need to strip the nulls. We do this using _mm256_packus_epi16. This instruction treats the register as containing 16-bit integers and attempts to pack them down into 8-bit integers by removing the upper byte.
However, there's a massive hardware catch with AVX2: It operates across two separate 128-bit lanes that cannot naturally talk to each other.
If you just pack the register, you don't get 16 contiguous bytes. You get 8 bytes from Lane A, followed by 8 bytes of garbage (since we packed chunk against itself), then 8 bytes from Lane B, followed by more shit.
To fix this, I noticed I could perform a cross-lane permutation using _mm256_permute4x64_epi64(packed, 0x08). The magic constant 0x08 is a control byte (00_00_10_00 in binary). It instructs the CPU's shuffle unit to reach across the 128-bit boundary, grab the valid 8-byte chunk from Lane B, and slide it directly next to the 8-byte chunk from Lane A.
graph TD
subgraph Original UTF-16 Register [YMM - 32 Bytes]
L1[Lane A: 8 WCHARs <br> 'H.\e.\l.\l.\o.\..\..\..']
L2[Lane B: 8 WCHARs <br> 'W.\o.\r.\l.\d.\..\..\..']
end
subgraph Pack Instruction [_mm256_packus_epi16]
P1[Lane A: 8 chars 'Hello...' + 8 chars Copy]
P2[Lane B: 8 chars 'World...' + 8 chars Copy]
end
subgraph Cross-Lane Permute[_mm256_permute4x64 0x08]
F1[Bottom 16 Bytes: <br> 'Hello...World...']
F2[Top 16 Bytes: <br> Ignored/Garbage]
end
L1 --> P1
L2 --> P2
P1 -->|Shuffle| F1
P2 -->|Shuffle across 128b boundary| F1
Finally, we cast the 256-bit register down to a 128-bit register (_mm256_castsi256_si128) and execute a single 16-byte store (_mm_storeu_si128).
So... we just converted 32 bytes of Unicode to 16 bytes of ASCII without evaluating a single if-statement, from milliseconds to 4 nanoseconds.
When a user passes the -f flag to search for a specific string (e.g., -f "password"), doing a standard file dump is no longer sufficient. We have to pattern match.
Doing an exact match is easy (_mm256_cmpeq_epi8), but what if the user passes the -i (case-insensitive) flag? The target file might contain "PASSWORD", "Password", or "pAsSwOrD".
A traditional standard library strcasestr will character-by-character run tolower() on the buffer. This involves branching (if c >= 'A' && c <= 'Z'). In a 100GB dump, this will literally add hours to the scan time and this is why most of the string scanners out there are super slow on large stuff.
Look at find_avx2_nocase:
__m256i v_case_mask = _mm256_set1_epi8(0x20);
// Inside the loop:
__m256i l0 = _mm256_or_si256(v0, v_case_mask);
__m256i c0 = _mm256_cmpeq_epi8(l0, v_first);The core of BMH relies on a precomputed "Bad Character" shift table. Let
The shift value
If the character
I fuse the mathematical concept of BMH with 256-bit SIMD registers. Standard BMH shifts a single scalar pointer. The AVX2 implementation checks 32 independent BMH positions simultaneously.
__m256i v_first = _mm256_set1_epi8(first);
__m256i v_last = _mm256_set1_epi8(last);
// Load 32 bytes at offset 0 (Checking the first character of 32 parallel needles)
__m256i v0 = _mm256_load_si256((const __m256i*)p);
// Load 32 bytes at offset 'last_idx' (Checking the BMH last character of 32 needles)
__m256i vl0 = _mm256_loadu_si256((const __m256i*)(p + last_idx));
__m256i c0 = _mm256_cmpeq_epi8(v0, v_first);
c0 = _mm256_and_si256(c0, _mm256_cmpeq_epi8(vl0, v_last));By aligning vl0 exactly at p + last_idx, we are evaluating the BMH "last character" heuristic across 32 potential string starts in a single clock cycle. If both the first and last characters don't match, c0 becomes zero, and we skip 32 bytes instantly.
What if the user passes the -i (case-insensitive) flag? The file might contain "PASSWORD", "Password", or "pAsSwOrD". A traditional standard library strcasestr will character-by-character run tolower() on the buffer, introducing massive branching penalties.
In the ASCII table, the difference between any uppercase letter and its lowercase equivalent is exactly 32 (or 0x20).
'A'is0x41(binary0100 0001)'a'is0x61(binary0110 0001)
The only difference is the 6th bit. If we take any byte in our YMM register and perform a bitwise OR with 0x20, uppercase letters are instantly transformed into lowercase letters.
__m256i v_case_mask = _mm256_set1_epi8(0x20);
__m256i l0 = _mm256_or_si256(v0, v_case_mask);
__m256i c0 = _mm256_cmpeq_epi8(l0, v_first);Wait, doesn't this corrupt other characters?
Yes. If we OR the symbol [ (0x5B) with 0x20, it turns into { (0x7B).
But we don't care. We pre-lowercase our target "needle" during initialization. We are only checking if the corrupted buffer matches our needle's first/last letter. No numeric or symbol ASCII character will mathematically evaluate to a lowercase letter when OR'd with 0x20. This allows us to perform case-insensitive BMH anchor checks again completely branch-free.
graph TD
subgraph Sub1 ["32-Byte Vector 0 (Offset 0)"]
V0["[p] [x] [a] [p] ... [x]"]
end
subgraph Sub2 ["32-Byte Vector 1 (Offset + 7)"]
V1["[x] [d] [x] [d] ... [x]"]
end
subgraph Sub3 ["Compare == 'p'"]
C0["[1] [0] [0] [1] ... [0]"]
end
subgraph Sub4 ["Compare == 'd'"]
C1["[0] [1] [0] [1] ... [0]"]
end
subgraph Sub5 ["Bitwise AND (c0 & c1)"]
AND["[0] [0] [0][1] ... [0]"]
end
V0 --> C0
V1 --> C1
C0 --> AND
C1 --> AND
style AND fill:#0a0,stroke:#0f0,stroke-width:2px
If you examine the AVX2 scan loop, you will notice that I don't just load one 32-byte vector and process it. We explicitly unroll the loop four times, creating four independent instruction streams (v0 through v3).
__m256i v0 = _mm256_load_si256((const __m256i*)p);
__m256i v1 = _mm256_load_si256((const __m256i*)(p + 32));
__m256i v2 = _mm256_load_si256((const __m256i*)(p + 64));
__m256i v3 = _mm256_load_si256((const __m256i*)(p + 96));
__m256i c0 = _mm256_cmpeq_epi8(v0, v_first);
__m256i c1 = _mm256_cmpeq_epi8(v1, v_first);
// ...
__m256i combined = _mm256_or_si256(_mm256_or_si256(c0, c1), _mm256_or_si256(c2, c3));A vector load instruction (vmovdqu) fetching data from the L1 Data Cache takes roughly 3 to 4 clock cycles of latency to complete. If we wrote a standard loop (Load -> Compare -> Test -> Advance), the CPU's execution units would stall for 3 cycles waiting for the data to arrive.
However, Intel and AMD cores have multiple Load/Store ports (typically 2 or 3) and a pipelined architecture. Their throughput for vector loads is 2 per cycle.
By writing four sequential loads, we allow the CPU's register renamer to map v0, v1, v2, and v3 to distinct physical registers. The instruction scheduler places all four loads into the reservation station simultaneously.
Because they have no data dependencies on each other, the CPU executes them (or at least it should) out-of-order, fetching 128 bytes of data from the L1 cache at the exact same time.
Once the data is loaded and compared (c0 through c3), we must evaluate if any of the 128 bytes matched. We reduce the four masks into a single combined mask.
Notice the specific nesting of the bitwise OR operations: _mm256_or_si256(c0, c1) and _mm256_or_si256(c2, c3).
Instead of a linear chain (c0 | c1 | c2 | c3), which would require 3 sequential clock cycles (latency of 1 cycle per vpor), I structure it as a binary tree.
graph TD
subgraph Sub1 ["Cycle 1: Parallel Evaluation"]
C0["c0"]
C1["c1"]
C2["c2"]
C3["c3"]
end
subgraph Sub2 ["Cycle 2: First-Level OR (Parallel)"]
OR1["c0 OR c1"]
OR2["c2 OR c3"]
end
subgraph Sub3 ["Cycle 3: Final Reduction"]
FINAL["Final OR"]
end
C0 --> OR1
C1 --> OR1
C2 --> OR2
C3 --> OR2
OR1 --> FINAL
OR2 --> FINAL
c0|c1 and c2|c3 are evaluated simultaneously on two different ALU execution ports. The final reduction takes one more cycle. I compress a 12-cycle sequential latency chain down to effectively 5 cycles of parallel execution.
When the AVX2 engine identifies a potential BMH anchor match, it must drop out of the SIMD loop and verify the full string. Normally, a programmer would call memcmp or use a for loop to check the remaining characters.
If we use a for loop, the branch predictor will attempt to guess if the inner characters match. As we concluded before, in random binary data, false positive anchors are common, meaning the loop will fail unpredictably.
We bypass loop branching entirely in eq_fast_scalar by treating memory not as an array of characters, but as a mathematical integer. I didn't rely on memcmp because it's just 2x times slower.
uint64_t chunk;
`mov r64, [p]`
if ((chunk & ctx->needle_mask) != ctx->needle_u64) {
return 0;
}During initialization (in main.c), the needle is padded into a (needle_u64), and a needle_mask is generated (for example, if the search string is 5 bytes long, the mask contains 1s for the first 5 bytes and 0s for the remaining 3).
During initialization, if the user searches for the word password, we calculate its 64-bit integer representation. On a Little-Endian machine, the string 'p', 'a', 's', 's', 'w', 'o', 'r', 'd' becomes the hexadecimal integer 0x64726F7773736170. We store this in ctx->needle_u64.
If the search string is shorter than 8 bytes (e.g., admin), we create a bitmask (ctx->needle_mask) that contains 1s for the valid bytes and 0s for the unused bytes.
When a match is suspected at pointer p, we load the next 8 bytes directly into a CPU GPR via a 64-bit load. We apply our bitmask using a single bitwise AND, and compare it to our precomputed target using a single CMP.
flowchart LR
subgraph Sub1 ["Memory (Pointer P)"]
M["[ a | d | m | i | n | ? | ? | ? ]"]
end
subgraph Sub2 ["CPU ALU Operations"]
L["Load into QWORD"] --> AND["Bitwise AND with Mask"]
AND -->|"0x0000006E696D6461"| CMP["Compare == needle_u64"]
end
M --> L
style AND fill:#2d2,stroke:#0f0,stroke-width:2px
What used to be 8 individual byte loads, 8 comparisons, and 8 conditional branches is mathematically reduced to exactly three assembly instructions (mov, and, cmp).
If the string is longer than 8 bytes, we only then fall back to checking the remaining body.
Hardware manufacturers include automatic stream prefetchers that try to guess your access patterns and load data into the cache ahead of time. However, the scanner was easily outspacing the hardware prefetcher.
To seize control of the memory subsystem, ntstrings injects explicit software prefetch instructions directly into the SIMD unrolled loops:
const char* unroll_end = end - 128;
while (p <= unroll_end) {
_mm_prefetch(p + 256, _MM_HINT_T0);
__m256i v0 = _mm256_load_si256((const __m256i*)p);
// ... vector execution ...
}The _mm_prefetch intrinsic translates to the prefetcht0 assembly instruction. This instruction is unique because it is entirely asynchronous and "non-blocking." If the memory address is invalid, it doesn't throw a segmentation fault; the CPU just ignores it.
So _mm_prefetch(p + 256, _MM_HINT_T0) basically tells: "I am currently calculating the data at pointer p. I guarantee that in exactly two loop iterations, I will need the memory at p + 256. Go fetch it right now while I'm busy."
The _MM_HINT_T0 flag basically tells the CPU to pull the 64-byte cache line from RAM, through the L3 cache, through the L2 cache, and place it directly into the L1d. By the time our pointer p advances to p + 256, the data is already physically sitting millimeters away from the execution units.
sequenceDiagram
participant ALU as AVX Execution Units
participant L1 as L1 Data Cache
participant MemCtr as Memory Controller
participant RAM as Physical RAM
ALU->>MemCtr: prefetcht0 (p + 256) (Non-blocking)
Note right of ALU: ALU immediately processes data at `p`
MemCtr->>RAM: Request Address (p + 256)
RAM-->>MemCtr: Return 64-byte cache line
MemCtr->>L1: Store line in L1d
Note right of ALU: Loop advances. Pointer reaches `p + 256`
ALU->>L1: vmovdqu (p + 256)
L1-->>ALU: Instant 4-cycle response
If you look closely at the beginning of any of the SIMD engines in ntstrings (e.g., find_avx2_long or scan_buffer_avx2), you'll notice a strange scalar loop right before the main vector loop begins:
while (((uintptr_t)p & 31) && p < end) {
if (*p == first && eq_fast_scalar(p, ctx)) {
return p;
}
p++;
}Why do I abandon our AVX2 engine if I always said it was garebage to process bytes one-by-one with a scalar loop? To align the pointer to a 32-byte boundary.
If we ask the CPU to perform a 32-byte AVX2 load (vmovdqu) at an address like 0x1000, the data spans from 0x1000 to 0x101F. This fits inside a single 64-byte hardware cache line.
But what if our pointer p happens to start at 0x1030?
Our 32-byte read will span from 0x1030 to 0x104F.
The first cache line holds data from 0x1000 to 0x103F. The next cache line holds data from 0x1040 to 0x107F.
Because our AVX2 read crosses the 0x103F boundary, the CPU suffers a CLS. The load unit must now access two separate L1 cache lines, merge the data in a temporary hardware buffer, and then push it to the YMM register. If the split crosses a 4KB Page boundary, it's even worse cuz it requires two TLB lookups.
graph TD
subgraph Sub1 ["Memory Access Types"]
subgraph Sub2 ["Aligned Pointer (p & 31 == 0)"]
CL1_A["Cache Line N: Bytes 0 - 63"]
READ_A["AVX2 Read: Bytes 0 - 31"]
CL1_A -->|"Fit: 1 L1 Access"| READ_A
end
subgraph Sub3 ["Unaligned Pointer (Crosses Boundary)"]
CL1_U["Cache Line N: Bytes 0 - 63"]
CL2_U["Cache Line N+1: Bytes 64 - 127"]
READ_U["AVX2 Read: Bytes 48 - 79"]
CL1_U -->|"Fetch bytes 48-63"| MergeUnit["CPU Hardware Merge"]
CL2_U -->|"Fetch bytes 64-79"| MergeUnit
MergeUnit -->|"Stall / Penalty"| READ_U
end
end
style READ_A fill:#2d2,stroke:#0f0,stroke-width:2px
style MergeUnit fill:#d22,stroke:#f00,stroke-width:2px
By adding ((uintptr_t)p & 31), we perform a bitwise check on the raw memory address. If the lowest 5 bits are not zero, the address is not a multiple of 32. We process bytes manually (p++) until the pointer snaps perfectly to a 32-byte boundary. From that moment on, every subsequent p += 32 in the SIMD loop is mathematically guaranteed to never cross a cache line.
In our vector buffers, once we identify that a chunk of memory contains valid strings, we get a 32-bit or 64-bit integer mask representing the exact indices. We discussed how we use trailing zero count to find the first index. But how do we iterate to the next hit in the mask without doing a scalar for loop over all 64 bits?
With the BLSR instruction.
static _force_inline uint64_t blsr_u64_custom(uint64_t mask) {
return mask & (mask - 1);
}
while (m0) {
uint64_t off = tzcnt_u64_custom(m0);
if (eq_fast_scalar(p + off, ctx)) {
return p + off;
}
m0 = blsr_u64_custom(m0);
}The expression mask & (mask - 1) is a piece of bit twiddling. Let's trace it mathematically.
Suppose we have a mask where bits 3 and 6 are set (representing string hits at indices 3 and 6).
m0 = 0b0100 1000 (Decimal 72)
- We call
tzcnt(m0). It returns3. We process the string at index 3. - We need to clear bit 3 so our next
tzcntfinds bit 6. - We calculate
m0 - 1.72 - 1 = 71. In binary,71is0b0100 0111. Notice what happened: the lowest set bit (bit 3) turned into a0, and all the0s to the right of it turned into1s (due to binary borrowing). - We perform the bitwise AND:
m0 & (m0 - 1).0b0100 1000AND0b0100 0111=0b0100 0000.
The lowest set bit was instantly erased, and all other bits remained untouched.
flowchart TD
M1["m0: 0100 1000"]
M2["m0 - 1: 0100 0111"]
AND{"Bitwise AND"}
RES["Result: 0100 0000"]
M1 --> AND
M2 --> AND
AND --> RES
style M1 fill:#333,stroke:#fff
style M2 fill:#333,stroke:#fff
style RES fill:#0a0,stroke:#0f0,stroke-width:2px
Practically all compilers nowadays map the C expression m & (m - 1) directly to the hardware blsr assembly instruction. By doing this, our while(m0) loop executes exactly as many times as there are hits in the mask. If there are 2 hits in a 64-byte vector, the loop runs twice. It completely bypasses checking the other 62 empty bytes.
The problem here with doing inline assembly is that, you don't know what CPU the user has. , so you can't benefit from their capabilities unless you inspect their CPU.
typedef const char* (_vec_call* search_func_t)(const char*, size_t, const filter_context_t*);
static search_func_t g_search_strategy = NULL;
const char* configure_scan_engine(int case_insensitive) {
if (g_cpu_caps.has_avx512) {
g_search_strategy = find_dispatcher_avx512;
} else if (g_cpu_caps.has_avx2) {
g_search_strategy = find_dispatcher_avx2;
}
}When perform_search is called, it just blindly jumps to the pointer address stored in g_search_strategy. This functions identically to a C++ virtual method table but without the overhead of object instantiation (this pointer passing). The branch predictor should memorize this static jump.
Some scalar checks are unavoidable. For instance, the fallback check: if (haystack_len < ctx->filter_len).
At the assembly level, a conditional if creates a branching path in the instruction cache. The CPU tries to fetch instructions sequentially. If the hot path requires jumping to a different memory address, it causes an I-Cache miss.
To help the compiler layout the assembly optimally, structural branch hints like __builtin_expect(!!(x), 1) are used, althought most of the time if not used correctly, it makes the program slower.
#if defined(__clang__) || defined(__GNUC__)
#define _likely(x) __builtin_expect(!!(x), 1)
#define _unlikely(x) __builtin_expect(!!(x), 0)
#endifIf you don't understand the physical layout of the CPU's execution engine, your 256-bit instructions will simply queue up behind each other.
ops are dumped into a ROB and then dispatched to the RS. The RS assigns the ops to physical execution ports.
Let's see the AVX2 BMH case-insensitive loop to understand why the exact sequence of intrinsics prevents port monopolization:
__m256i v0 = _mm256_load_si256((const __m256i*)p);
__m256i l0 = _mm256_or_si256(v0, v_case_mask);
__m256i c0 = _mm256_cmpeq_epi8(l0, v_first);
uint32_t m0 = _mm256_movemask_epi8(c0);Not all execution ports can handle all instructions.
_mm256_load_si256(vmovdqu), memory loads are handled by the load/store units. Non dinosaur cores have dedicated load ports (typically Port 2 and Port 3).- **
_mm256_or_si256(vpor)*, bitwise vector operations are kind of more flexible. The CPU can route them to Port 0, Port 1, or Port 5. _mm256_cmpeq_epi8(vpcmpeqb), vector comparisons are slightly more restricted. They can only execute on Port 0 and Port 1._mm256_movemask_epi8(vpmovmskb), this is the bottleneck. Extracting a vector mask into a GPR requires cross-domain wiring (moving data from the SIMD FP unit over to the integer unit), this instruction is normally hardwired exclusively to Port 0.
If I wrote the code without unrolling, evaluating one 32-byte chunk at a time, the CPU would execute the vpcmpeqb on Port 0 or 1, and then immediately demand Port 0 for the vpmovmskb. Because of the data dependency, Port 0 becomes congested.
By unrolling the loop by a factor of 4, I decouple the data dependencies and flood the reorder buffer with a mix of instructions to saturate the execution ports.
graph TD
subgraph Reorder Buffer & Instruction Scheduler
RS[Reservation Station]
end
subgraph Execution Ports
P0[Port 0 <br> Vector ALU / Movemask]
P1[Port 1 <br> Vector ALU]
P2[Port 2 <br> Load Address/Data]
P3[Port 3 <br> Load Address/Data]
P5[Port 5 <br> Vector ALU]
end
RS -- "`vmovdqu v0`" --> P2
RS -- "`vmovdqu v1`" --> P3
RS -- "`vmovdqu v2`" --> P2
RS -- "`vpor l0`" --> P5
RS -- "`vpcmpeqb c0`" --> P1
RS -- "`vpcmpeqb c1`" --> P0
RS -- "`vpor l1`" --> P5
style P0 fill:#a00,stroke:#f00,stroke-width:2px
style P1 fill:#2d2,stroke:#0f0,stroke-width:2px
style P2 fill:#00a,stroke:#00f,stroke-width:2px
style P3 fill:#00a,stroke:#00f,stroke-width:2px
style P5 fill:#2d2,stroke:#0f0,stroke-width:2px
While Port 0 is busy executing the bottleneck vpmovmskb from the previous loop iteration, the scheduler routes the vpor instructions for v1 and v2 to Port 5, and the vpcmpeqb instructions to Port 1. Simultaneously, the next memory blocks are being pulled through Ports 2 and 3.
By grouping the logic as a binary reduction tree (_mm256_or_si256 combinations) I explained before, we delay the dreaded vpmovmskb instruction until the very end. We only execute it once for every 128 bytes, effectively freeing up Port 0 to handle the vector comparisons and maximizing IPC.
When a compiler looks at a function taking two pointers, it has to assume the worst-case scenario: that modifying memory through one pointer might secretly overwrite the data pointed to by the other. This is known as Pointer Aliasing.
Taking eq_fast_scalar as an example.
static _force_inline int eq_fast_scalar(__restrict_ptr p, const filter_context_t* ctx)__restrict_ptr is mapped to const char* __restrict.
Without it, the compiler's optimizer would be severely handicapped. Inside the string comparison loop, if the compiler thinks p and ctx->filter might point to overlapping memory regions, it is legally forbidden from caching ctx->filter values in CPU registers. It would be forced to emit a mov instruction to read ctx->filter from RAM on every single iteration of the comparison, just in case checking p somehow altered ctx.
Freed from its stupid paranoia, the compiler hoists the loads for ctx->filter entirely outside the loop, storing them permanently in generic purpose registers. The inner loop is reduced to pure register-to-register or register-to-memory comparisons.
This scanner is technically single-threaded for the core file execution, but there is a background thread spinning up to provide UI feedback: progress_thread.
This thread reads bytes_processed every 1000 milliseconds to calculate throughput speeds. Seems harmless, right? Wrong.
That's the reason why I padded some structures:
typedef struct _align_cache {
volatile uint64_t bytes_processed;
uint64_t total_size;
ULONGLONG start_tick;
HANDLE h_exit_event;
char _pad[64 - sizeof(uint64_t) * 2 - sizeof(ULONGLONG) - sizeof(HANDLE)];
} progress_context_t;For multi-core processors we need to keep into account the cache coherency.
If the main thread (running on last physical core and first-core of SMT index, usually with less DPCs and kernel interrupts, but let's imagine it runs on Core 0) is constantly updating bytes_processed (e.g., p_ctx->bytes_processed += bytes_read), Core 0 marks that 64-byte cache line as Modified (M).
When the background thread (running on Core 1) attempts to read bytes_processed, Core 1's L1 cache misses. It sends a RFO or Snoop request over the CPU ring bus. Core 0 is forced to stall its execution, flush the modified cache line down to the L3 cache (or main memory), and transition the line to Shared (S).
Now, imagine if progress_context_t was allocated right next to filter_context_t in memory. They would likely share the same 64-byte cache line.
Our AVX-512 engine is reading f_ctx->filter on every loop iteration. If the background thread reads bytes_processed, and they share a cache line, Core 1 will invalidate Core 0's cache line containing the filter data, Core 0 would suffer an L1 cache miss just to read its own search needle. This ping-ponging of cache line ownership is called False Sharing, and it can degrade performance by over 500% (it actually did).
graph TD
subgraph Sub1 ["Core 0 L1 Cache (Main SIMD Thread)"]
CL0["Cache Line 0x1000: <br/> bytes_processed AND filter_context"]
end
subgraph Sub2 ["Core 1 L1 Cache (Background Thread)"]
CL1["Cache Line 0x1000 <br/> Wants to read bytes_processed"]
end
CL1 -- "Snoop Request (Invalidate)" --> CL0
CL0 -- "Flush to L3" --> L3_Cache["L3 Cache"]
L3_Cache -- "Fetch" --> CL1
style CL0 fill:#d22,stroke:#f00,stroke-width:2px
style CL1 fill:#2d2,stroke:#0f0,stroke-width:2px
To prevent this, the traditional fix is to put a __declspec(align(64)), in my case via my custom _align_cache macro, and explicitly added a padding array: char _pad[64 - ...] for some stupid compilers. This absolutely guarantees that progress_context_t occupies its own dedicated 64-byte cache line. Core 1 can invalidate the progress data all it wants; it will never touch the cache line holding our AVX-512 search variables.
ntstrings supports full regular expressions via the -r flag. However, the core SIMD engine is written in pure C. Standard C++ (std::regex, std::vector, std::string) introduces massive overheads: exception handling frames (.pdata / .xdata), dynamic allocations (new/delete), and shitty template instantiations.
If we include <regex> in our SIMD loop file, the compiler's optimizer will freak out. It will start inserting exception unwinding prologues and epilogues into our hot path, destroying our tight assembly generation, or it won't even compile because of trying to unify C and C++.
To solve this, I built a bridge with the C++ code. The C++ code is quarantined in its own compilation unit. The C engine only interacts with it via extern "C" endpoints.
extern "C" int cpp_check_string(const char* str, size_t len) {
std::string_view sv(str, len);
for (const auto& re : g_regexes) {
if (std::regex_search(str, str + len, re)) {
return 1;
}
}
return 0;
}Notice exactly when this bridge is called. Inside emit_string (the very final stage before output):
static _force_inline void emit_string(const char* str, size_t len, filter_context_t* f_ctx, output_context_t* out_ctx) {
if (f_ctx->use_regex_bridge && !cpp_check_string(str, len)) {
return;
}
// ...
}We do not run the regex engine on the raw memory buffer. That would be suicidal for performance, so I do it only after validating with the AVX-512/AVX2 engines.
The C++ bridge utilizes std::string_view to wrap the C-pointer without allocating any heap memory.
Let's circle back to the chunking mechanism in process_single_file. Memory is scanned in 16MB blocks. But what happens to a 20-character string that starts exactly 5 characters before the chunk ends?
Most conventional scanners use overlapping reads (e.g., read 16MB, but rewind the file pointer by 4096 bytes for the next read). Overlapping reads break the NT Cache Manager's strict FILE_SEQUENTIAL_ONLY read-ahead patterns. If you seek backwards, the kernel assumes random access and can't use its background prefetch buffers.
This means we need to create the illusion of reading memory forward-only infinitely.
When the SIMD engine finishes a 16MB chunk, it does not just stop at end.
if (tail && end > p) {
const char* t = end - 1;
while (t >= p && is_printable_ascii((unsigned char)*t)) {
t--;
}
size_t len = (end - 1) - t;
if (len > 0 && len < tail->cap) {
memcpy(tail->buf, t + 1, len);
tail->len = len;
}
}This is a micro-state machine.
- It walks backward from the end of the 16MB chunk to see if the chunk terminates in the middle of a valid printable string.
- If it does, it copies that fragment into
tail->buf. - When the next 16MB chunk is loaded, it immediately looks at the beginning of the new memory block.
// stitch strings across chunk boundaries if we cut one in half
if (tail->len > 0) {
char* scan = p;
while (scan < end && is_printable_ascii((unsigned char)*scan)) {
scan++;
}
size_t extra = scan - p;
memcpy_s(tail->buf + tail->len, tail->cap - tail->len, p, extra);
// ... evaluate merged tail ...
p += extra; // advance SIMD pointer past the stitched string
}It walks forward into the new chunk to find the rest of the string, appends it to tail->buf, evaluates the complete, merged string via the standard filtering algorithms, and then forcefully advances the p pointer past the string segment so the AVX-512 engine doesn't process it twice.
This creates the illusion of an infinite, contiguous memory haystack for the SIMD engine.
A pitfall in high-performance string scanners may be attempting to write a monolithic SIMD state machine that handles ingestion, pattern matching, bounds-checking, and regex validation simultaneously. I abandoned monolithic approach.
The idea behind this is to processes memory through an asymmetric execution pipeline, a progressively narrowing sequence of distinct hardware filters. Each stage is designed to reject binary noise using the cheapest possible CPU instructions, only advancing surviving data to more complex execution units.
The execution flow consists of four distinct phases. The SIMD engine is purposely kept "dumb", it does not know how to extract strings or evaluate regex. It only knows how to generate binary opmasks representing mathematical probability.
flowchart TD
subgraph S1 ["Phase 1: SIMD Spatial Rejector (AVX2 / AVX-512)"]
A["16MB Chunk (Raw Memory)"] --> B["Parallel Anchor Check (First/Last Chars)"]
B --> C["Case-Insensitive Bitwise Fold"]
C --> D{"Opmask != 0?"}
end
subgraph S2 ["Phase 2: Arithmetic Validator (ALU)"]
D -- "Yes (Anchor Hit)" --> E["tzcnt Extraction"]
E --> F["64-bit Integer Mask & CMP"]
F --> G{"QWORD == Needle?"}
end
subgraph S3 ["Phase 3: Context Expander (Scalar)"]
G -- "Yes (Exact Match)" --> H["Walk backward to string start"]
H --> I["Walk forward to string end (NULL/Binary boundary)"]
end
subgraph S4 ["Phase 4: C++ Bridge (Out-of-Band)"]
I --> J["std::string_view creation"]
J --> K["std::regex Execution"]
K --> L["Output Queue / Sort Buffer"]
end
D -- "No" --> A
G -- "No" --> E
To understand why multiple SIMD functions and scalar fallbacks are used in conjunction, we must look at the data reduction rate. The pipeline acts as a datapath funnel.
- The SIMD Domain (L1d Cache Speed): Processes 16MB chunks at maximum PCIe Gen4/Gen5 bandwidth. It has zero semantic awareness; it only performs vectorized spatial and arithmetic intersections to generate a mathematical "probability pointer."
- The Integer ALU Domain (Sub-nanosecond): Processes only the surviving pointers. It treats memory as QWORD integers to confirm exact byte sequences.
- The Scalar Domain: Resolves the contextual boundaries of the confirmed hit.
- The C++ ABI Domain: Executes regex state machines only on the mathematically proven, fully isolated string bounds, if specificed of course.
flowchart TD
subgraph STAGE_1 ["Vector SIMD Domain (Filtering 99.9% of data)"]
IN["16MB Locked Memory Chunk"] --> SIMD_OPS["Parallel execution of unrolled AVX2 / AVX-512 vectors"]
SIMD_OPS --> MASK_GEN["Opmask Generation (vpmovmskb / k-registers)"]
end
subgraph STAGE_2 ["Integer ALU Domain (Filtering 99.0% of remaining hits)"]
MASK_GEN --> BMI1["BMI1 Extraction (tzcnt & blsr)"]
BMI1 --> QWORD_CMP["64-bit GPR AND/CMP Validation"]
end
subgraph STAGE_3 ["Scalar Expansion Domain (Context Resolution)"]
QWORD_CMP -- "Valid Anchor Hit" --> SCALAR_EXP["Forward/Backward Byte Walk"]
SCALAR_EXP --> BOUNDS["Determine Absolute String Boundaries"]
end
subgraph STAGE_4 ["Out-of-Band C++ Domain (Semantic Validation)"]
BOUNDS --> CPP_BRIDGE["std::string_view Initialization"]
CPP_BRIDGE --> REGEX["std::regex_search execution"]
REGEX -- "Match Success" --> OUT["Push to Sort Buffer / IO Queue"]
end
MASK_GEN -. "Zero Mask (Skip 128 Bytes)" .-> IN
QWORD_CMP -. "False Positive" .-> BMI1
The conjunction of these methods is necessary because SIMD vector instructions cannot efficiently perform unbounded memory walks or complex stateful evaluations (like regex backtracking).
When the scanner is searching for a string, the execution flow transitions rapidly between CPU domains. The transition points are precisely engineered to minimize cross-domain latency.
stateDiagram-v2
state "Vector Execution State (Unrolled AVX-512/AVX2)" as VEC
state "Mask Traversal State (BMI1 loop)" as MASK
state "GPR Validation State (64-bit integer math)" as GPR
state "Scalar Context State (String Expansion)" as SCALAR
state "C++ Bridge (Regex)" as CPP
[*] --> VEC: Load Aligned Chunk
VEC --> VEC: Mask == 0 (Fast Path)
VEC --> MASK: Mask != 0 (Hits detected)
MASK --> GPR: Extract Index (tzcnt)
GPR --> MASK: Mismatch (False Positive)
GPR --> SCALAR: Exact Needle Match
SCALAR --> CPP: Bounds Isolated
CPP --> MASK: Return to Mask Loop
MASK --> VEC: Mask Exhausted (blsr == 0)
VEC --> [*]: Chunk Complete
The full parallel execution flow of identifying, filtering, and matching a string without monolithic branching looks like this:
sequenceDiagram
participant Mem as L1d Cache
participant SIMD as Vector Execution Units
participant ALU as Integer ALU / BMI1
participant Scalar as Context Expander
participant CPP as C++ Regex Engine
Note over Mem, SIMD: 1. Vector Space (Processing 128 Bytes per cycle)
SIMD->>Mem: Software Prefetch (p + 256)
Mem-->>SIMD: Loads (v0, v1, v2, v3)
SIMD->>SIMD: Apply Case-Folding Mask (vpor)
SIMD->>SIMD: Spatial BMH Intersection (vpcmpeqb + vpand)
SIMD->>SIMD: Binary OR Reduction Tree
SIMD->>ALU: Extract Mask (vpmovmskb)
Note over ALU: 2. Integer Space (Validating SIMD false-positives)
loop While Mask != 0
ALU->>ALU: tzcnt (Find Hit Index)
ALU->>Mem: mov r64, [p + index]
Mem-->>ALU: 8-byte QWORD chunk
ALU->>ALU: bitwise AND (needle_mask)
ALU->>ALU: cmp (needle_u64)
opt If QWORD matches
Note over Scalar: 3. Scalar Space (Resolving overarching boundaries)
ALU->>Scalar: Pass pointer to exact hit
Scalar->>Mem: Read backwards until binary/null
Scalar->>Mem: Read forwards until binary/null
Scalar->>CPP: Pass localized start/end pointers
Note over CPP: 4. C++ Space (Out-of-band regex validation)
CPP->>CPP: Cast to std::string_view
CPP->>CPP: std::regex_search
CPP->>ALU: Return Validation Boolean
end
ALU->>ALU: blsr (Clear lowest bit, iterate next hit)
end
Note over SIMD: 5. Advance pointer by 128 bytes. Repeat.
In summary, and omitting a lot of mentioned points:
- Vector Space: The L1d cache is pounded with parallel vector loads. The SIMD units execute the case-folding and spatial BMH checks. The binary reduction tree compresses the results, crossing the ABI domain into the integer ALU via
vpmovmskb. - Integer Space: The BMI1 loop (
tzcntandblsr) iterates over the mask. For each hit, the CPU fetches exactly 8 bytes, masking out excess lengths and comparing it against the precomputed 64-bit needle. - Scalar Space: If the ALU confirms the needle is real, the scalar expansion logic probes memory adjacencies to find where the surrounding string actually begins and ends, neutralizing the "SIMD rigid boundary" problem.
- C++ Space: The now human-readable string is handed across the C-to-C++ ABI bridge. The regex state machine runs entirely in the L1 cache on a bounded memory view, so that no cycles are wasted running regex logic on raw binary files.
User-mode API timers like QueryPerformanceCounter or GetSystemTimeAsFileTime invoke syscalls, we could also read the InterruptCount field from the KUSER_SHARED_DATA page directly each 64Hz (15.625ms) or read the RW memory page copy from the kernel updated each 100ns, but the overhead of these calls spans hundreds of clock cycles. We must read the CPU's fastest available clock directly (normally TSC, someone must kill HPET, PIT and APIC PM already). However, in superscalar CPUs if you place rdtsc before a SIMD block, the CPU's ROB might execute the rdtsc after the SIMD instructions have already begun messing up the delta, even if at the compiler level you place an atomic_signal_fence/memory clobber/_ReadWriteBarrier().
To achieve exact cycle counting, the pipeline must be serialized; to not trigger malicious hypervisors with conditionally/unconditionally exiting instructions that will TSC offset our rdtscp deltas (this is a joke), we could use _mm_lfence();.
We must take care of SMT logical siblings, but overmore, DPC and NMI/SMI interrupts after raising our IRQL. Of course, it must not be pinned to a physical core handling most of the ndis.sys network stack or storport.sys disk interrupts.
Even with maximum user-mode priority (REALTIME_PRIORITY_CLASS / THREAD_PRIORITY_TIME_CRITICAL), it would not be 100% accurate. To execute a micro-benchmark, a custom kernel driver (benchmarking harness) is required to elevate the processor to DISPATCH_LEVEL via KeRaiseIrql. However, At DISPATCH_LEVEL, page faults are strictly illegal. If the memory backing the 16MB buffer is paged out, touching it will immediately trigger an IRQL_NOT_LESS_OR_EQUAL BugCheck... So before raising the IRQL, the memory must be permanently wired into physical RAM using an MDL with IoAllocateMdl and MmProbeAndLockPages.
Before running the benchmarks, the driver flushes the Windows Standby list (using NtSetSystemInformation with SystemMemoryListInformation), executes wbinvd and tests the worst-case scenario; he latency of the PCIe bus, the NVMe controller, and the CcScheduleReadAhead path.
To force P-states in the clock, and warm up caches and TLB, of course a dummy run of the function in the target memory buffer is performed and EcoQoS is disabled entirely for the current PROCESS_POWER_THROTTLING_STATE. Then, running rdtsc/reading the MSR directly is done before and after the hot path.
After testing on a large file, this was the average speed obtained (maximum for my system was around 6500Mbps, which is the maximum MTs my CPU supports):
Here, a full video benchmark scanning a 16GB file with both binary and normal data, testing across different compilers. Take into account that the mere fact of recording (240FPS/QuickSync AV1) affects the speed of the data retrieval from RAM.
Modern software engineering often prioritizes developer velocity and hardware abstraction layers. The result is software that is completely disconnected from the silicon it executes on, resulting in most string scanners out there being extremely slow on the files I normally want to scan daily, such as dumps, etc.
Your Pentium processor is not the problem. Hardware is not slow, software is.
I made this scanner to finally have a decent tool for scanning patterns in any file quickly, and I hope the community finds the project useful, even if it's not at its full speed.
If you reached here by reading everything, congrats, you're probably the only one.