Split the monolithic evp/hdf5_reader.h (629 lines, single header) into a modular
structure under a new evp/hdf5/ subdirectory. Zero behavioral changes.
- Single 629-line header mixing IO helpers, format parsers, recursion, and public API.
parse_ohdr_v2broken forflags=0x20(timestamps) — message boundaries misaligned.- All inline in one header, hard to test individual parsers.
The HDF5 format has multiple versioned components. Our reader must handle each variant:
- v0 (only version we support): 8-byte offsets, 8-byte lengths, root STE inline.
- Higher versions (1–9+) exist but are out of scope.
- v1: First byte is
1, followed byreserved(1),n_msgs(2),obj_ref_count(4),header_size(4),reserved(4). Messages are fixed 8-byte aligned withtype(2) + size(2) + flags(1) + reserved(3). - v2: Starts with
"OHDr"signature, thenversion(1),flags(1), optional timestamps (flags>>5 & 1), optional phase-change (flags>>4 & 1), chunk0 size (variable width perflags & 0x3), then messages withtype(2) + size(2) + flags(1) + [cre_order(2)].
- v1/v2:
dimensionality(1),layout_class(1),reserved(5),data_addr(8).layout_class=1→ contiguous, address only (no length — compute from dataspace).
- v3/v4:
layout_class(1), then if1(contiguous):data_addr(8),data_len(8). Explicit byte length included.
- v1:
ds_ver(1),rank(1),flags(1),reserved(5), thenrank×uint64dims. Ifflags & 0x01(H5S_VALID_MAX): additionalrank×uint64max dims follow. - v2+:
ds_ver(1),rank(1),flags(1),type(1)(H5S_SIMPLE=1, etc.), thenrank×uint64dims. Ifflags & 0x01: additionalrank×uint64max dims follow.
class_and_ver(1),bitfields(3),size_in_bytes(4).
btree_root_rel(8),heap_rel(8). Present only on groups.
- v1 (only version we support):
TREEsig,node_type(1),node_level(1),entries_used(2),left_sibling(8),right_sibling(8), thenkey(8), child(8)× N.
SNODsig,version(1),reserved(1),count(2), thencount× STE entries.- Each STE:
name_off(8),ohdr_rel(8),cache_type(4),reserved(4),scratch_btree(8),scratch_heap(8).
HEAPsig,version(1),reserved(3),data_size(8),free_list_off(8),data_rel(8).
C:\Data\ANN\sisap2026\small\
benchmark-dev-wikipedia-bge-m3-small.h5
allknn.ivecs
train.fvecs
train.hvecs
glove/
Note: itest.ivecs is NOT on disk — the test ReadItestKnns_MatchesIvecs skips.
- Root B-Tree: 1 leaf SNOD (NOT 4). HDF5 packs all root-level entries into a single SNOD node regardless of how many groups/datasets exist.
- Root SNOD: Contains entries for
allknn,itest,otest,train. - Root heap: Stores group/dataset names as null-terminated strings, accessible
via
name_offin SNOD entries. Names are at arbitrary offsets within the heap data block — never assume offset 0, and never usestrstron raw heap data (padding and free-list entries can confuse naive string search).
- Uses OHDr v1 (first byte =
1) — parsed correctly. - Sub-B-Tree: 1 leaf SNOD with 3 entries.
- Group heap: contains "knns", "dists", "queries" as null-terminated strings.
- Datasets:
itest/knns— 10000 × 1000 int32 (40 MB)itest/dists— 10000 × 1000 float32 (40 MB)itest/queries— 10000 × 1024 float32 (40.96 MB)
- Uses OHDr v2 with
flags=0x20(timestamps enabled:flags>>5 & 1). - Sub-B-Tree: 1 leaf SNOD with 2 entries (
dists,knns). - Sub-datasets:
allknn/dists(200000 × 32 float32, 25600000 bytes),allknn/knns(200000 × 32 int32, 25600000 bytes) allknn/knnsdata is 1-indexed (indices 1..200000). Verified 100% match againstallknn.ivecsground truth.
- Uses OHDr v2 with
flags=0x20(timestamps) fordists/knns, andflags=0x00forqueries. - Sub-datasets:
otest/dists— 10000 × 1000 float32 (40000000 bytes)otest/knns— 10000 × 1000 int32 (40000000 bytes)otest/queries— 10000 × 1024 fp16 (20480000 bytes)
- No ground truth files on disk. Sanity checks pass: indices in [1, 200000], distances >= 0, query norms ≈ 1.0.
- Uses OHDr v2 with
flags=0x00(no timestamps). - 200000 × 1024 fp16 (409600000 bytes)
- Verified 100% match against
train.hvecsground truth (all 204,800,000 FP16 values).
Same structure as small, but with larger train/allknn datasets:
train: 6350000 × 1024 fp16 (13,004,800,000 bytes) — verified vs train.hvecs ✅allknn/dists: 6350000 × 32 float32 (812,800,000 bytes)allknn/knns: 6350000 × 32 int32 (812,800,000 bytes) — verified vs allknn.ivecs ✅itest/*,otest/*: same dimensions as small (10000 queries)
OHDr v2 parsing has been fixed. The issues were:
- Message type field is 1 byte (not 2) in v2.
- No 8-byte message alignment between messages in v2.
- 4-byte CRC32 checksum at end of chunk0 must be skipped.
- Dataspace v2 has a
typebyte afterflagsthat must be skipped before reading dimensions. - Dataspace
flags & 0x01(H5S_VALID_MAX) means max dimensions follow current dimensions. All datasets (train, allknn/, otest/) are now correctly discovered with proper dimensions.
- SISAP
knnsdatasets use mixed indexing — do not assume uniform behavior:- 1-indexed:
itest/knns(int32, indices 1..200000),allknn/knns(int32, indices 1..200000),test/knnsin llama-dev.h5 (INT64, indices 1..256921) - 0-indexed:
otest/knns(int32, indices 0..199999) — the only exception
- 1-indexed:
allknn.ivecsis also 1-indexed. Comparison is direct:hdf5_data[i] == ivecs_data[i](no -1 offset needed).itest.ivecs(if it existed) may also be 1-indexed — the existing test subtracts 1 as a precaution, but this depends on how the file was generated.
- SISAP 2026 benchmark dataset.
- Groups:
itest,otest,allknn,train. allknn/knnsverified 100% match againstallknn.ivecsground truth.
- Same structure as small, but with larger train/allknn datasets.
allknn/knnsverified vs allknn.ivecs ✅
- Disk File:
C:\Data\ANN\sisap2026\llama-dev\llama-dev.h5 - SISAP-generated file — same Python conversion pipeline as other SISAP datasets.
- Root Level:
traindataset (FP32, 256921 x 128)testgroup
- Group
test:- Contains
queries(FP32, 1000 x 128),dists(FP64 / double, 1000 x 100), andknns(INT64, 1000 x 100). - Uses HDF5 Object Header Continuation Messages (Type 16) and Link Messages (Type 6).
- Stale B-Tree/SNOD structures only contained 2 entries (missing
knns). The Link Messages (Type 6) represent the up-to-date group structure. - Reader updated to follow recursive continuation blocks and parse Link Messages directly, bypassing stale B-Tree/SNOD nodes when
linksare found.
- Contains
evp/
hdf5_reader.h # shim: includes all below (backward compat)
evp_common.h # now_ms(), hvecs_read() (legacy .hvecs support)
deglib_evp_test.cpp # main benchmark (HDF5 + legacy .hvecs/.ivecs)
deglib_evp_microbench.cpp # SIMD microbenchmark
hdf5/
structure.md # this file
hdf5_types.h # DatasetInfo, OhdrInfo, SteEntry, Superblock, UNDEF64, SIG_*
hdf5_io.h # r16/r32/r64, frd8/16/32/64, fskip, fseek, heap_str
hdf5_superblock.h # parse_superblock() — superblock v0 only
hdf5_heap.h # parse_heap()
hdf5_btree.h # traverse_btree() — btree v1 only
hdf5_snod.h # parse_snod()
hdf5_ohdr.h # apply_msg(), parse_ohdr_v1(), parse_ohdr_v2(), parse_ohdr()
hdf5_scan.h # collect_datasets(), scan_datasets(), print_datasets() (supports recursive continuation & links)
hdf5_readers.h # read_fp16_vectors(), read_int32_flat(), read_matrix_int64(), read_matrix_fp64()
test/
test_main.cpp # gtest entry point
test_types.cpp # type struct defaults, constants
test_io.cpp # r16/r32/r64 endian, heap_str
test_structural.cpp # heap, btree, snod against real file
test_ohdr.cpp # OHDr v1/v2 parsing
test_integration.cpp # end-to-end scan + readers + ground truth (small)
test_dataset_dims.cpp # per-dataset dimension verification (small)
test_sanity.cpp # data integrity: indices, distances, norms (small)
test_integration_large.cpp # end-to-end scan + readers + ground truth (large)
test_dataset_dims_large.cpp # per-dataset dimension verification (large)
test_sanity_large.cpp # data integrity: indices, distances, norms (large)
test_llama_dev.cpp # end-to-end structure and values verification for llama-dev (uses Link & Continuation msgs)
1. hdf5_types.h (no deps)
2. hdf5_io.h (deps: types)
3. hdf5_superblock.h (deps: types, io)
4. hdf5_heap.h (deps: types, io)
5. hdf5_btree.h (deps: types, io)
6. hdf5_snod.h (deps: types, io)
7. hdf5_ohdr.h (deps: types, io)
8. hdf5_scan.h (deps: types, io, superblock, heap, btree, snod, ohdr)
9. hdf5_readers.h (deps: types, io)
- All functions remain inline (header-only, no new compilation units)
- All code stays in namespace hdf5_reader (or hdf5_reader::detail)
- Existing
#include "hdf5_reader.h"in all .cpp files works unchanged - Each sub-header includes needed standard headers + prerequisite hdf5/ headers
- Version-specific parsing logic must be clearly separated within each parser
- No dataset-specific functions in generic headers —
hdf5_readers.honly contains generic readers (read_fp16_vectors, read_int32_flat). Dataset-specific loading (e.g. "load train", "load allknn/knns") belongs in the consumer code (deglib_evp_test.cpp), never in the library headers.
- Never use
strstron raw heap data to verify name presence. Heap data may contain padding, free-list entries, or internal structures that break naive string search. Always useheap_str(heap, name_off)with SNOD entry offsets. - Root B-Tree may have only 1 leaf SNOD even when the root group contains many entries. HDF5 packs multiple entries into a single SNOD node. Do not assume 1 SNOD leaf per group entry.
- Root heap does not necessarily start with names at offset 0. The heap
data portion (returned by
parse_heap) is a raw memory block with strings at arbitrary offsets. Use SNODname_offvalues for access. - OHDr v2 with timestamps (
flags & 0x20) has been fixed. Message type is 1 byte, no 8-byte alignment, CRC32 at end of chunk0, and dataspace v2 type byte are all handled. itest.ivecsis not on disk — ground truth comparison test skips.- FP16 datasets:
trainandotest/querieshave element_size=2, both small and large files. - Large file (14.2 GB): all 9 datasets verified. train/allknn scaled to 6.35M rows.
- 63 small tests + 33 large tests + 5 llama-dev tests = 101 total (2 skipped, 99 passed, 0 failures).
- Test include paths:
#include "hdf5_types.h"(nothdf5/hdf5_types.h) whenevp/hdf5/is set as the include directory in CMake (add_hdf5_testmacro). heap_strtest vectors must null-terminate — astd::vector<uint8_t>initialized from a braced list does not pad with zeros, soheap_strwill read past the intended string into uninitialized memory. Always append0as the last element.
- Create evp/hdf5/ directory ✅
- Write each of the 9 files by cutting code from current hdf5_reader.h ✅
- Write hdf5_reader.h shim that includes all 9 files in order ✅
- Build all targets to verify zero regressions ✅
- Add unit tests under evp/hdf5/test/ ✅
- Fix test assumptions to match actual file structure ✅
- Fix OHDr v2 + dataspace v2 parsing (type byte, max dims, timestamps) ✅
- Add large file tests (14.2 GB, 6.35M vectors) ✅
- 101 tests total: 99 passed, 2 skipped (itest.ivecs missing x2), 0 failed ✅
- Add
itest.ivecsground truth file to enableReadItestKnns_MatchesIvecstest (both small and large). - Ground truth for
otest/*is not available on disk — sanity checks are the best we can do.