Skip to content

Latest commit

 

History

History
420 lines (330 loc) · 17.7 KB

File metadata and controls

420 lines (330 loc) · 17.7 KB

System Architecture

OpenNANDLab v2.0

This document is the authoritative reference for how OpenNANDLab is structured internally. Read it before contributing code.


1. Design Philosophy

OpenNANDLab follows three principles:

Correctness before performance. The simulator must produce outputs that match theory (BCH can correct exactly t errors, WAF ≥ 1.0, RBER grows monotonically with P/E count). Optimizations come second.

Layered abstraction. Each layer communicates with its neighbours through a well-defined interface. The FTL does not know whether ECC is BCH or LDPC. The analytics engine does not know which GC policy was used. Swapping algorithms requires only a config change.

Measurable results. Every subsystem exposes telemetry. If a benchmark cannot produce a number, it is not a benchmark.


2. High-Level Architecture

╔═══════════════════════════════════════════════════════════╗
║                  Host Interface Layer                     ║
║   CLI (Click)  │  Python API  │  Streamlit Dashboard      ║
╚══════════════════════════╤════════════════════════════════╝
                           │  LogicalRequest(op, lba, size, data)
                           ▼
╔══════════════════════════════════════════════════════════╗
║                 Workload Engine                          ║
║   Sequential  │  Random  │  Mixed  │  Trace Replay       ║
╚══════════════════════════╤═══════════════════════════════╝
                           │  stream of LBA requests
                           ▼
╔══════════════════════════════════════════════════════════╗
║             Flash Translation Layer (FTL)                ║
║  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐   ║
║  │  L2P Map    │  │ Write Buffer │  │  Free-Block    │   ║
║  │ (flat array)│  │  (64 pages)  │  │  Pool (deque)  │   ║
║  └─────────────┘  └──────────────┘  └────────────────┘   ║
║  ┌───────────────────────────────────────────────────┐   ║
║  │       Garbage Collector                           │   ║
║  │  Greedy │ Cost-Benefit │ Age-Threshold            │   ║
║  └───────────────────────────────────────────────────┘   ║
║  ┌───────────────────────────────────────────────────┐   ║
║  │       Wear Leveling Engine (min-heap)             │   ║
║  │  Dynamic │ Static │ Hybrid                        │   ║
║  └───────────────────────────────────────────────────┘   ║
╚══════════════════════════╤═══════════════════════════════╝
                           │  PhysicalPage(block_id, page_id, data)
                           ▼
╔══════════════════════════════════════════════════════════╗
║                   ECC + Compression Layer                ║
║  ┌──────────────┐  ┌────────────────┐  ┌─────────────┐   ║
║  │  Compressor  │  │   ECCHandler   │  │  Scrambler  │   ║
║  │  LZ4 / Zstd  │  │  BCH or LDPC   │  │  (optional) │   ║
║  └──────────────┘  └────────────────┘  └─────────────┘   ║
╚══════════════════════════╤═══════════════════════════════╝
                           │  codeword bytes
                           ▼
╔══════════════════════════════════════════════════════════╗
║                  NAND Device Model                       ║
║  Channel → Die → Plane → Block → Page                    ║
║  ┌────────────────┐  ┌─────────────────────────────────┐ ║
║  │  Timing Model  │  │   Reliability Model             │ ║
║  │  tR/tPROG/tBERS│  │   RBER(P/E), Retention, Disturb │ ║
║  └────────────────┘  └─────────────────────────────────┘ ║
╚══════════════════════════╤═══════════════════════════════╝
                           │  telemetry events
                           ▼
╔══════════════════════════════════════════════════════════╗
║                  Analytics Engine                        ║
║  WAF │ IOPS │ Latency Percentiles │ ECC Rate │ Lifetime  ║
║  Wear Heatmap │ RBER Curve │ GC Timeline │ HTML Report   ║
╚══════════════════════════════════════════════════════════╝

3. Module Reference

3.1 nand/device.py — Physical NAND model

NANDDevice
 └── channels: list[NANDChannel]
      └── dies: list[NANDDie]
           └── planes: list[NANDPlane]
                └── blocks: list[NANDBlock]
                     └── pages: list[NANDPage]

NANDPage fields:

  • state: PageStateFREE | VALID | INVALID
  • data: bytes — raw codeword bytes (post-ECC)
  • write_count: int — number of times this page has been written

NANDBlock fields:

  • pe_count: int — erase count
  • is_bad: bool
  • rber: float — computed from reliability.rber_model(pe_count)

NANDDevice.read_page(block_id, page_id) -> bytes — returns raw codeword or raises NANDReadError.
NANDDevice.write_page(block_id, page_id, data: bytes) -> None — writes codeword; increments write counter.
NANDDevice.erase_block(block_id) -> None — sets all pages to FREE; increments pe_count; recomputes RBER.


3.2 nand/reliability.py — Endurance & error models

RBER model (Weibull variant):

def rber_model(pe_count: int, cfg: NANDConfig) -> float:
    """
    RBER increases from rber_floor toward rber_ceil as erase count grows.
    λ controls the characteristic lifetime (50% of rber_ceil at pe_count=λ).
    """
    fraction = 1.0 - math.exp(-pe_count / cfg.rber_lambda)
    return cfg.rber_floor + (cfg.rber_ceil - cfg.rber_floor) * fraction

Bit-error injection (used by simulator, not ECC):

def inject_errors(data: bytes, rber: float, rng: random.Random) -> bytes:
    """Flip each bit independently with probability rber."""
    ...

Retention loss (optional, P1):

def retention_rber(base_rber: float, hours_since_write: float) -> float:
    return base_rber * math.log1p(hours_since_write / 24.0)

3.3 ftl/page_ftl.py — Page-level Flash Translation Layer

Invariants (enforced by Hypothesis):

  1. Every mapped LPN points to exactly one physical page.
  2. No physical page is pointed to by more than one LPN.
  3. After GC, free pages ≥ gc_trigger threshold.
  4. WAF ≥ 1.0 at all times.

Core data structures:

class PageFTL:
    l2p: array.array            # logical → physical page index; -1 = unmapped
    p2l: array.array            # physical → logical page index; -1 = free/invalid
    page_state: bytearray       # 2 bits per physical page: FREE=0, VALID=1, INVALID=2
    free_pool: deque[int]       # deque of free physical page indices
    write_buffer: WriteBuffer   # in-memory buffer before NAND commit
    _host_writes: int           # for WAF numerator
    _nand_writes: int           # for WAF denominator

Write path:

FTL.write(lbn, data)
  → compress(data) if enabled
  → write_buffer.add(lbn, compressed)
  → if buffer full:
      flush_buffer()
        → for each (lbn, payload) in buffer:
            old_ppn = l2p[lbn]
            new_ppn = free_pool.popleft()
            ecc_data = ecc_handler.encode(payload)
            nand.write_page(ppn_to_block(new_ppn), ppn_to_page(new_ppn), ecc_data)
            l2p[lbn] = new_ppn
            p2l[new_ppn] = lbn
            if old_ppn != -1:
                page_state[old_ppn] = INVALID
                p2l[old_ppn] = -1
            page_state[new_ppn] = VALID
            nand_writes += 1
        if free_pool size < gc_trigger:
            gc.run()

Read path:

FTL.read(lbn) -> bytes
  → if lbn in write_buffer: return write_buffer[lbn]
  → ppn = l2p[lbn]
  → if ppn == -1: raise UnmappedLBNError
  → raw = nand.read_page(ppn_to_block(ppn), ppn_to_page(ppn))
  → decoded = ecc_handler.decode(raw)    # raises UncorrectableECCError if needed
  → return decompress(decoded) if compression flag set

3.4 ftl/gc.py — Garbage Collector

GreedyGC:

class GreedyGC:
    def select_victim(self, ftl: PageFTL) -> int:
        """Return block_id with the most INVALID pages. O(B) scan, called infrequently."""
        ...

    def run(self, ftl: PageFTL, nand: NANDDevice) -> GCStats:
        block_id = self.select_victim(ftl)
        # Copy all VALID pages in block to free pages
        for page_id in range(pages_per_block):
            ppn = block_to_ppn(block_id, page_id)
            if page_state[ppn] == VALID:
                lbn = p2l[ppn]
                payload = nand.read_page(block_id, page_id)
                ftl.write_to_free_page(lbn, payload)      # internal, bypasses buffer
                nand_writes += 1                           # counted in WAF
        nand.erase_block(block_id)                         # pe_count += 1
        free_pool.extend(block_pages(block_id))
        return GCStats(pages_moved=valid_count, erases=1)

CostBenefitGC (P1): score = (1 - utilization) * age / (2 * utilization * age_weight + 1e-9)


3.5 defect/wear_leveling.py — Wear Leveling Engine

import heapq

class WearLevelingEngine:
    """
    Tracks per-block erase counts in a min-heap.
    Triggers wear leveling when the range (max - min) exceeds threshold.
    """
    def __init__(self, num_blocks: int, threshold: int):
        self._heap: list[tuple[int, int]] = [(0, i) for i in range(num_blocks)]
        heapq.heapify(self._heap)
        self._counts = [0] * num_blocks
        self.threshold = threshold

    def record_erase(self, block_id: int) -> None:
        self._counts[block_id] += 1
        heapq.heappush(self._heap, (self._counts[block_id], block_id))

    def least_worn_block(self) -> int:
        """O(1) peek of the min-heap."""
        return self._heap[0][1]

    def should_level(self) -> bool:
        max_count = max(self._counts)
        min_count = self._counts[self.least_worn_block()]
        return (max_count - min_count) > self.threshold

3.6 ecc/bch.py — BCH encoder/decoder

The BCH implementation must cover all four steps:

Step Algorithm Output
Encode Generator polynomial multiplication Codeword with appended parity
Syndrome computation Evaluate codeword at primitive roots Syndrome vector
Error-locator polynomial Berlekamp–Massey σ(x)
Error location Chien search Error positions
Error value (non-binary) Forney's algorithm Error magnitudes

Note on Forney: Forney's algorithm is required for m > 1 (GF(2^m) with non-binary symbols). Without it, the decoder returns incorrect data silently when m > 1 and t > 0. Binary BCH (m=1) does not need Forney.


3.7 ecc/ldpc.py — LDPC encoder/decoder

Parity-check matrix generation: Progressive Edge-Growth (PEG) algorithm. The matrix is cached to disk as an .npz file keyed by (n, d_v, d_c) to avoid regeneration overhead.

Soft-decision belief propagation:

def decode(self, llr: np.ndarray, max_iter: int = 50) -> np.ndarray:
    """
    llr: log-likelihood ratios from Gaussian Vth model.
         llr[i] = log(P(bit=0|channel) / P(bit=1|channel))
    Returns decoded bits (hard decision after convergence).
    """

Gaussian threshold-voltage model (for LLR generation):

def compute_llr(raw_bits: bytes, rber: float, sigma: float = 1.0) -> np.ndarray:
    """
    Model each cell's threshold voltage as Gaussian(μ=0 or 1, σ).
    Returns LLR vector for the BP decoder.
    """

3.8 analytics/metrics.py — Telemetry

All telemetry is captured via an event bus pattern: subsystems emit events; the analytics engine subscribes:

@dataclass
class WriteEvent:
    timestamp_ns: int
    lbn: int
    ppn: int
    is_gc: bool

@dataclass
class EraseEvent:
    timestamp_ns: int
    block_id: int
    pe_count: int

@dataclass
class ECCEvent:
    timestamp_ns: int
    errors_corrected: int
    uncorrectable: bool

Derived metrics:

Metric Formula
WAF nand_writes / host_writes
IOPS host_ops / elapsed_seconds
P99 latency 99th percentile of latency_samples
ECC correction rate ecc_corrections / total_reads
UBER uncorrectable_reads / total_reads
Estimated lifetime max_pe_cycles / current_pe_rate_per_day

4. Critical Bug Fix: write_page (v1.1.0 → v2.0)

Problem (identified by Deepseek): In v1.1.0, NANDController.write_page ends with a comment # Perform error correction coding / # ... (comment only). ECC encoding is never called. The physical write is never called. Every "write" in v1.1.0 is a no-op.

Fix:

def write_page(self, logical_block: int, page: int, data: bytes) -> None:
    # 1. Validate inputs
    if not self._initialized:
        raise RuntimeError("Controller not initialized")

    # 2. Check bad block
    physical_block = self._bbm.get_replacement(logical_block)
    if physical_block is None:
        raise BadBlockError(f"No replacement for block {logical_block}")

    # 3. Compress
    payload = data
    compressed = False
    if self.compression_enabled:
        c = self._compressor.compress(data)
        if len(c) < len(data):
            payload, compressed = c, True

    # 4. ECC encode  ← THIS WAS MISSING
    codeword = self._ecc_handler.encode(payload)

    # 5. Optional scramble  ← THIS WAS MISSING
    if self.data_scrambling:
        codeword = self._scramble_data(codeword, physical_block, page)

    # 6. Physical write  ← THIS WAS MISSING
    self._nand_interface.write_page(physical_block, page, codeword)

    # 7. Update cache with original data
    if self.cache_enabled:
        self._cache.put((physical_block, page), data)

    # 8. Telemetry
    self._metrics.record_write(logical_block, physical_block, compressed)
    self._wear_leveler.record_write(physical_block)

5. Key Design Decisions

Decision Choice Rationale
FTL default Page-level mapping Simplest correct semantics; easiest to verify
Config Pydantic BaseModel Type-safe, validated at startup, generates JSON schema
Heap for WL heapq (min-heap) O(log n) insert, O(1) min lookup — far better than linear scan
LRU cache OrderedDict Python stdlib, O(1) get/put/evict, no extra dependency
GC default Greedy Easiest to verify correct; cost-benefit is a config switch
GUI Streamlit Headless-CI-friendly, no Tkinter install pain, produces HTML reports
Parity matrix storage .npz cache PEG matrix generation is expensive; cache it after first run
Test framework pytest + Hypothesis Property-based tests catch corner cases unit tests miss

6. Thread Safety Note

The v2.0 simulator is single-threaded by design. The event loop is synchronous; parallel_access.py exists for future work. Do not add locking to core data structures until the parallel model is designed.


7. Performance Budget (CPython 3.12)

Operation Target Notes
FTL write (no GC) < 5 µs Dominated by array indexing
BCH encode (4096 B, t=4) < 500 µs GF polynomial arithmetic
LZ4 compress (4096 B) < 10 µs C extension via lz4 package
GC cycle (1 block, 256 pages) < 50 ms Acceptable tail latency spike
1M simulated writes < 60 s End-to-end benchmark target

For algorithm references, see docs/references.md. For configuration, see docs/configuration.md. For contributing, see CONTRIBUTING.md.