|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project |
| 6 | + |
| 7 | +Elastickv is an experimental, cloud-oriented distributed key-value store written in Go (module `github.com/bootjp/elastickv`, Go 1.25.0 with `toolchain go1.26.2`). It exposes multiple wire protocols (gRPC RawKV/Transactional, Redis, DynamoDB-compatible HTTP, S3-compatible HTTP, SQS-compatible HTTP) on top of a Raft-replicated, MVCC/OCC storage engine. **Not production-ready.** |
| 8 | + |
| 9 | +## Common Commands |
| 10 | + |
| 11 | +```bash |
| 12 | +make test # go test -v -race ./... |
| 13 | +make lint # golangci-lint --config=.golangci.yaml run --fix |
| 14 | +make run # go run cmd/server/demo.go (built-in 3-node single-process demo) |
| 15 | +make client # go run cmd/client/client.go |
| 16 | +make gen # regenerate protobufs (cd proto && make gen) |
| 17 | +``` |
| 18 | + |
| 19 | +Run a single test or package: |
| 20 | + |
| 21 | +```bash |
| 22 | +go test -run TestName ./store/... |
| 23 | +go test -race ./kv/... |
| 24 | +``` |
| 25 | + |
| 26 | +If `$GOCACHE` is sandbox-blocked (macOS), create the cache dirs first (Go errors out if `GOTMPDIR` does not exist), then prefix the command: |
| 27 | + |
| 28 | +```bash |
| 29 | +mkdir -p "$(pwd)/.cache/tmp" "$(pwd)/.golangci-cache" |
| 30 | +GOCACHE=$(pwd)/.cache GOTMPDIR=$(pwd)/.cache/tmp go test ./... |
| 31 | +GOCACHE=$(pwd)/.cache GOLANGCI_LINT_CACHE=$(pwd)/.golangci-cache golangci-lint run |
| 32 | +``` |
| 33 | + |
| 34 | +Single-node server (etcd/raft is the default backend): |
| 35 | + |
| 36 | +```bash |
| 37 | +go run . --address "127.0.0.1:50051" --redisAddress "127.0.0.1:6379" --raftId "n1" --raftBootstrap |
| 38 | +``` |
| 39 | + |
| 40 | +The local Jepsen runner (builds, starts a 3-node cluster on `5005{1,2,3}` / `6379{1,2,3}` / `6380{1,2,3}` / `6390{1,2,3}`, runs DynamoDB workloads): |
| 41 | + |
| 42 | +```bash |
| 43 | +./scripts/run-jepsen-local.sh # full cycle |
| 44 | +./scripts/run-jepsen-local.sh --no-rebuild --no-cluster # reuse running cluster |
| 45 | +``` |
| 46 | + |
| 47 | +Direct Jepsen invocation requires isolating Leiningen state from `$HOME`: |
| 48 | + |
| 49 | +```bash |
| 50 | +cd jepsen && HOME=$(pwd)/tmp-home LEIN_HOME=$(pwd)/.lein \ |
| 51 | + LEIN_JVM_OPTS="-Duser.home=$(pwd)/tmp-home" /tmp/lein test |
| 52 | +# Same pattern under jepsen/redis/ with HOME=$(pwd)/../tmp-home etc. |
| 53 | +``` |
| 54 | + |
| 55 | +Protobuf regeneration is version-pinned and will fail unless the toolchain matches: `libprotoc 29.3`, `protoc-gen-go v1.36.11`, `protoc-gen-go-grpc 1.6.1` (see `proto/Makefile`). |
| 56 | + |
| 57 | +Pre-commit hook (runs `make lint`) is opt-in: `git config --local core.hooksPath .githooks`. |
| 58 | + |
| 59 | +## Architecture |
| 60 | + |
| 61 | +The full diagrams live in `docs/architecture_overview.md` — read it before non-trivial changes touching coordination, replication, or routing. Big picture: |
| 62 | + |
| 63 | +- **Adapters (`adapter/`)** — Per-protocol ingress: `redis.go`, `dynamodb.go`, `grpc.go`, `s3.go`, `sqs.go` (with `sqs_auth.go` / `sqs_catalog.go` / `sqs_keys.go` / `sqs_messages.go`), `distribution_server.go` (operator/control plane). The S3 and SQS adapters share the SigV4 path (`sigv4.go`, `s3_auth.go`, `sqs_auth.go`) and static-credentials loader. `redis_proxy.go` and the standalone `cmd/redis-proxy/` implement a phased Redis-to-Elastickv migration proxy with dual-write/shadow-read modes (see `proxy/`). |
| 64 | +- **Data plane (`kv/`)** — `ShardedCoordinator` (`sharded_coordinator.go`) is the entry point all adapters dispatch into. It resolves keys via `ShardRouter` (`shard_router.go`) against the in-memory `RouteEngine` cache, then drives `ShardStore` (`shard_store.go`) per Raft group. Transactions live in `transaction.go` / `txn_codec.go`; OCC and lock resolution in `lock_resolver.go`. Leader-only reads go through `lease_state.go`. |
| 65 | +- **Replication (`internal/raftengine/`, `kv/fsm.go`)** — Only backend is `etcd/raft` under `internal/raftengine/etcd` (the hashicorp backend was dropped in `a35245a`; the `--raftEngine` flag still advertises `hashicorp` in `main.go` but `newRaftFactory` rejects anything other than `etcd`). Each Raft data dir contains a `raft-engine` marker so the process refuses to reopen a dir under a different backend. Note: README and `docs/etcd_raft_migration_operations.md` still reference `go run ./cmd/etcd-raft-migrate`, but that directory was deleted in `a35245a` — the migrator is no longer in-tree. The KV FSM (`kv/fsm.go`) applies committed entries to the storage layer and to the HLC ceiling. |
| 66 | +- **Storage (`store/`)** — MVCC over Pebble (`mvcc_store.go`, `lsm_store.go`); OCC, TTL/expiry, snapshots (`snapshot_pebble.go`), and per-type helpers for Redis collections (`hash_helpers.go`, `list_helpers.go`, `set_helpers.go`, `zset_helpers.go`, `stream_helpers.go`). |
| 67 | +- **Control plane (`distribution/`)** — Durable route catalog persisted in reserved keys of the **default Raft group**. `engine.go` is the read-side cache; `watcher.go` polls the catalog and applies versioned snapshots into the engine; `catalog.go` is the storage layer. Operator RPCs (`ListRoutes`, `SplitRange` — same-group split only) are on `proto.Distribution`. **All routing decisions read from the cached `RouteEngine`, not from the catalog directly.** |
| 68 | +- **Timestamp Oracle (`kv/hlc.go`, `kv/hlc_wall.go`)** — All HLC timestamps are **issued exclusively by the Raft leader** via `ShardedCoordinator` / `Coordinator` — followers never call `HLC.Next()` for persistence. The 64-bit value splits into an upper 48-bit **physical** half (Unix ms) and a lower 16-bit **logical** counter, and the two halves take very different paths: |
| 69 | + - **Physical (upper 48 bits) — Raft-agreed.** The leader periodically (`hlcRenewalInterval = 1s`, window `hlcPhysicalWindowMs = 3s`) proposes a ceiling entry through the default Raft group; FSM apply on every node calls `SetPhysicalCeiling`. `Next()` clamps the physical half to `max(wall_ms, ceiling_ms)`, so a newly elected leader can never issue a timestamp inside the previous leader's lease window. |
| 70 | + - **Logical (lower 16 bits) — in-memory only.** Advanced by atomic CAS on each `Next()` call; **no Raft round-trip and no consensus per timestamp**. This is what keeps issuance in the nanosecond range. |
| 71 | + - The coordinator and FSM **must share the same `*HLC`** instance (wired via `WithHLC` / `NewKvFSMWithHLC`) so the in-memory counter and the replicated ceiling stay coupled. |
| 72 | +- **Process entrypoints** — `main.go` is the multi-binary server (gRPC + Redis + DynamoDB + S3 + SQS + admin + metrics + pprof). Per-protocol bootstrapping is split into `main_s3.go` and `main_sqs.go`; SigV4 static credentials load via `main_sigv4_creds.go`. SQS exposure is opt-in via `--sqsAddress` (with `--sqsRegion` and `--sqsCredentialsFile`); leave `--sqsAddress` empty to disable. `cmd/server/demo.go` is a single-process 3-node demo. `cmd/client/`, `cmd/redis-proxy/`, `cmd/elastickv-admin/`, and `cmd/raftadmin/` are standalone tools. `multiraft_runtime.go` and `shard_config.go` wire shard groups to addresses for multi-group deployments (`--raftRedisMap`, `--raftDynamoMap`, `--raftS3Map`, `--raftSqsMap`). |
| 73 | + |
| 74 | +## Conventions |
| 75 | + |
| 76 | +- `gofmt` + the linters in `.golangci.yaml` (`gocritic`, `gocyclo`, `gosec`, `wrapcheck`, `errorlint`, `mnd`, etc.) are enforced. Avoid `//nolint` — refactor instead. |
| 77 | +- Errors: wrap with `github.com/cockroachdb/errors` (the `wrapcheck` linter enforces wrapping at boundaries). |
| 78 | +- Logging: structured `slog` with stable keys (`key`, `commit_ts`, `route_id`, …). |
| 79 | +- Test files are co-located (`*_test.go`); prefer table-driven tests. `pgregory.net/rapid` is available for property tests (`store/mvcc_store_prop_test.go`, `adapter/redis_transcoder_prop_test.go`, `adapter/grpc_transcoder_prop_test.go`). |
| 80 | +- After changes to replication, MVCC, OCC, or the Redis adapter, run the relevant Jepsen suite — these are the integration-level safety net. |
| 81 | +- When code review surfaces a defect (incorrect behavior, regression, edge case), **first add a failing test that reproduces the issue, then make it pass with the fix**. Push the test and the fix together (one commit or two adjacent commits) so the regression is locked down. Do not respond to a review-identified defect with a fix-only change. |
| 82 | +- HLC: do **not** issue persistence timestamps from non-leader nodes; OCC decisions assume leader-issued ts. **Never use the local wall clock (`time.Now()` / `hlc_wall.go` directly) for snapshot reads, MVCC visibility checks, OCC validation, lease/expiry decisions, or any other ordering-sensitive read** — always go through `HLC.Next()` (writes/commits) or the leader-issued read timestamp pipeline. Local wall clocks are only valid for diagnostics/metrics and as the input that bounds the physical ceiling. Keep wall clocks reasonably synchronized across nodes. |
| 83 | +- Route catalog mutations must go through `SplitRange` (or future control-plane RPCs) so the catalog version bumps and watchers fan out — never write catalog keys directly. |
| 84 | +- Commits: short imperative summary, optional scope prefix matching the touched area (`store:`, `adapter:`, `kv:`, `docs:`, …). PR descriptions should call out behavior change, risk, and the test evidence (`go test`, `make lint`, relevant Jepsen suite). |
| 85 | + |
| 86 | +## Self-review of code changes |
| 87 | + |
| 88 | +After every code change, run **five independent review passes** — one lens at a time, do not collapse them. Each lens has a different failure mode and merging them tends to skip cases. Record the result of each pass (even a one-line "no issues") in the PR description. |
| 89 | + |
| 90 | +1. **Data loss** — Can any committed write be lost or silently overwritten? Check Raft propose/apply ordering, FSM idempotency, snapshot/restore round-trips, Pebble sync semantics (`lsm_store_sync_mode_*`), TTL/expiry deletes, retention/compaction (`store/mvcc_store_retention_test.go`, `kv/compactor.go`), and crash-restart paths. New failure modes (`return nil` after an error, swallowed `Apply` errors, missing `WAL.Sync`) are the usual culprits. |
| 91 | +2. **Concurrency / distributed failures** — Race conditions, lock ordering, deadlocks, leader change mid-operation, follower forwarding while leadership flips, partial Raft membership changes, partition + heal, slow follower, snapshot-during-apply, OCC conflict resolver paths (`kv/lock_resolver.go`), and the lease-read window (`kv/lease_state.go`). Run the relevant `go test -race` and the matching Jepsen suite. |
| 92 | +3. **Performance** — Hot-path allocations, lock contention, fan-out across shards, extra Raft round-trips per request (especially anything that would force consensus on a per-`Next()` HLC tick), N+1 reads against Pebble, Lua/transcoder churn (`adapter/redis_lua_pool.go`, `adapter/grpc_transcoder.go`), and metric cardinality. Check existing benchmarks (`*_benchmark_test.go`) and add one if a hot path changed. |
| 93 | +4. **Data consistency** — MVCC visibility, OCC commit-ts ordering, HLC physical-ceiling invariant, snapshot read isolation, route-catalog versioning + watcher fan-out, cross-shard transaction atomicity (`kv/transaction.go`, `kv/txn_codec.go`), DynamoDB/Redis adapter semantics versus the upstream contract, and the lease-read freshness bound. Reads that bypass `HLC.Next()` or the leader-issued read pipeline are bugs. |
| 94 | +5. **Test coverage** — New/changed branches must have unit tests (table-driven, co-located `*_test.go`); property tests via `pgregory.net/rapid` for codecs/transcoders; OCC/HLC/MVCC behavior changes need targeted tests under `kv/` and `store/`; replication/Redis/MVCC changes need the corresponding Jepsen workload. If a reviewer found the defect, the regression test (per the convention above) must be in the same PR. |
| 95 | + |
| 96 | +## Design Documents |
| 97 | + |
| 98 | +`docs/design/` is dated proposals and as-implemented records. Filenames carry one of three lifecycle markers: |
| 99 | + |
| 100 | +- `*_proposed_*.md` — Design accepted, no implementation yet (or implementation just started). |
| 101 | +- `*_partial_*.md` — Some milestones / phases of the design have shipped, but the full proposal is not yet complete. The doc tracks which milestones have landed and what remains. Example: `2026_02_18_partial_hotspot_shard_split.md` (Milestone 1 of the hotspot-split design has shipped; later milestones are still open). |
| 102 | +- `*_implemented_*.md` — All milestones of the proposal have shipped; the doc is preserved as the as-built record. |
| 103 | + |
| 104 | +Check this directory before designing anything new — there is likely a recent precedent (HLC lease, FSM compaction, S3 adapter, lease reads, Lua commit batching, TTL inline value, centralized TSO proposal, hotspot shard split, etc.). `docs/design/README.md` indexes them. |
| 105 | + |
| 106 | +**Design-doc-first workflow.** For any change that goes beyond a single-file edit — new feature, new adapter, new control-plane RPC, schema/wire-format change, or any modification touching replication / MVCC / OCC / HLC / routing — **write a `*_proposed_*.md` design doc first and land it before the implementation**. Do not start implementation until the proposal has been reviewed and accepted. The PR may carry both the doc and the implementation (in that order: doc commit first, implementation commits after) as long as the doc is reviewable on its own. Lifecycle transitions: rename `*_proposed_*.md` → `*_partial_*.md` once the first milestone ships (and update the doc to record what landed and what is still open); rename `*_partial_*.md` → `*_implemented_*.md` once the final milestone ships. Use `git mv` so the history follows the rename. |
0 commit comments