Skip to content

Commit 93484e2

Browse files
imbajinEthereal-O
andauthored
docs: add AGENTS.md and update docs & gitignore (#344)
* docs: add AGENTS.md and update docs & gitignore Add AI-assistant guidance files (AGENTS.md) at repository root and under vermeer, and expand documentation across the project: significantly update top-level README.md, computer/README.md, and vermeer/README.md with architecture, quick-starts, build/test instructions, and examples. Also update CI badge link in README and add AI-assistant-specific ignore patterns to .gitignore and vermeer/.gitignore to avoid tracking assistant artifacts. * Add vermeer-focused .devin/wiki.json Introduce .devin/wiki.json with repository notes directing contributors to focus exclusively on the vermeer directory: document its architecture, implementation, and APIs; exclude content from the computer module/directory; and prioritize vermeer-specific functionality and code examples. * Update READMEs: PageRank params and Vermeer configs Clarify algorithm parameters and configuration guidance across computer/README.md and vermeer/README.md. In computer/README.md PageRank options were renamed and documented (page_rank.alpha, bsp.max_superstep, pagerank.l1DiffThreshold) and a pointer to the full PageRank implementation was added to avoid confusion from the simplified example. In vermeer/README.md example Docker volume mounts now recommend a dedicated config directory (~/vermeer-config) and include a security note about avoiding mounting the whole home directory. The master.ini/worker.ini sample blocks were reworked to use revised keys (http_peer, grpc_peer, master_peer, run_mode, task_parallel_num, etc.) and a note clarifies that HugeGraph connection details are supplied via the graph load API. Additional notes direct readers to the real WorkerComputer/MasterComputer interfaces and existing algorithm examples; minor performance-tuning guidance was also adjusted to reflect the new task_parallel_num setting. * Update README.md * doc: fix some mistakes in docs about vermeer (#345) --------- Co-authored-by: Jingkai Yang <m15635418665@163.com>
1 parent cec0f80 commit 93484e2

File tree

8 files changed

+1721
-92
lines changed

8 files changed

+1721
-92
lines changed

.devin/wiki.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"repo_notes": [
3+
"Focus exclusively on the vermeer directory and its components",
4+
"Document vermeer's architecture, implementation, and APIs",
5+
"Exclude all content from the computer module/directory",
6+
"Prioritize vermeer-specific functionality and code examples"
7+
]
8+
}

.gitignore

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,15 @@ build/
5555
*.log
5656
*.pyc
5757

58+
# AI assistant specific files (we only maintain AGENTS.md)
59+
CLAUDE.md
60+
GEMINI.md
61+
CURSOR.md
62+
COPILOT.md
63+
.cursorrules
64+
.cursor/
65+
.github/copilot-instructions.md
66+
5867
# maven ignore
5968

6069
apache-hugegraph-*-incubating-*/

AGENTS.md

Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
# AGENTS.md
2+
3+
This file provides guidance to AI coding assistants when working with code in this repository.
4+
5+
## Repository Overview
6+
7+
This is the Apache HugeGraph-Computer repository containing two distinct graph computing systems:
8+
9+
1. **computer** (Java/Maven): A distributed BSP/Pregel-style graph processing framework that runs on Kubernetes or YARN
10+
2. **vermeer** (Go): A high-performance in-memory graph computing platform with master-worker architecture
11+
12+
Both integrate with HugeGraph for graph data input/output.
13+
14+
## Build & Test Commands
15+
16+
### Computer (Java)
17+
18+
**Prerequisites:**
19+
- JDK 11 for building/running
20+
- JDK 8 for HDFS dependencies
21+
- Maven 3.5+
22+
- For K8s module: run `mvn clean install` first to generate CRD classes under computer-k8s
23+
24+
**Build:**
25+
```bash
26+
cd computer
27+
mvn clean compile -Dmaven.javadoc.skip=true
28+
```
29+
30+
**Tests:**
31+
```bash
32+
# Unit tests
33+
mvn test -P unit-test
34+
35+
# Integration tests
36+
mvn test -P integrate-test
37+
```
38+
39+
**Run single test:**
40+
```bash
41+
# Run specific test class
42+
mvn test -P unit-test -Dtest=ClassName
43+
44+
# Run specific test method
45+
mvn test -P unit-test -Dtest=ClassName#methodName
46+
```
47+
48+
**License check:**
49+
```bash
50+
mvn apache-rat:check
51+
```
52+
53+
**Package:**
54+
```bash
55+
mvn clean package -DskipTests
56+
```
57+
58+
### Vermeer (Go)
59+
60+
**Prerequisites:**
61+
- Go 1.23+
62+
- `curl` and `unzip` (for downloading binary dependencies)
63+
64+
**First-time setup:**
65+
```bash
66+
cd vermeer
67+
make init # Downloads supervisord and protoc binaries, installs Go deps
68+
```
69+
70+
**Build:**
71+
```bash
72+
make # Build for current platform
73+
make build-linux-amd64
74+
make build-linux-arm64
75+
```
76+
77+
**Development build with hot-reload UI:**
78+
```bash
79+
go build -tags=dev
80+
```
81+
82+
**Clean:**
83+
```bash
84+
make clean # Remove built binaries and generated assets
85+
make clean-all # Also remove downloaded tools
86+
```
87+
88+
**Run:**
89+
```bash
90+
# Using binary directly
91+
./vermeer --env=master
92+
./vermeer --env=worker
93+
94+
# Using script (configure in vermeer.sh)
95+
./vermeer.sh start master
96+
./vermeer.sh start worker
97+
```
98+
99+
**Regenerate protobuf (if proto files changed):**
100+
```bash
101+
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.28.0
102+
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@v1.2.0
103+
104+
# Generate (adjust protoc path for your platform)
105+
vermeer/tools/protoc/linux64/protoc vermeer/apps/protos/*.proto --go-grpc_out=vermeer/apps/protos/. --go_out=vermeer/apps/protos/. # please note remove license header if any
106+
```
107+
108+
## Architecture
109+
110+
### Computer (Java) - BSP/Pregel Framework
111+
112+
**Module Structure:**
113+
- `computer-api`: Public interfaces for graph processing (Computation, Vertex, Edge, Aggregator, Combiner, GraphFactory)
114+
- `computer-core`: Runtime implementation (WorkerService, MasterService, messaging, BSP coordination, managers)
115+
- `computer-algorithm`: Built-in algorithms (PageRank, LPA, WCC, SSSP, TriangleCount, etc.)
116+
- `computer-driver`: Job submission and driver-side coordination
117+
- `computer-k8s`: Kubernetes deployment integration
118+
- `computer-yarn`: YARN deployment integration
119+
- `computer-k8s-operator`: Kubernetes operator for job management
120+
- `computer-dist`: Distribution packaging
121+
- `computer-test`: Integration and unit tests
122+
123+
**Key Design Patterns:**
124+
125+
1. **API/Implementation Separation**: Algorithms depend only on `computer-api` interfaces; `computer-core` provides runtime implementation. Algorithms are dynamically loaded via config.
126+
127+
2. **Manager Pattern**: `WorkerService` composes multiple managers (MessageSendManager, MessageRecvManager, WorkerAggrManager, DataServerManager, SortManagers, SnapshotManager, etc.) with lifecycle hooks: `initAll()`, `beforeSuperstep()`, `afterSuperstep()`, `closeAll()`.
128+
129+
3. **BSP Coordination**: Explicit barrier synchronization via etcd (EtcdBspClient). Each superstep follows:
130+
- `workerStepPrepareDone``waitMasterStepPrepareDone`
131+
- Local compute (vertices process messages)
132+
- `workerStepComputeDone``waitMasterStepComputeDone`
133+
- Aggregators/snapshots
134+
- `workerStepDone``waitMasterStepDone` (master returns SuperstepStat)
135+
136+
4. **Computation Contract**: Algorithms implement `Computation<M extends Value>`:
137+
- `compute0(context, vertex)`: Initialize at superstep 0
138+
- `compute(context, vertex, messages)`: Process messages in subsequent supersteps
139+
- Access to aggregators, combiners, and message sending via `ComputationContext`
140+
141+
**Important Files:**
142+
- Algorithm contract: `computer/computer-api/src/main/java/org/apache/hugegraph/computer/core/worker/Computation.java`
143+
- Runtime orchestration: `computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/worker/WorkerService.java`
144+
- BSP coordination: `computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/bsp/Bsp4Worker.java`
145+
- Example algorithm: `computer/computer-algorithm/src/main/java/org/apache/hugegraph/computer/algorithm/centrality/pagerank/PageRank.java`
146+
147+
### Vermeer (Go) - In-Memory Computing Engine
148+
149+
**Directory Structure:**
150+
- `algorithms/`: Go algorithm implementations (pagerank.go, sssp.go, louvain.go, etc.)
151+
- `apps/`:
152+
- `bsp/`: BSP coordination helpers
153+
- `graphio/`: HugeGraph I/O adapters (reads via gRPC to store/pd, writes via HTTP REST)
154+
- `master/`: Master scheduling, HTTP endpoints, worker management
155+
- `compute/`: Worker-side compute logic
156+
- `protos/`: Generated protobuf/gRPC definitions
157+
- `common/`: Utilities, logging, metrics
158+
- `client/`: Client libraries
159+
- `tools/`: Binary dependencies (supervisord, protoc)
160+
- `ui/`: Web UI assets
161+
162+
**Key Patterns:**
163+
164+
1. **Maker/Registry Pattern**: Graph loaders/writers register themselves via init() (e.g., `LoadMakers[LoadTypeHugegraph] = &HugegraphMaker{}`). Master selects loader by type.
165+
166+
2. **HugeGraph Integration**:
167+
- `hugegraph.go` implements HugegraphMaker, HugegraphLoader, HugegraphWriter
168+
- Queries PD via gRPC for partition metadata
169+
- Streams vertex/edge data via gRPC from store (ScanPartition)
170+
- Writes results back via HugeGraph HTTP REST API
171+
172+
3. **Master-Worker**: Master schedules LoadPartition tasks to workers, manages worker lifecycle via WorkerManager/WorkerClient, exposes HTTP admin endpoints.
173+
174+
**Important Files:**
175+
- HugeGraph integration: `vermeer/apps/graphio/hugegraph.go`
176+
- Master scheduling: `vermeer/apps/master/tasks/tasks.go`
177+
- Worker management: `vermeer/apps/master/workers/workers.go`
178+
- HTTP endpoints: `vermeer/apps/master/services/http_master.go`
179+
- Scheduler: `vermeer/apps/master/bl/scheduler_bl.go`
180+
181+
## Integration with HugeGraph
182+
183+
**Computer (Java):**
184+
- `WorkerInputManager` reads vertices/edges from HugeGraph via `GraphFactory` abstraction
185+
- Graph data is partitioned and distributed to workers via input splits
186+
187+
**Vermeer (Go):**
188+
- Directly queries HugeGraph PD (metadata service) for partition information
189+
- Uses gRPC to stream graph data from HugeGraph store
190+
- Writes computed results back via HugeGraph HTTP REST API (adds properties to vertices)
191+
192+
## Development Workflow
193+
194+
**Adding a New Algorithm (Computer):**
195+
1. Create class in `computer-algorithm` implementing `Computation<MessageType>`
196+
2. Implement `compute0()` for initialization and `compute()` for message processing
197+
3. Use `context.sendMessage()` or `context.sendMessageToAllEdges()` for message passing
198+
4. Register aggregators in `beforeSuperstep()`, read/write in `compute()`
199+
5. Configure algorithm class name in job config
200+
201+
**K8s-Operator Development:**
202+
- CRD classes are auto-generated; run `mvn clean install` in `computer-k8s-operator` first
203+
- Generated classes appear in `computer-k8s/target/generated-sources/`
204+
- CRD generation script: `computer-k8s-operator/crd-generate/Makefile`
205+
206+
**Vermeer Asset Updates:**
207+
- Web UI assets must be regenerated after changes: `cd asset && go generate`
208+
- Or use `make generate-assets` from vermeer root
209+
- For dev mode with hot-reload: `go build -tags=dev`
210+
211+
## Testing Notes
212+
213+
**Computer:**
214+
- Integration tests require etcd, HDFS, HugeGraph, and Kubernetes (see `.github/workflows/computer-ci.yml`)
215+
- Test environment setup scripts in `computer-dist/src/assembly/travis/`
216+
- Unit tests run in isolation without external dependencies
217+
218+
**Vermeer:**
219+
- Test scripts in `vermeer/test/`,with `vermeer_test.go` and `vermeer_test.sh`
220+
- Configuration files in `vermeer/config/` (master.ini, worker.ini templates)
221+
222+
## CI/CD
223+
224+
CI pipeline (`.github/workflows/computer-ci.yml`) runs:
225+
1. License check (Apache RAT)
226+
2. Setup HDFS (Hadoop 3.3.2)
227+
3. Setup Minikube/Kubernetes
228+
4. Load test data into HugeGraph
229+
5. Compile with Java 11
230+
6. Run integration tests (`-P integrate-test`)
231+
7. Run unit tests (`-P unit-test`)
232+
8. Upload coverage to Codecov
233+
234+
## Important Notes
235+
236+
- **Computer K8s module**: Must run `mvn clean install` before editing to generate CRD classes
237+
- **Java version**: Build requires JDK 11; HDFS dependencies require JDK 8
238+
- **Vermeer binary deps**: First-time builds need `make init` to download supervisord/protoc
239+
- **BSP coordination**: Computer uses etcd for barrier synchronization (configure via `BSP_ETCD_URL`)
240+
- **Memory management**: Both systems auto-manage memory by spilling to disk when needed

0 commit comments

Comments
 (0)