diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000000..5fefc3821a --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,44 @@ +# AGENTS.md + +## Purpose and scope +- This repository is a multi-module Java + C++ codebase for the Pixels columnar engine, deployment daemons, and serverless query acceleration. +- `PIXELS_HOME` is a hard requirement for most workflows (`install.sh`, IntelliJ run configs, runtime scripts). +- Existing AI guidance files were not found; conventions here are inferred from project READMEs and build scripts. + +## Big-picture architecture (read these first) +- Core format/runtime: `pixels-core`, shared APIs/types: `pixels-common`, cache service: `pixels-cache`. +- Control plane services run as stateless daemons (`pixels-daemon`): metadata, transaction, cache coordination, index endpoints. +- External API gateway: `pixels-server` (REST on `18891`, RPC on `18892`). +- Query pipeline: SQL parse (`pixels-parser`) -> physical planning (`pixels-planner`) -> operator execution (`pixels-executor`). +- Turbo/serverless path: `pixels-turbo/*` modules (`pixels-invoker-*`, `pixels-worker-*`, `pixels-scaling-*`) integrate Trino with Lambda/vHive/EC2 autoscaling. +- Retina CDC path: `pixels-retina` + `cpp/pixels-retina` replay log-based changes with MVCC; index backends are pluggable under `pixels-index/*`. + +## Cross-component contracts and data flow +- RPC/data contracts are centralized in `proto/*.proto`; row batch schema is in `flatbuffers/rowBatch.fbs`. +- Storage backends are split by module (`pixels-storage/pixels-storage-{s3,hdfs,gcs,redis,http,localfs,...}`) and selected by table/storage settings. +- Typical operational flow: Trino connector reads Pixels metadata/files -> optional Turbo pushdown builds sub-plans -> workers execute and write intermediate/output storage. +- Example runtime switch for Turbo lives in Trino catalog config: `cloud.function.switch=off|on|auto` (`pixels-turbo/README.md`). + +## Developer workflows (project-specific) +- Full build from repo root (default used by project): + - `mvn -T 3 clean install` +- Install local runnable layout (`bin/`, `sbin/`, `etc/`) into `PIXELS_HOME`: + - `./install.sh` +- Start core services after install: + - `./sbin/start-pixels.sh` (from `PIXELS_HOME`) +- Run CLI tooling (load/compact/stat/eval): + - `java -jar ./sbin/pixels-cli-*-full.jar` +- C++/DuckDB path is separate (`cpp/README.md`): + - `cd cpp && make pull && make -j` + +## Test/build caveats to avoid wasted cycles +- Root `pom.xml` sets `maven-surefire-plugin` with `true`; tests do not run unless explicitly enabled. +- Some JUnit tests need lower JDK internals access (README notes JDK 8 for those tests), while integrations like Trino may require newer JDKs. +- Prefer module-scoped iterations when changing one area (for example `mvn -pl pixels-core -am ...`) to avoid rebuilding all modules. + +## Conventions agents should follow in this repo +- Keep runtime/config edits aligned with `PIXELS_HOME/etc/pixels.properties` and scripts under `scripts/{bin,sbin,etc}`. +- When documenting/evaluating performance flows, mirror project examples in `docs/TPC-H.md` and `docs/CLICKBENCH.md` (for example `LOAD`, `COMPACT`, `STAT`). +- For storage- or index-related changes, update the matching pluggable module instead of hard-coding backend-specific logic in shared modules. +- For serverless changes, ensure planner/invoker/worker settings remain consistent (input/intermediate/output storage schemes in `pixels-turbo/README.md`). + diff --git a/README.md b/README.md index 62e26aa588..d39b44266c 100644 --- a/README.md +++ b/README.md @@ -2,12 +2,6 @@ Pixels ======= [![Pixels Daily Build](https://github.com/pixelsdb/pixels/actions/workflows/daily-build.yml/badge.svg)](https://github.com/pixelsdb/pixels/releases/tag/daily-latest) ![GitHub commits](https://img.shields.io/github/commit-activity/m/pixelsdb/pixels/master) -![GitHub Issues](https://img.shields.io/github/issues-closed/pixelsdb/pixels) -![GitHub Pull Requests](https://img.shields.io/github/issues-pr-closed/pixelsdb/pixels) -[![Visitors](https://api.visitorbadge.io/api/combined?path=https%3A%2F%2Fgithub.com%2Fpixelsdb%2Fpixels&label=visitors&countColor=%23ff8a65&style=flat)](https://visitorbadge.io/status?path=https%3A%2F%2Fgithub.com%2Fpixelsdb%2Fpixels) -![GitHub Created At](https://img.shields.io/github/created-at/pixelsdb/pixels) -![GitHub code size](https://img.shields.io/github/languages/code-size/pixelsdb/pixels) -![GitHub repo size](https://img.shields.io/github/repo-size/pixelsdb/pixels) [![GitHub License](https://img.shields.io/github/license/pixelsdb/pixels)](https://github.com/pixelsdb/pixels/blob/master/LICENSE) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/pixelsdb/pixels) @@ -27,7 +21,7 @@ The other integrations are opensourced in separate repositories: Pixels also has its own query engine [Pixels-Turbo](pixels-turbo). It prioritizes processing queries in an autoscaling MPP cluster (currently based on Trino) and exploits serverless functions -(e.g, [AWS Lambda](https://aws.amazon.com/lambda/) or [vHive / Knative](https://github.com/vhive-serverless/vHive)) +(e.g, [AWS Lambda](https://aws.amazon.com/lambda/), [vHive / Knative](https://github.com/vhive-serverless/vHive), and [Spike](https://github.com/pixelsdb/pixels-spike)) to accelerate the processing of workload spikes. With `Pixels-Turbo`, we can achieve better performance and cost-efficiency for continuous workloads while not compromising elasticity for workload spikes. @@ -37,6 +31,11 @@ service levels in query urgency. It allows users to select whether to execute th Pixels-Turbo can apply different resource scheduling and query execution policies for Different levels of query urgency, which will result in different monetary costs on resources. +Furthermore, Pixels has a real-time data synchronization framework namely [Pixels-Retina](pixels-retina). +It replays data-change operations from log-based CDC sources as mirror transactions on the columnar table data, +using a lightweight MVCC mechanism to support concurrent analytical queries with 10-ms-level data freshness, significantly +outperforming the batch-granular merge-on-read approach used by existing lakehouses such as Apache Iceberg and Paimon. + ## Build Pixels Pixels is mainly implemented in both Java (with some JNI hooks of system calls and C/C++ libs) and C++. diff --git a/cpp/README.md b/cpp/README.md index 11dda0bd0f..b4807583fd 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -202,7 +202,7 @@ localfs.enable.async.io=true ## Common issues -### 1. How to fetch the lastest pixels reader and duckdb? +### 1. How to fetch the latest pixels reader and duckdb? `pixels reader` and `duckdb` will be updated frequently in the next few months, so please keep the two submodules updated. diff --git a/docker/Pixels Docker Deployment Guide.md b/docs/INSTALL-DOCKER.md similarity index 100% rename from docker/Pixels Docker Deployment Guide.md rename to docs/INSTALL-DOCKER.md diff --git a/docs/INSTALL.md b/docs/INSTALL.md index dd76fb8176..2df9d89ea4 100644 --- a/docs/INSTALL.md +++ b/docs/INSTALL.md @@ -12,6 +12,7 @@ Here, we only install and configure the essential components for query processin To use the following optional components, follow the instructions in the corresponding README.md after the basic installation: * [Pixels Cache](../pixels-cache/README.md): The distributed columnar cache to accelerate query processing. * [Pixels Turbo](../pixels-turbo/README.md): The hybrid query engine that invokes serverless resources to help process unpredictable workload spikes. +* [Pixels Retina](../pixels-retina/README.md): The transactional data synchronization framework that replays the data changes from the CDC (Change-Data-Capture) stream. * [Pixels Amphi](../pixels-amphi/README.md): The adaptive query scheduler that enables cost-efficient query processing in both on-perm and in-cloud environments. In AWS EC2, create an Ubuntu 22.04 instance with x86 arch and at least 4GB memory and 20GB root volume. diff --git a/pixels-cli/README.md b/pixels-cli/README.md index ea197d97fb..678cf64f84 100644 --- a/pixels-cli/README.md +++ b/pixels-cli/README.md @@ -4,8 +4,6 @@ This is the command-line tool for benchmark evaluations (e.g., TPC-H). We can use it to load data from csv files into Pixels, copy the data, compact the small files, collect data statistics, and run the benchmark queries. -It was previously named `pixels-load`. - ## Usage [TPC-H Evaluation](../docs/TPC-H.md) provides an example of using `pixels-cli`. \ No newline at end of file diff --git a/pixels-daemon/README.md b/pixels-daemon/README.md index 4a0c5e4ea8..b2a8e129a5 100644 --- a/pixels-daemon/README.md +++ b/pixels-daemon/README.md @@ -7,3 +7,7 @@ While the `Workers` are deployed together with the query engine workers and stor Either `Coordinator` or `Worker` process is started as a **stateless** daemon process in the server. Whenever the daemon process is crashed or killed, it only needs a restart to recover. + +## Usage + +[TPC-H Evaluation](../docs/TPC-H.md) provides an example of using `pixels-daemon`. diff --git a/pixels-retina/README.md b/pixels-retina/README.md index 8e3d2b475b..c0536fa56d 100644 --- a/pixels-retina/README.md +++ b/pixels-retina/README.md @@ -14,7 +14,7 @@ replay throughput, without compromising query performance or resource cost-effic significantly outperforming state-of-the-art lakehouses, Iceberg and Paimon, which provides minute-level data freshness and one order of magnitude lower data-change throughput. -## Retina Components +## Components The components related to Retina are: diff --git a/pixels-storage/README.md b/pixels-storage/README.md index 21d47a79fb..604705d18e 100644 --- a/pixels-storage/README.md +++ b/pixels-storage/README.md @@ -14,6 +14,7 @@ Queries will load and call the providers to get access to the underlying storage - `pixels-storage-s3qs` provides a storage based on AWS SQS and S3 for intermediate data shuffle. ## Usage + Storage provider can be used in either of the following ways: 1. Put the compiled jar and its dependencies in the CLASSPATH of you program. 2. If your program is build by maven, you can also add the storage provider as dependency. diff --git a/pixels-turbo/README.md b/pixels-turbo/README.md index 9944942dfb..311b838678 100644 --- a/pixels-turbo/README.md +++ b/pixels-turbo/README.md @@ -7,6 +7,7 @@ and automatically invokes cloud functions to process unpredictable workload spik enough resources. ## Components + Currently, `Pixels-Turbo` uses Trino as the query entry-point and the query processor in the MPP cluster, and implements the serverless query accelerator from scratch. `Pixels-Turbo` is composed of the following components: