|
| 1 | +# DataMate Architecture |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +DataMate is a microservices-based data management platform for model fine-tuning and RAG retrieval. It follows a polyglot architecture with Java backend, Python runtime, and React frontend. |
| 6 | + |
| 7 | +## High-Level Architecture |
| 8 | + |
| 9 | +``` |
| 10 | +┌─────────────────────────────────────────────────────────────────┐ |
| 11 | +│ Frontend (React) │ |
| 12 | +│ localhost:5173 │ |
| 13 | +└────────────────────────┬────────────────────────────────────────┘ |
| 14 | + │ HTTP/REST |
| 15 | + ▼ |
| 16 | +┌─────────────────────────────────────────────────────────────────┐ |
| 17 | +│ API Gateway │ |
| 18 | +│ (Spring Cloud) │ |
| 19 | +│ localhost:8080 │ |
| 20 | +│ ┌──────────────────────────────────────────────────────────┐ │ |
| 21 | +│ │ Authentication (JWT) │ │ |
| 22 | +│ │ Route Forwarding │ │ |
| 23 | +│ │ Rate Limiting │ │ |
| 24 | +│ └──────────────────────────────────────────────────────────┘ │ |
| 25 | +└────────────────┬────────────────────────────────────────────────┘ |
| 26 | + │ |
| 27 | + ├─────────────────┬─────────────────┐ |
| 28 | + ▼ ▼ ▼ |
| 29 | +┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐ |
| 30 | +│ Main Application │ │ Data Management │ │ RAG Indexer │ |
| 31 | +│ (Spring Boot) │ │ Service │ │ Service │ |
| 32 | +│ - Data Cleaning │ │ - Dataset Mgmt │ │ - Knowledge Base │ |
| 33 | +│ - Operator Market │ │ - File Operations │ │ - Vector Search │ |
| 34 | +│ - Data Collection │ │ - Tag Management │ │ - Milvus Integration │ |
| 35 | +└─────────┬───────────┘ └─────────┬───────────┘ └─────────┬───────────┘ |
| 36 | + │ │ │ |
| 37 | + │ │ │ |
| 38 | + ▼ ▼ ▼ |
| 39 | +┌─────────────────────────────────────────────────────────────────────────────────┐ |
| 40 | +│ PostgreSQL (Metadata) │ |
| 41 | +│ Redis (Cache) │ |
| 42 | +│ Milvus (Vectors) │ |
| 43 | +│ MinIO (Files) │ |
| 44 | +└─────────────────────────────────────────────────────────────────────────────────┘ |
| 45 | + │ |
| 46 | + ▼ |
| 47 | +┌─────────────────────────────────────────────────────────────────┐ |
| 48 | +│ Python Runtime (FastAPI) │ |
| 49 | +│ localhost:18000 │ |
| 50 | +│ ┌──────────────────────────────────────────────────────────┐ │ |
| 51 | +│ │ Data Synthesis │ │ |
| 52 | +│ │ Data Annotation (Label Studio) │ │ |
| 53 | +│ │ Data Evaluation │ │ |
| 54 | +│ │ RAG Indexing │ │ |
| 55 | +│ └──────────────────────────────────────────────────────────┘ │ |
| 56 | +└────────────────┬────────────────────────────────────────────────┘ |
| 57 | + │ |
| 58 | + ▼ |
| 59 | +┌─────────────────────────────────────────────────────────────────┐ |
| 60 | +│ Ray Executor (Distributed) │ |
| 61 | +│ ┌──────────────────────────────────────────────────────────┐ │ |
| 62 | +│ │ Operator Execution │ │ |
| 63 | +│ │ Task Scheduling │ │ |
| 64 | +│ │ Distributed Computing │ │ |
| 65 | +│ └──────────────────────────────────────────────────────────┘ │ |
| 66 | +└─────────────────────────────────────────────────────────────────┘ |
| 67 | +``` |
| 68 | + |
| 69 | +## Components |
| 70 | + |
| 71 | +### Frontend Layer |
| 72 | +- **Framework**: React 18 + TypeScript + Vite |
| 73 | +- **UI Library**: Ant Design |
| 74 | +- **Styling**: TailwindCSS v4 |
| 75 | +- **State Management**: Redux Toolkit |
| 76 | +- **Routing**: React Router v7 |
| 77 | + |
| 78 | +### Backend Layer (Java) |
| 79 | +- **API Gateway**: Spring Cloud Gateway |
| 80 | + - Route forwarding |
| 81 | + - JWT authentication |
| 82 | + - Rate limiting |
| 83 | + |
| 84 | +- **Main Application**: Spring Boot 3.5 |
| 85 | + - Data cleaning pipeline |
| 86 | + - Operator marketplace |
| 87 | + - Data collection tasks |
| 88 | + |
| 89 | +- **Data Management Service**: Spring Boot 3.5 |
| 90 | + - Dataset CRUD |
| 91 | + - File operations |
| 92 | + - Tag management |
| 93 | + |
| 94 | +- **RAG Indexer Service**: Spring Boot 3.5 |
| 95 | + - Knowledge base management |
| 96 | + - Vector search |
| 97 | + - Milvus integration |
| 98 | + |
| 99 | +### Runtime Layer (Python) |
| 100 | +- **FastAPI Backend**: Port 18000 |
| 101 | + - Data synthesis (QA generation) |
| 102 | + - Data annotation (Label Studio integration) |
| 103 | + - Model evaluation |
| 104 | + - RAG indexing |
| 105 | + |
| 106 | +- **Ray Executor**: Distributed execution |
| 107 | + - Operator execution |
| 108 | + - Task scheduling |
| 109 | + - Multi-node parallelism |
| 110 | + |
| 111 | +### Operator Ecosystem |
| 112 | +- **filter**: Data filtering (duplicates, sensitive content, quality) |
| 113 | +- **mapper**: Data transformation (cleaning, normalization) |
| 114 | +- **slicer**: Data segmentation (text splitting, slide extraction) |
| 115 | +- **formatter**: Format conversion (PDF → text, slide → JSON) |
| 116 | +- **llms**: LLM-based operators (quality evaluation, condition checking) |
| 117 | + |
| 118 | +## Data Flow |
| 119 | + |
| 120 | +### 1. Data Ingestion |
| 121 | +``` |
| 122 | +User Upload → Frontend → API Gateway → Data Management Service → PostgreSQL/MinIO |
| 123 | +``` |
| 124 | + |
| 125 | +### 2. Data Processing |
| 126 | +``` |
| 127 | +Dataset → Frontend → API Gateway → Main Application → Python Runtime |
| 128 | +→ Ray Executor → Operators → Processed Data → PostgreSQL/MinIO |
| 129 | +``` |
| 130 | + |
| 131 | +### 3. RAG Indexing |
| 132 | +``` |
| 133 | +Processed Data → Python Runtime → RAG Indexer Service → Milvus (Vectors) |
| 134 | +``` |
| 135 | + |
| 136 | +### 4. RAG Retrieval |
| 137 | +``` |
| 138 | +Query → Frontend → API Gateway → RAG Indexer Service → Milvus → Results |
| 139 | +``` |
| 140 | + |
| 141 | +## Technology Stack |
| 142 | + |
| 143 | +| Layer | Technology | |
| 144 | +|--------|-----------| |
| 145 | +| **Frontend** | React 18, TypeScript, Vite, Ant Design, TailwindCSS | |
| 146 | +| **Backend** | Spring Boot 3.5, Java 21, MyBatis-Plus, PostgreSQL | |
| 147 | +| **Runtime** | FastAPI, Python 3.12, Ray, SQLAlchemy | |
| 148 | +| **Vector DB** | Milvus | |
| 149 | +| **Cache** | Redis | |
| 150 | +| **Object Storage** | MinIO | |
| 151 | +| **Deployment** | Docker Compose, Kubernetes/Helm | |
| 152 | + |
| 153 | +## Communication Patterns |
| 154 | + |
| 155 | +### Service-to-Service |
| 156 | +- **REST API**: HTTP/JSON between frontend and backend |
| 157 | +- **gRPC**: (if any) between backend services |
| 158 | +- **Message Queue**: (if any) for async tasks |
| 159 | + |
| 160 | +### Backend-to-Runtime |
| 161 | +- **HTTP/REST**: Java backend calls Python runtime runtime APIs |
| 162 | +- **Ray**: Python runtime submits tasks to Ray executor |
| 163 | + |
| 164 | +## Security |
| 165 | + |
| 166 | +### Authentication |
| 167 | +- **JWT**: Token-based authentication via API Gateway |
| 168 | +- **Session**: (if any) session management |
| 169 | + |
| 170 | +### Authorization |
| 171 | +- **Role-based**: (if any) RBAC |
| 172 | +- **Resource-based**: (if any) resource-level access control |
| 173 | + |
| 174 | +## Scalability |
| 175 | + |
| 176 | +### Horizontal Scaling |
| 177 | +- **Backend Services**: Kubernetes pod scaling via Helm |
| 178 | +- **Ray Executor**: Multi-node Ray cluster |
| 179 | +- **Frontend**: Static asset serving + CDN |
| 180 | + |
| 181 | +### Vertical Scaling |
| 182 | +- **Database**: PostgreSQL connection pooling |
| 183 | +- **Cache**: Redis clustering |
| 184 | +- **Vector DB**: Milvus cluster |
| 185 | + |
| 186 | +## Deployment |
| 187 | + |
| 188 | +### Docker Compose |
| 189 | +```bash |
| 190 | +make install INSTALLER=docker |
| 191 | +``` |
| 192 | + |
| 193 | +### Kubernetes/Helm |
| 194 | +```bash |
| 195 | +make install INSTALLER=k8s |
| 196 | +``` |
| 197 | + |
| 198 | +## Monitoring |
| 199 | + |
| 200 | +### Metrics |
| 201 | +- **Spring Boot Actuator**: `/actuator/metrics` |
| 202 | +- **Prometheus**: (if configured) metrics collection |
| 203 | +- **Ray**: Ray dashboard for executor monitoring |
| 204 | + |
| 205 | +### Logging |
| 206 | +- **Java**: Log4j2 |
| 207 | +- **Python**: Ray dashboard for executor monitoring |
| 208 | + |
| 209 | +## Architecture Decisions |
| 210 | + |
| 211 | +### Why Polyglot? |
| 212 | +- **Java Backend**: Enterprise-grade, mature ecosystem, strong typing |
| 213 | +- **Python Runtime**: Rich ML/AI ecosystem, flexible, fast prototyping |
| 214 | +- **React Frontend**: Modern UI, component-based, large ecosystem |
| 215 | + |
| 216 | +### Why Microservices? |
| 217 | +- **Scalability**: Independent scaling of services |
| 218 | +- **Maintainability**: Clear service boundaries |
| 219 | +- **Technology Diversity**: Use best tool for each job |
| 220 | + |
| 221 | +### Why Ray? |
| 222 | +- **Distributed Computing**: Seamless multi-node execution |
| 223 | +- **Fault Tolerance**: Automatic task retry and recovery |
| 224 | +- **Resource Management**: Dynamic resource allocation |
| 225 | + |
| 226 | +## Future Enhancements |
| 227 | + |
| 228 | +- [ ] Service Mesh (Istio/Linkerd) |
| 229 | +- [ ] Event Bus (Kafka/Pulsar) |
| 230 | +- [ ] GraphQL API |
| 231 | +- [ ] Real-time-Updates (WebSocket) |
| 232 | +- [ ] Advanced Monitoring (Grafana, Loki) |
| 233 | + |
| 234 | +## References |
| 235 | + |
| 236 | +- [Backend Architecture](./backend/README.md) |
| 237 | +- [Runtime Architecture](./runtime/README.md) |
| 238 | +- [Frontend Architecture](./frontend/README.md) |
| 239 | +- [AGENTS.md](./AGENTS.md) |
0 commit comments