Skip to content

Commit 4078c23

Browse files
Update export format to version 3: switch to ZIP archive with manifest and per-entity JSON files
1 parent 9cc68a5 commit 4078c23

10 files changed

Lines changed: 950 additions & 360 deletions

File tree

extensions/saas/sources/migration/ARCHITECTURE.md

Lines changed: 152 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ The Account Migration Module enables full lifecycle management of tenant (Accoun
4848
┌──────────────┐ ┌────────────────┐
4949
│ ExportPipeline│ │ ImportPipeline │
5050
│ (streaming │ │ (streaming │
51-
JSON write)│ │ JSON read)
51+
ZIP write) │ │ ZIP / legacy)
5252
└──────┬───────┘ └───────┬────────┘
5353
│ │
5454
│ uses │ uses
@@ -97,26 +97,61 @@ Account → Customer → Order → OrderItem
9797

9898
---
9999

100-
## 4. Export Pipeline
100+
## 4. Export Pipeline (format v3)
101+
102+
### Output: ZIP Archive
103+
104+
Every export produces a single ZIP file containing one JSON entry per entity type, plus a manifest:
105+
106+
```
107+
Account101_20260616T100500.zip
108+
├── manifest.json ← always first entry
109+
├── Account101_Account.json ← empty rows; Account data lives in manifest
110+
├── Account101_Customer.json
111+
├── Account101_Product.json
112+
└── Account101_Order.json
113+
```
114+
115+
ZIP entries use `DEFLATE` at `BEST_SPEED` level. Typical JSON compresses 5–10×.
101116

102117
### Streaming Strategy
103118

104119
```
105-
OutputStream (raw or GZIPOutputStream)
106-
└── JsonGenerator (Jackson streaming)
107-
├── Write header: {version, exportedAt, sourceAccountId, identityStrategy}
108-
├── Write "account": AccountDTO object
109-
└── Write "entities": {
110-
For each entityClass in topological order:
111-
Write entityClass.getName(): {
112-
"fields": ["id", "name", "category_ref_id", ...],
113-
"rows": [
114-
[1, "John", 3],
115-
[2, "Cindy", null],
116-
...
117-
]
118-
}
119-
}
120+
ZipOutputStream (wraps caller's OutputStream, 256 KB buffer)
121+
122+
├── Entry: manifest.json
123+
│ JsonGenerator → {version, exportedAt, sourceAccountId,
124+
│ identityStrategy, account: AccountDTO,
125+
│ entities: [{file, entityClass}, ...]}
126+
127+
└── For each entityClass in topological order:
128+
Entry: Account{id}_{SimpleName}.json
129+
JsonGenerator → {
130+
"entityClass": "com.example.Customer",
131+
"fields": ["id", "name", "category_ref_id", ...],
132+
"rows": [
133+
[1, "John", 3],
134+
[2, "Cindy", null],
135+
...
136+
]
137+
}
138+
```
139+
140+
Each `JsonGenerator` is wrapped with a `NoCloseOutputStream` so that `gen.close()` flushes
141+
without closing the underlying `ZipOutputStream`.
142+
143+
### Keyset Pagination
144+
145+
Entity rows are fetched in pages using keyset pagination (`id > lastId`) to avoid
146+
`OFFSET`-based degradation on large tables:
147+
148+
```
149+
lastId = 0
150+
loop:
151+
SELECT ... WHERE accountId = ? AND id > lastId ORDER BY id LIMIT chunkSize
152+
→ stream rows into current ZIP entry
153+
lastId = last row's id
154+
until page.size() < chunkSize
120155
```
121156

122157
### Serialization of a Single Entity (Columnar Format)
@@ -131,32 +166,55 @@ For each SingularAttribute (excluding id):
131166
```
132167

133168
Fields annotated with `@ExportIgnore` are skipped. Missing or inaccessible fields write `null`.
169+
`Account` entity rows are intentionally empty — account data is in `manifest.json`.
134170

135171
---
136172

137173
## 5. Import Pipeline
138174

139-
### Streaming Strategy
175+
### Format Detection (magic bytes)
140176

141177
```
142-
InputStream (auto-detected: raw or GZIPInputStream)
143-
└── JsonParser (Jackson streaming)
144-
├── Read header → validate version, note sourceAccountId
145-
├── Read "account" → AccountDTO (optionally create new Account)
146-
└── Read "entities" → {
147-
For each entityClassName:
148-
resolve class → Class.forName(entityClassName)
149-
Read "fields": [...] → ordered column names
150-
For each row array (chunked):
151-
reconstruct JsonNode from fields + row values
152-
deserialize → entity instance
153-
resolve _ref_id references via idMappings
154-
set accountId = targetAccountId
155-
persist entity
156-
record originalId → newId in idMappings
157-
}
178+
BufferedInputStream.mark(4) → peek 2 bytes → reset
179+
0x50 0x4B ("PK") → ZIP archive → importFromZip()
180+
0x1F 0x8B → GZIP stream → importLegacy() with GZIPInputStream wrapper
181+
anything else → plain JSON → importLegacy()
158182
```
159183

184+
### ZIP Import Flow (v3)
185+
186+
```
187+
ZipInputStream (sequential — entries arrive in topological order)
188+
189+
├── Entry: manifest.json
190+
│ Read version + sourceAccountId for logging
191+
│ (entity list not required — each entity file is self-describing)
192+
193+
└── For each *.json entry:
194+
JsonParser (via NoCloseInputStream wrapper)
195+
→ read "entityClass" → Class.forName()
196+
→ read "fields" → ordered column names
197+
→ read "rows" → chunked into List<JsonNode>
198+
persistChunk() [REQUIRES_NEW transaction per chunk]
199+
→ deserialize entity
200+
→ resolve _ref_id references via idMappings
201+
→ set accountId = targetAccountId
202+
→ persist + flush (REGENERATE_IDS) or set ID (KEEP_IDS)
203+
→ record originalId → newId in idMappings
204+
zipIn.closeEntry() — advances stream to next entry
205+
```
206+
207+
`NoCloseInputStream` prevents `parser.close()` from closing the `ZipInputStream` between entries.
208+
`ZipInputStream.read()` reports EOF at the end of each entry, so Jackson's parser stops
209+
naturally without reading into the next entry.
210+
211+
### Legacy Import (v1 / v2)
212+
213+
For files produced by format v1 (per-row objects) or v2 (single JSON/GZIP):
214+
- GZIP-wrapped streams are unwrapped automatically.
215+
- The single JSON document is parsed using the original streaming algorithm.
216+
- This path is maintained for backward compatibility and will not be removed.
217+
160218
### ID Resolution (Identity Mapper)
161219

162220
```
@@ -172,7 +230,20 @@ REGENERATE_IDS (default for clone):
172230

173231
---
174232

175-
## 6. Worker Lifecycle
233+
## 6. Clone Operation
234+
235+
Clone uses a temporary file (not an in-memory buffer) to safely handle tenants with
236+
hundreds of megabytes of data:
237+
238+
```
239+
Phase 1: ExportPipeline.export(source, tempFile) → Account{src}_timestamp.zip
240+
Phase 2: ImportPipeline.importTenant(tempFile, importOptions)
241+
Phase 3: Files.deleteIfExists(tempFile) [in finally block]
242+
```
243+
244+
---
245+
246+
## 7. Worker Lifecycle
176247

177248
```
178249
POST /export/{accountId}
@@ -184,7 +255,7 @@ AccountMigrationJobService.createExportJob(accountId, options)
184255
├── 2. SchedulerUtil.runWithResult(new ExportWorker(jobId, accountId, options))
185256
│ └── Virtual Thread starts
186257
│ ├── Update job status → RUNNING
187-
│ ├── Call ExportPipeline.export(...)
258+
│ ├── Call ExportPipeline.export(...) → writes Account{id}_{ts}.zip
188259
│ │ └── MigrationProgressListener updates job.progress periodically
189260
│ ├── On success: update job status → COMPLETED, set resultPath
190261
│ └── On failure: update job status → FAILED, set errorMessage
@@ -200,58 +271,57 @@ Cancellation:
200271

201272
---
202273

203-
## 7. Export File Format
274+
## 8. Export File Format (v3)
204275

205276
```
206-
saas_export_42_20260614T100500.json[.gz]
277+
Account42_20260614T100500.zip
207278
```
208279

280+
**manifest.json** (first ZIP entry):
209281
```json
210282
{
211-
"version": "2",
283+
"version": "3",
212284
"exportedAt": "2026-06-14T10:05:00",
213285
"sourceAccountId": 42,
214286
"identityStrategy": "KEEP_IDS",
215287
"account": {
216288
"id": 42,
217289
"name": "Acme Corp",
218290
"subdomain": "acme",
219-
"email": "admin@acme.com",
220-
...
291+
"email": "admin@acme.com"
221292
},
222-
"entities": {
223-
"tools.dynamia.modules.saas.jpa.AccountParameter": {
224-
"fields": ["id", "accountId", "name", "value"],
225-
"rows": [
226-
[1, 42, "theme", "dark"]
227-
]
228-
},
229-
"com.example.Customer": {
230-
"fields": ["id", "accountId", "name", "category_ref_id"],
231-
"rows": [
232-
[10, 42, "John", 3]
233-
]
234-
},
235-
"com.example.Order": {
236-
"fields": ["id", "accountId", "total", "customer_ref_id"],
237-
"rows": [
238-
[100, 42, 99.99, 10]
239-
]
240-
}
241-
}
293+
"entities": [
294+
{ "file": "Account42_AccountParameter.json", "entityClass": "tools.dynamia.modules.saas.jpa.AccountParameter" },
295+
{ "file": "Account42_Customer.json", "entityClass": "com.example.Customer" },
296+
{ "file": "Account42_Order.json", "entityClass": "com.example.Order" }
297+
]
298+
}
299+
```
300+
301+
**Account42_Customer.json** (ZIP entry per entity):
302+
```json
303+
{
304+
"entityClass": "com.example.Customer",
305+
"fields": ["id", "accountId", "name", "category_ref_id"],
306+
"rows": [
307+
[10, 42, "John", 3],
308+
[11, 42, "Cindy", null]
309+
]
242310
}
243311
```
244312

245313
**Key conventions:**
246-
- Each entity section is an object with `fields` (column names) and `rows` (value arrays).
314+
- Each entity file is a standalone JSON document with `entityClass`, `fields`, and `rows`.
247315
- `{fieldName}_ref_id` encodes a `@ManyToOne` / `@OneToOne` reference by its primary key.
248-
- The `account` section makes the package self-describing.
249-
- Entities appear in topological order (parents before children).
250-
- Format version `"2"` uses the columnar layout; version `"1"` (legacy) used per-row objects.
316+
- Entities appear in topological order (parents before children) both in the manifest and as ZIP entries.
317+
- The `account` section in the manifest makes the archive self-describing.
318+
- Format version `"3"` uses the ZIP multi-file layout.
319+
- Format version `"2"` (legacy) uses a single columnar JSON/GZIP document.
320+
- Format version `"1"` (legacy) uses a single JSON document with per-row objects.
251321

252322
---
253323

254-
## 8. Identity Mapping SPI
324+
## 9. Identity Mapping SPI
255325

256326
```java
257327
public interface IdentityMapper {
@@ -280,7 +350,7 @@ Register a custom implementation as a Spring bean (`@Component` / `@Service`) to
280350
281351
---
282352

283-
## 9. Annotaions
353+
## 10. Annotations
284354

285355
### `@AccountExportIgnore`
286356

@@ -304,7 +374,7 @@ Apply to entity fields that should be skipped during serialization (computed val
304374

305375
---
306376

307-
## 10. Database Schema
377+
## 11. Database Schema
308378

309379
The module adds one table:
310380

@@ -329,25 +399,28 @@ CREATE TABLE saas_migration_jobs (
329399

330400
---
331401

332-
## 11. Scalability Notes
402+
## 12. Scalability Notes
333403

334404
| Concern | Mitigation |
335405
|---------|------------|
336-
| Large tables | Chunked pagination (configurable, default 500 rows/page) |
337-
| Memory | Jackson streaming API: never load all rows |
338-
| Network / disk | Optional GZIP compression |
406+
| Large tables | Keyset pagination (`id > lastId`, configurable chunk size, default 500 rows/page) |
407+
| Memory | Jackson streaming API + ZipOutputStream: never load all rows; one ZIP entry at a time |
408+
| Disk I/O | 256 KB write buffer on ZipOutputStream; DEFLATE BEST_SPEED compression |
409+
| Network / disk | ZIP always produced — typically 5–10× smaller than raw JSON |
339410
| Long-running jobs | Virtual threads, cooperative cancellation via `CancellationToken` |
340-
| DB load | Read-only paginated queries; imports batched per chunk |
411+
| DB load | Read-only keyset-paginated queries; imports batched per chunk in isolated transactions |
341412
| Concurrent jobs | In-memory job registry + DB-backed state; configurable max concurrent |
413+
| Large clone | Export → temp file → import (avoids OOM from in-memory buffers) |
342414

343415
---
344416

345-
## 12. Implementation Roadmap
417+
## 13. Implementation Roadmap
346418

347419
| Phase | Scope |
348420
|-------|-------|
349-
| **v1 (current)** | EXPORT, IMPORT, CLONE, BACKUP, RESTORE. `KEEP_IDS` + `REGENERATE_IDS`. REST API. Progress tracking. Cancellation. |
350-
| **v2** | Cross-environment MIGRATE (HTTP push to remote endpoint). Resume after failure (checkpoint in DB). |
351-
| **v3** | `UUID7` identity strategy. Partial export (subset of entities). Schema validation on import. Diff/merge strategy. |
352-
| **v4** | Multi-region database migration. S3/GCS file storage backend. Event-driven progress via SSE/WebSocket. |
353-
421+
| **v1 (legacy)** | EXPORT, IMPORT as single JSON. `KEEP_IDS` + `REGENERATE_IDS`. |
422+
| **v2 (legacy)** | Columnar format (fields + rows arrays). Optional GZIP. |
423+
| **v3 (current)** | ZIP multi-file: one JSON per entity. Always compressed. Keyset pagination. Backward-compatible import. Clone via temp file. |
424+
| **v4** | Cross-environment MIGRATE (HTTP push to remote endpoint). Resume after failure (checkpoint in DB). |
425+
| **v5** | `UUID7` identity strategy. Partial export (subset of entities). Schema validation on import. |
426+
| **v6** | Multi-region database migration. S3/GCS file storage backend. Event-driven progress via SSE/WebSocket. |

0 commit comments

Comments
 (0)