You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/copilot-instructions.md
+57-13Lines changed: 57 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@
8
8
9
9
**What this is:** An AI-assisted, modular ETL (Extract, Transform, Load) platform where each data operation is an independent Flask microservice. Pipelines are orchestrated via Apache Airflow DAGs or via the AI agent (natural language → YAML → execution).
10
10
11
-
**Primary use case:** HR / People Analytics — the platform ships with a production-ready pipeline for the IBM HR Attrition dataset (extract → quality check → drop columns → outlier detection → clean NaN → load).
11
+
**Primary use case:** HR / People Analytics and E-commerce — the platform ships with production-ready pipelines for the IBM HR Attrition dataset and e-commerce order analytics, plus a weather API demo. Bundled demo datasets in `data/demo/` allow out-of-the-box testing.
12
12
13
13
**Core differentiator:** Composable, dynamically-assembled pipelines where an AI agent can translate natural language into validated YAML pipeline definitions and execute them. Each microservice is independently deployable, observable, and scalable.
|`ai_agent/pipeline_agent.py`|`PipelineAgent`: builds system prompt from `service_registry.json`, calls LLM to generate YAML, validates structure + services + params + dependencies. Can be instantiated without LLM (via `__new__()`) for validation-only use (e.g., Streamlit UI). |
287
-
|`ai_agent/pipeline_compiler.py`|`PipelineCompiler`: executes validated pipeline definition step-by-step via Preparator SDK, returns `PipelineResult` with per-step metrics + `correlation_id`. Supports `join_datasets` (2 `depends_on` entries as input datasets). Exposes `last_step_outputs` dict for UI data preview. |
319
+
|`ai_agent/pipeline_agent.py`|`PipelineAgent`: builds system prompt from `service_registry.json`, calls LLM to generate YAML, validates structure + services + params + dependencies. Standalone `validate_pipeline()` module-level function enables validation-only use without instantiating the agent (e.g., Streamlit UI). |
320
+
|`ai_agent/pipeline_compiler.py`|`PipelineCompiler`: executes validated pipeline definitions via Preparator SDK with **parallel execution** of independent steps (topological layering via Kahn’s algorithm + `ThreadPoolExecutor`). Uses a **dispatch registry** (`_build_dispatch_registry()`) for extensibility—add new services via `register_service()` without if/elif chains. Returns `PipelineResult` with per-step metrics + `correlation_id`. Supports `join_datasets` (2 `depends_on` entries). Exposes `last_step_outputs` dict for UI data preview. |
288
321
|`schemas/service_registry.json`| Complete metadata for all 11 services: name, type, description, endpoint, input/output formats, params with types/required/defaults/enums |
289
322
|`schemas/pipeline_schema.json`| JSON Schema for pipeline definitions |
290
323
@@ -356,6 +389,8 @@ Files stored at `/app/data/<dataset_name>/xcom/<step>_<timestamp>_<uuid>.arrow`.
- `services_config` — test service URL configuration fixture
445
487
446
488
### Important: sys.path Constraint
447
489
@@ -475,7 +517,9 @@ make lint # ruff linter
475
517
476
518
## 10. Adding a New Microservice (Step-by-Step)
477
519
478
-
1.**Create directory:**`services/<name>/` with `Dockerfile`, `requirements.txt`, `run.py`, `app/__init__.py`, `app/routes.py`, `app/<logic>.py`
520
+
A ready-to-copy scaffolding template is available in `templates/new_service/` with placeholder syntax. A comprehensive developer guide is in `docs/extending.md`.
521
+
522
+
1.**Create directory:** Copy `templates/new_service/` to `services/<name>/` and replace placeholders (`{{SERVICE_NAME}}`, `{{SERVICE_PORT}}`, etc.).
479
523
2.**Logic module:** Pure function — takes `pyarrow.Table` + params, returns `pyarrow.Table`. No Flask imports.
480
524
3.**Routes:** Follow the endpoint pattern (section 5). Include REQUEST/SUCCESS/ERROR Prometheus counters. Include `/health` endpoint.
481
525
4.**Dockerfile:** Copy `common/`, install deps, set `PYTHONPATH=/app/services`, use gunicorn CMD with HEALTHCHECK.
@@ -563,11 +607,11 @@ These are hard-won insights from building and debugging the platform. They shoul
563
607
564
608
### Observability Lessons
565
609
566
-
15.**Centralized utilities eliminate boilerplate.** The `common/service_utils.py` module extracted ~800 lines of duplicated code (Prometheus counters, /health, /metrics, X-Params parsing, metadata writing) from 11 services into shared functions. This reduced each `routes.py` by ~60–80 lines and ensures consistent behavior across all services.
610
+
16.**Centralized utilities eliminate boilerplate.** The `common/service_utils.py` module extracted duplicated code (Prometheus counters, /health, /metrics, X-Params parsing, metadata writing) from 11 services into shared functions. This ensures consistent behavior across all services.
567
611
568
-
16.**Correlation ID must be generated at the edge.** The Preparator SDK generates a UUID `correlation_id` on construction and includes it in every HTTP request (`X-Correlation-ID` header). Services read or generate it via `get_correlation_id()`. This enables end-to-end tracing of a pipeline request across all service logs.
612
+
17.**Correlation ID must be generated at the edge.** The Preparator SDK generates a UUID `correlation_id` on construction and includes it in every HTTP request (`X-Correlation-ID` header). Services read or generate it via `get_correlation_id()`. This enables end-to-end tracing of a pipeline request across all service logs.
569
613
570
-
17.**Structured logging must be opt-in at startup.** Using `configure_service_logging()` in `create_app()` (not at module level) prevents duplicate handlers when Flask reloads. The `JSONFormatter` outputs single-line JSON with timestamp, level, service, message, correlation_id, and dataset_name.
614
+
18.**Structured logging must be opt-in at startup.** Using `configure_service_logging()` in `create_app()` (not at module level) prevents duplicate handlers when Flask reloads. The `JSONFormatter` outputs single-line JSON with timestamp, level, service, message, correlation_id, and dataset_name.
0 commit comments