Skip to content

Commit c64cd35

Browse files
authored
Merge branch 'codex/fix-remaining-issues-and-raise-pr' into codex/implement-kubeflow-for-ml-pipeline-orchestration
2 parents 69b1d52 + 8a2fb3a commit c64cd35

10 files changed

Lines changed: 232 additions & 0 deletions

File tree

backend/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,17 @@ pip install -r backend/requirements.txt
1414

1515
## Running the Backend
1616

17+
### Core API Layer
18+
19+
- **FastAPI** - High-performance Python API framework
20+
- **Uvicorn** - ASGI server for FastAPI
21+
- **Gunicorn** - Production process manager
22+
- **NGINX** - Reverse proxy, load balancing, and rate limiting
23+
24+
### Optional Alternative
25+
26+
- **Django** - Use when a full admin experience and enterprise-grade auth system are required
27+
1728
### Option 1: CLI Testing (Phase 1)
1829

1930
Test pipeline execution from command line:

pipeline/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@ A production-ready pipeline automation system built with:
88
- **PostgreSQL** - Persistence
99
- **AI Safety Module** - Failure prediction & anomaly handling
1010
- **BentoML + Feast + Kubeflow** - End-to-end model infrastructure
11+
- **Prometheus + Grafana** - Metrics collection and dashboards
12+
- **ELK Stack (Elasticsearch, Logstash, Kibana)** - Centralized logging
13+
- **Sentry** - Error monitoring and tracing
1114

1215
## Architecture Overview
1316

@@ -115,6 +118,10 @@ docker-compose up -d
115118
| FastAPI Docs | http://localhost:8000/api/docs | - |
116119
| PostgreSQL | localhost:5432 | airflow / airflow |
117120
| Redis | localhost:6379 | - |
121+
| Prometheus | http://localhost:9090 | - |
122+
| Grafana | http://localhost:3000 | admin / admin |
123+
| Kibana | http://localhost:5601 | - |
124+
| Elasticsearch | http://localhost:9200 | - |
118125

119126
### 4. Create Your First Pipeline
120127

@@ -193,6 +200,26 @@ curl -X POST http://localhost:8000/api/executions/pipeline-xxx/execute
193200
| GET | `/metrics` | Dashboard metrics |
194201
| GET | `/insights` | AI insights |
195202

203+
204+
## Observability & Monitoring
205+
206+
### Metrics (Prometheus)
207+
- Backend exposes Prometheus metrics at `/metrics` via `prometheus-fastapi-instrumentator`.
208+
- Prometheus scrapes `backend:8000/metrics` every 15 seconds.
209+
210+
### Dashboards (Grafana)
211+
- Grafana runs on port `3000` and can connect to Prometheus (`http://prometheus:9090`) as a data source.
212+
- Default credentials are `admin/admin` (change in production).
213+
214+
### Centralized Logging (ELK Stack)
215+
- Elasticsearch stores indexed logs.
216+
- Logstash listens on `5000` (TCP JSON) and `5044` (beats) and forwards to Elasticsearch.
217+
- Kibana provides visualization for indices like `flexiroaster-backend-*`.
218+
219+
### Error Monitoring (Sentry)
220+
- Configure `SENTRY_DSN` to enable Sentry for FastAPI exception capture and tracing.
221+
- Optional tuning: `SENTRY_ENVIRONMENT`, `SENTRY_TRACES_SAMPLE_RATE`, and `SENTRY_PROFILES_SAMPLE_RATE`.
222+
196223
## Configuration
197224

198225
### Environment Variables
@@ -205,6 +232,9 @@ curl -X POST http://localhost:8000/api/executions/pipeline-xxx/execute
205232
| `EXECUTOR_STAGE_TIMEOUT` | `120` | Stage timeout in seconds |
206233
| `AI_BLOCK_HIGH_RISK` | `false` | Block high-risk executions |
207234
| `AI_RISK_THRESHOLD_HIGH` | `0.7` | High risk threshold |
235+
| `SENTRY_DSN` | `""` | Enables Sentry when set |
236+
| `SENTRY_ENVIRONMENT` | `development` | Sentry environment label |
237+
| `SENTRY_TRACES_SAMPLE_RATE` | `0.1` | Fraction of traced requests |
208238

209239
### Airflow Variables
210240

pipeline/backend/config.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,14 @@ class Settings(BaseSettings):
9696
LOG_FORMAT: str = "json" # "json" or "text"
9797
LOG_FILE: Optional[str] = None
9898

99+
# ===================
100+
# Observability Settings
101+
# ===================
102+
SENTRY_DSN: str = ""
103+
SENTRY_ENVIRONMENT: str = "development"
104+
SENTRY_TRACES_SAMPLE_RATE: float = 0.1
105+
SENTRY_PROFILES_SAMPLE_RATE: float = 0.1
106+
99107
# ===================
100108
# Airflow Integration
101109
# ===================

pipeline/backend/main.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
from core.redis_state import redis_state_manager
1919
from core.executor import pipeline_executor
2020
from api.routes import ai_automation, executions, health, model_infra, monitoring, pipelines
21+
from api.routes import pipelines, executions, health, monitoring, ai_automation
22+
from observability import setup_observability
2123

2224

2325
# ===================
@@ -133,6 +135,9 @@ async def lifespan(app: FastAPI):
133135
allow_headers=["*"],
134136
)
135137

138+
# Configure metrics and error monitoring
139+
setup_observability(app)
140+
136141

137142
# ===================
138143
# Exception Handlers

pipeline/backend/observability.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
"""Observability setup for metrics and error monitoring."""
2+
import logging
3+
4+
import sentry_sdk
5+
from fastapi import FastAPI
6+
from prometheus_fastapi_instrumentator import Instrumentator
7+
from sentry_sdk.integrations.asgi import SentryAsgiMiddleware
8+
9+
from config import settings
10+
11+
logger = logging.getLogger(__name__)
12+
13+
14+
def setup_observability(app: FastAPI) -> None:
15+
"""Configure Prometheus metrics and Sentry error monitoring."""
16+
Instrumentator(
17+
should_group_status_codes=False,
18+
should_ignore_untemplated=True,
19+
should_respect_env_var=False,
20+
should_instrument_requests_inprogress=True,
21+
excluded_handlers=["/metrics", "/health"],
22+
env_var_name="ENABLE_METRICS",
23+
inprogress_name="flexiroaster_inprogress",
24+
inprogress_labels=True,
25+
).instrument(app).expose(app, include_in_schema=False)
26+
27+
if not settings.SENTRY_DSN:
28+
logger.info("Sentry disabled - SENTRY_DSN not configured")
29+
return
30+
31+
sentry_sdk.init(
32+
dsn=settings.SENTRY_DSN,
33+
environment=settings.SENTRY_ENVIRONMENT,
34+
release=f"{settings.APP_NAME}@{settings.APP_VERSION}",
35+
traces_sample_rate=settings.SENTRY_TRACES_SAMPLE_RATE,
36+
profiles_sample_rate=settings.SENTRY_PROFILES_SAMPLE_RATE,
37+
)
38+
app.add_middleware(SentryAsgiMiddleware)
39+
logger.info("Sentry error monitoring enabled")

pipeline/backend/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ pyyaml==6.0.1
3030
httpx==0.25.2
3131
structlog==23.2.0
3232
tenacity==8.2.3
33+
sentry-sdk[fastapi]==2.19.2
34+
prometheus-fastapi-instrumentator==7.0.0
3335

3436
# Async support
3537
anyio==4.1.0

pipeline/docker-compose.yml

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,12 @@ services:
9494

9595
# Logging
9696
LOG_LEVEL: INFO
97+
98+
# Error Monitoring
99+
SENTRY_DSN: ${SENTRY_DSN:-}
100+
SENTRY_ENVIRONMENT: ${SENTRY_ENVIRONMENT:-local}
101+
SENTRY_TRACES_SAMPLE_RATE: ${SENTRY_TRACES_SAMPLE_RATE:-0.1}
102+
SENTRY_PROFILES_SAMPLE_RATE: ${SENTRY_PROFILES_SAMPLE_RATE:-0.1}
97103
ports:
98104
- "8000:8000"
99105
depends_on:
@@ -198,9 +204,91 @@ services:
198204
- -c
199205
- airflow
200206

207+
# Prometheus metrics store
208+
prometheus:
209+
image: prom/prometheus:v2.54.1
210+
container_name: flexiroaster-prometheus
211+
command:
212+
- "--config.file=/etc/prometheus/prometheus.yml"
213+
volumes:
214+
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
215+
- prometheus-data:/prometheus
216+
ports:
217+
- "9090:9090"
218+
depends_on:
219+
backend:
220+
condition: service_healthy
221+
restart: always
222+
223+
# Grafana dashboards
224+
grafana:
225+
image: grafana/grafana:11.2.2
226+
container_name: flexiroaster-grafana
227+
ports:
228+
- "3000:3000"
229+
environment:
230+
GF_SECURITY_ADMIN_USER: admin
231+
GF_SECURITY_ADMIN_PASSWORD: admin
232+
GF_USERS_ALLOW_SIGN_UP: "false"
233+
volumes:
234+
- grafana-data:/var/lib/grafana
235+
depends_on:
236+
- prometheus
237+
restart: always
238+
239+
# ELK Stack - Elasticsearch
240+
elasticsearch:
241+
image: docker.elastic.co/elasticsearch/elasticsearch:8.15.1
242+
container_name: flexiroaster-elasticsearch
243+
environment:
244+
ES_JAVA_OPTS: "-Xms512m -Xmx512m"
245+
discovery.type: single-node
246+
xpack.security.enabled: "false"
247+
volumes:
248+
- ./monitoring/elasticsearch/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml:ro
249+
- elasticsearch-data:/usr/share/elasticsearch/data
250+
ports:
251+
- "9200:9200"
252+
healthcheck:
253+
test: ["CMD-SHELL", "curl -fsS http://localhost:9200/_cluster/health || exit 1"]
254+
interval: 30s
255+
timeout: 10s
256+
retries: 10
257+
restart: always
258+
259+
# ELK Stack - Logstash
260+
logstash:
261+
image: docker.elastic.co/logstash/logstash:8.15.1
262+
container_name: flexiroaster-logstash
263+
volumes:
264+
- ./monitoring/logstash/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
265+
ports:
266+
- "5000:5000"
267+
- "5044:5044"
268+
depends_on:
269+
elasticsearch:
270+
condition: service_healthy
271+
restart: always
272+
273+
# ELK Stack - Kibana
274+
kibana:
275+
image: docker.elastic.co/kibana/kibana:8.15.1
276+
container_name: flexiroaster-kibana
277+
environment:
278+
ELASTICSEARCH_HOSTS: http://elasticsearch:9200
279+
ports:
280+
- "5601:5601"
281+
depends_on:
282+
elasticsearch:
283+
condition: service_healthy
284+
restart: always
285+
201286
volumes:
202287
postgres-data:
203288
redis-data:
289+
prometheus-data:
290+
grafana-data:
291+
elasticsearch-data:
204292

205293
networks:
206294
default:
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
cluster.name: flexiroaster-observability
2+
node.name: flexiroaster-es01
3+
network.host: 0.0.0.0
4+
discovery.type: single-node
5+
xpack.security.enabled: false
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
input {
2+
beats {
3+
port => 5044
4+
}
5+
6+
tcp {
7+
port => 5000
8+
codec => json
9+
}
10+
}
11+
12+
filter {
13+
if [service] == "flexiroaster-backend" {
14+
mutate {
15+
add_field => { "[@metadata][target_index]" => "flexiroaster-backend-%{+YYYY.MM.dd}" }
16+
}
17+
} else {
18+
mutate {
19+
add_field => { "[@metadata][target_index]" => "flexiroaster-logs-%{+YYYY.MM.dd}" }
20+
}
21+
}
22+
}
23+
24+
output {
25+
elasticsearch {
26+
hosts => ["http://elasticsearch:9200"]
27+
index => "%{[@metadata][target_index]}"
28+
}
29+
30+
stdout { codec => rubydebug }
31+
}
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
global:
2+
scrape_interval: 15s
3+
evaluation_interval: 15s
4+
5+
scrape_configs:
6+
- job_name: 'flexiroaster-backend'
7+
metrics_path: /metrics
8+
static_configs:
9+
- targets: ['backend:8000']
10+
11+
- job_name: 'prometheus'
12+
static_configs:
13+
- targets: ['prometheus:9090']

0 commit comments

Comments
 (0)