Full Thesis Manuscript: The complete, final written thesis manuscript (PDF and LaTeX source code) is available in the ./docs/thesis directory.
Modern distributed systems, composed of microservices and heterogeneous components, are increasingly complex and prone to performance anomalies that can compromise reliability and service quality. Traditional centralized monitoring systems struggle with massive bandwidth bottlenecks and severe data privacy risks.
This framework proposes an end-to-end Site Reliability Engineering (SRE) architecture integrating real-time edge detection via Long Short-Term Memory Autoencoders (LSTM-AE), federated learning (FedAvg) for collaborative model training without centralizing raw telemetry, causal graph-based root cause analysis (RCA), and adaptive Q-learning thresholding. By combining localized anomaly detection at the edge with global model aggregation, the framework enhances detection accuracy while preserving data privacy through Differential Privacy (DP-SGD).
- Decentralized Inference: Pushed sub-second anomaly detection directly to edge nodes, eliminating the need to transmit high-resolution raw telemetry to a central server.
- Privacy-Preserving Aggregation: Implemented Federated Learning secured by Differential Privacy (DP-SGD), achieving global model consensus exclusively through encrypted gradient sharing.
- Automated Root Cause Tracing: Developed a deterministic PageRank-based causal analyzer that traverses dynamic microservice dependency graphs to isolate the origin of cascading failures.
- Adaptive Alerting: Replaced static SLO thresholds with a Reinforcement Learning (Q-learning) agent to dynamically balance precision and recall, dramatically reducing false positive alerts in volatile environments.
flowchart TD
subgraph Edge Layer ["Edge Nodes (Microservices)"]
E1[Node 1 <br> LSTM-AE]
E2[Node 2 <br> LSTM-AE]
E3[Node N <br> LSTM-AE]
end
subgraph FL Layer ["Federated Aggregation"]
FC[Federated Coordinator <br> FedAvg + DP-SGD]
end
subgraph Central Control Plane ["Central Coordination & Telemetry"]
RCA[Root Cause Analyzer <br> Causal Graph PageRank]
RL[Threshold Optimizer <br> Q-Learning]
Prom[SRE Dashboard <br> Prometheus & Grafana]
end
%% Federated Learning Flow
E1 -- Encrypted Gradients --> FC
E2 -- Encrypted Gradients --> FC
E3 -- Encrypted Gradients --> FC
FC -. Global Weights .-> E1
FC -. Global Weights .-> E2
FC -. Global Weights .-> E3
%% Telemetry and Alerting Flow
E1 ==>|Anomaly Alerts & Traces| RCA
E2 ==>|Anomaly Alerts & Traces| RCA
E3 ==>|Anomaly Alerts & Traces| RCA
RCA --> RL
RL -. Dynamic Thresholds .-> E1
RCA --> Prom
FC --> Prom
The framework relies on the integration of the following domains:
- Time-Series Forecasting: Unsupervised reconstruction of 38-dimensional telemetry using an LSTM-Autoencoder.
- Federated Optimization: The global model objective is minimized using the FedAvg algorithm over K clients.
- Differential Privacy: Client updates are bounded by an L2 clipping norm C and obscured via Gaussian noise N(0, σ²C²).
- Causal Tracing: Utilizing Breadth-First Search (BFS) and PageRank over directed acyclic dependency graphs to calculate impact probability scores.
Extensive benchmarking was performed across the Server Machine Dataset (SMD), Train-Ticket, and Alibaba Cluster Traces.
| Metric | Result | Context |
|---|---|---|
| Best Detection Fidelity | F1: 0.839, AUC-ROC: 0.950 | Config C ( |
| Statistical Variance | F1: |
Multi-seed evaluation confirmed stable AUC-ROC ( |
| Privacy Phase Transition | Critical Drop at |
F1 drops sharply from 0.859 to 0.235 at |
| Alert Volume Reduction | 94.4% Decrease | Q-learning adaptive thresholding reduced alert volume from 502 (static) to 28 alerts, achieving an FPR of just 0.0155. |
| Bandwidth Compression | 80%+ Decrease | Using Zstd compression over TorchSave binaries vs. JSON reduced federated communication overhead. |
State-of-the-Art (SOTA) Competitiveness: The results demonstrate that federated LSTM-AE configurations achieve an
| Domain | Core Technologies |
|---|---|
| Deep Learning & FL | PyTorch, NumPy, Scikit-Learn |
| Graph Algorithms | NetworkX |
| Telemetry & Observability | Prometheus, Grafana, psutil |
| Infrastructure Deployment | Docker, Docker Compose, Kubernetes, Terraform (AWS) |
This framework demonstrates the viability of fully decentralizing anomaly detection in massive, heterogeneous microservice architectures. By pushing intelligence to the edge and aggregating knowledge via Federated Learning, the system achieves detection fidelity matching centralized state-of-the-art architectures without compromising data privacy or saturating network bandwidth. Furthermore, the integration of Differential Privacy reveals a critical privacy-utility threshold, emphasizing the necessity of rigorous hyperparameter tuning (
To facilitate peer review, a fully self-contained Dockerized demonstration is provided. It includes a live Prometheus exporter executing real-time inference on the SMD dataset and a comprehensive 4-dashboard Grafana SRE suite.
Please refer to the Demonstration Guide or initialize it directly:
cd demo
docker-compose up -d
source ../venv/bin/activate
python3 prometheus_exporter.py(The Grafana interface will automatically bind to http://localhost:3000)
If you utilize this framework or findings in your research, please cite the manuscript:
@bachelorsthesis{moghioros2026federated,
author = {Moghioroș, Eric},
title = {A Privacy-Preserving, Federated Anomaly Detection Framework for Distributed Microservice Architectures},
school = {Babeș-Bolyai University of Cluj-Napoca},
year = {2026},
type = {Bachelor's Thesis}
}