Skip to content

Latest commit

 

History

History
331 lines (211 loc) · 7.24 KB

File metadata and controls

331 lines (211 loc) · 7.24 KB

🧠 P4 In-Network Aggregation for Federated Learning

Accelerating Federated Learning by moving gradient aggregation directly into programmable network switches using P4.

This project demonstrates the concept of In-Network Intelligence, where the gradient aggregation phase of Federated Learning (FL) is executed directly inside a programmable switch instead of a centralized server.

By performing the aggregation within the network data plane, we reduce network congestion, minimize server workload, and enable ultra-low latency distributed training, following the principles of the SwitchML architecture.


📖 Table of Contents


🌐 Overview

Federated Learning (FL) allows multiple distributed clients to collaboratively train a machine learning model without sharing their raw data.

In traditional FL systems:

  1. Each worker trains locally.
  2. Workers send gradients to a Parameter Server.
  3. The server aggregates gradients.
  4. The updated model is broadcast back to workers.

However, this architecture creates a severe network bottleneck when model sizes become large.

This project solves that problem by moving gradient aggregation into the network switch itself using P4.


🚨 The Problem

In traditional Federated Learning:

  • Each worker sends large gradient vectors to the server.
  • The server must wait for N packets from N workers.
  • It then sums the weights and redistributes the result.

Issues

❌ Heavy network congestion
Server CPU overhead
High latency
NIC bandwidth saturation

For modern neural networks containing millions of parameters, this approach becomes inefficient.


💡 The Solution

Using programmable switches with P4, we intercept worker packets and perform gradient aggregation inside the network switch.

Workflow

  1. Workers compute gradients.
  2. Gradients are quantized and packed into custom packets.
  3. Packets pass through the P4 programmable switch.
  4. The switch:
    • Reads the gradient values
    • Aggregates them using stateful registers
    • Drops intermediate packets
  5. After receiving contributions from all workers, the switch sends only ONE aggregated packet to the server.

Result

Traditional FL
Workers → Switch → Server
N packets received

In-Network Aggregation
Workers → P4 Switch (Aggregation) → Server
1 packet received

This dramatically reduces network load and improves efficiency.


🛠 Tech Stack

Technology Purpose
P4 (p4-16) Data plane programming language
BMv2 simple_switch Software switch architecture
Mininet Virtual network topology
Scapy (Python) Custom packet creation & sniffing
Docker Containerized networking environment

⚠️ Handling P4 Limitations

1️⃣ No Floating Point Support

P4 switches support integer ALU operations only, but neural network weights are floating point.

Solution

Workers quantize gradients:

quantized_weight = float_weight * 1000
  • Converted to 32-bit signed integers
  • Sent inside the packet
  • Server dequantizes after aggregation

2️⃣ Stateful Memory Management

Switches must maintain running sums across multiple packets.

Solution

P4 register arrays store partial sums:

registers[param_index] += incoming_weight

This allows the switch to maintain persistent aggregation state.


🏗 Project Architecture

Custom FL Packet Header

The custom header fl_update_t is encapsulated inside:

Ethernet → IPv4 → UDP → FL Header

It contains:

Field Description
worker_id ID of the worker sending gradients
param_index Index of parameter chunk
w1, w2, w3, w4 Quantized gradient values

Components

👨‍💻 Workers (h1, h2, h3)

  • Generate random gradient vectors
  • Quantize weights
  • Send packets using Scapy

🔀 P4 Switch (s1)

Responsibilities:

  • Parse custom packet header
  • Read current weight sums
  • Perform integer addition
  • Maintain aggregation state
  • Forward only the final aggregated packet

🖥 Parameter Server (h4)

  • Listens on UDP port 5555
  • Receives aggregated packet
  • Dequantizes weights
  • Uses them for model update

🌍 Network Topology

        h1 (Worker)
           \
        h2 (Worker) ---- s1 (P4 Switch) ---- h4 (Server)
           /
        h3 (Worker)

Workers send gradients → switch aggregates → server receives single packet.


🐳 How to Run (Using Docker)

The entire environment is fully containerized.

No need to install P4, Mininet, or BMv2 locally.


1️⃣ Build and Start Container

docker-compose up -d --build

2️⃣ Start the Network Topology

Compile the P4 program and start Mininet.

docker exec -it p4_fl_aggregator_container python3 run_mininet.py

⚠️ Keep this terminal open.


3️⃣ Start the Parameter Server

Open a new terminal:

docker exec -it p4_fl_aggregator_container mx h4 python3 client/server.py

4️⃣ Start Worker 1

Open another terminal:

docker exec -it p4_fl_aggregator_container mx h1 python3 client/worker.py --id 1

5️⃣ Start Workers 2 & 3

Open another terminal:

docker exec -it p4_fl_aggregator_container /bin/bash -c "mx h2 python3 client/worker.py --id 2 & mx h3 python3 client/worker.py --id 3"

6️⃣ Execute Workers

Press Enter in:

  1. Worker 1 terminal
  2. Immediately press Enter in Worker 2/3 terminal

This simulates concurrent gradient updates.


🎯 Expected Output

On the Parameter Server terminal, you will observe:

  • Multiple worker packets entering the switch
  • Switch aggregating gradients
  • Single aggregated packet received per parameter chunk

Example:

Received aggregated gradients:
Chunk 1 → [10234, -3422, 8765, 2231]
Chunk 2 → [4521, 2234, -1234, 7642]

This proves the aggregation occurred inside the P4 switch.


⚡ Key Advantages

✔ Reduces network congestion
✔ Minimizes server computation load
✔ Enables line-rate aggregation
✔ Demonstrates in-network computing
✔ Inspired by SwitchML architecture


🔮 Future Improvements

  • Support more workers
  • Implement vectorized gradient aggregation
  • Integrate with real ML frameworks
  • Deploy on hardware programmable switches
  • Add secure aggregation mechanisms

📚 Inspiration

This project is inspired by the research system:

SwitchML — Scaling Distributed Machine Learning with In-Network Aggregation


👨‍💻 Author

Sadvik Kumar
B.Tech CSE (AI & ML)


⭐ If you found this project interesting, consider starring the repository!