Skip to content

Latest commit

 

History

History
557 lines (486 loc) · 36.5 KB

File metadata and controls

557 lines (486 loc) · 36.5 KB

PhishNChips - System Architecture Diagram (Simplified)

1. High-Level System Overview

┌──────────────────────────────────────────────────────────────────┐
│                        CLIENT LAYER                              │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐         │
│  │   Browser    │   │   Browser    │   │   Browser    │         │
│  │  Extension   │   │  Extension   │   │  Extension   │         │
│  │  (US/EU/ASIA)│   │              │   │              │         │
│  └──────┬───────┘   └──────┬───────┘   └──────┬───────┘         │
│         │ POST /report     │                  │                 │
│         │ {url, region,    │                  │                 │
│         │  timestamp}      │                  │                 │
└─────────┼──────────────────┼──────────────────┼─────────────────┘
          └──────────────────┴──────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│                     APPLICATION LAYER                            │
│                  ┌────────────────────┐                          │
│                  │  FastAPI Backend   │                          │
│                  │  • Validation      │                          │
│                  │  • ES Client       │                          │
│                  │  • Sync Routes     │                          │
│                  └─────────┬──────────┘                          │
└────────────────────────────┼─────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│              DISTRIBUTED DATABASE LAYER                          │
│           Elasticsearch Cluster (3-Node)                         │
│                                                                  │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │ NODE 1          │  │ NODE 2          │  │ NODE 3          │ │
│  │ es-us-east      │  │ es-eu-central   │  │ es-asia-south   │ │
│  │                 │  │                 │  │                 │ │
│  │ Shard 0 (P) ────┼──┼→Shard 0 (R) ────┼──┼→Shard 0 (R)     │ │
│  │ Shard 1 (R) ←───┼──┼─Shard 1 (P) ────┼──┼→Shard 1 (R)     │ │
│  │ Shard 2 (R)     │  │ Shard 2 (R) ←───┼──┼─Shard 2 (P)     │ │
│  │                 │  │                 │  │                 │ │
│  │                 │  │                 │  │                 │ │
│  │                 │  │                 │  │                 │ │    
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
│                                                                  │
│  Legend: (P)=Primary  (R)=Replica  ──→ Replication              │
└──────────────────────┬───────────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────────┐
│                   VISUALIZATION LAYER                            │
│                  ┌────────────────────┐                          │
│                  │  Kibana Dashboard  │                          │
│                  │  • Threat Heatmap  │                          │
│                  │  • Risk Timeline   │                          │
│                  │  • Regional Stats  │                          │
│                  └────────────────────┘                          │
└──────────────────────────────────────────────────────────────────┘

2. Detailed Component Breakdown

2.1 Client Layer Components

┌─────────────────────────────────────────────────────────────────┐
│                    Browser Extension (Client)                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌───────────────┐      ┌───────────────┐      ┌────────────┐  │
│  │  content.js   │      │ background.js │      │  popup.js  │  │
│  │               │      │               │      │            │  │
│  │ • DOM analyze │      │ • URL monitor │      │ • UI logic │  │
│  │ • Form detect │─────►│ • Risk calc   │◄─────│ • Manual   │  │
│  │ • Pattern     │      │ • API send    │      │   report   │  │
│  │   matching    │      │ • Storage     │      │            │  │
│  └───────────────┘      └───────┬───────┘      └────────────┘  │
│                                 │                               │
│                                 │ JSON Payload:                 │
│                                 │ {                             │
│                                 │   "url": "http://...",        │
│                                 │   "region": "US",             │
│                                 │   "timestamp": "2025-11-01"   │
│                                 │ }                             │
│                                 │                               │
│                                 ▼                               │
│                         HTTP POST Request                       │
│                         http://localhost:8000/report            │
└─────────────────────────────────────────────────────────────────┘

2.2 Application Layer Components

┌─────────────────────────────────────────────────────────────────────┐
│                     FastAPI Application Server                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │                    Request Pipeline                        │    │
│  │                                                            │    │
│  │  1. HTTP Request ──► 2. Pydantic ──► 3. Business ──► 4. ES│    │
│  │     Received          Validation      Logic          Client│    │
│  │                                                            │    │
│  │     ┌──────────┐      ┌──────────┐    ┌──────────┐   ┌───┴──┐ │
│  │     │ CORS     │      │ Input    │    │ Region   │   │ ES   │ │
│  │     │ Middleware─────►│ Sanitize ├───►│ Validate ├──►│Index │ │
│  │     │          │      │          │    │          │   │ API  │ │
│  │     └──────────┘      └──────────┘    └──────────┘   └──────┘ │
│  │                                                            │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │              Elasticsearch Python Client                   │    │
│  │                                                            │    │
│  │  • Connection Pool (max 10 connections)                   │    │
│  │  • Automatic retry logic (3 attempts)                     │    │
│  │  • Request timeout: 30s                                   │    │
│  │  • Sniffing: Discover all cluster nodes                   │    │
│  │  • Round-robin load balancing                             │    │
│  │                                                            │    │
│  │  Connections to:                                          │    │
│  │  ├─► es-us-east:9200                                      │    │
│  │  ├─► es-eu-central:9201                                   │    │
│  │  └─► es-asia-south:9202                                   │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │                    API Endpoints                           │    │
│  │                                                            │    │
│  │  POST /report      → Index new phishing report            │    │
│  │  GET  /threats     → Retrieve recent threats              │    │
│  │  GET  /hotspots    → Regional aggregation stats           │    │
│  │  GET  /cluster/health → Cluster status                    │    │
│  │  GET  /stats       → Index statistics                     │    │
│  └────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

2.3 Distributed Database Layer - Node Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                    Single Elasticsearch Node Architecture                │
│                          (Replicated 3x in Cluster)                      │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌────────────────────────────────────────────────────────────────┐     │
│  │                    HTTP/Transport Layer                        │     │
│  │  • HTTP REST API (Port 9200)                                   │     │
│  │  • Transport Protocol (Port 9300) - Inter-node communication   │     │
│  └────────────────┬───────────────────────────────────────────────┘     │
│                   │                                                      │
│                   ▼                                                      │
│  ┌────────────────────────────────────────────────────────────────┐     │
│  │                    Cluster Service                             │     │
│  │                                                                │     │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐     │     │
│  │  │ Master       │  │ Discovery    │  │ Cluster State    │     │     │
│  │  │ Election     │  │ (Zen2)       │  │ Management       │     │     │
│  │  │              │  │              │  │                  │     │     │
│  │  │ • Quorum vote│  │ • Ping nodes │  │ • Routing table  │     │     │
│  │  │ • Leader     │  │ • Join req   │  │ • Index metadata │     │     │
│  │  │   selection  │  │ • Heartbeat  │  │ • Shard location │     │     │
│  │  └──────────────┘  └──────────────┘  └──────────────────┘     │     │
│  └────────────────────────────────────────────────────────────────┘     │
│                   │                                                      │
│                   ▼                                                      │
│  ┌────────────────────────────────────────────────────────────────┐     │
│  │                    Index/Search Service                        │     │
│  │                                                                │     │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐     │     │
│  │  │ Index Engine │  │ Search Engine│  │ Aggregation      │     │     │
│  │  │              │  │              │  │ Engine           │     │     │
│  │  │ • Doc parsing│  │ • Query parse│  │ • Bucket agg     │     │     │
│  │  │ • Analysis   │  │ • Score calc │  │ • Metric agg     │     │     │
│  │  │ • Indexing   │  │ • Result rank│  │ • Pipeline agg   │     │     │
│  │  └──────────────┘  └──────────────┘  └──────────────────┘     │     │
│  └────────────────────────────────────────────────────────────────┘     │
│                   │                                                      │
│                   ▼                                                      │
│  ┌────────────────────────────────────────────────────────────────┐     │
│  │                    Shard Management                            │     │
│  │                                                                │     │
│  │  Primary Shards (3):  ┌─────────┐  ┌─────────┐  ┌─────────┐  │     │
│  │                       │ Shard 0 │  │ Shard 1 │  │ Shard 2 │  │     │
│  │                       │ (5GB)   │  │ (5GB)   │  │ (5GB)   │  │     │
│  │                       └────┬────┘  └────┬────┘  └────┬────┘  │     │
│  │                            │            │            │        │     │
│  │  Replica Shards (3):       ▼            ▼            ▼        │     │
│  │                       ┌─────────┐  ┌─────────┐  ┌─────────┐  │     │
│  │                       │Replica 0│  │Replica 1│  │Replica 2│  │     │
│  │                       │ (5GB)   │  │ (5GB)   │  │ (5GB)   │  │     │
│  │                       └─────────┘  └─────────┘  └─────────┘  │     │
│  │                                                                │     │
│  │  Shard Allocation:                                            │     │
│  │  • Node 1: Shard 0 (P), Shard 1 (R), Shard 2 (R)             │     │
│  │  • Node 2: Shard 0 (R), Shard 1 (P), Shard 2 (R)             │     │
│  │  • Node 3: Shard 0 (R), Shard 1 (R), Shard 2 (P)             │     │
│  └────────────────────────────────────────────────────────────────┘     │
│                   │                                                      │
│                   ▼                                                      │
│  ┌────────────────────────────────────────────────────────────────┐     │
│  │                    Storage Layer (Lucene)                      │     │
│  │                                                                │     │
│  │  ┌────────────────────────────────────────────────────────┐   │     │
│  │  │                    Translog (WAL)                      │   │     │
│  │  │  • Write-ahead log for crash recovery                 │   │     │
│  │  │  • Fsync policy: per request (durability=request)     │   │     │
│  │  │  • Size: 512MB before flush                           │   │     │
│  │  │  • Location: /usr/share/elasticsearch/data/translog   │   │     │
│  │  └────────────────────────────────────────────────────────┘   │     │
│  │                                                                │     │
│  │  ┌────────────────────────────────────────────────────────┐   │     │
│  │  │              Lucene Index Segments                     │   │     │
│  │  │                                                        │   │     │
│  │  │  Segment 1  ┌─────────┐  ┌──────────┐  ┌──────────┐  │   │     │
│  │  │  (100MB)    │Documents│  │Inverted  │  │Doc Values│  │   │     │
│  │  │             │ Store   │  │Index     │  │          │  │   │     │
│  │  │             └─────────┘  └──────────┘  └──────────┘  │   │     │
│  │  │                                                        │   │     │
│  │  │  Segment 2  ┌─────────┐  ┌──────────┐  ┌──────────┐  │   │     │
│  │  │  (200MB)    │Documents│  │Inverted  │  │Doc Values│  │   │     │
│  │  │             │ Store   │  │Index     │  │          │  │   │     │
│  │  │             └─────────┘  └──────────┘  └──────────┘  │   │     │
│  │  │                                                        │   │     │
│  │  │  • Immutable once written                             │   │     │
│  │  │  • Merged periodically by background threads          │   │     │
│  │  │  • Checksum validation on read                        │   │     │
│  │  └────────────────────────────────────────────────────────┘   │     │
│  │                                                                │     │
│  │  Docker Volume: es-XX-data (Persistent)                       │     │
│  └────────────────────────────────────────────────────────────────┘     │
└──────────────────────────────────────────────────────────────────────────┘

3. System Interaction Flows

3.1 Write Request Flow

Browser Extension → FastAPI → Coordinating Node → Primary Shard → Replica Shard

1. Browser sends: {url, region, timestamp}
2. FastAPI validates request (Pydantic)
3. Coordinator routes to Primary Shard
   • Hash-based routing: hash(doc_id) % 3
4. Primary Shard:
   • Writes to Translog (WAL)
   • Indexes in memory
   • Replicates to Replica Shard
5. Replica Shard:
   • Writes to Translog
   • Indexes in memory
   • Sends ACK to Primary
6. Primary checks quorum (2/2 = success)
7. Returns success to client
8. Background: Refresh every 1s (makes data searchable)

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ Browser  │────►│ FastAPI  │────►│ Primary  │────►│ Replica  │
│          │     │          │     │  Shard   │     │  Shard   │
│          │◄────│          │◄────│          │◄────│ (ACK)    │
└──────────┘     └──────────┘     └──────────┘     └──────────┘

Key Features: Hash Routing, Sync Replication, Quorum Writes, Translog

3.2 Read Request Flow

Kibana/FastAPI → Coordinator → Scatter to All Shards → Gather Results

1. Kibana sends query request
2. Coordinator parses query
3. SCATTER: Query broadcast to all 3 shards (parallel)
   • Adaptive replica selection (least loaded)
4. Each shard:
   • Scans inverted index
   • Scores and ranks results
   • Returns top K (IDs + scores)
5. GATHER: Coordinator merges results
   • Global sort by score
   • Apply pagination
6. FETCH: Get full documents (if needed)
7. Return to client

                    ┌──────────────┐
                    │ Coordinator  │
                    └───┬──────────┘
                        │ Scatter
         ┌──────────────┼──────────────┐
         ▼              ▼              ▼
    ┌────────┐     ┌────────┐     ┌────────┐
    │Shard 0 │     │Shard 1 │     │Shard 2 │
    │(Query) │     │(Query) │     │(Query) │
    └───┬────┘     └───┬────┘     └───┬────┘
        │ Gather       │              │
        └──────────────┴──────────────┘
                       │
                  ┌────▼────┐
                  │  Merge  │
                  │ Results │
                  └─────────┘

Key Features: Scatter-Gather, Adaptive Replica Selection, Parallel Execution

3.3 Aggregation Flow (Regional Statistics)

GET /hotspots → Aggregation by region with avg risk_score

1. Coordinator sends aggregation query to all shards
2. Each shard computes PARTIAL aggregations using Doc Values (column store)
   • Shard 0: US(count:2, sum:1.7), EU(count:1, sum:0.4)
   • Shard 1: US(count:1, sum:0.95), EU(count:1, sum:0.3), ASIA(count:1, sum:0.7)
   • Shard 2: US(count:2, sum:1.45), ASIA(count:1, sum:0.55)
3. Coordinator MERGES partial aggregations:
   • US: avg = (2+1+2)/(1.7+0.95+1.45) = 0.82
   • EU: avg = (1+1)/(0.4+0.3) = 0.35
   • ASIA: avg = (1+1)/(0.7+0.55) = 0.625
4. Return final aggregated results

Benefits: Uses Doc Values (fast, no full document load), Distributed computation

Key Features: Doc Values, Distributed Aggregation, Partial Results Merging

3.4 Replication Flow

Synchronous Primary-Replica Replication (RF=1)

Primary Shard → Replica Shard (synchronous, <50ms)

1. Primary receives write
2. Primary writes to translog (fsync)
3. Primary replicates to Replica
4. Replica writes to translog (fsync)
5. Replica sends ACK to Primary
6. Primary checks quorum (2/2 nodes)
7. Primary returns success to client

Timeline:
T0:        Write arrives at Primary
T0+20ms:   Translog fsync complete
T0+30ms:   Replica receives & writes
T0+50ms:   ACK back to Primary
T0+60ms:   Success to client

Guarantees:
• Strong consistency (quorum writes)
• Zero data loss (both have translog)
• Replica ready for immediate promotion

Key Features: Synchronous Replication, Quorum, Translog (WAL)

3.5 Fault Tolerance & Recovery

Scenario: Node 1 (Master) Fails

Timeline:
─────────
T0:       Node 1 goes offline (has Shard 0 Primary)
T0+10s:   Nodes 2 & 3 detect missing heartbeat
T0+15s:   Master election (Zen Discovery)
          • Node 2 elected as new master (quorum: 2/3)
T0+20s:   Shard reallocation
          • Shard 0 Replica (on Node 2) promoted to Primary
          • New Replica allocated to Node 3
T0+25s:   Data sync (translog replay + segment copy)
T0+30s:   Cluster status: GREEN ✓

Recovery Process:
1. Failure Detection (heartbeat timeout)
2. Master Election (quorum-based, Zen Discovery)
3. Replica Promotion (immediate, already in-sync)
4. Shard Reallocation (create new replica)
5. Translog Replay (ensure 100% data integrity)

Result:
• Recovery Time: <30 seconds
• Data Loss: ZERO (translog ensures durability)
• Availability: 100% during recovery (reads from replicas)

Key Features: Zen Discovery, Automatic Failover, Translog Recovery

4. Distributed Features Summary

4.1 Sharding (Partitioning)

Configuration:
• 3 Primary Shards
• 1 Replica per shard
• Total: 6 shards (3P + 3R)

Routing: shard_id = hash(doc_id) % 3

Distribution:
Node 1: Shard 0(P), Shard 1(R), Shard 2(R)
Node 2: Shard 0(R), Shard 1(P), Shard 2(R)
Node 3: Shard 0(R), Shard 1(R), Shard 2(P)

Benefits:
✓ Uniform distribution (no hotspots)
✓ Deterministic routing
✓ Parallel writes (3x throughput)
✓ Horizontal scalability

4.2 Replication

Strategy: Synchronous Primary-Replica (RF=1)

Layers:
1. Shard Replication: Primary → Replica (sync)
2. Translog Replication: WAL persisted on both
3. Cluster State: Master broadcasts to all nodes
4. Segment Replication: Checksum-verified copies

Guarantees:
✓ Strong consistency (quorum writes)
✓ Zero data loss (translog durability)
✓ Immediate replica promotion
✓ Bit-for-bit identical replicas

4.3 Consistency

Mechanisms:
1. Quorum Writes: 2/2 nodes (Primary + Replica) must ACK
2. Primary Term: Prevents split-brain (term numbers)
3. Sequence Numbers: Ordered operations, gap detection
4. NRT Refresh: Data searchable within 1s (eventual)
5. Version Control: Optimistic concurrency (last-write-wins)

Consistency Level: Strong (writes), Eventual (reads, <1s)

5. Data Flow Summary

┌─────────────────────────────────────────────────────────────────┐
│                   COMPLETE DATA FLOW                            │
└─────────────────────────────────────────────────────────────────┘

INGESTION FLOW:
───────────────
Browser Extension ──► FastAPI Backend ──► ES Coordinating Node
                                              │
                            ┌─────────────────┼─────────────────┐
                            ▼                 ▼                 ▼
                       Primary Shard 0   Primary Shard 1   Primary Shard 2
                            │                 │                 │
                            ├────► Replica    ├────► Replica    ├────► Replica
                            ▼                 ▼                 ▼
                       Persistent        Persistent        Persistent
                       Storage (Vol)     Storage (Vol)     Storage (Vol)

QUERY FLOW:
───────────
Kibana Dashboard ──► ES Coordinating Node
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
   Query Shard 0       Query Shard 1       Query Shard 2
   (Any replica)       (Any replica)       (Any replica)
        │                   │                   │
        └───────────────────┴───────────────────┘
                            │
                    Merge Results
                            │
                            ▼
                      Return to Client

REPLICATION FLOW:
─────────────────
Primary Shard ──[Sync]──► Replica Shard
      │                         │
      ├─► Translog             ├─► Translog
      ├─► Memory Buffer         ├─► Memory Buffer
      ├─► Segment Flush         ├─► Segment Flush
      └─► Checksum Verify       └─► Checksum Verify

FAILURE RECOVERY:
─────────────────
Node Failure Detected ──► Master Election (if master down)
                            │
                            ▼
                    Update Cluster State
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
   Promote Replica     Reallocate Shards    Sync New Replica
   to Primary          to Healthy Nodes     from Primary
        │                   │                   │
        └───────────────────┴───────────────────┘
                            │
                    Cluster Health: GREEN ✓

6. Technology Stack Mapping

┌─────────────────────────────────────────────────────────────────┐
│  LAYER                │  TECHNOLOGY      │  PURPOSE             │
├───────────────────────┼──────────────────┼──────────────────────┤
│  Client               │  Chrome Extension│  URL detection       │
│                       │  JavaScript      │  User interaction    │
├───────────────────────┼──────────────────┼──────────────────────┤
│  API Gateway          │  FastAPI         │  REST endpoints      │
│                       │  Uvicorn         │  ASGI server         │
│                       │  Pydantic        │  Data validation     │
├───────────────────────┼──────────────────┼──────────────────────┤
│  Distributed Database │  Elasticsearch   │  Indexing & search   │
│                       │  Lucene          │  Storage engine      │
│                       │  Zen Discovery   │  Cluster coordination│
├───────────────────────┼──────────────────┼──────────────────────┤
│  Visualization        │  Kibana          │  Dashboards          │
│                       │  Vega/D3.js      │  Charts              │
├───────────────────────┼──────────────────┼──────────────────────┤
│  Infrastructure       │  Docker          │  Containerization    │
│                       │  Docker Compose  │  Orchestration       │
│                       │  Docker Volumes  │  Persistent storage  │
└─────────────────────────────────────────────────────────────────┘

This comprehensive diagram covers all major system components, their interactions, and distributed features as implemented in the PhishNChips platform.