This document provides a complete specification of the HOSER (student) and LM-TAD (teacher) model architectures used in knowledge distillation for trajectory prediction.
- Overview
- HOSER Student Architecture
- LM-TAD Teacher Architecture
- Parameter Count Breakdown
- Architecture Comparison
- Architecture Diagram
This research investigates knowledge distillation from a large transformer-based teacher model (LM-TAD) into a compact student model (HOSER) for real-time trajectory prediction. The student model is 96.8% smaller than the teacher while maintaining competitive accuracy through distillation training.
Key Metrics:
- Student (HOSER): ~4.45M parameters, ~13ms inference latency
- Teacher (LM-TAD): ~137M parameters, ~430ms inference latency
- Compression Ratio: 30.8× smaller
The HOSER (Hierarchical One-Shot Embedding and Routing) model consists of three main components: Road Network Encoder, Trajectory Encoder, and Navigator.
Encodes the road network structure using learned embeddings and graph neural networks.
-
Road ID Embedding:
nn.Embedding(40060, 64)- Input: Road IDs (vocabulary size: 40,060 road segments)
- Output: 64-dimensional embedding per road
- Parameters: 2,563,840
-
Road Attribute Encoders:
- Length encoder:
nn.Linear(1, 16)→ 17 params (16 weights + 1 bias) - Type encoder:
nn.Embedding(10, 16)→ 160 params - Longitude encoder:
nn.Linear(1, 16)→ 17 params (16 weights + 1 bias) - Latitude encoder:
nn.Linear(1, 16)→ 17 params (16 weights + 1 bias) - Total attribute embedding dimension: 64 + 16 + 16 + 16 + 16 = 128
- Length encoder:
- Zone ID Embedding:
nn.Embedding(300, 128)- Input: Zone IDs (300 learned hierarchical zones)
- Output: 128-dimensional embedding per zone
- Parameters: 38,400
- Architecture: 2-layer Graph Attention Network (v2)
- Input channels: 128
- Hidden channels: 128
- Output channels: 128
- Edge feature dimension: 2 (intersection attributes)
- Number of attention heads: Adaptive (PyG default)
- Activation: LeakyReLU (implicit in GAT)
- Parameters: ~65,792
- Architecture: 2-layer Graph Convolutional Network
- Input channels: 128
- Hidden channels: 128
- Output channels: 128
- Edge weights: Precomputed zone adjacency
- Activation: ReLU (implicit in GCN)
- Cached: True (for inference efficiency)
- Parameters: ~65,792
Road Network Encoder Total Parameters: ~2,734,423
Processes the historical trajectory sequence using temporal encoding and transformer blocks.
- Road-Zone Fusion MLP:
- Layer 1:
nn.Linear(256, 64)→ 16,384 params - Activation: GELU
- Layer 2:
nn.Linear(64, 1)→ 64 params - Fusion:
road_emb + sigmoid(MLP(concat(road_emb, zone_emb))) * zone_emb - Output dimension: 128
- Layer 1:
- Time Embedding:
nn.Linear(1, 128)- Input: Normalized timestamp (0-1)
- Output: 128-dimensional temporal embedding
- Activation: Cosine (applied after linear layer:
cos(time_emb(t))) - Parameters: 129 (128 weights + 1 bias)
-
Configuration:
- Number of layers: 2
- Hidden dimension: 256 (128 spatial + 128 temporal)
- Number of heads: 2
- Head dimension: 128
- Max sequence length: 1024
- Dropout rate: 0.0 (disabled for optimal performance)
- Gradient checkpointing: Disabled (sufficient VRAM)
-
Per-Layer Architecture (2 layers total):
Causal Self-Attention Block:
- Query/Key/Value projection:
nn.Linear(256, 768)→ 196,608 params (no bias)- Projects to 3× hidden_dim for Q, K, V
- Calculation: 256 × 768 = 196,608 (bias=False as per code)
- Output projection:
nn.Linear(256, 256)→ 65,536 params (no bias)- Calculation: 256 × 256 = 65,536
- Relative position embeddings (K):
nn.Linear(2, 128)→ 256 params- Input: [distance, time_interval]
- Relative position embeddings (V):
nn.Linear(2, 128)→ 256 params - Attention dropout: 0.0
- Residual dropout: 0.0
- Masking: Causal mask + sequence length mask
Feed-Forward Network:
- Layer 1:
nn.Linear(256, 1024)→ 262,144 params - Activation: GELU
- Layer 2:
nn.Linear(1024, 256)→ 262,144 params - Dropout: 0.0
Layer Normalization:
- Pre-attention LayerNorm:
nn.LayerNorm(256)→ 512 params - Pre-FFN LayerNorm:
nn.LayerNorm(256)→ 512 params
- Query/Key/Value projection:
-
Per-Layer Parameters: 787,968
-
Total Transformer Parameters (2 layers): 1,575,936
Trajectory Encoder Total Parameters: ~1,592,513
Generates next-step predictions using attention-based routing over candidate roads.
-
Distance Projection:
nn.Linear(1, 128)→ 129 params- Input: Euclidean distance from last position to candidate
- Output: 128-dimensional distance embedding
-
Angle Projection:
nn.Linear(1, 128)→ 129 params- Input: Bearing angle from last position to candidate
- Output: 128-dimensional angle embedding
-
Query Projection:
nn.Linear(384, 128)→ 49,152 params- Input: Concatenation of:
- Trajectory embedding: 256 dims
- Destination zone embedding: 128 dims
- Total: 384 dims
- No bias
- Input: Concatenation of:
-
Key Projection:
nn.Linear(384, 128)→ 49,152 params- Input: Concatenation of:
- Candidate road embedding: 128 dims
- Distance projection: 128 dims
- Angle projection: 128 dims
- Total: 384 dims
- No bias
- Input: Concatenation of:
-
Value Projection:
nn.Linear(128, 1)→ 128 params- Computes attention score for each candidate
- No bias
- Activation:
tanh(Q + K)before value projection
Predicts travel time to each candidate road.
-
Trajectory Encoder:
nn.Linear(256, 64)→ 16,384 params- Input: Trajectory embedding (256 dims)
- Activation: GELU
-
Road Encoder:
nn.Linear(128, 64)→ 8,192 params- Input: Candidate road embedding (128 dims)
- Activation: GELU
-
Output Layer:
nn.Linear(128, 1)→ 128 params- Input: Concatenation of trajectory and road encodings (64 + 64 = 128)
- Output: Predicted log-normalized travel time
Navigator Total Parameters: ~123,394
| Component | Parameters |
|---|---|
| Road Network Encoder | 2,734,423 |
| Trajectory Encoder | 1,592,513 |
| Navigator | 123,394 |
| Total | ~4,450,330 |
The LM-TAD (Language Model for Trajectory Anomaly Detection) model is a transformer-based autoregressive model operating on fine-grained grid cells.
- Model Type: Causal transformer (GPT-2 style)
- Number of Layers: 8
- Number of Heads: 12
- Embedding Dimension: 768
- Vocabulary Size: 51,663 (grid cell tokens)
- Block Size: 1024 (max sequence length)
- Dropout Rate: 0.1 (during training)
- Learning Rate: 0.0003
- Integer POE: False (uses learned positional embeddings)
- Bias: False (no bias terms in linear layers)
- Layer:
nn.Embedding(51663, 768)- Input: Grid cell token IDs
- Output: 768-dimensional embedding
- Parameters: 39,677,184
- Layer:
nn.Embedding(1024, 768)- Input: Position indices (0 to block_size-1)
- Output: 768-dimensional position embedding
- Parameters: 786,432
- Rate: 0.1
- Applied after embedding sum
Embedding Total Parameters: 40,463,616
Per-Layer Architecture (8 layers total):
-
Query/Key/Value Projection:
nn.Linear(768, 2304)- Projects to 3× embedding dimension
- Parameters: 1,769,472
- No bias (config.bias = False)
-
Attention Heads: 12
- Head dimension: 768 / 12 = 64
- Attention dropout: 0.1
-
Output Projection:
nn.Linear(768, 768)- Parameters: 589,824
- No bias
-
Residual Dropout: 0.1
Attention Parameters per Layer: 2,359,296
-
Expansion Layer:
nn.Linear(768, 3072)- 4× expansion ratio
- Activation: GELU
- Parameters: 2,359,296
- No bias
-
Projection Layer:
nn.Linear(3072, 768)- Parameters: 2,359,296
- No bias
-
Dropout: 0.1
MLP Parameters per Layer: 4,718,592
-
Pre-Attention LayerNorm:
nn.LayerNorm(768)- Parameters: 1,536 (scale + bias)
-
Pre-MLP LayerNorm:
nn.LayerNorm(768)- Parameters: 1,536
LayerNorm Parameters per Layer: 3,072
Total Parameters per Transformer Layer: 7,080,960
Total Transformer Parameters (8 layers): 56,647,680
- Layer:
nn.LayerNorm(768)- Parameters: 1,536 (scale + bias)
- Layer:
nn.Linear(768, 51663)- Projects to vocabulary size for next-token prediction
- Parameters: 39,677,184
- No bias (bias=False)
- Often weight-tied with token embedding (not counted separately if tied)
Output Layer Parameters: 1,536 (LayerNorm only; LM Head counted in total separately)
| Component | Parameters |
|---|---|
| Token Embedding | 39,677,184 |
| Position Embedding | 786,432 |
| Transformer Blocks (8×) | 56,647,680 |
| Final LayerNorm | 1,536 |
| LM Head | 39,677,184 |
| Total | ~136,790,016 |
Road Network Encoder: 2,734,423 (61.5%)
├── Road ID Embedding: 2,563,840
├── Type Embedding: 160
├── Attribute Linear Layers: 51
├── Zone ID Embedding: 38,400
├── Road GAT (2 layers): 65,792
└── Zone GCN (2 layers): 65,792
Trajectory Encoder: 1,592,513 (35.8%)
├── Road-Zone Fusion MLP: 16,448
├── Temporal Encoder: 129
└── Transformer (2 layers): 1,575,936
├── Attention Blocks: 524,288
├── FFN Blocks: 1,048,576
└── LayerNorm: 2,048
Navigator: 123,394 (2.8%)
├── Metric Projections: 258
├── Attention Projections: 98,432
└── Time Estimator: 24,704
──────────────────────────────────────────
TOTAL STUDENT PARAMETERS: 4,450,330
Embeddings: 40,463,616 (29.6%)
├── Token Embedding: 39,677,184
└── Position Embedding: 786,432
Transformer Blocks: 56,647,680 (41.4%)
├── Attention (8 layers): 18,874,368
├── MLP (8 layers): 37,748,736
└── LayerNorm (16 layers): 24,576
Output: 39,678,720 (29.0%)
├── Final LayerNorm: 1,536
└── LM Head: 39,677,184
──────────────────────────────────────────
TOTAL TEACHER PARAMETERS: 136,790,016
| Metric | HOSER (Student) | LM-TAD (Teacher) | Ratio |
|---|---|---|---|
| Total Parameters | 4.45M | 136.79M | 30.8× |
| Embedding Dimension | 128 | 768 | 6.0× |
| Number of Layers | 2 | 8 | 4.0× |
| Attention Heads | 2 | 12 | 6.0× |
| Vocabulary Size | 40,060 (roads) | 51,663 (grid cells) | 1.3× |
| Inference Latency | ~13ms | ~430ms | 33.1× |
| Memory Footprint | ~17 MB | ~548 MB | 32.2× |
Compression Summary:
- Student is 96.8% smaller than teacher
- Student is 33× faster than teacher
- Student uses 97% less memory than teacher
| Aspect | HOSER (Student) | LM-TAD (Teacher) |
|---|---|---|
| Architecture Type | Hierarchical spatial + Transformer | Pure transformer |
| Input Representation | Road IDs + graph structure | Grid cell tokens |
| Spatial Reasoning | GAT/GCN on road network | Learned from sequence |
| Context Modeling | 2-layer causal attention | 8-layer causal attention |
| Position Encoding | Relative (distance + time) | Absolute learned |
| Activation Functions | GELU, LeakyReLU, tanh | GELU |
| Dropout | 0.0 (disabled) | 0.1 |
| Output | Next road + travel time | Next grid cell token |
| Training Objective | CrossEntropy + MAPE + KL | CrossEntropy (perplexity) |
| Deployment | Production-ready | Research/offline |
HOSER (Student):
- Optimized for real-time inference (<20ms latency)
- Explicit spatial reasoning through graph neural networks
- Hierarchical design: zones → roads → trajectories
- Minimal parameters for edge deployment
- Joint prediction of location + time
LM-TAD (Teacher):
- Designed for trajectory anomaly detection (not prediction)
- Learned spatial patterns from fine-grained grid representation
- Deep transformer for rich contextual modeling
- Optimized for detection accuracy over speed
- Outputs perplexity scores (repurposed for distillation)
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#e8f5e9','primaryTextColor':'#1b5e20','primaryBorderColor':'#388e3c','lineColor':'#666','secondaryColor':'#e3f2fd','tertiaryColor':'#fff3e0'}}}%%
graph TB
subgraph Input["📥 INPUT"]
A1[Historical Trajectory<br/>Road IDs: T×1<br/>Temporal Info: T×1]
A2[Destination<br/>Road ID: 1]
A3[Candidate Roads<br/>IDs: K×1<br/>Distance: K×1<br/>Angle: K×1]
end
subgraph RoadNet["🗺️ ROAD NETWORK ENCODER"]
B1[Road ID Emb<br/>40060 → 64]
B2[Attributes<br/>len/type/lon/lat → 64]
B3[Road GAT<br/>2 layers, 128d]
B4[Zone ID Emb<br/>300 → 128]
B5[Zone GCN<br/>2 layers, 128d]
B1 --> B3
B2 --> B3
B4 --> B5
end
subgraph TrajEnc["🛣️ TRAJECTORY ENCODER"]
C1[Spatial Fusion<br/>Road + Zone → 128d]
C2[Temporal Encoding<br/>cos time_emb → 128d]
C3[Concat<br/>256d = 128 + 128]
C4[Transformer Block 1<br/>2 heads, 256d]
C5[Transformer Block 2<br/>2 heads, 256d]
C6[Trajectory Embedding<br/>T×256d]
C1 --> C3
C2 --> C3
C3 --> C4
C4 --> C5
C5 --> C6
end
subgraph Nav["🎯 NAVIGATOR"]
D1[Query<br/>Traj + Dest → 384d]
D2[Key<br/>Cand + Metrics → 384d]
D3[Attention<br/>Q·K → K scores]
D4[Time Estimator<br/>Traj + Cand → K times]
end
subgraph Output["📤 OUTPUT"]
E1[Next Road Logits<br/>K×1]
E2[Travel Time Pred<br/>K×1]
end
A1 --> B3
A1 --> C1
A2 --> B5
A2 --> D1
A3 --> B3
A3 --> D2
B3 --> C1
B5 --> C1
B5 --> D1
B3 --> D2
C6 --> D1
C6 --> D4
D1 --> D3
D2 --> D3
D3 --> E1
D4 --> E2
style Input fill:#e3f2fd
style RoadNet fill:#e8f5e9
style TrajEnc fill:#fff3e0
style Nav fill:#fce4ec
style Output fill:#f3e5f5
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#fce4ec','primaryTextColor':'#880e4f','primaryBorderColor':'#c2185b','lineColor':'#666'}}}%%
graph TB
subgraph Input["📥 INPUT"]
A[Grid Token Sequence<br/>T×1<br/>51,663 vocab]
end
subgraph Emb["🔤 EMBEDDINGS"]
B1[Token Embedding<br/>51663 → 768d]
B2[Position Embedding<br/>1024 → 768d]
B3[Sum + Dropout 0.1]
B1 --> B3
B2 --> B3
end
subgraph Trans["🔄 TRANSFORMER (8 layers)"]
C1[Layer 1<br/>12 heads, 768d]
C2[Layer 2<br/>12 heads, 768d]
C3[Layer 3<br/>12 heads, 768d]
C4[...]
C5[Layer 8<br/>12 heads, 768d]
C1 --> C2
C2 --> C3
C3 --> C4
C4 --> C5
end
subgraph Block["📦 TRANSFORMER BLOCK"]
D1[LayerNorm]
D2[Causal Attention<br/>12 heads × 64d]
D3[Residual + Dropout]
D4[LayerNorm]
D5[MLP<br/>768 → 3072 → 768<br/>GELU activation]
D6[Residual + Dropout]
D1 --> D2
D2 --> D3
D3 --> D4
D4 --> D5
D5 --> D6
end
subgraph Output["📤 OUTPUT"]
E1[Final LayerNorm<br/>768d]
E2[LM Head<br/>768 → 51663]
E3[Next Token Logits<br/>51663×1]
E4[Softmax<br/>Teacher Probs]
E1 --> E2
E2 --> E3
E3 --> E4
end
A --> B1
A --> B2
B3 --> C1
C5 --> E1
style Input fill:#e3f2fd
style Emb fill:#fff3e0
style Trans fill:#fce4ec
style Block fill:#f3e5f5
style Output fill:#e8f5e9
%%{init: {'theme':'base', 'themeVariables': {'primaryColor':'#e8f5e9','lineColor':'#666'}}}%%
graph LR
subgraph Student["🎓 HOSER (Student)<br/>4.45M params"]
S1[Trajectory Input]
S2[HOSER Model]
S3[Student Logits]
S1 --> S2
S2 --> S3
end
subgraph Teacher["👨🏫 LM-TAD (Teacher)<br/>136.8M params<br/>🔒 Frozen"]
T1[Grid Tokens<br/>via Mapping]
T2[LM-TAD Model]
T3[Teacher Probs<br/>Temperature τ=2.0]
T1 --> T2
T2 --> T3
end
subgraph Loss["📉 COMBINED LOSS"]
L1[CrossEntropy<br/>Hard Labels]
L2[MAPE<br/>Time Prediction]
L3[KL Divergence<br/>λ=0.01]
L4[Total Loss<br/>= CE + MAPE + λ·KL]
L1 --> L4
L2 --> L4
L3 --> L4
end
S1 -.->|"Road→Grid<br/>Mapping"| T1
S3 --> L1
S3 --> L2
S3 -.->|"Soft Targets"| L3
T3 -.->|"Soft Targets"| L3
L4 -.->|"Backprop<br/>Student Only"| S2
style Student fill:#e8f5e9
style Teacher fill:#fce4ec
style Loss fill:#fff3e0
- HOSER implementation:
/models/hoser.py,/models/trajectory_encoder.py,/models/navigator.py,/models/road_network_encoder.py - LM-TAD wrapper:
/critics/lmtad_teacher.py - Configuration:
/config/Beijing.yaml - Training:
/train_with_distill.py
- Distillation methodology:
/docs/LMTAD-Distillation.md - Model checkpoints:
/docs/reference/MODEL_LOCATIONS.md - Evaluation guide:
/docs/EVALUATION_PIPELINE_GUIDE.md
- HOSER: Cao et al., "Hierarchical One-Shot Embedding and Routing"
- LM-TAD: Mbuya et al., "Language Models for Trajectory Anomaly Detection", SIGSPATIAL 2024 (arXiv:2409.15366)
Last Updated: 2025-01-06
Validation Status: ✅ All layer details documented, parameter counts verified, diagrams created, size comparison added