Deep Learning system that counts crowds at transit stops and automatically dispatches extra buses when capacity is exceeded.
Problem β’ Solution β’ Screenshots β’ Why Density Maps β’ Architecture β’ How It Works β’ Get Started
Urban public transport stops face a silent crisis: unmonitored overcrowding.
When 80 people pile up at a bus stop designed for 50, there's no automated system to detect the overload. Dispatchers rely on manual reports or complaints β by the time they react, citizens have already waited 30+ minutes in frustration.
Traditional approaches fail for two key reasons:
Object detection models (YOLO, Faster R-CNN) collapse in dense crowds. When dozens of people overlap, only partial features β a head, a shoulder β are visible. Bounding-box algorithms can't draw boxes around what they can't isolate.
Scale variation breaks detectors. People close to the camera appear 5x larger than those in the back. A single frame contains massive scale differences that standard detectors weren't designed to handle.
The problem isn't a lack of cameras β it's the lack of intelligent counting behind them.
Crowd Detection doesn't try to draw boxes around people β it predicts density.
We built a Density Map Regression system powered by CSRNet (Congested Scene Recognition Network). Instead of isolating individuals, the model generates a continuous heatmap where each pixel represents crowd density. The total count is simply the integral (sum) of the entire map.
This approach is fundamentally more robust against occlusion, scale variation, and real-world noise.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Object Detection (YOLO) Density Map (CSRNet) β
β βββββββββββββββββββββββ ββββββββββββββββββββ β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β ββββ ββββ β β ββββββββββββββββββ β β
β β β??β β??β ββββ β β ββββββββββββββββββ β β
β β ββββ ββββ β??β β β ββββββββββββββββββ β β
β β ββββ ββββ β β ββββββββββββββββββ β β
β β β??β Missed: 60% β β ββββββββββββββββββ β β
β β ββββ β β β β
β βββββββββββββββββββββββ β Ξ£ pixels = 73 π₯ β β
β βββββββββββββββββββββββ β
β β Detected: 12/73 β
Estimated: 73 (Β±3) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Upload β Citizen or camera feed sends a crowd image
- Preprocess β Image is normalized using ImageNet standards
- Predict β CSRNet generates a 2D density heatmap
- Count β Sum all pixel values = estimated crowd size
- Decide β If count > 50 β π Dispatch Extra Bus alert triggered
- Visualize β Original image + heatmap displayed side-by-side
| Challenge | Object Detection | Density Map Regression |
|---|---|---|
| Severe Occlusion | β Fails β can't draw boxes around overlapping people | β Robust β predicts density per pixel, no isolation needed |
| Scale Variation | β Struggles β same frame has tiny & large people | β Handles natively β dilated convolutions capture multi-scale context |
| Counting Accuracy | β Misses 40-60% in dense scenes | β Β±5% error in ShanghaiTech benchmarks |
| Speed | β‘ Fast, but inaccurate | β‘ Fast AND accurate |
| Real-world Noise | β False positives from bags, signs, shadows | β Learned density patterns from 400+ annotated scenes |
Bottom line: In a crowd of 73 people, YOLO might detect 12. CSRNet estimates 73 Β± 3.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Crowd Detection Architecture β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STREAMLIT FRONTEND β β
β β β β
β β βββββββββββββββ ββββββββββββββββ ββββββββββββββββββ β β
β β β Image β β Density β β Dispatch β β β
β β β Uploader β β Heatmap β β Alert UI β β β
β β ββββββββ¬βββββββ ββββββββ¬ββββββββ βββββββββ¬βββββββββ β β
β β β β β β β
β β ββββββββ΄βββββββββββββββββ΄ββββββββββββββββββββ΄βββββββββ β β
β β β Analysis Engine β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β 1. ImageNet Normalization β β β β
β β β β 2. CSRNet Inference (CPU) β β β β
β β β β 3. Density Map β Crowd Count β β β β
β β β β 4. Threshold Decision (> 50 = Alert) β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CSRNet MODEL β β
β β β β
β β Input Image (RGB) β β
β β β β β
β β ββββββΌββββββββββββββββββββββββββββββ β β
β β β FRONTEND β VGG-16 (Layer 0-23) β β β
β β β Pre-trained ImageNet weights β β β
β β β Output: 512-ch feature maps β β β
β β ββββββ¬ββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββΌββββββββββββββββββββββββββββββ β β
β β β BACKEND β Dilated Convolutions β β β
β β β 6Γ Conv2d (dilation=2) β β β
β β β 512 β 512 β 512 β 256 β 128 β 64β β β
β β β Preserves spatial resolution β β β
β β ββββββ¬ββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββΌββββββββββββββββββββββββββββββ β β
β β β OUTPUT β Conv2d(64, 1, 1Γ1) β β β
β β β β 2D Density Map (Heatmap) β β β
β β β β Ξ£(pixels) = Crowd Count β β β
β β ββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DEPLOYMENT TARGET β β
β β Hugging Face Spaces (Streamlit SDK) β β
β β CPU-only inference via map_location β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Details | Purpose |
|---|---|---|
| Frontend | VGG-16, layers 0β23, pre-trained on ImageNet | High-level feature extraction from crowd images |
| Backend | 6Γ Dilated Conv2d (dilation=2, padding=2) |
Expands receptive field without losing spatial resolution |
| Output | Conv2d(64, 1, kernel_size=1) | Single-channel density map β sum = crowd count |
| Training Data | ShanghaiTech Part B | 400+ annotated crowd scenes with dot-labeled heads |
| Inference | CPU-optimized (map_location='cpu') |
Runs on free-tier Hugging Face Spaces |
| Component | Technology | Purpose |
|---|---|---|
| Framework | Streamlit 1.57 | Interactive web interface for image upload & visualization |
| Deep Learning | PyTorch 2.0+ | Model inference engine |
| Backbone | torchvision (VGG-16) | Pre-trained feature extractor weights |
| Visualization | Matplotlib (jet colormap) |
Heatmap rendering of density maps |
| Image Processing | Pillow, NumPy | Image loading, tensor operations |
| Deployment | Hugging Face Spaces | Free CPU-tier cloud hosting |
The user uploads a JPG/PNG photo of a transit stop through the Streamlit interface.
The image is converted to RGB and normalized using ImageNet standards:
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])CSRNet processes the image through VGG-16 frontend β Dilated Conv backend β 1Γ1 output layer, producing a 2D density heatmap.
crowd_count = max(0, int(np.sum(density_map)))βββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β crowd_count β€ 50 β β
Normal level β
β crowd_count > 50 β π¨ EXTRA BUS ALERT β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
The system displays the original image alongside the density heatmap and shows a clear dispatch decision.
- Python 3.10+ (Install Guide)
- Git (Install Guide)
git clone https://github.com/ErenBalkis/Smart_Transit_AI.git
cd Smart_Transit_AIpython -m venv venv
# Linux / macOS
source venv/bin/activate
# Windows
venv\Scripts\activatepip install -r requirements.txtMake sure the trained best_model.pth file is in the project root directory.
streamlit run app.pyThe app will open at http://localhost:8501.
crowd_detection/
βββ app.py # Streamlit application + CSRNet model definition
β # ββ CSRNet class (VGG-16 frontend + dilated backend)
β # ββ Model loading with CPU mapping
β # ββ Image preprocessing pipeline
β # ββ UI: upload, heatmap, metrics, dispatch alert
βββ best_model.pth # Trained CSRNet weights (ShanghaiTech Part B)
βββ requirements.txt # Python dependencies
βββ info.md # Technical specification document
βββ README.md # This file
The secret sauce behind CSRNet's accuracy is dilated (atrous) convolutions. Standard pooling layers downsample the feature map, destroying spatial information critical for density estimation. Dilated convolutions solve this by expanding the receptive field without reducing resolution:
Standard Conv (3Γ3, dilation=1) Dilated Conv (3Γ3, dilation=2)
βββββ¬ββββ¬ββββ¬ββββ¬ββββ βββββ¬ββββ¬ββββ¬ββββ¬ββββ
β β β β β β β β β β β β β β β β β β
βββββΌββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββΌββββ€
β β β β β β β β β β β β β β β
βββββΌββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββΌββββ€
β β β β β β β β β β β β β β β β β β
βββββΌββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββΌββββ€
β β β β β β β β β β β β
βββββΌββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββΌββββ€
β β β β β β β β β β β β β β β
βββββ΄ββββ΄ββββ΄ββββ΄ββββ βββββ΄ββββ΄ββββ΄ββββ΄ββββ
Receptive field: 3Γ3 Receptive field: 5Γ5
Parameters: 9 Parameters: 9 (same!)
Same computational cost, wider field of view β allowing the model to understand both local details and crowd-level patterns simultaneously.
This project is deployed and running on Hugging Face Spaces:
- Uses
StreamlitSDK withmap_location=torch.device('cpu')for seamless CPU-tier inference - Zero-config deployment β push to HF repo and it builds automatically
- The YAML frontmatter in this README configures the Space (SDK version, app file, etc.)
Built with π by Eren BalkΔ±Ε
Smart cities start with smarter infrastructure.

