Date: 2025-05-07
-
Built
docker-compose.ymlandDockerfilefor each use case (Infra and Ships). -
Ran containers on specific GPUs:
CUDA_DEVICES=<GPU_ID> docker compose up -d CUDA_DEVICES=1 docker compose up -d docker exec -it SAR_AIR_model_trainer bash
-
Ensured unique container names to avoid conflicts.
-
Removed old containers:
docker rm -f model_trainer docker rm -f sar_ship_model_trainer
- Started server:
~/.local/bin/tensorboard --logdir trained_models_home/ --port 6010 --host 0.0.0.0 - Accessed remotely:
ssh -L 6010:localhost:6010 gpaps@139.91.185.102
trainv4_Infra_sweeps_v2.py: Stable and used for production.train_sweep.py: Initially promising, but led to gradient explosion and NaNs.
- Dropped
ClassAwareMapperfrom Ships where it was unnecessary. - Used gradient clipping:
cfg.SOLVER.CLIP_GRADIENTS.ENABLED = True cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE = "value" cfg.SOLVER.CLIP_GRADIENTS.CLIP_VALUE = 1.0
MAX_ITER: 30000STEPS: (18000, 25000)BASE_LR: 0.0002BATCH_SIZE_PER_IMAGE: 256MIN_SIZE_TRAIN: [640, 800, 1024, 1280]
trainv4_Ships_sweeps_v2.pywithShipSARMapper
- ResizeShortestEdge
- Brightness, contrast
- Horizontal and vertical flip
- Random rotation
- Replaced
ClassAwareMapper(1-class only) - Handled StopIteration via proper dataset mapping
MAX_ITER: 15000STEPS: (8000, 12000)BASE_LR: 0.0001BATCH_SIZE_PER_IMAGE: 256
- Added TensorBoard text logging via:
logText(cfg, cfg.OUTPUT_DIR)
- Removed in-script dataset registration
- Evaluation hooks added for val metrics and output visual logging
- Avoided clutter: controlled logging output, model summaries
# Run sweep
./run_sweep.sh
# Remove stuck containers
docker rm -f model_trainer
docker rm -f sar_ship_model_trainer
# TensorBoard from remote
~/.local/bin/tensorboard --logdir trained_models_home/ --port 6010 --host 0.0.0.0
ssh -L 6010:localhost:6010 gpaps@139.91.185.102✅ You’re ready for repeatable, scalable training across use cases. Next up: monitoring and inference!