This repository implements a complete deep-learning Visual-Inertial Odometry stack for a downward-facing UAV camera observing a planar surface, with six interlocking branches. Read this file first — it explains how the pieces fit together, the dataset format the branches assume, the order in which things are trained, and the conventions used by every SLURM script.
deep-vio/
├── data_generation/ Blender pipeline that produces synthetic
│ sequences (frames + poses + IMU).
├── data/ Generated datasets live here.
├── cluster/ Apptainer container + run helpers shared
│ across branches that lack their own venv.
├── vision_only/ Branch A — vision-only odometry (BranchA).
├── imu_only/ AirIMU + AirIO + 15-state velocity EKF.
├── prgflow/ PRGFlow VIO baseline — cascaded SqueezeNet
│ patch-flow + Madgwick IMU attitude.
├── prgflow_mod/ PRGFlow with an additional yaw head.
├── fusion_common/ Shared dataset / loss / metrics for the
│ two fusion branches below.
├── loose_fusion/ EKF-based loose coupling (no training).
├── cross_attention/ Tight coupling via cross-modal transformer.
└── CLAUDE.md This file.
Six branches in total (five trainable + EKF-only loose fusion):
| Branch | What it does | Trained? | Depends on |
|---|---|---|---|
vision_only |
RGB → relative pose | yes | dataset only |
imu_only |
IMU → body-frame velocity → trajectory via EKF | yes | dataset only |
prgflow |
Cascaded SqueezeNet patch flow + Madgwick IMU → VIO | yes | dataset only |
prgflow_mod |
PRGFlow + extra yaw head (cascaded t/y/s) | yes | dataset only |
loose_fusion |
EKF fuses BranchA pose + AirIO velocity + AirIMU | no | pretrained vision_only + imu_only checkpoints |
cross_attention |
Tight fusion: transformer cross-modal attention | yes | pretrained backbones (recommended) |
Every branch reads from a top-level dataset directory containing one
sub-directory per trajectory. The actual format on the cluster (after
the dataset.py refactor in commit 6b2d17d) is multi-rate with
separate CSVs for each modality:
dataset_root/
clover_100/
images/
000000.png
000001.png
... # camera-rate, ≈100 Hz
frame_index.csv # t, imu_index, image_path (one row per frame)
poses_body.csv # t, x, y, z, qw, qx, qy, qz (IMU-rate, ≈1000 Hz)
imu_clean.csv # t, gx, gy, gz, ax, ay, az (IMU-rate, ≈1000 Hz)
clover_101/
...
Trajectory naming. Sequences come in trajectory families
(clover, figure8, figure8_3d, lemniscate, oval, spiral)
with numerical suffixes that index difficulty:
_100— baseline maneuver (smooth, well-represented in training)_101— moderate aggressiveness_102— most aggressive (fastest yaw rates, sharpest turns)
The _102 instances cause the largest accuracy regressions across
every method we've tested — see Empirical observations.
Conventions assumed by all branches
- Vision (camera) and IMU run at different rates — typically 100 Hz
vs 1000 Hz (10:1).
frame_index.csv'simu_indexcolumn maps each camera frame to its corresponding IMU sample. - Poses are stored as unit quaternions
(qw, qx, qy, qz)plus position, both at IMU rate. The shared helperimu_only.utils.quat_to_rotmat_npconverts per-sequence quaternion arrays to(N, 3, 3)rotation matrices. - IMU CSV column order is
t, gx, gy, gz, ax, ay, az(gyro-then-accel). The dataset loaders extractacc = data[:, 4:7]andgyro = data[:, 1:4]. Watch for this if you write any external preprocessing — the convention differs from many ROS bag dumps. - IMU specific-force convention:
ᴮa = ᴮF/m + Rᵀᴳg(an IMU at rest level on the ground reads+9.81along its body-up axis). - Gravity vector:
g = [0, 0, -9.81](Z-up world). Defined once in imu_only/utils.py (GRAVITY,GRAVITY_VEC). - Image preprocessing: resize to
(img_height, img_width)then ImageNet normalize (mean[0.485, 0.456, 0.406], std[0.229, 0.224, 0.225]). Photometric augmentation in training is applied identically to the two frames of a pair so the relative pose label stays valid.
Pose representation in fusion_common/dataset.py. Primary buffers
(acc, gyro, attitude, v_body, R, p) are stored at IMU rate.
Camera-rate views (R_cam, p_cam) are derived via the
frame_imu_indices mapping. Loose fusion uses the IMU-rate buffers
for its sample-by-sample EKF; the tight-fusion branches use the
camera-rate views for trajectory comparison and IMU-rate context
windows for AirIO input.
Reference: Wang et al. 2017 (DeepVO) + Zhou et al. 2019 (6D rotation) + PWC-Net-style correlation.
Inputs: two consecutive RGB frames [B, T, 3, H, W] plus their
sequence offset.
Outputs:
trans[B, T, 3]— relative translation in metric unitsrot_6d[B, T, 6]— raw 6D rotationR[B, T, 3, 3]— Gram-Schmidt rotation matrixhidden— LSTM state(h, c)for trajectory continuationfeatures[B, T, 128]— shared with the fusion branches
Architecture (4 stages):
- Siamese MobileNetV2 (layers 0-13, no pretraining) → 96 ch → 1×1 conv → 64 ch.
- Correlation layer with
max_displacement=4→ 81 ch → two 3×3 conv blocks → 64 ch. - Concat
[corr | feat_t | feat_t1](192 ch) → conv → AdaptiveAvgPool(4, 4)→ flatten[B, T, 1024]. - 2-layer unidirectional LSTM (hidden=256, dropout=0.3) → FC(256,
128) → two heads (
trans 3,rot 6). 6D output recovered viagram_schmidt.
Loss: smooth-L1 on translation + geodesic SO(3) (atan2
formulation), lambda_rot=100.
Files:
vision_only/
├── model.py BranchA, CNNEncoder, CorrelationLayer (vectorised via unfold)
├── loss.py geodesic_loss_stable, branch_a_loss
├── dataset.py UAVOdometryDataset
├── train.py Adam 1e-4, weight_decay 1e-4, CosineAnnealingLR, AMP, grad-clip 1.0
├── evaluate.py ATE (Umeyama-aligned) + RTE + rot deg + plots
├── utils.py gram_schmidt, pose_to_matrix, umeyama_alignment, integrate_trajectory
├── test_pipeline.py synthetic-data smoke test
├── setup_env.slurm
├── train.slurm
├── test.slurm
├── requirements.txt
└── README.md full design choices write-up
Run:
sbatch vision_only/setup_env.slurm
sbatch vision_only/test.slurm
sbatch --export=ALL,DATA_ROOT=/path/to/data vision_only/train.slurmReferences: Qiu et al. RA-L 2025 ("AirIO") and Qiu et al. 2023 ("AirIMU"). The full pipeline is AirIMU → AirIO → EKF, where:
AirIMUNetcorrects raw IMU and reports per-sample uncertainty.AirIONetpredicts body-frame velocity + diagonal covariance.VelocityEKF(15-state) fuses both with IMU pre-integration.
Inputs: raw [acc, gyro] [B, W, 6].
Outputs:
correction [B, W, 6]soâ = a + σ̂_a,ŵ = ω + σ̂_g,log_var [B, W, 6]clamped to[-10, 10](3 accel + 3 gyro).
Architecture: Conv1d → BN → GELU stack (ch 64→128→128) → Dropout(0.5) → bi-GRU (hidden=128, 2 layers) → 2 MLP heads.
Loss (training, imu_only/loss.py): integrate
the corrected IMU forward over the window with forward-Euler
(integrate_imu_window), supervise on:
- geodesic SO(3) of the integrated end-of-window rotation,
- MSE of the integrated end-of-window velocity (world frame),
- MSE of the integrated displacement,
- Gaussian NLL on per-sample uncertainty against a window-summary residual.
Default loss weights: λ_rot=10, λ_vel=1, λ_pos=1, λ_nll=1e-4.
Window size is short by design (default 20 = 100 ms at 200 Hz)
because forward-Euler error compounds super-linearly with window
length.
Inputs: corrected [acc, gyro] [B, W, 6] + per-sample attitude
ξ ∈ ℝ³ [B, W, 3] (so(3) log of R).
Outputs: v_body [B, W, 3], log_var [B, W, 3] clamped.
Architecture: two parallel CNN1D encoders (IMU 6→64→128→128; attitude 3→32→64), concat → bi-GRU (hidden=128, 2 layers) → 2 MLP heads.
Loss (paper Eq. 3-5): Huber on velocity (δ=0.005) + λ_c=1e-4
× Gaussian NLL of residual under the predicted diagonal variance.
EKF — imu_only/ekf.py
15-state error-state EKF.
State (nominal): X = (R, ᴳv, ᴳp, b_a, b_g).
Error state: δX = [δξ, δᴳv, δᴳp, δb_a, δb_g] ∈ ℝ¹⁵.
Predict (Eq. 7):
R_{i+1} = R_i Exp((ω - b_g) Δt)
v_{i+1} = v_i + (R_i (a - b_a) + g) Δt
p_{i+1} = p_i + v_i Δt + 0.5 Δt² (R_i (a - b_a) + g)
b_a, b_g unchanged
Process noise = AirIMU log_var (rotated through R) + bias random
walk (defaults σ_ba_rw=1e-3, σ_bg_rw=1e-4).
Update (Eq. 10-12):
- Measurement:
h(X) = Rᵀᴳv= body-frame velocity from AirIO. - Jacobians:
H_{δᴳv} = Rᵀ,H_{δξ} = Rᵀ [ᴳv]_×. - Joseph form covariance update.
imu_only/
├── model.py AirIMUNet, AirIONet, CNNEncoder1D
├── loss.py airio_loss, airimu_loss, geodesic_loss_stable, gaussian_nll
├── dataset.py IMUWindowDataset (sliding windows)
├── ekf.py VelocityEKF (numpy)
├── utils.py SO(3) ops (torch + numpy), GT-velocity-from-poses, integrator
├── train_airimu.py Adam 1e-3, ReduceLROnPlateau, batch=128, window=20
├── train_airio.py same, window=1000 (5 s at 200 Hz); loads frozen AirIMU
├── evaluate.py per-IMU-sample EKF, ATE/RTE-5s/rot-deg, plots
├── test_pipeline.py synthetic helical-trajectory smoke test
├── setup_env.slurm
├── train_airimu.slurm
├── train_airio.slurm
├── test.slurm
├── requirements.txt
└── README.md full design choices write-up
Run order:
sbatch imu_only/setup_env.slurm
sbatch imu_only/test.slurm
sbatch --export=ALL,DATA_ROOT=/path imu_only/train_airimu.slurm
sbatch --export=ALL,DATA_ROOT=/path,\
AIRIMU_CHECKPOINT=imu_only/checkpoints/airimu/best.pt \
imu_only/train_airio.slurmIf you skip AirIMU, AirIO trains on raw IMU directly — matches the "AirIO Net" baseline column in the paper.
Imported by loose_fusion and cross_attention via the standard
sys-path injection pattern at the top of each entry script:
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from fusion_common.dataset import PairedDataset
from fusion_common.loss import pose_loss
from fusion_common.metrics import trajectory_metrics| File | Provides |
|---|---|
| dataset.py | PairedDataset — sliding window of T frame pairs with synchronized IMU context (imu_context samples ending at the next-frame timestamp). Returns frames_t, frames_t1, imu_acc, imu_gyro, attitude (per-sample so(3)), v_body_gt, trans_gt, R_gt. Also exposes _PairedSequence for evaluation scripts that want raw arrays. |
| loss.py | geodesic_loss_stable, pose_loss (smooth-L1 + λ × geodesic). |
| metrics.py | umeyama_alignment, trajectory_metrics (ATE, RTE-step, mean rotation deg, per-frame error series). |
PairedDataset requires imu_context IMU samples of history before
the first valid frame pair, so the first imu_context pairs of each
sequence are skipped at evaluation time and filled with identity.
No training. Loads three pretrained backbones, runs them independently, and steps a fusion EKF that consumes their outputs.
FusionEKF (extends imu_only.VelocityEKF):
- inherits
predict()(corrected IMU + AirIMU log-var) andupdate_velocity()(AirIO body-frame velocity); - adds
update_vision_velocity(delta_t_vis, dt_frame, sigma)— vision Δt/Δt is treated as a body-frame velocity measurement and reuses the same Eq. 10-12 update model; - adds
update_vision_rotation(delta_R_vis, sigma_deg)— innovation inso(3),His the identity onδξto first order.
Files:
loose_fusion/
├── fusion_ekf.py FusionEKF subclass
├── evaluate.py end-to-end evaluator with --no_vision / --no_imu_update
├── test_pipeline.py
├── setup_env.slurm
├── eval.slurm
├── test.slurm
├── requirements.txt
└── README.md
Run:
sbatch loose_fusion/setup_env.slurm
sbatch loose_fusion/test.slurm
sbatch --export=ALL,DATA_ROOT=/path,\
VISION_CKPT=../vision_only/checkpoints/branch_a/best.pt,\
AIRIO_CKPT=../imu_only/checkpoints/airio/best.pt,\
AIRIMU_CKPT=../imu_only/checkpoints/airimu/best.pt \
loose_fusion/eval.slurmTwo tuning knobs worth sweeping: --vision_vel_sigma (default
0.05 m/s) and --vision_rot_sigma_deg (default 0.5°). Lower → trust
vision more.
Course-provided VIO baselines used as comparators. Both branches are
self-contained — they do not import fusion_common, do not
share backbones with the other branches, and have no per-branch
SLURM scripts (run via cluster/ Apptainer container or manual
python invocation).
Pipeline (both variants):
- Per-frame center crop →
crop_size × crop_sizegrayscale patch (default 128). - Madgwick filter integrates
(gyro, accel, dt)from each frame's IMU window to produce body attitudeR. - Inter-frame attitude change is converted to a homography
H_R = K (Rₜᵀ Rₜ₊₁) K⁻¹and used to rotation-compensate the incoming patch before flow estimation. - The network regresses a residual pseudo-similarity transform
between the previous patch and the rotation-compensated current
patch — output is
(scale, tx, ty)(PRGFlow) or(scale, yaw, tx, ty)(PRGFlow_mod). - Body-frame velocity is recovered analytically from flow + altitude
(
v = patch_disp × altitude / (focal × dt)), then rotated into world frame and integrated. There is no learned EKF — this is loose vision-IMU coupling with an analytic projection model.
Architecture:
PRGFlow(in prgflow/model.py): four cascaded SqueezeNet headsT₁ → S₁ → T₂ → S₂(translation / scale, alternating). Each head sees[patch_t, warp(patch_{t+1}, H_total)]stacked on the channel dim and predicts a delta toH_total. Final decomposition:(s, tx, ty).PRGFlow_mod(in prgflow_mod/model.py): six cascaded headsT₁ → Y₁ → S₁ → T₂ → Y₂ → S₂adding two yaw heads. Final decomposition:(s, yaw, tx, ty).
Empirically the yaw head hurts — see observation #5 in
Empirical observations.
Treat prgflow_mod as a cautionary tale, not an upgrade.
Files (mirrored across prgflow/ and prgflow_mod/):
prgflow/
├── model.py PRGFlow cascaded SqueezeNet
├── squeezenet.py SHead, THead, fire blocks
├── warp.py homography compose / decompose / warp_image
├── attitude.py Madgwick filter
├── fusion.py VIOFusion: end-to-end attitude + flow → velocity
├── dataset.py
├── preprocess.py cropping, normalization
├── loss.py
├── train.py manual python invocation (no train.slurm)
├── evaluate.py
└── test.py smoke test
Run: no SLURM scripts. Either drive via the shared Apptainer
container in cluster/ or run python train.py / python evaluate.py directly inside an environment that has the same
PyTorch + numpy + scipy + opencv stack as the other branches.
Idea: treat each modality as one token per timestep, concatenate
the two streams, and let a transformer encoder do self-attention over
the combined 2T tokens — equivalent to alternating within-modality
self-attention and cross-modality attention but parameter-shared.
CrossAttentionFusionNet:
BranchA(frames_t, frames_t1)exposes its post-FCfeatures [B, T, 128]._ImuFeatureExtractorrunsAirIONet's CNN encoders- bi-GRU and takes the last sample of each window as a per-pair
feature
[B, T, 256]. Both projected tofeat_dim=128viavis_proj(128 → 128) andimu_proj(256 → 128).
- bi-GRU and takes the last sample of each window as a per-pair
feature
- Add learned modality embedding (
(2, D)table) + learned temporal positional embedding ((max_len, D)) — same temporal embedding shared across modalities, modality embedding distinguishes them. - Concat as
[B, 2T, D], run throughnn.TransformerEncoder(2 layers by default, 4 heads, FFN width 256, GELU,norm_first=True,batch_first=True). - LayerNorm; readout = mean of vision-half and IMU-half output
tokens at each
t.trans_head(3) androt_head(6) →gram_schmidt(rot_6d).
Training: Adam, weight_decay 1e-4, CosineAnnealingLR with warm-up + joint fine-tune.
- Phase 1 (
--warmup_epochs=5): both backbones frozen, only the transformer + projections + heads update at lr=1e-4. - Phase 2: full network unfrozen, lr drops to lr_finetune=2e-5.
Extra args: --feat_dim, --num_heads, --num_layers,
--ffn_hidden, --dropout.
Loss: pose_loss (λ_rot=100).
Diagnostic plot: <seq>_vis_imu_cos.png — per-timestep cosine
similarity between the two output tokens (vision vs IMU). Low
similarity = modalities disagree; the transformer is reconciling.
Files:
cross_attention/
├── model.py CrossAttentionFusionNet, _ImuFeatureExtractor
├── train.py warm-up + joint fine-tune + transformer args
├── evaluate.py
├── test_pipeline.py
├── setup_env.slurm, train.slurm, eval.slurm, test.slurm
├── requirements.txt
└── README.md
Run:
sbatch cross_attention/setup_env.slurm
sbatch cross_attention/test.slurm
sbatch --export=ALL,DATA_ROOT=/path,\
VISION_CKPT=../vision_only/checkpoints/branch_a/best.pt,\
AIRIO_CKPT=../imu_only/checkpoints/airio/best.pt \
cross_attention/train.slurmTo go from a fresh dataset to evaluation of every method:
data_generation/ (Blender)
│
▼
data/<run>/
│
┌─────────────┴──────────────┐
▼ ▼
vision_only/train.py imu_only/train_airimu.py
│ │
│ ▼
│ imu_only/train_airio.py (uses frozen AirIMU)
│ │
└────────────┬───────────────┘
▼
┌────────────┴────────────┐
▼ ▼
loose_fusion/eval cross_attention/train
│
▼
cross_attention/eval
(PRGFlow baselines run independently on data/<run>/.)
Or in shell:
# 1. Pretrain backbones
sbatch --export=ALL,DATA_ROOT=$D vision_only/train.slurm
sbatch --export=ALL,DATA_ROOT=$D imu_only/train_airimu.slurm
# (wait, then…)
sbatch --export=ALL,DATA_ROOT=$D,\
AIRIMU_CHECKPOINT=imu_only/checkpoints/airimu/best.pt \
imu_only/train_airio.slurm
# 2. Loose fusion (no training)
sbatch --export=ALL,DATA_ROOT=$D,\
VISION_CKPT=vision_only/checkpoints/branch_a/best.pt,\
AIRIO_CKPT=imu_only/checkpoints/airio/best.pt,\
AIRIMU_CKPT=imu_only/checkpoints/airimu/best.pt \
loose_fusion/eval.slurm
# 3. Tight fusion (independent training run)
sbatch --export=ALL,DATA_ROOT=$D,\
VISION_CKPT=vision_only/checkpoints/branch_a/best.pt,\
AIRIO_CKPT=imu_only/checkpoints/airio/best.pt \
cross_attention/train.slurmEach branch has its own venv (<branch>/.venv/) created by
setup_env.slurm — they're not shared. The fusion branches still
import sibling code via the sys.path injection trick, so the
imported modules see whatever venv the importing branch is using.
vision_only, imu_only, loose_fusion, and cross_attention each
have a near-identical set (prgflow and prgflow_mod do not — they
share the Apptainer container in cluster/ instead):
| File | Job | Notes |
|---|---|---|
setup_env.slurm |
one-time venv create + torch + reqs | overridable via TORCH_INDEX_URL (default cu121) |
test.slurm |
python test_pipeline.py |
~1 min, runs the smoke test on synthetic data |
train*.slurm |
python train*.py |
reads everything from ${VAR:-default} env vars |
eval.slurm |
python evaluate.py |
only loose / cross_attention; imu_only and vision_only use train.slurm + a manual eval |
All scripts:
set -euo pipefail,module load python/3.10|3.11|pythonandcuda/12.1|11.8|cuda(with|| trueso unknown clusters fall back gracefully),mkdir -p logs/,- read
SLURM_SUBMIT_DIRfor the project dir, - assume the venv lives at
${VENV_DIR:-${PROJECT_DIR}/.venv}.
To override anything pass it via --export=ALL,VAR=value at submit
time. Example: sbatch --export=ALL,EPOCHS=200,LR=5e-5 cross_attention/train.slurm.
- No pypose dependency. SO(3) ops are implemented in imu_only/utils.py (torch + numpy variants).
- Numerical stability rules: geodesic loss uses
atan2, neverarccos. SO(3)exp/loguse small-angle Taylor expansions nearθ = 0. AirIMU / AirIOlog_varare clamped to[-10, 10]. - Mixed precision is opt-in via the absence of
--no_amp. All training scripts usetorch.cuda.ampwhen on a CUDA device. - Gradient clipping at max-norm
1.0everywhere there's an LSTM or GRU. - Hidden state hygiene: LSTM / GRU hidden states are detached
between batches; they are reset (
None) at trajectory boundaries during training. - Checkpoint format: every
train*.pysaves{epoch, model, optimizer, scheduler, args, val_*}so reloading viastate.get("model", state)always works. - Augmentation rules for vision: photometric only (brightness + contrast, factor 0.2). Same jitter applied to both frames of a pair. No geometric augmentation (would invalidate the relative-pose label).
- All paths in scripts are absolute or relative to
SLURM_SUBMIT_DIR— never to the user'spwd.
Every branch ships a test_pipeline.py that:
- Generates a tiny synthetic dataset (helical trajectory with
analytical IMU readings, ~60-600 samples, ~3 sequences) in
/tmp. - Builds the network with random weights.
- Runs forward + backward + an optimizer step.
- Asserts shape correctness, finite gradients, and pipeline-specific invariants (e.g. EKF covariance stays PSD, SO(3) round-trips, integrator vs analytic helix matches).
sbatch <branch>/test.slurm is the fastest way to confirm a clone
works on a new cluster after setup_env.slurm. Total wall-clock <1
minute on a single GPU. Output ends with All checks passed.
- Validation Huber loss is a poor proxy for trajectory ATE. Empirically a 2.3× increase in val Huber on AirIO produced a 16× increase in ATE on the same evaluation sequence. Do not early-stop on Huber alone — track ATE on a held-out sequence during training, even if it's expensive. See observation #3 below.
- Per-axis velocity errors are wildly asymmetric in AirIO outputs.
On
clovertrajectories the Z-axis RMSE is 5× the X-axis RMSE; on planaroval/lemniscatetrajectories Y and Z errors are 2-3 orders of magnitude smaller than X. This isn't a bug — it's the network failing to generalise across motion axes that aren't well-represented in training. Symptom: huge ATE on the first out-of-distribution trajectory family you evaluate. acc_pct/ confidence outputs are modality-specific. If a model reports a translation-classification confidence, that signal can be 100% on a sequence whose rotation error is 100°. Don't use a single confidence channel as a global "did it work" signal.- Adding a new prediction head without commensurate data hurts. PRGFlow's "improved" model with an added yaw head regressed 3-5× on translation and 10-25× on rotation versus the no-yaw baseline on the same evaluation set. Treat extra heads as a hypothesis to test, not a free improvement.
- Watch for degenerate constants in metric outputs. The PRGFlow baseline's scale prediction is literally constant (0.00824) on 4 of 5 evaluation sequences — mean = p95 = max identically. The network learned to output a default rather than predict scale. If you see suspiciously low variance in any metric, plot the per-frame distribution before trusting the mean.
loose_fusion._R_prev_visinitialises to identity onreset(). The first vision rotation update is therefore silently skipped (innovation against itself); intentional, but worth knowing.- Cross-attention's
pos_embhas lengthmax_len=256by default. Training with--sequence_length > 256triggers the runtime check at the top offorward. AirIONetclamps itslog_varto[-10, 10]. If your loss starts oscillating it is not because the variance is exploding — investigate the Huber term first.
The results/ directory contains evaluations from before the fusion
branches existed: BranchA (vision_only), AirIO+EKF (imu_only), and
PRGFlow VIO (an external visual-inertial baseline not in this repo,
but a useful comparator). Eight non-obvious findings worth remembering
when interpreting future results:
BranchA's per-pair rotation error is 0.5°; PRGFlow VIO's is 6.9° — vision is 14× more accurate per frame. Yet PRGFlow's ATE is 0.72 m vs BranchA's 2.87 m — VIO is 4× better globally. Bias dominates trajectory error, not variance. Vision dead-reckons; VIO's IMU+EKF stops the bleed.
| Family | rmse_vx | rmse_vy | rmse_vz |
|---|---|---|---|
| oval (planar) | 0.17 | 0.004 | 0.04 |
| clover (3D loops) | 0.52 | 0.34 | 2.79 |
| spiral | 0.35 | 0.08 | 1.28 |
The Z-axis error is 30× the X-axis error on clover even though
accelerometer/gyro physics are identical for both axes. The network
learned a motion-distribution prior, not an axis-symmetric inverse
model. Strong argument for diversifying training trajectories or for
data-augmentation that randomly rotates the body frame.
Same clover_100 sequence on two AirIO checkpoints:
| Checkpoint | val Huber | ATE | rotation |
|---|---|---|---|
| run0 | 0.000139 | 0.78 m | 14° |
| new_data_run | 0.000538 | 12.73 m | 104° |
A 2.3× rise in Huber → 16× rise in ATE. Per-window velocity residuals can be sub-millimeter while the integrated trajectory drifts meters. Validation should include trajectory-level metrics.
On oval_102 the model self-reports acc_pct = 100% (perfectly
confident) yet rotation error is 92.7°. On clover_102 it
self-reports acc_pct = 15.6% with rotation error of 105° — knows it
failed. The confidence head was trained on translation classification
and never saw rotation residuals during supervision. Lesson: any
confidence signal is partial, scoped to whatever it was supervised on.
Adding a 167K-parameter yaw head produced 3-5× worse translation and 10-25× worse rotation on the same evaluation sequences. Capacity without commensurate data caused specialisation in yaw at the expense of translation generalisation.
PRGFlow VIO's e_scale_mean, e_scale_p95, and e_scale_max are all
identically 0.00824 on 4 of 5 evaluation sequences. The network
learned a degenerate solution where it outputs a constant rather than
predicting scale. Worth adopting a habit: always plot the per-frame
distribution before trusting the mean of a metric.
- Best result:
oval_100, ATE 0.24 m in 20 s = 1.2 cm/s drift. - Worst result:
clover_102, same network, same training: ATE 26.9 m. - Same parameters → 112× error gap from a 5% change in trajectory shape. Frames the fusion work as solving generalisation, not capability.
Wall-clock for evaluating 15 sequences: AirIO+EKF = 71 s, PRGFlow_imp = 2279 s (≈70% of which is preprocessing). The deployment story for onboard UAV use isn't accuracy alone — at the latency budget that matters for closed-loop control AirIO's accuracy gap is much smaller than the table suggests.
These observations directly motivate specific fusion design choices:
- Loose fusion's vision-rotation update is the highest-leverage
measurement. AirIO rotations drift 8-119° while BranchA rotations
are sub-degree per frame — feeding vision rotation into the EKF
patches AirIO's biggest weakness. Tune
--vision_rot_sigma_degaggressively low (0.1-0.3°) before--vision_vel_sigma. - Cross-attention's interpretability is via
vis_imu_cos. Low cosine similarity onclover_102means the modalities disagree (which is correct — IMU is failing) and the transformer is reconciling. High cosine similarity on planar trajectories means the modalities concur. - Validation strategy for the fusion branches. Don't just track the pose loss. Pick 1-2 held-out sequences with different trajectory families and report ATE per epoch. The Huber/pose-loss validation curve will look smooth while ATE silently regresses.
- The bar to beat (small set, 5 trajectories). PRGFlow VIO scores
ATE 0.72 m, RTE 0.005 m, rotation 6.9°. Loose fusion is unlikely to
beat this outright (PRGFlow is already tightly-coupled VIO); the
interesting comparison is
cross_attentionversus PRGFlow on identical sequences.
The full bibliography lives in each branch's README.md. The most
load-bearing papers:
- Wang, S. et al., 2017. "DeepVO." ICRA. (vision_only backbone)
- Sandler, M. et al., 2018. "MobileNetV2." CVPR. (vision_only backbone)
- Sun, D. et al., 2018. "PWC-Net." CVPR. (correlation layer)
- Zhou, Y. et al., 2019. "On the Continuity of Rotation Representations in Neural Networks." CVPR. (6D rotation)
- Qiu, Y. et al., 2025. "AirIO." RA-L. (imu_only)
- Qiu, Y. et al., 2023. "AirIMU." (imu_only AirIMU sub-network)
- Forster, C. et al., 2015. "IMU Preintegration on Manifold." RSS. (EKF kinematic model)
- Clark, R. et al., 2017. "VINet." AAAI. (cross_attention / fusion design)
- Vaswani, A. et al., 2017. "Attention Is All You Need." NeurIPS. (cross_attention)