Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
dataset.py	dataset.py
eval.slurm	eval.slurm
evaluate.py	evaluate.py
loss.py	loss.py
model.py	model.py
requirements.txt	requirements.txt
setup_env.slurm	setup_env.slurm
test.slurm	test.slurm
test_pipeline.py	test_pipeline.py
train.py	train.py
train.slurm	train.slurm
utils.py	utils.py

Branch A — Vision Only Odometry

Overview

Branch A is the vision-only baseline of a three-branch deep Visual-Inertial Odometry system designed for a downward-facing monocular camera on a UAV observing a planar surface. It takes two consecutive RGB frames as input and predicts the relative 6-DoF pose between them.

This branch serves two purposes:

A standalone vision-only odometry system
The visual encoder whose features are reused in Branch C (Visual-Inertial fusion)

Network Architecture

[Frame t]   ──┐
              ├──→ Siamese MobileNetV2 ──→ feat_t   [B, 64, H/16, W/16]
[Frame t+1] ──┘    (shared weights)    ──→ feat_t1  [B, 64, H/16, W/16]
                                │
                ┌───────────────┤
                ↓               │
        Correlation Layer       │
       [B, 81, H/16, W/16]      │
                │               │
        Compress convs          │
       [B, 64, H/16, W/16]      │
                └───────────────┘
                        │
              Concatenate [192ch]
                        │
              Compress + Pool
                        │
                Flatten [B, 1024]
                        │
                ┌───────┘  (per timestep)
                ↓
            2-layer LSTM
             hidden=256
                │
            FC(256→128)
            ┌────┴────┐
            ↓         ↓
        trans_head   rot_head
       Linear(128,3) Linear(128,6)
            ↓         ↓
         [B,T,3]    [B,T,6]
                        │
                 Gram-Schmidt
                        │
                    [B,T,3,3]

Design Choices and Justification

1. Siamese CNN Encoder

Choice: Two separate CNN encoders with shared weights, one per frame.

Justification: By encoding both frames with the same weights, both feature maps live in the same embedding space. This makes cross-frame comparison in the correlation layer geometrically meaningful — a feature responding to a corner in frame t will produce the same activation pattern for a corner in frame t+1. In contrast, the naive 6-channel stacked input used in DeepVO has no such symmetry guarantee.

Reference:

Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H., 2016. Fully-convolutional siamese networks for object tracking. ECCV Workshop on Visual Object Challenge.

2. MobileNetV2 Backbone

Choice: MobileNetV2 layers 0–13, trained from scratch.

Justification: Training data is synthetic (Blender-generated). MobileNetV2's depthwise separable convolutions reduce parameter count significantly compared to ResNet or VGG, reducing the risk of overfitting on limited synthetic data. Spatial stride of 16 (after layer 13) gives a 14×14 feature map for 224×224 input — sufficient resolution for texture matching on a planar scene while keeping computation tractable.

Reference:

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C., 2018. MobileNetV2: Inverted residuals and linear bottlenecks. CVPR 2018.

3. Correlation Layer

Choice: Explicit cross-frame feature correlation with max_displacement=4.

Justification: The correlation layer computes dot-product similarity between every location in feat_t and all displaced locations in feat_t1 within a 9×9 neighborhood. This gives the network an explicit motion signal before any learned processing — the inductive bias that features move smoothly between frames. Without this, the network must learn to compare features implicitly through subsequent convolutions, requiring more data and more parameters. This design was validated for optical flow (pixel-level motion) and the same argument applies to odometry (camera-level motion).

References:

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T., 2015. FlowNet: Learning optical flow with convolutional networks. ICCV 2015.

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T., 2017. FlowNet 2.0: Evolution of optical flow estimation with deep networks. CVPR 2017.

Sun, D., Yang, X., Liu, M.Y., Kautz, J., 2018. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. CVPR 2018.

4. Retaining feat_t and feat_t1 Alongside Correlation

Choice: Concatenate correlation map with both individual frame feature maps before spatial compression.

Justification: The correlation map encodes where things moved but discards appearance information. Individual frame features carry texture density, scene scale, and lighting context. For a downward-facing UAV over a planar scene, apparent texture scale directly encodes camera height, which is essential for accurate translation estimation in Z. PWC-Net demonstrated that combining the cost volume with per-frame features consistently outperforms using either alone.

Reference:

Sun, D., Yang, X., Liu, M.Y., Kautz, J., 2018. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. CVPR 2018.

5. LSTM for Temporal Consistency

Choice: 2-layer unidirectional LSTM, hidden size 256, hidden state carried across the full trajectory during inference.

Justification: Frame-to-frame pose predictions are independent without temporal modeling — each prediction has no knowledge of previous errors, causing rapid trajectory drift during dead-reckoning. The LSTM hidden state accumulates trajectory context, effectively learning to be consistent with its own history. DeepVO demonstrated that removing the LSTM from an otherwise identical architecture approximately doubles ATE on standard benchmarks. Hidden state is reset at trajectory boundaries and detached between batches to prevent gradient flow across trajectories.

References:

Wang, S., Clark, R., Wen, H., Trigoni, N., 2017. DeepVO: Towards end-to-end visual odometry with deep recurrent convolutional networks. ICRA 2017.

Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9(8), 1735–1780.

6. Separate Translation and Rotation Heads

Choice: Two independent linear layers after the shared FC, one for translation (→3) and one for rotation (→6).

Justification: Translation and rotation have different physical units (meters vs radians), different magnitudes, and different geometric properties. A shared head produces conflicting gradient signals because improving translation prediction may worsen rotation prediction and vice versa. Separate heads allow each to develop independently and are standard practice in pose regression literature.

Reference:

Kendall, A., Grimes, M., Cipolla, R., 2017. Geometric loss functions for camera pose regression with deep learning. CVPR 2017.

7. 6D Rotation Representation

Choice: Output 6 unconstrained numbers representing the first two columns of a 3×3 rotation matrix. Third column recovered via Gram-Schmidt orthogonalization.

Justification: All compact rotation representations with fewer than 5 dimensions have discontinuities on SO(3). Specifically:

Euler angles: gimbal lock at ±90° pitch, wrapping discontinuity at ±180°
Quaternions: double cover (q and -q represent the same rotation, creating two valid targets for the same pose), unit norm constraint breaks gradient flow
Axis-angle: discontinuity at 0° and 360°

The 6D representation is continuous everywhere on SO(3). Small changes in network output always produce small, predictable changes in the recovered rotation — which is exactly the smoothness assumption that gradient descent requires. Zhou et al. proved this is the minimum overhead representation satisfying this requirement.

Reference:

Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H., 2019. On the continuity of rotation representations in neural networks. CVPR 2019.

8. Geodesic Loss for Rotation

Choice: Angular distance on SO(3) computed via the atan2 formulation for numerical stability.

Justification: MSE on rotation parameters does not correspond to actual angular error. The geodesic distance θ is the true shortest path between two rotations on the SO(3) manifold:

θ = atan2(||skew(R_diff)||/2, (trace(R_diff)-1)/2)

where R_diff = R_pred^T @ R_gt. This θ is directly interpretable as angular error in radians and is exactly what you report in the evaluation table. Optimizing a loss function that matches the evaluation metric is a fundamental principle of supervised learning. The atan2 formulation is used instead of arccos because arccos has infinite gradient at 0 and π, causing NaN during training.

References:

Mahendran, S., Ali, H., Vidal, R., 2017. 3D pose regression using convolutional neural networks. CVPR Workshops 2017.

Hartley, R., Trumpf, J., Dai, Y., Li, H., 2013. Rotation averaging. IJCV 101(1), 86–128.

Training Details

Hyperparameter	Value	Justification
Optimizer	Adam	Adaptive LR, standard for deep odometry
Learning rate	1e-4	Standard starting LR for Adam on vision tasks
Weight decay	1e-4	L2 regularization, prevents overfitting on synthetic data
Scheduler	CosineAnnealingLR	Smooth LR decay, avoids sharp drops
Batch size	16	Balance between gradient quality and memory
Sequence length	10	10 frame pairs per LSTM unroll during training
λ_rot	100	Balances meter-scale translation with radian-scale rotation
Gradient clip	1.0	Prevents LSTM gradient explosion
LSTM dropout	0.3	Regularization between LSTM layers

Evaluation Metrics

ATE — Absolute Trajectory Error RMSE of per-frame translation error after Umeyama alignment of the full predicted trajectory to ground truth. Measures global consistency of the trajectory.

RTE — Relative Trajectory Error Mean per-frame translation error without global alignment. Measures local accuracy of individual pose predictions.

Mean Rotation Error Mean geodesic distance in degrees between predicted and ground truth rotation matrices across all frames.

Data Format

dataset/
  sequence_001/
    frames/
      frame_000000.png
      frame_000001.png
      ...
    poses.txt    ← 4x4 homogeneous transforms, 16 floats per line
    imu.txt      ← IMU readings: [ax, ay, az, gx, gy, gz] per line

Relative pose computed as:

T_rel = inv(T_world_t) @ T_world_t1
trans = T_rel[:3, 3]
R     = T_rel[:3, :3]

File Structure

branch_a/
├── model.py      BranchA, CNNEncoder, CorrelationLayer
├── loss.py       geodesic_loss_stable, branch_a_loss
├── dataset.py    UAVOdometryDataset
├── train.py      Training loop with TensorBoard logging
├── evaluate.py   Evaluation metrics + trajectory plots
├── utils.py      gram_schmidt, pose_to_matrix, Umeyama alignment
└── README.md     This file

References

Bertinetto et al., "Fully-Convolutional Siamese Networks for Object Tracking," ECCV 2016
Sandler et al., "MobileNetV2: Inverted Residuals and Linear Bottlenecks," CVPR 2018
Dosovitskiy et al., "FlowNet: Learning Optical Flow with Convolutional Networks," ICCV 2015
Ilg et al., "FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks," CVPR 2017
Sun et al., "PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume," CVPR 2018
Wang et al., "DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Networks," ICRA 2017
Hochreiter and Schmidhuber, "Long Short-Term Memory," Neural Computation 1997
Kendall et al., "Geometric Loss Functions for Camera Pose Regression with Deep Learning," CVPR 2017
Zhou et al., "On the Continuity of Rotation Representations in Neural Networks," CVPR 2019
Mahendran et al., "3D Pose Regression Using Convolutional Neural Networks," CVPR Workshops 2017
Hartley et al., "Rotation Averaging," IJCV 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Branch A — Vision Only Odometry

Overview

Network Architecture

Design Choices and Justification

1. Siamese CNN Encoder

2. MobileNetV2 Backbone

3. Correlation Layer

4. Retaining feat_t and feat_t1 Alongside Correlation

5. LSTM for Temporal Consistency

6. Separate Translation and Rotation Heads

7. 6D Rotation Representation

8. Geodesic Loss for Rotation

Training Details

Evaluation Metrics

Data Format

File Structure

References

FilesExpand file tree

vision_only

Directory actions

More options

Directory actions

More options

Latest commit

History

vision_only

Folders and files

parent directory

README.md

Branch A — Vision Only Odometry

Overview

Network Architecture

Design Choices and Justification

1. Siamese CNN Encoder

2. MobileNetV2 Backbone

3. Correlation Layer

4. Retaining feat_t and feat_t1 Alongside Correlation

5. LSTM for Temporal Consistency

6. Separate Translation and Rotation Heads

7. 6D Rotation Representation

8. Geodesic Loss for Rotation

Training Details

Evaluation Metrics

Data Format

File Structure

References