Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multi-scale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.
@inproceedings{liu2020disentangling,
title={Disentangling and unifying graph convolutions for skeleton-based action recognition},
author={Liu, Ziyu and Zhang, Hongwen and Chen, Zhenghao and Wang, Zhiyong and Ouyang, Wanli},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
pages={143--152},
year={2020}
}We release numerous checkpoints trained with various modalities, annotations on NTURGB+D and NTURGB+D 120. The accuracy of each modality links to the weight file.
| Dataset | Annotation | Joint Top1 | Bone Top1 | Joint Motion Top1 | Bone Motion Top1 | Two-Stream Top1 | Four Stream Top1 |
|---|---|---|---|---|---|---|---|
| NTURGB+D XSub | Official 3D Skeleton | joint_config: 69.8 | bone_config: 65.7 | joint_motion_config: 66.3 | bone_motion_config: 65.5 | 72.1 | 74.8 |
| NTURGB+D XView | Official 3D Skeleton | joint_config: 77.2 | bone_config: 72.0 | joint_motion_config: 72.4 | bone_motion_config: 68.3 | 78.4 | 80.9 |
| NTURGB+D 120 XSub | Official 3D Skeleton | joint_config: 59.4 | bone_config: 56.7 | joint_motion_config: 56.6 | bone_motion_config: 52.3 | 61.0 | 62.7 |
| NTURGB+D 120 XSet | Official 3D Skeleton | joint_config: 61.0 | bone_config: 57.8 | joint_motion_config: 59.4 | bone_motion_config: 54.6 | 62.2 | 64.6 |
We also provide numerous checkpoints trained with BFL (Balanced Representation Learning) on NTURGB+D. The accuracy of each modality links to the weight file.
| Dataset | Annotation | Joint Top1 | Bone Top1 | Skip Top1 | Joint Motion Top1 | Bone Motion Top1 | Skip Motion Top1 | Two-Stream Top1 | Four Stream Top1 | Six Stream Top1 |
|---|---|---|---|---|---|---|---|---|---|---|
| NTURGB+D XSub | Official 3D Skeleton | joint_config: 78.0 | bone_config: 77.0 | skip_config: 78.9 | joint_motion_config: 75.4 | bone_motion_config: 73.9 | skip_motion_config: 73.8 | 80.7 | 81.8 | 82.4 |
| NTURGB+D XView | Official 3D Skeleton | joint_config: 82.4 | bone_config: 81.0 | skip_config: 81.6 | joint_motion_config: 79.4 | bone_motion_config: 77.1 | skip_motion_config: 76.9 | 84.2 | 85.3 | 85.7 |
Note
- We use the linear-scaling learning rate (Initial LR ∝ Batch Size). If you change the training batch size, remember to change the initial LR proportionally.
- For Two-Stream results, we adopt the 1 (Joint):1 (Bone) fusion. For Four-Stream results, we adopt the 2 (Joint):2 (Bone):1 (Joint Motion):1 (Bone Motion) fusion. For Six-Stream results, we adopt the 2 (Joint):2 (Bone):2 (Skip):1 (Joint Motion):1 (Bone Motion):1 (Skip Motion) fusion.
You can use the following command to train a model.
bash tools/dist_train.sh ${CONFIG_FILE} ${NUM_GPUS} [optional arguments]
# For example: train MSG3D on NTURGB+D XSub (Joint Modality) with one GPU, with validation, and test the last and the best (with best validation metric) checkpoint.
bash tools/dist_train.sh configs/msg3d/ntu60_xsub_LT_msg3d/j.py 1 --validate --test-last --test-bestYou can use the following command to test a model.
bash tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${NUM_GPUS} [optional arguments]
# For example: test MSG3D on NTURGB+D XSub (Joint Modality) with metrics `top_k_accuracy`, and dump the result to `result.pkl`.
bash tools/dist_test.sh configs/msg3d/ntu60_xsub_LT_msg3d/j.py checkpoints/SOME_CHECKPOINT.pth 1 --eval top_k_accuracy --out result.pklYou can use the following command to ensemble the results of different modalities.
cd ./tools
python ensemble.py