Abstract
Human action recognition from skeleton data has drawn a lot of attention from researchers due to the availability of thousands of real videos with many challenges. Existing works attempted to model the spatial characteristics and temporal dependencies of 3D joints using dynamic time warping, hand-crafted, and spatial co-occurrence features. However, the representation derived from the spatial stream overemphasizes the temporal information; thus, it yields limited expressive power. Some studies use skeleton sequences as frames to enhance the expressive power of representations but lose the generalization capability because the derived temporal smoothness is specific to a particular dataset. The proposed work uses joint distance maps as a base representation that encodes the spatial and temporal information to color texture images. We increase the expressive power by extracting the feature maps from pre-trained networks on ImageNet to diversify the texture representation and propose a network architecture to model the temporal dependency explicitly. We also explore various fusion strategies to generate diverse representations from the feature maps of the pre-trained networks. The experimental results show that the proposed method achieves the best recognition accuracy when using decision-level fusion with meta-learners (Random Forest). The analysis also reveals that the use of feature-level fusion yields relatively good results in terms of the trade-off, i.e., on par recognition performance with some decision-level fusion strategies while having less tunable parameters. Extensive experimental results and comparative analysis on three benchmark datasets prove that the proposed representation and network not only yield better recognition accuracy but also exhibit stronger generalization capability on multiple datasets.
| Original language | English |
|---|---|
| Pages (from-to) | 3729-3746 |
| Number of pages | 18 |
| Journal | Journal of Ambient Intelligence and Humanized Computing |
| Volume | 13 |
| Issue number | 8 |
| DOIs | |
| Publication status | Published - Aug 2022 |
| Externally published | Yes |
Keywords
- 3D skeleton data
- Convolutional neural networks
- Decision-level fusion
- Feature-level fusion
- Human action recognition
- Long-short term memory networks
Fingerprint
Dive into the research topics of 'Skeleton-based human action recognition with sequential convolutional-LSTM networks and fusion strategies'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver