TY - GEN
T1 - Actor-Centric Spatio-Temporal Feature Extraction for Action Recognition
AU - Anil, Kunchala
AU - Bouroche, Mélanie
AU - Schoen-Phelan, Bianca
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024
Y1 - 2024
N2 - Action understanding involves the recognition and detection of specific actions within videos. This crucial task in computer vision gained significant attention due to its multitude of applications across various domains. The current action detection models, inspired by 2D object detection methods, employ two-stage architectures. The first stage is to extract actor-centric video sub-clips, i.e. tubelets of individuals, and the second stage is to classify these tubelets using action recognition networks. The majority of these recognition models utilize a frame-level pre-trained 3D Convolutional Neural Networks (3D CNN) to extract spatio-temporal features of a given tubelet. This, however, results in suboptimal spatio-temporal feature representation for action recognition, primarily because the actor typically occupies a relatively small area in the frame. This work proposes the use of actor-centric tubelets instead of frames to learn spatio-temporal feature representation for action recognition. We present an empirical study of the actor-centric tubelet and frame-level action recognition models and propose a baseline for actor-centric action recognition. We evaluated the proposed method on the state-of-the-art C3D, I3D, and SlowFast 3D CNN architectures using the NTURGBD dataset. Our results demonstrate that the actor-centric feature extractor consistently outperforms the frame-level and large pre-trained fine-tuned models. The source code for the tubelet generation is available at https://github.com/anilkunchalaece/ntu_tubelet_parser.
AB - Action understanding involves the recognition and detection of specific actions within videos. This crucial task in computer vision gained significant attention due to its multitude of applications across various domains. The current action detection models, inspired by 2D object detection methods, employ two-stage architectures. The first stage is to extract actor-centric video sub-clips, i.e. tubelets of individuals, and the second stage is to classify these tubelets using action recognition networks. The majority of these recognition models utilize a frame-level pre-trained 3D Convolutional Neural Networks (3D CNN) to extract spatio-temporal features of a given tubelet. This, however, results in suboptimal spatio-temporal feature representation for action recognition, primarily because the actor typically occupies a relatively small area in the frame. This work proposes the use of actor-centric tubelets instead of frames to learn spatio-temporal feature representation for action recognition. We present an empirical study of the actor-centric tubelet and frame-level action recognition models and propose a baseline for actor-centric action recognition. We evaluated the proposed method on the state-of-the-art C3D, I3D, and SlowFast 3D CNN architectures using the NTURGBD dataset. Our results demonstrate that the actor-centric feature extractor consistently outperforms the frame-level and large pre-trained fine-tuned models. The source code for the tubelet generation is available at https://github.com/anilkunchalaece/ntu_tubelet_parser.
KW - action detection
KW - action recognition
KW - untrimmed action detection in extended videos
UR - https://www.scopus.com/pages/publications/85200402116
U2 - 10.1007/978-3-031-58181-6_50
DO - 10.1007/978-3-031-58181-6_50
M3 - Conference contribution
AN - SCOPUS:85200402116
SN - 9783031581809
T3 - Communications in Computer and Information Science
SP - 586
EP - 599
BT - Computer Vision and Image Processing - 8th International Conference, CVIP 2023, Revised Selected Papers
A2 - Kaur, Harkeerat
A2 - Jakhetiya, Vinit
A2 - Goyal, Puneet
A2 - Khanna, Pritee
A2 - Raman, Balasubramanian
A2 - Kumar, Sanjeev
PB - Springer Science and Business Media Deutschland GmbH
T2 - 8th International Conference on Computer Vision and Image Processing, CVIP 2023
Y2 - 3 November 2023 through 5 November 2023
ER -