TY - GEN
T1 - Multi-stream Multi-attention Deep Neural Network for Context-Aware Human Action Recognition
AU - Rashmi, M.
AU - Guddeti, Ram Mohana Reddy
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Technological innovations in deep learning models have enabled reasonably close solutions to a wide variety of computer vision tasks such as object detection, face recognition, and many more. On the other hand, Human Action Recognition (HAR) is still far from human-level ability due to several challenges such as diversity in performing actions. Due to data availability in multiple modalities, HAR using video data recorded by RGB-D cameras is frequently used in current research. This paper proposes an approach for recognizing human actions using depth and skeleton data captured using the Kinect depth sensor. Attention modules have been introduced in recent years to assist in focusing on the most important features in computer vision tasks. This paper proposes a multi-stream deep learning model with multiple attention blocks for HAR. At first, the depth and skeletal modalities' action data are represented using two distinct action descriptors. Each generates an image from the action data gathered from numerous frames. The proposed deep learning model is trained using these descriptors. Additionally, we propose a set of score fusion techniques for accurate HAR using all the features and trained CNN + LSTM streams. The proposed method is evaluated on two benchmark datasets using well known cross-subject evaluation protocol. The proposed technique achieved 89.83% and 90.7% accuracy on the MSRAction3D and UTDMHAD datasets, respectively. The experimental results establish the validity and effectiveness of the proposed model.
AB - Technological innovations in deep learning models have enabled reasonably close solutions to a wide variety of computer vision tasks such as object detection, face recognition, and many more. On the other hand, Human Action Recognition (HAR) is still far from human-level ability due to several challenges such as diversity in performing actions. Due to data availability in multiple modalities, HAR using video data recorded by RGB-D cameras is frequently used in current research. This paper proposes an approach for recognizing human actions using depth and skeleton data captured using the Kinect depth sensor. Attention modules have been introduced in recent years to assist in focusing on the most important features in computer vision tasks. This paper proposes a multi-stream deep learning model with multiple attention blocks for HAR. At first, the depth and skeletal modalities' action data are represented using two distinct action descriptors. Each generates an image from the action data gathered from numerous frames. The proposed deep learning model is trained using these descriptors. Additionally, we propose a set of score fusion techniques for accurate HAR using all the features and trained CNN + LSTM streams. The proposed method is evaluated on two benchmark datasets using well known cross-subject evaluation protocol. The proposed technique achieved 89.83% and 90.7% accuracy on the MSRAction3D and UTDMHAD datasets, respectively. The experimental results establish the validity and effectiveness of the proposed model.
UR - https://www.scopus.com/pages/publications/85138492907
UR - https://www.scopus.com/pages/publications/85138492907#tab=citedBy
U2 - 10.1109/TENSYMP54529.2022.9864404
DO - 10.1109/TENSYMP54529.2022.9864404
M3 - Conference contribution
AN - SCOPUS:85138492907
T3 - 2022 IEEE Region 10 Symposium, TENSYMP 2022
BT - 2022 IEEE Region 10 Symposium, TENSYMP 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE Region 10 Symposium, TENSYMP 2022
Y2 - 1 July 2022 through 3 July 2022
ER -