Dual-Domain attention for Facial Emotion Recognition, and Fusion attention network

Dual-Domain attention for Facial Emotion Recognition, and Fusion attention network

Tags
Computer Science
Attention Mechanism
Deep Learning
Published
June 21, 2023
Author
Minh-Hai Tran, Advisor: Tram-Tran Nguyen-Quynh
📚
Full text in Vietnamese, slide and poster here

Attention:

The ideas in my thesis have been shared with another person and have been published at the 2023 ICIP conference and ASK conference, but I am not listed as an author (Even though it was my idea - proposed 1 and proposed 2). So, if you read a paper or a thesis with similar ideas, please note that I'm not the one who copied them :)
 

Summary of the thesis:

The recognition of human emotions is a highly applicable and extensively researched problem in the field of computer vision. Various applications, such as ensuring driving safety, analyzing learning conditions, and understanding customer and human psychology, demonstrate the significance of emotion recognition. However, existing emotion recognition tasks often overlook context, focusing solely on facial expressions. Context plays a crucial role, contributing to approximately 40% of human emotion identification in real-life situations.
Modern deep learning methods employing Attention mechanisms have achieved notable success on real-world datasets related to human facial expressions. Nevertheless, Attention-based methods still have limitations, such as incomplete coverage of essential aspects, and they are not widely utilized in context-aware human emotion recognition.
This thesis delves into the challenges of human emotion recognition and modern methods employing Attention mechanisms to address these challenges. Additionally, it proposes a Dual-Domain Attention Module for improved object classification by combining context and spatial domains to generate comprehensive attention feature maps. The thesis introduces Fusion Attention, a method leveraging both global and local attention mechanisms to identify optimal features before making emotion predictions.
Furthermore, the Dual-Domain Attention Module is applied to context-aware human emotion recognition, utilizing features extracted from facial data in the dataset. The results obtained on FER2013, RAF-DB, and EMOTIC datasets demonstrate the effectiveness of the proposed methods compared to contemporary approaches.

1. INTRODUCTION

The recognition of human emotions plays a crucial role in human-computer interaction systems. There are various methods to identify a person's emotions, such as voice, facial expressions, or gestures. Facial emotion recognition involves multiple factors, including the surrounding environment, known as context [6], or psychological factors [7]. As a result, it has significant impacts and applications in daily life as well as in related research fields.
In the field of computer science, facial emotion recognition is a broad and extensively researched topic. In Figure 1, various facial expressions are illustrated.
                                                         Figure 1. Examples of facial emotions
Figure 1. Examples of facial emotions
Currently, applications of facial expression analysis are widely implemented in real-life scenarios, such as criminal psychology analysis, ensuring driving safety [8], and in educational settings, like analyzing cognitive learning states [9]. Moreover, automated systems for detecting and recognizing facial emotions have the ability to analyze customer psychology [10], particularly in large sectors such as commerce and advertising. Customers often express their emotions through their faces when observing and evaluating products or when providing feedback on services. Additionally, there are numerous applications in managing and analyzing crowds for governmental agencies or managers in places like stadiums and supermarkets [11].
With the advancement of technology, current facial emotion recognition methods can be categorized into two branches based on two types of image data: controlled data and real-world complex data. Recent studies have achieved high accuracy, exceeding 90% for controlled datasets like CK+ [12] and over 80% for MMI datasets [13]. However, accuracy tends to be lower for real-world complex datasets, such as FER2013 [14] with current accuracy below 80% or EmotiW [15]. This study focuses on experimentation and improvement of models on real-world datasets

2. PROPOSED METHOD

2.1. Dual-Domain Attention

Attention mechanisms have become crucial in modern techniques for facial emotion recognition. In this chapter, we propose a new attention network called "Dual-Domain Attention" that integrates local context from the spatial domain and global context from the context domain from the feature maps. This attention module learns from residual attention in the feature maps of both domains to focus on crucial facial parts, such as eyes, nose, mouth, eyebrows, where emotions are most clearly expressed. This enables the model to generate more meaningful representations of facial features, thereby improving accuracy in facial emotion recognition tasks. Experiments conducted on FER2013 and RAF-DB demonstrate superior performance compared to existing advanced methods.
Figure 2. Dual-Domain Attention integrated into 4 stages of the ResNet network. (a) ResNet architecture. (b) Dual-Domain Attention in stage 2 of ResNet. (c) Dual-Domain Fusion used to combine the two domains. (d) Multi-level attention.
Figure 2. Dual-Domain Attention integrated into 4 stages of the ResNet network. (a) ResNet architecture. (b) Dual-Domain Attention in stage 2 of ResNet. (c) Dual-Domain Fusion used to combine the two domains. (d) Multi-level attention.
Table 1: Compare with SOTA method
Method
Accuracy
Method
Accuracy
Inception [44]
71.60%
RAN [36]
86.90%
MLCNNs [35]
73.03%
SCN [45]
87.03%
ResNet-50 + CBam [29]
73.39%
DACL [46]
87.78%
ResMaskingNet [29]
74.14%
KTN [47]
88.07%
LHC-Net [41]
74.42%
EfficientFace [48]
88.36%
ResNet-50 + DDA (ours)
74.67%
MViT [49]
88.62%
ResNet-34 + DDA (ours)
74.75%
DAN [38]
89.70%
ResNet-50 + DDA (ours)
89.96%
Grad-Cam
notion image
notion image
           Resnet50 Grad-Cam on RAF-DB
Resnet50 Grad-Cam on RAF-DB
notion image
notion image
       Resnet50 + DDA Grad-Cam on RAF-DB
Resnet50 + DDA Grad-Cam on RAF-DB
The visualizations of Grad-CAM++ for two models, Resnet50 and Resnet50 with DDA, are presented. On the left are the visualizations for Resnet50, and on the right are those for Resnet50 with DDA. The image sequence is as follows: Original Image → Image at the 3rd stage → Image at the 4th stage. The Resnet50 model with DDA captures more features compared to when DDA is not used, particularly evident at the 3rd stage. In most images utilizing DDA, the focus is concentrated on the eyes, nose, and mouth, indicating a better prediction of emotions after passing through the final stag
Confusion matrix:
         Confusion matrix’s Resnet50 on RAF-DB
Confusion matrix’s Resnet50 on RAF-DB
      Confusion matrix’s Resnet50 + DAA on RAF-DB
Confusion matrix’s Resnet50 + DAA on RAF-DB
 
The confusion matrix illustrates the effectiveness of two models, Resnet50 and Resnet50 with DDA, on the RAF-DB dataset. It is noticeable that the performance of the disgust and fear classes significantly improves when using DDA. The accuracy increases from 61% to 70% for the disgust class and from 57% to 72% for the fear class. Furthermore, the accuracy for the neutral class also sees an improvement, reaching up to 93%. The highest accuracy is achieved for the happy class, reaching 95%.

2.2 Local and Global Fusion Attention Network

I employ a fusion method to combine the final features of two emotion recognition models that have already been trained. The goal of this combination is to minimize the weaknesses of each model while maximizing their strengths. To generate a new feature the same size as the input sizes, we first concatenate the two features and pass them through a Multi-layer Perceptron (MLP). We average the features before passing them through the self attention block and the local attention block, and then back and forth through a completely connected network to classify emotions..
                                                        Figure 3: Local and Global attention Network
Figure 3: Local and Global attention Network
Table 2: Compare with other method
Fusion method
Model 1
Model 2
RAF-DB (%)
Late Fusion
Resnet18
Resnet34
86.35%
VGG11
VGG13
86.08%
VGG11
Resnet34
86.08%
Resnet34
Resnet50+DDA
89.21%
Resnet50+DDA (Image-net)
Resnet50+DDA (VGGface2)
89.66%
Early Fusion
Resnet18
Resnet34
86.66%
VGG13
Renet34
85.49%
VGG11
Resnet34
86.08%
Resnet34
Resnet50+DDA
88.97%
Resnet50+DDA (Image-net)
Resnet50+DDA (VGGface2)
89.65%
Joint-fusion
Resnet18
Resnet34
86.05%
VGG13
Resnet34
86.63%
VGG11
Resnet34
86.40%
Resnet34
Resnet50 + DDA
89.70%
Resnet50+DDA (Image-net)
Resnet50+DDA (VGGface2)
88.23%
Fusion attention (ours)
Resnet18
Resnet34
90.95%
VGG13
Resnet34
90.92%
Resnet34
Resnet50 + DDA
92.54%
Resnet50+DDA (Image-net)
Resnet50+DDA (VGGface2)
93.00%
Confusion matrix
Our methods using Resnet18 and Resnet34
Our methods using Resnet18 and Resnet34
Late fusion method using Resnet18 and Resnet3
Late fusion method using Resnet18 and Resnet3
All emotion classes in the proposed method achieve accuracy levels exceeding 80%. Particularly, the neutral, happy, and surprise emotion classes all attain accuracy levels surpassing 90%, with the neutral class reaching the highest accuracy of 99%. This highlights the strength of the model combining self-attention and local attention methods.

2.3 Embracing Context-Aware Emotion Recognition

Facial emotion recognition has seen significant progress in modern times, with many deep learning methods focusing on facial features. However, practical applications still pose numerous challenges when predicting human emotions, as they depend on various factors such as body language and the context of events rather than solely on facial expressions [13]. In this chapter, I propose a network that combines multiple models to extract features from both the face and body, providing predictions for 26 emotions on the EMOTIC dataset [18]. My approach achieves higher accuracy compared to recent methods that concentrate solely on facial and body features of the individuals.
I address the issue by reusing a facial emotion recognition model using DDA for feature extraction. Additionally, to assess effectiveness and once again compare the use of a model combining DDA with models not using DDA for feature extraction on faces. Furthermore, I construct additional features from the body by extracting features from a segmentation model as a third input. Finally, I integrate these features to make predictions across 26 emotion classes.
My primary contributions are as follows:
  • I propose a network that combines multiple models, each playing a role in feature extraction for different images, such as facial images, body images, and distinctive body feature images.
  • I apply a model utilizing DDA for optimal feature extraction and experiment with various models for comparison.
  • My experiments on EMOTIC yield promising results when using only facial and body features as input features.
                                 Figure 4: Proposed method for contextual emotion recognition
Figure 4: Proposed method for contextual emotion recognition
Table 3: Experiments on Emotic dataset
Model 1
Model 2
Model 3
mAP
Resnet50​
Resnet34-DDA​
Resnet34​
29.8​
Resnet50​
Resnet34-DDA​
X
29.4​
Resnet50-DDA​
Resnet18​
Resnet34​
30.3​
Resnet50-DDA​
Resnet18​
X
29.6​
Resnet50-DDA​
Resnet34-DDA​
X
30.6​
Resnet50-DDA
Resnet34-DDA
Resnet34
31.7

3. CONCLUSION

Throughout the research and implementation of this thesis, I have proposed two attention-based methods: Dual-Domain Attention and Fusion Attention, utilizing both self-attention and local attention for facial emotion recognition. Additionally, pretrained models for facial emotion expression have been employed to address context-aware emotion recognition. The effectiveness of the Dual-Domain Attention module has been demonstrated once again.
 
© Bản quyền thuộc về Harly © Copyright by Harly ☞ Do not Reup

REFERENCE

[1]       L. E. Ishii, J. C. Nellis, K. D. Boahene, P. Byrne, and M. Ishii, “The importance and psychology of facial expression,” Otolaryngol. Clin. North Am., vol. 51, no. 6, pp. 1011–1017, 2018.
[2]       L. M. Mayo, J. Lindé, H. Olausson, and M. Heilig, “Putting a good face on touch: Facial expression reflects the affective valence of caress-like touch across modalities,” Biol. Psychol., vol. 137, pp. 83–90, 2018.
[3]       G. Tavares, A. Mourao, and J. Magalhaes, “Crowdsourcing facial expressions for affective-interaction,” Comput. Vis. Image Underst., vol. 147, pp. 102–113, 2016.
[4]       P. Ekman and W. V. Friesen, “Constants across cultures in the face and emotion.,” J. Pers. Soc. Psychol., vol. 17, no. 2, p. 124, 1971.
[5]       E. L. Rosenberg and P. Ekman, What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, 2020.
[6]       J. Yang, T. Qian, F. Zhang, and S. U. Khan, “Real-time facial expression recognition based on edge computing,” IEEE Access, vol. 9, pp. 76178–76190, 2021.
[7]       H. Aviezer, Y. Trope, and A. Todorov, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,” Science, vol. 338, no. 6111, pp. 1225–1229, 2012.
[8]       J. K. McNulty and F. D. Fincham, “Beyond positive psychology? Toward a contextual view of psychological processes and well-being.,” Am. Psychol., vol. 67, no. 2, p. 101, 2012.
[9]       M. Jeong and B. C. Ko, “Driver’s facial expression recognition in real-time for safe driving,” Sensors, vol. 18, no. 12, p. 4270, 2018.
[10]     S. Zhong and J. Ghosh, “Decision boundary focused neural network classifier,” Intell. Eng. Syst. Artif. Neural Netw. ANNIE, 2000.
[11]     L. T. Dang, E. W. Cooper, and K. Kamei, “Development of facial expression recognition for training video customer service representatives,” in 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, 2014, pp. 1297–1303.
[12]     S. J. Goyal, A. K. Upadhyay, R. S. Jadon, and R. Goyal, “Real-life facial expression recognition systems: a review,” Smart Comput. Inform., pp. 311–331, 2018.
[13]     L. F. Barrett, B. Mesquita, and M. Gendron, “Context in emotion perception,” Curr. Dir. Psychol. Sci., vol. 20, no. 5, pp. 286–290, 2011.
[14]     P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in 2010 ieee computer society conference on computer vision and pattern recognition-workshops, IEEE, 2010, pp. 94–101.
[15]     M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” in 2005 IEEE international conference on multimedia and Expo, IEEE, 2005, p. 5 pp.
[16]     I. J. Goodfellow et al., “Challenges in representation learning: A report on three machine learning contests,” in Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20, Springer, 2013, pp. 117–124.
[17]     A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon, “From individual to group-level emotion recognition: Emotiw 5.0,” in Proceedings of the 19th ACM international conference on multimodal interaction, 2017, pp. 524–528.
[18]     R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “EMOTIC: Emotions in Context dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 61–69.
[19]     R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Context Based Emotion Recognition using EMOTIC Dataset,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2019, doi: 10.1109/TPAMI.2019.2916866.
[20]     G. Doherty-Sneddon, A. Anderson, C. O’malley, S. Langton, S. Garrod, and V. Bruce, “Face-to-face and video-mediated communication: A comparison of dialogue structure and task performance.,” J. Exp. Psychol. Appl., vol. 3, no. 2, p. 105, 1997.
[21]     A. Mehrabian, “Framework for a comprehensive description and measurement of emotional states.,” Genet. Soc. Gen. Psychol. Monogr., 1995.
[22]     S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” Proc. Natl. Acad. Sci., vol. 111, no. 15, pp. E1454–E1462, 2014.
[23]     M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space,” IEEE Trans. Affect. Comput., vol. 2, no. 2, pp. 92–105, 2011.
[24]     K. Schindler, L. Van Gool, and B. De Gelder, “Recognizing emotions expressed by body pose: A biologically inspired neural model,” Neural Netw., vol. 21, no. 9, pp. 1238–1246, 2008.
[25]     J. Zou, Y. Han, and S.-S. So, “Overview of artificial neural networks,” Artif. Neural Netw. Methods Appl., pp. 14–22, 2009.
[26]     S. Sharma, S. Sharma, and A. Athaiya, “Activation functions in neural networks,” Data Sci, vol. 6, no. 12, pp. 310–316, 2017.
[27]     D. Svozil, V. Kvasnicka, and J. Pospichal, “Introduction to multi-layer feed-forward neural networks,” Chemom. Intell. Lab. Syst., vol. 39, no. 1, pp. 43–62, 1997.
[28]     R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in Neural networks for perception, Elsevier, 1992, pp. 65–93.
[29]     K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” ArXiv Prepr. ArXiv151108458, 2015.
[30]     Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[31]     K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv, Apr. 10, 2015. doi: 10.48550/arXiv.1409.1556.
[32]     K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[33]     C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[34]     S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
[35]     F. Wang et al., “Residual attention network for image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3156–3164.
[36]     L. Pham, T. H. Vu, and T. A. Tran, “Facial expression recognition using residual masking network,” in 2020 25Th international conference on pattern recognition (ICPR), IEEE, 2021, pp. 4513–4519.
[37]     K. Liu, M. Zhang, and Z. Pan, “Facial expression recognition with CNN ensemble,” in 2016 international conference on cyberworlds (CW), IEEE, 2016, pp. 163–166.
[38]     Y. Tang, “Deep learning using linear support vector machines,” ArXiv Prepr. ArXiv13060239, 2013.
[39]     A. Mollahosseini, D. Chan, and M. H. Mahoor, “Going deeper in facial expression recognition using deep neural networks,” in 2016 IEEE Winter conference on applications of computer vision (WACV), IEEE, 2016, pp. 1–10.
[40]     J. Yan, W. Zheng, Z. Cui, and P. Song, “A joint convolutional bidirectional LSTM framework for facial expression recognition,” IEICE Trans. Inf. Syst., vol. 101, no. 4, pp. 1217–1220, 2018.
[41]     B.-K. Kim, J. Roh, S.-Y. Dong, and S.-Y. Lee, “Hierarchical committee of deep convolutional neural networks for robust facial expression recognition,” J. Multimodal User Interfaces, vol. 10, pp. 173–189, 2016.
[42]     H.-D. Nguyen, S.-H. Kim, G.-S. Lee, H.-J. Yang, I.-S. Na, and S.-H. Kim, “Facial Expression Recognition Using a Temporal Ensemble of Multi-Level Convolutional Neural Networks,” IEEE Trans. Affect. Comput., vol. 13, no. 1, pp. 226–237, Jan. 2022, doi: 10.1109/TAFFC.2019.2946540.
[43]     K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao, “Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition.” arXiv, Sep. 04, 2019. Accessed: Dec. 29, 2022. [Online]. Available: http://arxiv.org/abs/1905.04075
[44]     Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), IEEE, 2018, pp. 67–74.
[45]     Z. Wen, W. Lin, T. Wang, and G. Xu, “Distract your attention: Multi-head cross attention network for facial expression recognition,” ArXiv Prepr. ArXiv210907270, 2021.
[46]     M.-H. Hoang, S.-H. Kim, H.-J. Yang, and G.-S. Lee, “Context-aware emotion recognition based on visual relationship detection,” IEEE Access, vol. 9, pp. 90465–90474, 2021.
[47]     T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “Emoticon: Context-aware multimodal emotion recognition using frege’s principle,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14234–14243.
[48]     S. Li, W. Deng, and J. Du, “Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI: IEEE, Jul. 2017, pp. 2584–2593. doi: 10.1109/CVPR.2017.277.
[49]     R. Jin, S. Zhao, Z. Hao, Y. Xu, T. Xu, and E. Chen, “AVT: Au-Assisted Visual Transformer for Facial Expression Recognition,” in 2022 IEEE International Conference on Image Processing (ICIP), IEEE, 2022, pp. 2661–2665.
[50]     R. Pecoraro, V. Basile, and V. Bono, “Local multi-head channel self-attention for facial expression recognition,” Information, vol. 13, no. 9, p. 419, 2022.
[51]     C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 325–341.
[52]     J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[53]     L. Liu et al., “On the variance of the adaptive learning rate and beyond,” ArXiv Prepr. ArXiv190803265, 2019.
[54]     A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, 2017.
[55]     C. Pramerdorfer and M. Kampel, “Facial expression recognition using convolutional neural networks: state of the art,” ArXiv Prepr. ArXiv161202903, 2016.
[56]     K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing uncertainties for large-scale facial expression recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6897–6906.
[57]     A. H. Farzaneh and X. Qi, “Facial expression recognition in the wild via deep attentive center loss,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 2402–2411.
[58]     H. Li, N. Wang, X. Ding, X. Yang, and X. Gao, “Adaptively learning facial expression representation via cf labels and distillation,” IEEE Trans. Image Process., vol. 30, pp. 2016–2028, 2021.
[59]     Z. Zhao, Q. Liu, and F. Zhou, “Robust lightweight facial expression recognition network with label distribution training,” in Proceedings of the AAAI conference on artificial intelligence, 2021, pp. 3510–3519.
[60]     H. Li, M. Sui, F. Zhao, Z. Zha, and F. Wu, “MVT: mask vision transformer for facial expression recognition in the wild,” ArXiv Prepr. ArXiv210604520, 2021.
[61]     R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
[62]     A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in 2018 IEEE winter conference on applications of computer vision (WACV), IEEE, 2018, pp. 839–847.
[63]     K. Gadzicki, R. Khamsehashari, and C. Zetzsche, “Early vs Late Fusion in Multimodal Convolutional Neural Networks,” in 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Jul. 2020, pp. 1–6. doi: 10.23919/FUSION45008.2020.9190246.
[64]     S. Liu, M. Li, Z. Zhang, B. Xiao, and X. Cao, “Multimodal ground-based cloud classification using joint fusion convolutional neural network,” Remote Sens., vol. 10, no. 6, p. 822, 2018.
[65]     K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. Accessed: Apr. 03, 2023. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
[66]     A. López-Cifuentes, M. Escudero-Vinolo, J. Bescós, and Á. García-Martín, “Semantic-aware scene recognition,” Pattern Recognit., vol. 102, p. 107256, 2020.
[67]     S. I. Serengil and A. Ozpinar, “Lightface: A hybrid deep face recognition framework,” in 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), IEEE, 2020, pp. 1–5.
[68]     X. Ren, A. Lattas, B. Gecer, J. Deng, C. Ma, and X. Yang, “Facial geometric detail recovery via implicit representation,” in 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), IEEE, 2023, pp. 1–8.
[69]     O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer, 2015, pp. 234–241.
[70]     T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755.
[71]     W. Liu et al., “Ssd: Single shot multibox detector,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, Springer, 2016, pp. 21–37.