Automated Deep Learning Approaches for Multimodal Emotion Recognition: A Review of Fusion Strategies, Modalities and Architectures

Authors

  • Raja Abdulrahman University of Sahiwal, Sahiwal, 57000, Pakistan
  • Aleena Jamil University of Sahiwal, Sahiwal, 57000, Pakistan
  • Adeen Amjad University of Sahiwal, Sahiwal, 57000, Pakistan
  • Shafiq Hussain University of Sahiwal, Sahiwal, 57000, Pakistan
  • Muhammad Azhar Hong Kong Shue Yan University, Hong Kong SAR, China
  • Zunaira Aslam University of Sahiwal, Sahiwal, 57000, Pakistan
  • Ifra Shabbir Comsats University Islamabad, Islamabad, 44000, Pakistan
  • Waqar Ahmad University of Sahiwal, Sahiwal, 57000, Pakistan
  • Arslan Ali Mansab University of Sahiwal, Sahiwal, 57000, Pakistan
  • Muhammad Hamza Akbar University of Sahiwal, Sahiwal, 57000, Pakistan
  • Muhammad Waqas University of Sahiwal, Sahiwal, 57000, Pakistan

DOI:

https://doi.org/10.66108/mna.v4i3.103

Keywords:

Multimodal Emotion Recognition, Deep Learning, Transformers, Fusion Strategies, Affective Computing

Abstract

Emotion recognition is one of the fields of artificial intelligence that has garnered significant attention and is one of the fast-moving branches due to the increasing demand of emotionally intelligent systems to improve Human-Computer Interaction (HCI). The initial studies in this field were mainly based on unimodal models and manually constructed feature models, which restrict their capabilities of accountability of human expressiveness of emotions and their contextual variability. The development of deep learning has radically changed the idea of emotion recognition by providing automatic learning of features and sound modeling of multifaceted affective behaviors. The given paper is a thorough review of Multimodal Emotion Recognition (MER) development history, specifically the combination of speech, textual, and facial modalities. We critically synthesize the separate models of each modality, and study how deep learning models have evolved over time since Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to state-of-the-art Transformer-based models that are able to capture long-range dependencies and cross-modal interactions. Moreover, we explore multimodal fusion techniques, including early and late fusion methods as well as advanced hybrid or attention-based fusion systems that integrate complementary knowledge in several modalities in a dynamic manner. Particular attention is given to recent findings that are connected to the issues related to low-resource and multilingual settings where the lack of data and the linguistic variation is an important impediment. This paper brings up the latest development in architectures and fusion methodology and proposes the latest trends, performance improvements, and the gaps to be addressed in MER that can offer important insights to the construction of robust, scalable and inclusive emotion-aware systems.

Downloads

Download data is not yet available.

Additional Files

Published

2025-12-21

How to Cite

Raja Abdulrahman, Aleena Jamil, Adeen Amjad, SHAFIQ HUSSAIN, Muhammad Azhar, Zunaira Aslam, Ifra Shabbir, Waqar Ahmad, Arslan Ali Mansab, Muhammad Hamza Akbar, & Muhammad Waqas. (2025). Automated Deep Learning Approaches for Multimodal Emotion Recognition: A Review of Fusion Strategies, Modalities and Architectures. Machines and Algorithms, 4(3), 198–214. https://doi.org/10.66108/mna.v4i3.103

Issue

Section

Reviews

Categories