Multimodal transformer augmented fusion for speech emotion recognition
Multimodal transformer augmented fusion for speech emotion recognition
Blog Article
Speech emotion recognition is challenging due to the subjectivity and ambiguity of emotion.In recent years, multimodal methods for speech emotion recognition have achieved promising results.However, due to the heterogeneity of data from different modalities, effectively integrating different modal information remains a difficulty and breakthrough point of 355 maybelline fit me the research.
Moreover, in view of the limitations of feature-level fusion and decision-level fusion methods, capturing fine-grained modal interactions has often been neglected in previous studies.We propose a method named multimodal transformer augmented fusion that uses a hybrid fusion strategy, combing feature-level fusion and model-level fusion methods, to perform fine-grained information interaction within and between modalities.A Model-fusion module composed of three Cross-Transformer Encoders is proposed to generate multimodal emotional representation for modal guidance and information fusion.
Specifically, the multimodal features obtained by feature-level fusion and text features are used 15-eg1053cl to enhance speech features.Our proposed method outperforms existing state-of-the-art approaches on the IEMOCAP and MELD dataset.