Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure

Arianna Mencattini, Eugenio Martinelli, Giovanni Costantini, Massimiliano Todisco, Barbara Basile, Marco Bozzali, Corrado Di Natale

Research output: Contribution to journalArticle

37 Citations (Scopus)

Abstract

Speech emotion recognition (SER) is a challenging framework in demanding human machine interaction systems. Standard approaches based on the categorical model of emotions reach low performance, probably due to the modelization of emotions as distinct and independent affective states. Starting from the recently investigated assumption on the dimensional circumplex model of emotions, SER systems are structured as the prediction of valence and arousal on a continuous scale in a two-dimensional domain. In this study, we propose the use of a PLS regression model, optimized according to specific features selection procedures and trained on the Italian speech corpus EMOVO, suggesting a way to automatically label the corpus in terms of arousal and valence. New speech features related to the speech amplitude modulation, caused by the slowly-varying articulatory motion, and standard features extracted from the pitch contour, have been included in the regression model. An average value for the coefficient of determination R2 of 0.72 (maximum value of 0.95 for fear and minimum of 0.60 for sadness) is obtained for the female model and a value for R2 of 0.81 (maximum value of 0.89 for anger and minimum value of 0.71 for joy) is obtained for the male model, over the seven primary emotions (including the neutral state).

Original languageEnglish
Pages (from-to)68-81
Number of pages14
JournalKnowledge-Based Systems
Volume63
DOIs
Publication statusPublished - 2014

Fingerprint

Amplitude modulation
Speech recognition
Feature extraction
Labels
Emotion
Feature selection

Keywords

  • Audio signal modulation
  • Circumplex model of emotions
  • Partial least square (PLS) regression
  • Pearson correlation coefficient
  • Pitch contour characterization
  • Speech emotion recognition (SER)

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Management Information Systems
  • Information Systems and Management

Cite this

Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. / Mencattini, Arianna; Martinelli, Eugenio; Costantini, Giovanni; Todisco, Massimiliano; Basile, Barbara; Bozzali, Marco; Di Natale, Corrado.

In: Knowledge-Based Systems, Vol. 63, 2014, p. 68-81.

Research output: Contribution to journalArticle

Mencattini, Arianna ; Martinelli, Eugenio ; Costantini, Giovanni ; Todisco, Massimiliano ; Basile, Barbara ; Bozzali, Marco ; Di Natale, Corrado. / Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. In: Knowledge-Based Systems. 2014 ; Vol. 63. pp. 68-81.
@article{b272795466f44d7d8f736ed2c7a8cad5,
title = "Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure",
abstract = "Speech emotion recognition (SER) is a challenging framework in demanding human machine interaction systems. Standard approaches based on the categorical model of emotions reach low performance, probably due to the modelization of emotions as distinct and independent affective states. Starting from the recently investigated assumption on the dimensional circumplex model of emotions, SER systems are structured as the prediction of valence and arousal on a continuous scale in a two-dimensional domain. In this study, we propose the use of a PLS regression model, optimized according to specific features selection procedures and trained on the Italian speech corpus EMOVO, suggesting a way to automatically label the corpus in terms of arousal and valence. New speech features related to the speech amplitude modulation, caused by the slowly-varying articulatory motion, and standard features extracted from the pitch contour, have been included in the regression model. An average value for the coefficient of determination R2 of 0.72 (maximum value of 0.95 for fear and minimum of 0.60 for sadness) is obtained for the female model and a value for R2 of 0.81 (maximum value of 0.89 for anger and minimum value of 0.71 for joy) is obtained for the male model, over the seven primary emotions (including the neutral state).",
keywords = "Audio signal modulation, Circumplex model of emotions, Partial least square (PLS) regression, Pearson correlation coefficient, Pitch contour characterization, Speech emotion recognition (SER)",
author = "Arianna Mencattini and Eugenio Martinelli and Giovanni Costantini and Massimiliano Todisco and Barbara Basile and Marco Bozzali and {Di Natale}, Corrado",
year = "2014",
doi = "10.1016/j.knosys.2014.03.019",
language = "English",
volume = "63",
pages = "68--81",
journal = "Knowledge-Based Systems",
issn = "0950-7051",
publisher = "Elsevier",

}

TY - JOUR

T1 - Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure

AU - Mencattini, Arianna

AU - Martinelli, Eugenio

AU - Costantini, Giovanni

AU - Todisco, Massimiliano

AU - Basile, Barbara

AU - Bozzali, Marco

AU - Di Natale, Corrado

PY - 2014

Y1 - 2014

N2 - Speech emotion recognition (SER) is a challenging framework in demanding human machine interaction systems. Standard approaches based on the categorical model of emotions reach low performance, probably due to the modelization of emotions as distinct and independent affective states. Starting from the recently investigated assumption on the dimensional circumplex model of emotions, SER systems are structured as the prediction of valence and arousal on a continuous scale in a two-dimensional domain. In this study, we propose the use of a PLS regression model, optimized according to specific features selection procedures and trained on the Italian speech corpus EMOVO, suggesting a way to automatically label the corpus in terms of arousal and valence. New speech features related to the speech amplitude modulation, caused by the slowly-varying articulatory motion, and standard features extracted from the pitch contour, have been included in the regression model. An average value for the coefficient of determination R2 of 0.72 (maximum value of 0.95 for fear and minimum of 0.60 for sadness) is obtained for the female model and a value for R2 of 0.81 (maximum value of 0.89 for anger and minimum value of 0.71 for joy) is obtained for the male model, over the seven primary emotions (including the neutral state).

AB - Speech emotion recognition (SER) is a challenging framework in demanding human machine interaction systems. Standard approaches based on the categorical model of emotions reach low performance, probably due to the modelization of emotions as distinct and independent affective states. Starting from the recently investigated assumption on the dimensional circumplex model of emotions, SER systems are structured as the prediction of valence and arousal on a continuous scale in a two-dimensional domain. In this study, we propose the use of a PLS regression model, optimized according to specific features selection procedures and trained on the Italian speech corpus EMOVO, suggesting a way to automatically label the corpus in terms of arousal and valence. New speech features related to the speech amplitude modulation, caused by the slowly-varying articulatory motion, and standard features extracted from the pitch contour, have been included in the regression model. An average value for the coefficient of determination R2 of 0.72 (maximum value of 0.95 for fear and minimum of 0.60 for sadness) is obtained for the female model and a value for R2 of 0.81 (maximum value of 0.89 for anger and minimum value of 0.71 for joy) is obtained for the male model, over the seven primary emotions (including the neutral state).

KW - Audio signal modulation

KW - Circumplex model of emotions

KW - Partial least square (PLS) regression

KW - Pearson correlation coefficient

KW - Pitch contour characterization

KW - Speech emotion recognition (SER)

UR - http://www.scopus.com/inward/record.url?scp=84899981373&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84899981373&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2014.03.019

DO - 10.1016/j.knosys.2014.03.019

M3 - Article

AN - SCOPUS:84899981373

VL - 63

SP - 68

EP - 81

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

SN - 0950-7051

ER -