Medium term speaker state detection by perceptually masked spectral features

Cenk Sezgin*, Bilge Gunsel, Jarek Krajewski

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

We propose a method based on perceptual prosodic features for medium term speaker state classification, particularly sleepiness detection. Unlike existing methods, our features represent spectral characteristics of speech in perceptual bands and also track temporal content omitting any linguistic segmentation. Despite conventional methods, we aim to model transitions between non-sleepy and sleepy modes rather than emotional states. Along with the proposed compact feature set, the developed system enable discrimination of medium term speaker states with a lower complexity compared to existing systems. This is achieved by constructing a dictionary for speech data based on bag-of-words concept. It has been identified that a training setup which is based on learned codewords, yields a robust classifier for sleepy speech. The speaker state classification has been performed by applying a two-class classification scheme on the observed test data. The numerical results, obtained on the Sleepy Language Corpus (SLC) by using Support Vector Machines (SVM) classifier, demonstrate a 10% improvement on average on unweighted recall rates compared to the benchmarking results. The introduced method is promising for online applications because of its frame based feature extraction scheme which differs from conventional segmental descriptor extraction techniques.

Original languageEnglish
Pages (from-to)26-41
Number of pages16
JournalSpeech Communication
Volume67
DOIs
Publication statusPublished - Mar 2015

Bibliographical note

Publisher Copyright:
© 2014 Elsevier Ltd. All rights reserved.

Keywords

  • Bag-of-words
  • Medium term speaker state
  • Perceptual audio features
  • Sleepiness detection
  • Speaker emotion recognition

Fingerprint

Dive into the research topics of 'Medium term speaker state detection by perceptually masked spectral features'. Together they form a unique fingerprint.

Cite this