Thanks for tuning in to Google I/O. View all sessions on demandWatch on demand


  • Description:

CREMA-D is an audio-visual data set for emotion recognition. The data set consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states (happy, sad, anger, fear, disgust, and neutral). 7,442 clips of 91 actors with diverse ethnic backgrounds were collected. This release contains only the audio stream from the original audio-visual recording. The samples are splitted between train, validation and testing so that samples from each speaker belongs to exactly one split.

Split Examples
'test' 1,556
'train' 5,144
'validation' 738
  • Feature structure:
    'audio': Audio(shape=(None,), dtype=int64),
    'label': ClassLabel(shape=(), dtype=int64, num_classes=6),
    'speaker_id': string,
  • Feature documentation:
Feature Class Shape Dtype Description
audio Audio (None,) int64
label ClassLabel int64
speaker_id Tensor string
  • Citation:
  title={ {CREMA-D}: Crowd-sourced emotional multimodal actors dataset},
  author={Cao, Houwei and Cooper, David G and Keutmann, Michael K and Gur, Ruben C and Nenkova, Ani and Verma, Ragini},
  journal={IEEE transactions on affective computing},