Pre-trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition
This repository contains code for the paper Pre-trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition, which appears in Findings of EMNLP 2023. Please create an issue and tag me (@isaacOnline) if you have any questions.
The python packages necessary to run the majority of code in this repo are listed in mac_env.yml
and unix_env.yml
,
which specify the environments we used for running experiments on either mac or ubuntu machines, respectively. When
preprocessing data with propensity score matching, we used psmpy
, and because of package conflicts, created a separate
environment (psmpy_env.yml
) for exclusively that purpose.
Data used for this project comes from a variety of sources, most of which we are not able to re-distribute. We have included information about the files in our data directory (e.g. the names of specific clips we used). Links to the datasets are below.
- The data in
audio_iats/mitchell_et_al
comes from the paper Does Social Desirability Bias Favor Humans? Explicit–Implicit Evaluations Of Synthesized Speech Support A New HCI Model Of Impression Management - The data in
audio_iats/pantos_perkins
comes from the paper Measuring Implicit and Explicit Attitudes Toward Foreign Accented Speech - The data in
audio_iats/romero_rivas_et_al
comes from the paper Accentism on Trial: Categorization/Stereotyping and Implicit Biases Predict Harsher Sentences for Foreign-Accented Defendants - The data in
CORAAL
comes from the Corpus of Regional African American Language We used all CORAAL components that were recorded after the year 2000 and available in October of 2022. - The data in
EU_Emotion_Stimulus_Set
comes from The EU-Emotion Stimulus Set: A validation study - The data in
MESS
comes from the paper Categorical and Dimensional Ratings of Emotional Speech: Behavioral Findings From the Morgan Emotional Speech Set - The data in
speech_accent_archive
can be downloaded using the filedownloading/download_saa.py
- The data in
TORGO
comes from The TORGO database of acoustic and articulatory speech from speakers with dysarthria - The data in
UASpeech
comes from Dysarthric Speech Database for Universal Access Research - The data in
buckeye
comes from The Buckeye Corpus
After acquiring these datasets and placing them in the data
directory, you will need to run the scripts in the
preprocessing
directory. These scripts will clean the datasets and create necessary metadata that will be used for
extracting embeddings later. The preprocessing/process_buckeye.py
and preprocessing/process_coraal.py
scripts need
to be run before preprocessing/match_buckeye_coraal.py
, but other than this the scripts do not need to be
run in a particular order. Some of these scripts will need to be run using the environment you create with
psmpy_env.yml
.
If you would like to extract embeddings for a new dataset, you will need to create an all.tsv
file, examples of which
can be seen in the data directory. This file contains a header listing the directory where wav files for the dataset can
be founded, followed by relative paths to wav files in the dataset from this directory. Each wav file will need to be
accompanied by its sequence length. You can use the functions in downloading_utils.py
to find this sequence length,
as well as to ensure the audio clips have a uniform number of channels.
We use models from the HuBERT, wav2vec 2.0, WavLM, and Whisper model families. To download the relevant HuBERT and WavLM
checkpoints, you may be able to use the file downloading/download_model_ckpts.py
(depending on whether the links we
used are still working). This file uses urls defined in downloading/urls.py
which may need to be updated in the future. As of publication, the wav2vec 2.0 models we used are available here.
We use the Wav2Vec 2.0 Base—No finetuning
, Wav2Vec 2.0 Large—No finetuning
, and Wav2Vec 2.0 Large (LV-60)—No finetuning
checkpoints. The Whisper models will be automatically downloaded when extracting embeddings.
Scripts for extracting embeddings are available in the embedding_extraction
directory (extract_whisper.py
,
hubert.py
, wav2vec2.py
, and wavlm.py
). If you want to extract embeddings for a new dataset, you can add the dataset
to these files. Embedding extraction was generally the most time consuming part of running this project. When extracting
embeddings for Whisper, you'll need to make sure you're using the extract-embeddings
branch of my Whisper fork.
Once embeddings have been extracted, you can run the scripts in plots/eats
to carry out the embedding
association tests. These will save the SpEAT ds and p-values to results to files in plots/eats/test_results
(the
result files from our experiments are currently stored there). A script used for creating some of the plots in the paper
is available at plots/eats/plot_all_results.py
. To estimate the standard error of the SpEAT ds, there are scripts in
plots/standard_error
. The results from our standard error estimation is in plots/standard_error/all_mean_mean_results.csv
.
To train downstream SER models, you can use the file embedding_extraction/train_emotion_model.py
. Weights of the SER
models we trained are in dimension_models/model_objects
. You can use them to predict valence in the input datasets using
embedding_extraction/predict_valence.py
.