top of page

Visual Speech In Real Noisy Environments (VISION)

A Novel BenchmarkDataset and Deep Learning-based Baseline System


1 Oct 2022

In this paper, we present VIsual Speech In real nOisy eNvi-

ronments (VISION), a first of its kind audio-visual (AV) cor-

pus comprising 2500 utterances from 209 speakers, recorded

in real noisy environments including social gatherings, streets,

cafeterias and restaurants. While a number of speech enhance-

ment frameworks have been proposed in the literature that ex-

ploit AV cues, there are no visual speech corpora recorded in

real environments with a sufficient variety of speakers, to en-

able evaluation of AV frameworks’ generalisation capability in

a wide range of background visual and acoustic noises. The

main purpose of our AV corpus is to foster research in the area

of AV signal processing and to provide a benchmark corpus that

can be used for reliable evaluation of AV speech enhancement

systems in everyday noisy settings. In addition, we present a

baseline deep neural network (DNN) based spectral mask es-

timation model for speech enhancement. Comparative simula-

tion results with subjective listening tests demonstrate signifi-

cant performance improvement of the baseline DNN compared

to state-of-the-art speech enhancement approache

For more:


bottom of page