FOURTH PORTAL
GATEWAY TO THE FOURTH INDUSTRIAL REVOLUTION
A Novel BenchmarkDataset and Deep Learning-based Baseline System
ResearchGate
1 Oct 2022
In this paper, we present VIsual Speech In real nOisy eNvi-
ronments (VISION), a first of its kind audio-visual (AV) cor-
pus comprising 2500 utterances from 209 speakers, recorded
in real noisy environments including social gatherings, streets,
cafeterias and restaurants. While a number of speech enhance-
ment frameworks have been proposed in the literature that ex-
ploit AV cues, there are no visual speech corpora recorded in
real environments with a sufficient variety of speakers, to en-
able evaluation of AV frameworks’ generalisation capability in
a wide range of background visual and acoustic noises. The
main purpose of our AV corpus is to foster research in the area
of AV signal processing and to provide a benchmark corpus that
can be used for reliable evaluation of AV speech enhancement
systems in everyday noisy settings. In addition, we present a
baseline deep neural network (DNN) based spectral mask es-
timation model for speech enhancement. Comparative simula-
tion results with subjective listening tests demonstrate signifi-
cant performance improvement of the baseline DNN compared
to state-of-the-art speech enhancement approache
For more: