A new study by the University of East Anglia (UEA) suggests computers are now better at lip-reading than humans.
The peer-reviewed findings will be presented for the first time at the eighth International Conference on Auditory-Visual Speech Processing (AVSP) 2009, held at the University of East Anglia from September 10-13.
A research team from the School of Computing Sciences at UEA compared the performance of a machine-based lip-reading system with that of 19 human lip-readers. They found that the automated system significantly outperformed the human lip-readers — scoring a recognition rate of 80 per cent, compared with only 32 per cent for human viewers on the same task.
Furthermore, they found that machines are able to exploit very simplistic features that represent only the shape of the face, whereas human lip-readers require full video of people speaking.
The study also showed that rather than the traditional approach to lip-reading training, in which viewers are taught to spot key lip-shapes from static (often drawn) images, the dynamics and the full appearance of speech gestures are very important.
Using a new video-based training system, viewers with very limited training significantly improved their ability to lip-read monosyllabic words, which in itself is a very difficult task. It is hoped this research might lead to novel methods of lip-reading training for the deaf and hard of hearing.
“This pilot study is the first time an automated lip-reading system has been benchmarked against human lip-readers and the results are perhaps surprising,” said the study’s lead author Sarah Hilder.
“With just four hours of training it helped them improve their lip-reading skills markedly. We hope this research will represent a real technological advance for the deaf community.”
Agnes Hoctor, campaigns manager at the RNID, said: “This research confirms how difficult the vital skill of lip-reading is to learn and why RNID is campaigning for people who are deaf or hard of hearing to have improved access to classes. We would welcome the development of video-based or online training resources to supplement the teaching of lip-reading. Hearing loss affects 55 per cent of people over 60 so, with the ageing population, demand to learn lip-reading is only going to increase.”
The AVSP conference is being held in the UK for the first time since its inception in 1998. The University of East Anglia will host cutting edge researchers including psychologists, engineers, scientists and linguists from as far afield as Australia, Canada and Japan.
As part of the conference, delegates will take part in a Visual Speech Synthesis Challenge in which a number of visual speech synthesizers, or ‘talking heads’, will battle it out to determine the most intelligible and visually appealing system.
AVSP runs as a satellite conference to Interspeech 2009 which will be held in Brighton. Topics under discussion will include: machine recognition of audiovisual speech; the role of gestures accompanying speech; modeling, synthesis and recognition of facial gestures; and speech synthesis.
Keynote speakers will be Dr Peter Bull of the University of York who will be exploring The Myth of Body Language and Prof Louis Goldstein of the University of Southern California whose presentation is entitled Articulatory Phonology and Audio-Visual Speech.
Comparison of human and machine-based lip-reading by Sarah Hilder, Richard Harvey and Barry-John Theobald is published in the Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP) 2009 on Thursday September 10 2009.
The research will be presented on Saturday September 12 at the International Conference on Auditory-Visual Speech Processing (AVSP) 2009 at the University of East Anglia.
For more information about the conference, please visit www.avsp2009.co.uk.
Part of the lip-reading test used to compare the performance of the machine-based lip-reading system and human lip-readers can be downloaded here: http://www.jtuk.com/training/part1.html