Abstract
This tutorial provides an overview of the causes of lip sync (audio
to video synchronization) errors and their correction and cure. Examples
of some of the more common sources of errors are given and correction
methods using the Pixel AD3000 and AD3100
series audio synchronizers (tracking audio delays) are outlined.
Prepared by:
J. Carl Cooper
Mirko Vojnovic
Chris Smith
March 27, 2002 (updated August 12, 2004)
The sources and solution
One industry magazine reporter commented: “... how come
some stations don't bother to correct the audio/video delay inequality?
Sometimes when I watch Headline News I think I've somehow had my sound
discriminator captured by a different channel's audio carrier. Hello?
Anybody home? Compensating audio delay devices do exist. Honest!"
(1).
While compensating delays exist, the cure is not as easy as simply putting
an audio delay in the audio feed to the transmitter.
The causes of audio to video synchronization errors in video systems,
commonly known as lip sync errors, are usually quite subtle. These errors
are often the result of buildup of video delays at several locations,
without provision for compensating audio delay. The problem is now so
pervasive that it receives attention from virtually every segment of
the TV industry.
Obviously the advertisers do not want to have their commercials aired
with bad lip sync. Many are now monitoring for lip sync problems and
asking for “free make-ups”, when they find them. Bad lip
sync is also a big concern to newscasters, reporters, politicians and
others who are trying to convey a message of trust and sincerity to
their audience. Without proper lip sync these people can be perceived
by the viewers as less interesting, more unpleasant, less influential,
and less successful than if the same person were viewed with proper
lip sync (2).
It is important to recognize that for even a simple system, video delays
of several frames can occur. These video delays result in audio usually
being advanced relative to video. Lately, as in case of some serial
audio (eg. Dolby Digital) decoders, audio can actually be delayed in
respect to video, which further complicates the problem.
Part of the mismatch actually occurs in the television camera due to
delays in the CCD sensors, and part in the television signal production
environment. Various video signal processing devices such as frame synchronizers,
DVEs, A-D and D-A converters are typical video delay sources. It is
important to note that many of these delays can instantly change video
delay by throwing away or repeating frames of video to avoid video memory
overflow. Jumps can also occur when the operating mode of these devices
change.
A frequent problem with video delays takes place in MPEG encoding and
decoding (as well as for other video delays) is the introduction of
large quick video delay changes in the video. These delay changes are
caused by instant changes in complexity and motion in the video image,
and are often seen on network feeds. A typical system video delay chart
for individual video delay components and the total delay is shown below.

Note that the error ranges from approximately 2.5 to 8.5 frames, with
instant jumps of more than 3 frames at some instances.
From time to time, standards committees in various countries have addressed
the problem of audio to video timing errors, and have set standards
or guidelines for the maximum allowable amounts of these errors. For
the most part, these standards or guidelines are in agreement that lip
sync errors start to become noticeable if the audio leads the video
by 25 ms to 35 ms, or if the audio lags the video by 80 ms to 90 ms.
The smaller limit on advanced audio tends to support the theory that
advanced audio is more annoying because it is an unnatural condition.
In June 2003, the Advanced Television Systems Committee (ATSC) issued
a finding on this topic (3). The ATSC finding
states:
“One of the overarching goals of the DTV broadcasting system is
to deliver audio and video in proper synchronization to the viewer.”
The finding continues:
“... at the inputs to the DTV encoding devices…the sound
program should never lead the video program by more than 15 milliseconds,
and should never lag the video program by more than 45 milliseconds.”
And:
“Pending [a finding on tolerances for system design], designers
should strive for zero differential offset throughout the system.”
In 1993, The International Telecommunication Union (ITU) was apparently
thinking along the same lines. In its Draft New Recommendation [DOC.
11/59] (4) the ITU reported that errors of +20
ms (audio advanced) and -40 ms (audio delayed) were "detectable"
and errors of +40 and -160 ms were "subjectively annoying".
The draft recommendation suggested that an overall tolerance (for production,
presentation, distribution and transmission) of +20 ms to –40
ms was appropriate.
Clearly, television facilities need to keep audio to video synchronization
problems in mind. It is impractical to remove all of the video delays
so the solution is to add compensating delays to the audio. Adding tracking
audio delays, known as audio synchronizers, at every place where there
is a substantial video delay is a good system solution to lip sync problems.
In situations where an operator is already present (such as the master
control room), providing an easy to adjust, fast tracking audio synchronizer
allows the operator to correct upstream lip sync errors. Rapid corrections
can be made on the air without introducing unwanted audio artifacts.
Typical applications are remote news feeds or network feeds that arrive
at the station with a lip sync problem. This is usually due to the separate
transmission paths that video and audio signals traverse (eg. video
going through satellite uplink, and audio via land line).
It should be remembered that most video delays often change several
frames in only a few seconds. The quick change in video delay requires
the audio delay to quickly track greatly different values. Unfortunately
the audio is continuous and there are no frames which can be repeated
or deleted. Old style variable audio delay designs used memory address
jumping and cause discontinuities in the audio data. Many products attempted
to cover these pops, clicks and gaps, with only limited success. Even
with the masking attempts, these variable delays still generate many
undesirable artifacts in normal program audio, and are generally considered
totally unsuitable for any broadcast quality audio.
The audio delay technology which has found the most successful solution
uses first-in-first- out memories, or FIFOs. The FIFO has independent
write and read to allow the reading to be faster or slower than the
writing. The audio delay is controlled by causing the reading to catch
up with the writing (to decrease delay) or to lag behind the writing
(to increase the delay). This method works very well for making slowly
changing delay adjustments in the order of 0.5% rate of change, and
does not cause gaps, pops or clicks. Unfortunately the technique does
create an audio pitch shift. This pitch error occurs because the ‘playback’
of the audio from the FIFO is not at the same speed as the ‘recording’
of the audio. There is a direct correspondence between the read/write
rate difference and the pitch error. Normal people can hear pitch shifts
of around 1% and people who have musical talent can hear on the order
of 0.1%. If the rate of delay change is fast, then a noticeable pitch
change is created.
With a small rate of delay change, the amount of time required to make
large delay changes is correspondingly large, and the audio will be
out of sync for a long time while the delay is catching up. For example,
assuming a 0.5% rate of delay change is used, and an MPEG encoder generates
a quick 6 frame change of video delay when a commercial is switched
in. The audio delay will not be able to catch up to the new video delay
for 6 / 005 or 1200 frames. That is 40 seconds and the lip sync will
be way off for an entire 30 second commercial! Audio delays used for
tracking video delays must have fast rates of change to keep up with
the video, which leads to a noticeable pitch change in the audio.
In order to both minimize perceptible pitch shifts during normal small
delay changes, and to allow rapid audio delay change after large video
delay changes, it is necessary that the audio synchronizer incorporate
a pitch correction circuit. With pitch correction capability, it is
possible to make rapid delay changes with the pitch correction circuit
reducing corresponding audio pitch artifacts to a level where the viewer
will not notice them. The Pixel Instruments AD3000
and AD3100 audio synchronizers provide a high
performance pitch shifting capability as part of the audio synchronizing
function. As of the writing of this tutorial, they are the only audio
synchronizers on the market with this capability.
(1) Mario Orazio, "The Masked Engineer", TV Technology 13.12
(December 1995): 5-10.
(2) Dr. Byron Reeves & Dave Voelker, research report Effects of
Audio-Video Asynchrony on Viewer's Memory, Evaluation of Content and
Detection Ability (1993).
(3) ATSC Implementation Subcommittee Finding, Doc.IS-191, 26 June 2003.
(4) International Telecommunication Union Document 11A/47-E, 13 October
1993.
(5) Randy Hoffner, “A/V Synchronization: How Bad is Bad?”,
TV Technology, 14 May, 2003.