How to Reduce Latency and Improve Audio Fidelity

More and more businesses and residential telephone subscribers are moving to Internet-based service providers. According to the Canadian Radio-television and Telecommunications Commission (CRTC), there were 5.5 million retail “Voice Over Internet Protocol” (VoIP) residential telephone lines in Canada in 2013 [1] – representing nearly 50% of that market. These numbers will only increase in the future, as the arguments for choosing VoIP are so compelling: caller ID, easily-added features such as online contact lists, black lists of blocked callers, virtual numbers, physical portability (bring your VoIP line with you on holidays!), multi-way conferences, and, perhaps most significantly, dramatically reduced costs.

In spite of the abundant benefits of using Internet-based telephone service, one of the common criticisms of VoIP telephony centres on voice quality – the listener’s perception of the fidelity of the audio being received or transmitted.

The continuous audio of speech is broken down, typically, into 50 “packets” of 20ms duration each, for every second of sound being transmitted. Those 50 packets have to be digitized, transmitted to the party at the other end of the call, received, and converted back to analog audio, for the listener to hear. The choice of digitizing strategy (“codec”) will determine the best-case perception of quality; audio quality from that level can be eroded due to problems with the transmission of those digitized packets of sound – each packet must arrive in a timely fashion, ideally in the order in which they were transmitted. But with internet data transmission, VoIP typically cannot rely on such “guaranteed service”. Packets can arrive out of order, delayed, or they may indeed not arrive at all.

Latency is the term used to describe the delay incurred by a packet of data in transit from its origin to its destination. Contributing to latency are factors such as:

  • Physical distance between the two endpoints
  • Number of physical “hops” required to be traversed
  • Bandwidth of the physical ‘hops’
  • Demands upon the network(s) being used by other network users at the same time


How much does latency erode our perception of audio quality?

In an attempt to quantify audio quality, a measure called the Mean Opinion Score (MOS) was developed. Originally, the MOS was obtained by having a panel of listeners offering their opinions on the audio quality of sounds in controlled conditions, but in the VoIP world, these subjective measures have been quantified in terms of the hazards faced by packet data on a network: latency, jitter (the variation in latency) and packet loss. These MOS numbers range from 1 (“bad quality”) through 3 (“fair quality”) to 5 (“excellent quality”). Historically PSTN (circuit switched) calls have a MOS score in the 4.0 to 4.5 range and cellular networks a MOS score in the 3.5 to 4.0 range.

Ignoring packet loss can have a significant impact on the MOS. Here’s a chart showing the dependence of MOS upon latency, measured in milliseconds:


How to reduce latency and improve audio fidelity


We can see that any latency values greater than about half a second (500 milliseconds) will probably result in MOS values less than 3 – anywhere from “fair” quality (3) to “poor” quality (2) and “bad” audio quality (1).

So what can we do to reduce latency? First we have to know from where latencies arise. According to [2] internet transmission latency comes from four basic sources:

  1. This is a physical limitation; information is limited by the speed of light; travel across North America (say, 5000km) must take at least 16ms; travel to the moon (about 380,000km) is going to take at least 1250ms.
  2. Doing digital to analog conversion, encryption, and compression on the sending end, and then decompression, decryption, and digital to analog conversion on the receiving end, all take some amount of time
  3. Assuming that we’re transmitting our audio over a shared network, (as opposed to a dedicated one of which we are the sole users) delays are introduced when other users’ data must be interleaved with our own, or when a flood of other users’ data causes our stream to be delayed
  4. Grouping/Batching. Typically audio is digitized in 20ms packets; the packet cannot be transmitted until the entire 20ms interval is converted. This introduces a 20ms delay at minimum, assuming the digitization is instantaneous (which it won’t be). On slower or congested links, the effects of packet serialization–known as jitter–may be more prevalent. The techniques to manage this, such as buffering, may introduce additional delays. Decompression methods used by some codecs may also introduce delays when lining up packets.


So what can be done to reduce latency?

  1. Make sure the audio stream’s transmission path is as short as possible, all other things being equal. Of course, this is often out of our control. Choice of network backbones, failovers, and re-routings all occur more or less outside of users’ control
  2. Ensure that there is adequate processing horsepower and memory at both endpoints. Insufficient computing power can result in slowdowns and interruptions to completing the work at hand – encoding and decoding the audio stream. Some encode/decode schemes (codecs) have greater processing burdens than others. In all cases, VoIP demands timely processing
  3. Ensure that there is adequate bandwidth on all legs of the network(s) over which the audio is being transmitted. The more traffic being handled, the more likely it happens that some of the audio packets are delayed. “Bursty” traffic can be especially problematic in this respect. If our audio has to share a network with bursty or high-volume (i.e. close to network capacity) traffic, the timeliness of our transmissions will likely be impacted (which is by definition high latency). Whenever possible, prioritize your audio traffic by specifying the appropriate QoS (Quality of Service [3]) or DiffServ (Differentiated Services [4]) settings on any network equipment in your control.
  4. Optimize the trade-off between packet size and overhead ratio. The overhead for a single RTP packet is fixed; make the packet too small, and you’ll be wasting a lot of bandwidth; make it too large, and you’ll be adding to latency.


Further to the multiplexing issue above, much very low-level work has been done with respect to a problem known as “bufferbloat” [5]. Routers and switches have to buffer incoming packets when those packets’ outgoing destinations are “busy”, either through congestion or just contention. For many years, there was a tendency to design these devices with ever-larger buffers for handling such situations. Research has shown, however, that this tendency has actually caused breakage of congestion-avoidance algorithms, resulting in greater, and more variable, latencies. So in addition to the other strategies mentioned above, selecting the right networking equipment is also very important to latency minimization.




[1] http://www.crtc.gc.ca/eng/publications/reports/policymonitoring/2014/cmr5.htm figure 5.2.1

[2] Workshop on Reducing Internet Latency, 2013. Internet Society (internetsociety.org)

[3] http://en.wikipedia.org/wiki/Quality_of_service

[4] http://en.wikipedia.org/wiki/Differentiated_services

[5] http://en.wikipedia.org/wiki/Bufferbloat