Friday, November 13, 2009

Understanding VoIP

The elements of traditional telephony—status, address and supervisory signaling, digitization, and so on—must have functional parallels in the VoIP world for systems to function as people expect them to, and more importantly, for VoIP to interact with the PSTN properly.

This section examines packetizing digital voice, signaling, and transport protocols, the components of a VoIP network, and the factors that can cause problems in VoIP networks and how they can be mitigated.


Understanding Packetization

IP networks move data around in small pieces known as packets. Because we know how to digitize our voice, it now becomes just another binary payload to move around in a packet. VoIP uses Digital Signal Processors (DSP) for the codec functions. The digitized voice is then packaged in an appropriate protocol structure to move it through the IP infrastructure.


DSPs

DSPs are specialized chips that perform high-speed codec functions. DSPs are found in the IP phones to encode the analog speech of the user and to decode the digitized contents of the packets arriving from the other end of the call. DSPs are also used on IOS gateways at the interface to PSTN circuits, to change from a digital circuit to packetized voice, or from an analog circuit to packetized voice. DSPs also change from one codec to another, allow conferencing and call park, and other telephony features. DSPs are a vital component of a VoIP system. Different chip types have varying capacities, but the general rule is that you want as many DSP resources available to you as possible. The DSP calculator on cisco.com will help you calculate what you must have.


Real-Time Transport Protocol (RTP)

RTP was developed to better serve real-time traffic such as voice and video. Voice payloads are encapsulated by RTP, then by UDP, then by IP. A Layer 2 header of the correct format is applied; the type obviously depends on the link technology in use by each router interface. A single voice call generates two one-way RTP/UDP/IP packet streams. UDP provides multiplexing and checksum capability; RTP provides payload identification, timestamps, and sequence numbering.

Payload identification allows us to treat voice traffic differently from video, for example, simply by looking for the RTP header label, simplifying our configuration tasks. Timestamping and sequence numbering allows VoIP devices to reorder RTP packets that arrived out of sequence and play them back with the same timing in which they were recorded, eliminating delays or jerkiness. There is no provision for retransmission of a lost RTP packet.

Each RTP stream is accompanied by a Real-Time Transport Control Protocol (RTCP) stream. RTCP monitors the quality of the RTP stream, allowing devices to record events such as packet count, delay, loss, and jitter (delay variation).

A single voice packet by default contains a payload of 20 msec of voice (either uncompressed or compressed). Because sampling is occurring at 8000 times per second, 20 msec gives us 160 samples. If we divide 8000 by 160, we see that we are generating 50 packets with 160 bytes of payload, per second, for a one-way voice stream.

If we use compression, we can squeeze the 160-byte payload down to 20 bytes using the G.729 codec. We still have 160 samples, still 20 msec of audio, but reduced payload size.

Codecs

The codecs supported by Cisco include the following:
• G.711 (64kbps)—Toll-quality voice, uncompressed.
• G.729 (8kbps)

• Annex A variant: less processor-intensive, allows more voice channels encoded per DSP chip; lower audio quality than G.729
• Annex B variant: Allows the use of Voice Activity Detection and Comfort Noise Generation;
can be applied to G.729 or G.729-A


The values for bandwidth shown do not include the Layer 3 and Layer 2 overhead; the actual bandwidth used by a single (one-way) voice stream can be significantly larger. The following tables summarize the additional overhead added by packetization and Layer 2 encapsulation (assume 50 packets per second (pps):


When using G.729, the RTP/UDP/IP header of 40 bytes is twice the size of the 20B voice payload. This consumes significant bandwidth just for header transmission on a slow link. The recommended solution is to use Compressed RTP (cRTP) on slow WAN links. cRTP reduces the RTP/UDP/IP header to 2 bytes without checksums or 4 bytes with checksums. The effect of using cRTP is illustrated in the following table. (Note: Ethernet is not included because it is not classified as a slow link.)

Voice Activity Detection (VAD)

Phone conversations on average include about 35% silence. In Cisco Unified Communications, by default silence is packetized and transmitted, consuming the same bandwidth as speech. In situations where bandwidth is very scarce, the VAD feature can be enabled, causing the voice stream to be stopped during periods of silence. The theory here is that the bandwidth otherwise used for silence can be reclaimed for voice or data transmission. VAD also adds Comfort Noise Generation (CNG), which fills in the dead silence created by the stopped voice flow with white noise. VAD should not be taken into account during the network design bandwidth allocation process because its effectiveness varies with background noise and speech patterns. VAD is also made ineffective by Music on Hold and fax features. In reality, VAD typically causes more problems than it solves, and it is usually wiser to add the necessary bandwidth.


Additional DSP Functions

In addition to digitizing voice, DSP resources are used for the following:
  • Conferencing: DSPs mix the audio streams from the conference participants and transmit the mix (minus their own) to each participant.
  • Transcoding and Media Termination Points (MTP): A transcoder changes a packetized audio stream from one codec to another, perhaps for transit across a slow WAN link. MTPs provide a point for the stream to be terminated while other services are set up.
  • Echo Cancellation: DSPs provide the calculation power needed to analyze the audio stream and filter out the repetitive patterns that indicate echo. Echo is a chief cause of perceived poor voice quality; echo cancellation is an important function.