IP Telephony

October, 26th, 1998

 

Tommi Koistinen

Nokia Telecommunications

Tommi.Koistinen@ntc.nokia.com

 

Johan Haeggström

Nokia Telecommunications

Johan.Haeggstrom@ntc.nokia.com

 

 

Abstract

IP telephony has rapidly grown into a challenger to the traditional circuit-switched telephony. The current market situation and incorporated business models are shortly reviewed in the beginning of this paper. The corresponding standardisation bodies and their main achievements are discussed next. On the technical side, the emphasis is on the digital signal processing functions of voice over IP network equipment. The very important topic of quality of service in voice over IP applications is discussed in the last section of this paper.

 

 

Contents

Contents *

1. HISTORY OF IP TELEPHONY *

2. BUSINESS MODELS *

3. MARKETS *

4. STANDARDISATION *

4.1 ITU-T STANDARDISATION *

4.1.1 H.323, Packet-Based Multimedia Communications Systems *

4.2 IETF STANDARDISATION *

4.2.1 PSTN and Internet Interworking, PINT *

4.2.2 Audio/Video Transport, AVT *

4.2.3 IP Telephony, IPTEL *

4.2.4 Multiparty Multimedia Session Control *

4.2.5 Comparing SIP with H.323 *

4.3 ETSI STANDARDISATION *

4.4 IMTC STANDARDISATION *

5. DSP FUNCTIONS *

5.1 VOICE COMPRESSION *

5.2 DTMF PROCESSING *

5.3 DATA AND FAX FUNCTIONS *

5.4 ECHO CANCELLATION *

5.5 ERROR CORRECTION AND ENCRYPTION *

5.6 IMPLEMENTATION OF DSP FUNCTIONS *

6. CLIENT SOFTWARE AND EQUIPMENT *

6.1 CLIENT SOFTWARE *

6.1.1 VocalTec InternetPhone *

6.1.2 Microsoft Netmeeting *

6.1.3 Voxware VoxPhone Pro *

6.2 DSP ACCELERATION BOARDS *

6.3 GATEWAYS *

6.3.1 Hypercom’s IEN 6000 Gateway *

6.3.2 Blue Wave Systems’ VoIP Platform *

6.4 GATEKEEPERS *

6.4.1 Ericsson’s Gatekeeper *

7. QUALITY OF SERVICE *

7.1 REQUIREMENTS FOR SPEECH COMMUNICATION *

7.2 VOIP TERMINAL USABILITY *

7.3 SPEECH QUALITY ISSUES IN VOIP *

7.3.1 Effects of Data Transmission on Speech Quality *

7.3.2 Effects of Speech Processing on QoS *

8. CONCLUSIONS *

REFERENCES *

 

 

1. HISTORY OF IP TELEPHONY

Voice over IP (VoIP) technology has a short history. VocalTec pioneered the Internet telephony market in 1995 with PC software which opened a voice connection between two PCs over IP-based network. The product was ideally suited for the Internet. After that, several other competing software packages were launched consecutively. In 1996 first interworking trials between IP network and PSTN were made. In 1997 the Delta Three launched the first Phone-to-Phone service for commercial use. In the mean time, various standardization organizations had started their work. The ITU-T H.323 standard (originally targeted for LAN networks, extended to IP telephony) has gained the most attention and H.323 compliance is sought-after among the VoIP industry. The development of VoIP technology is summarized and predicted in the following (according to Jeff Pulver [1]):

1995 - The year of the Hobbyist

1996 - The year of the IP Telephony Client

1997 - The year of the Gateway

1998 - The year of the Gatekeeper

1999 - The year of the Application

 

2. BUSINESS MODELS

The three basic VoIP scenarios are depicted in figures 1, 2 and 4.

 

 

 

Figure 1. PC-to-PC scenario.

 

 

The first scenario is the oldest and also the most simple form of VoIP applications. PCs are connected together over IP network. There are a lot of commercial software for terminal PCs available. Speech is compressed and decompressed by the PC software. A corporation might want to move all its internal voice and fax traffic to self-owned and hopefully well managed Intranet. Local and long distance tolls are avoided.

 

Figure 2. PC-to-PSTN scenario.

 

 

 

The second scenario adds the interworking possibility of desktop PC and ordinary fixed or mobile phone. As an example, a user in Finland may select from PC application/web browser the nearest gateway of his/her ITSP (Internet Telephony Service Provider) in the destination country, for example in New York. ITSP's gateway then sets up a local call in New York to wanted phone number. The price of service may be less than 50 % of the price of the long distance call. In figure 3 is shown the global network of VocalTec gateways [2], which are part of Delta Three's coverage offer.

 

 

 

Figure 3. Global network of VocalTec gateways (source: VocalTec).

 

 

In the third scenario a user makes a local PSTN call to specific number of his/her ITSP's local gateway, enters some identification digits and dials the long distance destination number. The ITSP's local gateway transfers the compressed speech parameters to destination gateway, which then dials the wanted number as a local call the same way as in scenario 2.

 

 

Figure 4. PSTN-to-PSTN scenario.

 

 

As the current driver in VoIP business is the by-passing capability it might seem that old telecom operators with huge investments will be crushed by ITSP-newcomers. For new entrants the VoIP market is easy to enter and having only a short technological history the rapid changes are accepted among newcomers with less resistance. The old telecom operators in their turn have a large customer base, billing and management platforms ready and they own the local exchanges. The VoIP technology might then be used to cut down the infrastructure cost and to extend their value added services.

 

In the equipment side the datacom vendors doesn't necessarily understand the real-time and delay-sensitive characteristics of speech applications as well as telecom vendors. The telecom vendors have lack of deeper understanding of Internet protocols. More co-operation, partnerships and acquisitions of small knowledgeable companies has happened and will happen in near future. Both the old telcos and telecom vendors doesn't have much to loose by following and researching VoIP technology but they have very much to loose if not taking part in it. Sooner or later we are most likely approaching again the third scenario of purely packetised network - based on IP, ATM or something else.

 

 

 

After the by-passing phase is over the real business making at service layer will start. The integration of voice and data will be the major factor in creation of new services. The most of new services are not imagined yet but the first commercial value-added services could be:

 

 

3. MARKETS

The revenue for total Internet telephony market complies with the exponential growth curve. The total revenue is predicted to reach almost $2 billion by the end of 2001[3]. This growth is based mainly on widespreading capitalizing of toll-bypass capabilities, emerging interoperability standards, fulfilling multimedia communication needs, and the overall integration of voice and data networks. In the following we will concentrate on equipment vendor market.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5. The market share of gateway vendors in 1997 [3].

 

 

Until recently the gateway market has been dominated by stand-alone gateways offering capacity of 12 to 100 ports. The current shift is towards integrated gateways which sweeps the smallest vendors out of question. Integration means here the integration of different interfaces but also the integration of VoIP platform to routing functions and/or to fixed/mobile switching platforms. Integrated gateway has advantage in scalability, shorter delay, and possibility to prioritize the traffic. It is usually the more cost efficient choice. The price per port ranges from $300 to $2500 depending on the total size of the system.

 

The biggest gateway vendors (figure 5) and their complex relationships to telecom vendors are shortly introduced in the following. VocalTec has been there from the very beginning. VocalTec is strongly focused on software (InternetPhone etc.) but doesn't necessarily have such hardware competence as some of the others. Micom is owned by Nortel. Vienna Systems works with Siemens. Lucent's gateways will interoperate with VocalTec's gateways. Netspeak is partially owned by Motorola, which in turn has a strong partnership with VocalTec. Some gatekeeper functionality is in many case integrated to gateways. Separate gatekeeper products are available at least from Ericsson and VocalTec.

 

4. STANDARDISATION

There are a number of standardisation bodies working on Internet telephony. The most important ones are ITU-T, IETF, ETSI and International Multimedia Teleconferencing Consortium (IMTC). Also ATM Forum is doing some VoIP work. In addition, there are a couple of smaller standardisation bodies such as MIT Internet Telephony Consortium and Technical Advisory Committee, which is working on a interworking proposal called Internet Protocol Device Control (IPDC).

4.1 ITU-T STANDARDISATION

In ITU-T VoIP work has mainly been done in Study Group 16. Their responsibility is multimedia standards for terminals, modems, protocols and signal processing. [4] The most important VoIP standards are H.323 and all standards related to H.323, such as, H.245, H.225, H.450, G.723, G.729, etc.

4.1.1 H.323, Packet-Based Multimedia Communications Systems

Instead of specifying all parts of a video telephony system, H.323 consists of a collection of standards. Some of these standards, such as H.225, have been developed particularly for H.323 and others, such as, G.711 and IETF RTP, have been adopted into H.323 as such. Only a small part of the system is specified in the H.323 standard itself. A logical and OSI-based representation of H.323 can be seen in figures 6 and 7.

 

 

 

 

Figure 6. Logical architecture of H.323. Receive Path Delay is used for synchronising audio and video streams. [5]

 

 

 

 

 

Figure 7. Protocol view of H.323 (over IP). [6]

 

 

Although, H.323 was originally developed for LANs, the standards is independent of the three lowest OSI-layers (network layer, link layer and physical layer). [5], [6], [7] The standard specifies a multipoint video conferencing system and it has found most popularity in IP applications.

 

All H.323 systems must support audio. The mandatory codec is G.711 (64 kbit/s) and other supported codecs are G.722 (7 kHz bandwidth at 64, 56 and 48 kbit/s), G.723 (5.3 and 6.4 kbit/s), G.728 (16 kbit/s), G.729 (8 kbit/s) and GSM codecs (5.6 – 13 kbit/s). All of these codecs are speech codecs, but it is also possible to use audio codecs (MPEG1). [8] In VoIP H.323 systems the most popular codecs are G.723.1, G.729A and GSM Full Rate.

 

In H.323 systems video support is optional. If video is supported the mandatory codec is H.261. During the last years video coding techniques have evolved rapidly and H.261 doesn’t represent state-of-the-art anymore. Therefore newer codecs have been developed, i.e., H.263 and H.263+. Another optional application is data transfer, using T.120. These data applications can include simple file transfers or advanced applications like shared whiteboards or shared office-type applications.

 

H.323 includes several standards for signaling and controlling a session. H.225 specifies a signaling protocol, called Registration, Admission and Status, which is a subset of Q.931 (signaling for ISDN). This protocol is used between all entities of a H.323 system, i.e., end-points (terminals and gateways) and gatekeepers. RAS messages are sent between end-points and gatekeepers for registration, admission control, bandwidth control, sending status information and setting up connections between the network entities. For call setup H.225 specifies a separate call signaling protocol, which is used over a separate signaling channel, set up by RAS signaling. In case of VoIP, these two signaling channels are sent over two separate TCP/IP connections. After call setup, the RAS signaling channel may be closed. [6]

 

At call setup sand also during a call, H.245 messages are used for exchanging information about the capabilities of the end-points. H.245 also enables opening and closing of media streams (audio, video and data), messages for flow-control and other general commands.

 

Media streams are transported using IETF’s Real-time Transport Protocol (RTP). RTP includes identification of payload type, sequence numbering, time-stamping and delivery monitoring. In case of VoIP, RTP packets are carried using a User Datagram Protocol (UDP) connection, which only provides un-reliable transport, but with a checksum. In order to monitor and control the reliability of RTP media streams another IETF protocol is used, i.e., Real-time Transport Control Protocol (RTCP). This protocol can also be used for synchronising audio and video streams.

 

Using gateways, H.323 systems can interwork with different types of networks (H.246). This is shown in figure 8. H.323 version 2 includes also supplementary services (H.450) and security and encryption procedures (H.235). [8]

 

 

 

 

Figure 8. H.323 gateway scenarios. (MCU is Multipoint Control Unit.) [5]

 

4.2 IETF STANDARDISATION

VoIP work in Internet Engineering Task Force is done in four Working Groups (WGs): PINT, AVT, IPTEL and MMUSIC.

4.2.1 PSTN and Internet Interworking, PINT

PINT is working on service interworking between PSTN and IP. [9] The idea is that interworking enriches current services and also new services will be developed. Examples of services are click-to-dial and click-to-fax. PINT is drafting a Service Support Transfer Protocol (SSTP). Important interworking issues are, e.g., security and access to Intelligent Network (IN) services.

4.2.2 Audio/Video Transport, AVT

Audio/Video Transport WG develops protocols for real-time transmission of audio and video over multicast UDP/IP. [10] The idea is that these protocols can be used by large scale systems in conjunction with resource management protocols. In the future, this will enable low-delay services with bandwidth control. The most important RFC is "RTP: A transport protocol for real-time applications" (RFC 1889). In addition there are a number of RFCs related to RTP payload formats. Important issues are sequence numbering and time-stamping of streams. Later, the AVT WG will develop simple control protocols for authentication and encryption.

4.2.3 IP Telephony, IPTEL

This IETF WG develops protocols for signaling and capability exchange between VoIP terminals and other VoIP equipment. [11] IPTEL is working on two such protocols. "Call processing syntax" will define advanced methods for call setup, where also the callee can give input. Services using this protocol will also be defined. The other protocol concerns call routing between VoIP gateways (GWs). Selection of GW should be based on many criteria, such as, client and service provider preferences, availability of GWs and of course destination address. This protocol should enable scalable and bandwidth efficient VoIP networks.

4.2.4 Multiparty Multimedia Session Control, MMUSIC

MMUSIC bases its work on Mbone technology. [12] The WG aims at developing protocols for teleconferencing over Internet. Examples of protocols drafted by MMUSIC are:

 

 

SIP is an important draft and is a strong competitor to H.323.

4.2.5 Comparing SIP with H.323

Although, H.323 is the most widespread VoIP protocol, there has been a lot of criticism against it. First of, all it is not an IP type of protocol. Instead of being flexible and lightweight, it is a complex set of standards, which use connection oriented signaling, based on ISDN (Q.931). SIP is a typical IP-protocol and it is expected that SIP will gain acceptance quite fast and start competing against H.323. [13], [14], [15] Already now, e.g., MCI favors SIP instead of H.323.

 

The complexity of H.323 is clearly shown, e.g., in call setup time, which requires about 10 messages and takes about 20 to 30 s in time. SIP only requires four messages for call setup. Also because the H.323 protocols are so complex, gateways from different vendors cannot interoperate properly. Some improvements to this is expected with version 2 and 3 of the standard. When comparing feature by feature, H.323 has more to offer. SIP, however, is a more open and scalable protocol and SIP can actually interoperate with H.323. H.323 is backed up by many large computer and telecommunication companies such as Microsoft, Intel and Ericsson, and therefore it will stay as the leading VoIP standard for several years to come.

4.3 ETSI STANDARDISATION

ETSI VoIP activity is centered around an ETSI project called TIPHON (Telecommunications and Internet Protocol Harmonization Over Networks). The mission of TIPHON is to combine IP with other telecommunication technologies to enable VoIP networks to interwork with Switched Circuit Networks (SCN). [16] TIPHON will develop service oriented solutions that a variety of operators can use. Wherever possible, TIPHON will use available standards, of which the most important one is H.323 (version 2). Even though ETSI is working in Europe, TIPHON deliverables are aimed at gaining world-wide acceptance. Companies supporting TIPHON are, e.g., AT&T, Cisco, Ericsson, Lucent, Intel, Microsoft, Motorola, Nokia, Nortel, Siemens, Philips and Telia.

 

Main work items of TIPHON are:

 

 

These items are addressed in six work groups, which somehow relate to the above list:

 

 

In addition TIPHON has a Specialist Task Force (STF 114), with two full-time members.

 

 

 

 

TIPHON is working with four scenarios:

 

 

4.4 IMTC STANDARDISATION

The objective of International Multimedia Teleconferencing Consortium is to bring all organisations involved in the development of multimedia teleconferencing products and services together to help create and promote the adoption of required standards. [17] This means that IMTC does not develop standards and is mainly focused around ITU-T standards, T.120, H.320, H.323, H.324, etc. The organisation consists of 140 member organisations, mainly telecommunication manufacturers such as 3Com, Alcatel, Cisco, Ericsson, IBM, Lucent, NEC, Nokia, Nortel and Philips. The key activities of IMTC are:

 

 

5. DSP FUNCTIONS

A gateway implements a wide range of digital signal processing (DSP) functions. The most important DSP function is the voice compression. Other DSP functions include echo cancellation, DTMF-tone detection and generation, modem and fax detection and demodulation and error detection, which are discussed in detail in the following subchapters.

5.1 VOICE COMPRESSION

Voice is compressed to save network capacity and to enable VoIP traffic in low-speed modem connections. Typical compression rates range from 4:1 to 10:1. Additional compression (about 2:1) is gained using Voice Activity Detection (VAD) function and discontinuous transmission (DTX) i.e. only voice is transmitted – not silence. A block diagram of voice activity detection of G.723 codec [18] is depicted in figure 9. The purpose of the VAD is to reliably detect the presence or absence of speech and to convey this information to the Comfort Noise Generation (CNG) algorithm. The CNG algorithm creates noise parameters that match the background noise and sends them in Silence Insertion Descriptor (SID) frames to the receiving end. Comfort noise generation is utilized to avoid unpleasant noise modulation when transmission is switched off.

 

 

Figure 9. A block diagram of voice activity detection (G.723).

 

 

Three factors related to voice compression are complexity, delay and quality. The algorithm complexity increases hand in hand with compression ratio and/or achieved speech quality (see figure 10 and table 1).

 

 

Figure 10. A quality comparison of some speech codecs (source AudioCodes [19]).

 

 

The most common speech codecs on VoIP market are proprietary solutions and the H.323 compliant low-rate codecs: G.723.1 (6.4 kbps is a TrueSpeech variant) and G.729 [20]. Table 1 summarises the features of some of these codecs.

 

 

Codec

Rate (kbps)

Complexity

Quality (MOS)

Frame size

G.711 a/u-law

64

-

4.1

-

G.723.1

6.4/5.3

high

3.9

30ms

G.726 ADPCM

32

low

3.8

 

G.728

16

low

3.6

 

G.729

8

medium

3.9

10ms

G.729A

8

medium

<3.9

10ms

G.729B (DTX)

8

medium

3.9

10ms

GSM FR

13

low

3.5

20ms

GSM HR

5.6

high

3.5

20ms

GSM EFR

12.2

high

> 4

20ms

 

 

Table 1. The most common speech codecs in VoIP applications.

 

 

If automatic gain control (AGC) and speech enhancement techniques like noise cancellation are implemented, they are usually combined with speech codec. As an example of the complex encoder structure (which justifies the signal processor implementations) the block diagram of G.723.1 codec is depicted in figure 11.

 

 

 

Figure 11. Block diagram of the G.723.1 voice codec.

 

 

The G.723.1 codec encodes speech or other audio signals in frames using linear predictive analysis-by-synthesis coding. The excitation signal for the high rate coder (6.4 kbps) is Multipulse Maximum Likelihood Quantization (MP-MLQ) and for the low rate coder (5.3 kbps) Algebraic-Code-Excited Linear-Prediction (ACELP). The frame size is 30 ms and there is an additional look ahead of 7.5 ms, resulting in a total algorithmic delay of 37.5 ms. All additional delays in the coder are due to processing delays of the implementation, transmission delays in the communication link and buffering delays of the multiplexing protocol.

5.2 DTMF PROCESSING

Speech is compressed in gateways to achieve capacity saving on the IP link. DTMF-tones (Dual Tone Multi Frequency) are corrupted if they are compressed (e.g. G.723.1) the same way as speech. Tones may sound all right to human but they are probably degraded out of specifications of tone-managed devices. Usually DTMF-tone detection is implemented in gateways to be able to bypass the codec. See figure 12.

 

 

 

Figure 12. DTMF-handling in stand-alone gateways.

 

 

A tone is detected in the near-end gateway (coming from left). Signaling (e.g. Frame Relay Forum FRF.11 annex A [21] or some proprietary format) is used to transfer that tone to the far-end gateway where it is generated again into the 64 kbps line. Meanwhile the speech channel may be idle.

 

The most usual cases of DTMF-tone generation are depicted in figure 13. When pressing a key on fixed phone a tone is generated by the phone itself. When setting up a call the call progress tones (line busy, ringing etc) are received from the tone generator (TG) of the far-end MSC/FSC. When pressing a key on mobile phone a signaling message is forwarded to near-end MSC which then generates the outgoing tone. The far-end case goes as before.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 13. Sources of DTMF-tones.

 

5.3 DATA AND FAX FUNCTIONS

Fax and dial-up modems use the very same 64 kbps PCM channel as does the speech. Speech channels are compressed in codec for transmission but if fax/modem signal is put through the same procedure the fax/modem connection is blocked completely. A special fax/modem detector or external signaling is needed to be able to route speech and fax/modem signals appropriately. The possible solutions are depicted in figure 14:

 

 

 

Figure 14. Solutions to support modem/fax connections.

 

 

The first case saves transmission capacity (64 kbps to 9.6 kbps) but adds two modems at both endpoints of network side transmission link. Current products support rates up to 14.4 kbps to be demodulated by the internal DSP. Faster rates (like V.90) may be detected and then routed without demodulation using 64 kbps PCM format or they can be routed to external modem boards. A better illustration of the fax case is depicted in figure 15.

 

Figure 15. VoIP fax demodulation and remodulation (source Micom [22])

5.4 ECHO CANCELLATION

Two-wire to four-wire conversion (hybrid) between phone and central office switch gives rise to echo in fixed networks (figure 16). So called near-end echo comes back so fast that it is imperceptible but far-end echo must be cancelled because a pure transmission delay may cause echo to be noticeable. In mobile networks there are no hybrid circuits but the round-trip delay is much longer (due to e.g. channel and voice coding). Acoustic echo may also cause perceivable degradation in both cases.

 

 

Figure 16. Sources of echo.

 

 

 

Figure 17. Bypassing and echo cancellation.

 

 

In bypassing applications echo cancellers could be set like in figure 17. Echo cancellation may be integrated to gateway or external cancellers may be used. Echo is cancelled from the 'uplink' direction so the delay is approximately fixed (and no echo is let loose into the IP network).

 

 

 

Figure 18. PSTN-to-native IP call and echo cancellation.

 

 

In VoIP systems where a long IP network delay is present echo cancellation is absolutely required – especially in PSTN to native VoIP interworking (figure 18) where VoIP terminal doesn't necessary have an acoustic echo canceller (B) at all. If that's the case the acoustic echo should be cancelled from the decoded signal in gateway (A).

 

 

 

Figure 19. A block diagram of echo canceller (from G.165 [23]).

In figure 19 is depicted the block diagram of an echo canceller according to standard G.165 [23]. The echo canceller reduces the near-end echo present on the send path by subtracting an estimation of that echo from the near-end echo. Some fundamental requirements are set on echo cancellers: rapid convergence, subjective low returned echo level during single talk and low divergence during double talk.

5.5 ERROR CORRECTION AND ENCRYPTION

Forward error correction (FEC) for voice frames is not so relevant issue in IP networks because typically it can not handle the usual type of transmission errors: losing whole packets (interleaving is out of question because of inherited extra delay). Anyway, current products available usually include some sort of error detection function, usually a plain CRC (which is not a FEC), which can be used to ask retransmission of corrupted frames. Usually we have no time for retransmission and the CRC detection is just used as indication for bad frame handling procedure.

 

Some vendors like Clarent/Fortress [24] offer encryption for VoIP-based VPNs, but it is not a widespread function yet.

5.6 IMPLEMENTATION OF DSP FUNCTIONS

Digital signal processors are used to implement previously discussed DSP functions. Low-level (assembly) software development is quite a tedious task and the DSP software is in many cases bought ready. For example, AudioCodes [19] has a strongly integrated software package for their dedicated DSP chip (figure 20). Several gateway vendors use this chip in their hardware implementations.

 

Figure 20. Block diagram of DSP software for AudioCodes chip (source AudioCodes [19]).

6. CLIENT SOFTWARE AND EQUIPMENT

6.1 CLIENT SOFTWARE

The Internet telephony client software is developing towards more and more feature rich conferencing packages. Document sharing together with audio and/or video conferencing capabilities set heavy requirements on the network connection (a fixed IP access is naturally preferred). A basic requirement is also a full-duplex sound card to enable two-way conversation. The promised H.323 compliance lacks still real interoperability between different client software. With proprietary voice compression algorithms some benefit may be achieved in voice quality but this again restricts the interoperability. Directory services, like Microsoft Internet Locator Server (ILS, ils.microsoft.com), formerly known as User Location Service (ULS), enable users to find each other without knowing their exact IP addresses. A sample of three client software packages is reviewed shortly in the following. The range of features is quite the same - just the names differ slightly.

6.1.1 VocalTec InternetPhone

InternetPhone 5.0 [25] offers PC-to-Phone communication using VocalTec's gateway technology. It promises enhanced audio and video quality and includes full suite of multimedia features with voice-mail, file exchange and document sharing capabilities but not with H.323 compliance. VocalTec Communication Client (VCC), which is currently in public beta testing phase, will extend especially the document sharing capabilities (including full support for Microsoft Office). Main features of InternetPhone 5.0 are listed below:

 

 

6.1.2 Microsoft Netmeeting

NetMeeting 2.1 [26] supports the H.323 (ITU G.711 and G.723 audio) standard and Internet Engineering Task Force (IETF) RTP and RTCP specifications for controlling audio flow to improve voice quality. On MMX-enabled computers, NetMeeting uses the MMX-enabled audio codecs to improve performance for audio compression and decompression algorithms. This should result in lower CPU use and improved audio quality during a call. The main features of Microsoft Netmeeting 2.1 are:

 

6.1.3 Voxware VoxPhone Pro

VoxPhone™ Pro 3.0 [27] is H.323-compliant and promises to completely interoperate with all three major clients in the industry - Netscape Conference, Microsoft NetMeeting and Intel Video Phone. VoxPhone™ Pro 3.0 incorporates Voxware's proprietary voice compression technology including RT29, RT24 and RT28 codecs. VoxPhone main features are:

 

6.2 DSP ACCELERATION BOARDS

The processing delay of voice compression algorithm could be decreased significantly and the host's CPU resources saved if utilizing an add-on DSP board with client software. The extra cost and the fact that it is much easier to update the client software than embedded DSP software have not made add-on DSP boards such a success. As an example, Quicknet InternetPhoneJACK [28] is an add-on board with full-duplex audio, build-in echo cancellation and support for DSP based voice compression/decompression. It supports multiple clients (VocalTec InternetPhone, Microsoft Netmeeting) and a standard phone may be attached to it and used to make and receive Internet phone calls (e.g. phone rings when Internet call arrives).

6.3 GATEWAYS

The first large scale IP telephony networks have been built using gateways (GW) between IP and PSTN. This also means that VoIP GWs is the first larger scale VoIP market.

 

First VoIP GWs were built using PCs as a platform and adding DSP and I/O cards as well as VoIP software. This approach works for small sized office type of systems. The next step was to add more DSPs and to stack several computers to form a larger system. This type of system lacks basic requirements of telecommunication equipment, which are fault tolerance and centralised operation and maintenance.

 

In order to meet the requirement of larger capacity GWs, specialised GWs have been developed. These can be divided into dedicated GWs and integrated GWs. A dedicated GW is a standalone device consisting of a computer chassi with, e.g., a Compact PCI or VME backplane bus and a number of plug-in units. A typical configuration includes one master processor plug-in unit, ten DSP plug-in units and a couple of I/O plug-in units. An integrated GW is based on some exiting network element such as a router or a Remote Access Concentrator (RAC). Router based GWs have the advantage that one delay source is removed, since the data packets are directly put into and taken from routing queues. [29] It is also possible to put real-time speech packets ahead of non-real-time data packets. Integrating two devices into one should provide a more cost-efficient solution. A disadvantage of integrated GWs is that there is less flexibility in using separate VoIP control protocols, e.g., in the case of using a gatekeeper.

 

Some manufacturers offer integrated or dedicated VoIP GWs with up to a thousand voice ports, such as, Hypercom’s IEN 6000. [29] There are plans for even ten times larger systems. Still, the market is lacking true carrier class VoIP GWs, integrated into switching centers. The first large-scale, carrier-class VoIP GWs will probably be introduce during next year.

 

A key issue for VoIP manufactures is how they are going to implement interworking with SS7. The current consensus is that this will be done by a separate signaling GW entity. The other part of the GW, the media GW, would implement at least:

 

 

Only the first function is specified in H.323, but it is expected that the other two will be added in version 3 of the standard. [15]

 

In order to give a better view on GWs, two different GW products are presented. The first one is something between a dedicated and an integrated large-scale GW and the second one is a DSP-card, which can be used as a basis for implementing a GW.

6.3.1 Hypercom’s IEN 6000 Gateway

A typical example of a standalone VoIP gateway is the Hypercom’s IEN 6000, shown in figure 21. [30] Hypercom calls it a multiservice access gateway for voice, fax and data service over IP, Frame Relay (FR) and ATM. The IEN 6000 has full hardware redundancy, with dual power supply and hot swappable processor modules. The data is put through an integrated cell/packet/circuit switch and the gateway has also an integrated LAN router. The forwarding rate of the router is 150 000 packets per second and the internal bus speed is 344 Mbit/s.

 

 

Figure 21. Hypercom IEN 6000 [30]

 

 

The IEN 6000 supports several types of WAN interfaces: FR, ATM, HDLC, ISDN, X.25, IP and PPP/SLIP. In addition it has both analog and digital voice interfaces, with up to 960 fax or voice connections on a single node. For VoIP applications the gateway supports G.711, G.723.1 and G.729A speech codecs. It also supports G.165 echo cancelling and silence suppression. Quality of Service (QoS) can be improved by protocol priorisation, congestion control and end-to-end packet recovery. A Hypercom innovation called Modem Relay is used for sending analog modem calls over an IP network.

6.3.2 Blue Wave Systems’ VoIP Platform

An example of a state-of-the-art VoIP-board is Blue Wave Systems’ CPCI/C6400 TMS320C6x Telecommunications Platform. [31] This platform is a Compact PCI 6U card, which is also available in a regular PCI format. (See figure 22.) In the product brochure [31], the card is said to be optimised for multichannel telecommunication processing, such as, modem pools and VoIP applications. The board has several processors, one Motorola MPC860 PowerQUICC control processor and up to four Texas Instruments TMS320C6201 DSPs. Each processor can have up to 16 Mb of SDRAM and in addition the control processor has 2 Mb of flash memory and each DSP has 512 kb of SBSRAM. The platform has several interface:

 

Figure 22. Blue Wave Systems CPCI/C6400 TMS320C6x Telecommunications Platform. (Those four white chips are Texas Instruments TMS320C6201 DSPs.) [31]

 

 

The two serial ports of each DSP as well as the control processor and the PMC site are connected to an ECTF H.110 compliant switch array (Lucent T8100). This means that, e.g., traffic coming from a PMC E1/T1 card can be switched through the T8100 to a DSP and back through the T8100 to the control processor and further out through the Ethernet interface. The benefit for this type of operation is that traffic channels don’t have to go via the host. The T8100 enables each DSP to access 192 bi-directional 64 kbit/s channels.

6.4 GATEKEEPERS

In order to build functional VoIP network some entity must have control of the network. This entity is called gatekeeper (GK) and it has been added, e.g., to H.323 version 2. Still, it is not completely clear what are the functions of the GK. At least a GK should authenticate, register and control all VoIP equipment that belongs to its control zone. This could also include VoIP terminals. A GK should also provide address translation between VoIP and other networks. Optional GK-features are, e.g., call control signaling, call authorisation and management, and bandwidth management.

6.4.1 Ericsson’s Gatekeeper

Since the gatekeeper (GK) functionality is not particularly hardware dependent, Ericsson has designed their GK software using Java. [32] This means that the GK can be running on almost any computer platform. This approach has evidently a lot of advantages, such as, scalability, easy portability and fast upgradability. Scalability is an important issue for Ericsson, since they want to build carrier class solutions. Here are the main features of the Ericsson GK:

 

 

7. QUALITY OF SERVICE

In VoIP networks it is hard to achieve the same level of Quality of Service (QoS) as in POTS. There are several reasons for this. One obvious problem area is speech quality (which is thoroughly covered in this paper). There are also other - more vague - issues, such as, service accessibility and usability. In [16] QoS is defined as: "The collective effect of service performance which determine the degree of satisfaction of a user from the service." (This definition has been taken from ITU-T specifications.)

7.1 REQUIREMENTS FOR SPEECH COMMUNICATION

ITU-T has a number of specifications concerning speech quality in PSTN. These specifications cannot be used as such for VoIP networks, since it is not feasible to meet all the requirements, e.g., the delay requirements of G.114. [33] presents a few basic facts of speech and speech quality in telecommunication networks.

In a two-way conversation ~60% of the time is only background noise.

This indicates that packet transmission could be efficient for speech communication. Excluding those parts of the speech signal that contains only noise or other background sounds is called silence suppression.

Speech communication is sensitive to delay and especially variations in delay, i.e., delay jitter.

This is one of the most difficult problems in VoIP networks. All parts of a VoIP system have to be studied in order to be able to reduce delays. Buffering must be used at the edges of the packet network to reduce the effect of delay jitter.

Speech is not so sensitive to random bit errors, as long as the signal to noise ratio stays over ~30 dB.

In speech communication no one expects to have a "CD-quality" connection. People are used to noisy analog phone lines and the frequency bandwidth is anyhow limited to 3.4 kHz. With advanced speech coding and data transmission, it is even possible to completely recover from random bit errors.

Loosing complete blocks of speech can introduce severe artifacts and reduce intelligibility.

Packet loss is a problem in VoIP networks, especially if the network is heavily loaded or used for other traffic, such as, bursty data transmissions. A network node (e.g., a router) starts dropping packets when the load reaches the capacity of the node. In order to avoid packet loss, VoIP networks should be built with some spare capacity. This procedure also reduces delay and delay jitter, which is actually related to packet loss. If smaller delay jitter buffers are used at network edges, more packets are lost, as they arrive too late.

If speech coding or some other form of speech compression is used, it should modify the speech signal as little as possible.

Speech quality degradation is almost negligible with today’s state-of-the-art speech coding techniques, even with speech bandwidth reduced to 1/10th of the original. It is actually desirable to use speech coding, since lower bit rate reduces delays and packet loss in VoIP networks.

7.2 VOIP TERMINAL USABILITY

The first VoIP applications were based on using "multimedia PCs", with internet connections and VoIP software. Although, personal computers are not the ideal platform for VoIP, a lot of effort is put into enhancing PC-based VoIP. In [33] are listed a number of drawbacks with PC-based VoIP. Both parties need:

 

 

In addition, the session has to be synchronised, so that both parties are simultaneously "on-line". They might agree on starting a call at a certain time or they might use a regular phone or some other means of communication for agreeing on a VoIP session.

 

All of the above problems are avoided, if VoIP services can be accessed through regular phones. This means that the operator has a PSTN-VoIP gateway somewhere in its network. There is also emerging new types of VoIP-terminals. These can be multipurpose IP-terminals for VoIP, web-browsing, e-mail, etc. An example of such a terminal is Alcatel’s Internet Screenphone, which is shown in figure 23.

 

 

 

 

 

Figure 23. Alcatel Internet Screenphone. [34]

 

7.3 SPEECH QUALITY ISSUES IN VOIP

As already pointed out, speech communication sets some basic requirements on VoIP QoS. Here the requirements are divided into those concerning speech processing and those concerning data transmission. In general, it is easier to enhance the QoS of speech processing functions, since these are centralised and almost isolated issues. On the network side there are a lot of complex behavior that affects, e.g., delays and packet loss rates.

7.3.1 Effects of Data Transmission on Speech Quality

There is one outstanding QoS problem, that threats the future of VoIP. IP was originally designed for non-realtime data transmission and cannot guarantee low-delay connections. There are a number of methods for supporting real-time traffic in IP such as RTP and RSVP. One fact still remains, that the IP protocol itself doesn’t support real-time traffic.

 

In [29] it is said that in a best-case scenario round-trip delay is 600 ms on an international VoIP call. In G.114 a round-trip delay of up to 300 ms is considered acceptable. In case of, e.g., satellite links in PTSN round-trip delays of up to 500 ms might be experienced. Figure 24 shows the basic sources of delay in a VoIP network.. As can be seen in the figure there is no single solution for reducing delay. A few general guidelines might be:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 24. Network delays in a packet switched network such as Internet. The most important delays are buffering (1) and processing (2) delays in routers, buffering delays (3) at network edges and transmission delays (4). One important aspect is also that routers start dropping packets, when they are heavily loaded (5).

 

 

Because of delay jitter, some sort of data buffering must be used at network edges. This ensures that a constant stream of speech frames can be reproduced. (The next section discusses some advanced buffering schemes.)

 

Preferably, the network has some mechanism for recovering from packet losses. Although, it is not possible to use retransmission because of the real-time requirements, at least speech frame numbering should be used in order to find out if packets have been lost. In Internet packet loss rates of over 10 % are common, which puts very high demands on the speech decoder.

7.3.2 Effects of Speech Processing on QoS

As discussed earlier in the paper there are a number of speech processing function used in VoIP. From speech quality point of view the most important ones are speech coding, echo cancelling and silence suppression. It is also important to try an reduce delay and delay jitter in speech processing functions.

 

Speech codecs used in VoIP applications produces high quality in clean speech conditions. There are, however, a number of problematic situations:

 

 

Tandem speech coding is especially important in cases where VoIP is interworking with other networks that use speech coding, e.g., GSM. Transcoding between G.723.1 and a GSM codec produces poor speech quality.

 

Speech codecs are usually quite robust to random bit errors, but loosing complete speech frames requires special mechanisms, such as, bad frame handling used in GSM.

 

According to G.114 echo cancelling must be used in cases where the round-trip delay exceeds 50 ms. In PSTN echo cancelling is implemented in the switching centers so the echo canceller removes acoustic echo that the far-end terminal produces. Echo cancelling becomes somewhat more complicated in VoIP, where the delays are longer and may vary. Therefore VoIP terminals must implement echo cancelling.

 

If silence suppression is properly implemented it should not degrade speech quality. The most important parts are the algorithms used for Voice Activity Detection (VAD) and comfort noise generation. Especially in case of comfort noise, the noise parameters should be generated at the far-end based on surrounding background sounds. These parameters should also be regularly updated.

 

Speech coding introduces a couple of delay sources. These can be divided into system delays, which are independent of the implementation and processing delays that depends on processing capacity in the system. The largest system delay is determined by the size of the coded speech frames. This delay is, e.g., 37,5 ms for G.723.1 and 15 ms for G.729. With today’s efficient DSPs, processing delays are quite small, in the order of 5-10 ms.

 

According to [29], Hypercom uses a default delay jitter buffer of 50-100 ms. Delay jitter may also vary during a VoIP call. Therefore, VoIP GW manufactures such as Cisco, Motorola, Hypercom and Netrix have decided to implement smart buffering mechanisms, where the buffer size can be changed during a call. [29] Such a GW needs to constantly measure network condition in order to decide on making these buffer adjustments. If the buffer size should be increased additional speech segments must be synthesised. In case the buffer size should be decreased some parts of the speech signal must be dropped. These adjustments would preferably be made during silence or in those parts of the speech signal that have low energy. An implementation of a smart jitter buffer is shown in figure 25.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 25. Method for timing adjustment, when speech data from a packet network is interfaced to a D/A converter. Speech transmitted from the originating end (at the left), propagates with variable delay through the packet switched network. At the receiving end, delay jitter is compensated by using a leaky filter controlled by the queue depth detector and the idle word counter. The DSP is used for speech decoding. [33]

 

  1. CONCLUSIONS

The current market value of VoIP industry is more or less based on by-passing the arguably high long distance tariffs. In the near future, these tariffs will most likely come down in response to competitive alternatives like VoIP technology. To be able to compete with circuit switched traffic the VoIP industry should solve the voice quality and interoperability problems pretty fast. Later on, service infrastructure which is build upon VoIP technology will be the key to death or success of VoIP networking. VoIP technology has naturally a very good starting point when integrating voice and data applications to commercial value-added services. The market shares of VoIP equipment vendors have not stabilized yet. The big telecom vendors are about to release their first carrier class products. Integration of gateways (and gatekeepers) to their existing switching platforms will substantially raise the market potential of the VoIP industry.

 

There are a number of standardisation organisations working with VoIP. VoIP business is expected to grow extremely fast and all sorts of computer and telecommunication organisations are trying to keep up with the pace. VoIP standardisation is dominated by ITU-T H.323 and a lot of effort is put into enhancing the standard. Also competing standards are being developed, such as, IETF SIP, but the most value will be in developing interoperability between both standards and equipment.

 

The VoIP technology is heavily based on digital signal processing. New voice coding techniques, like variable rate coding, which is very suitable for paketized networks, are researched to reach higher compression rates with better quality. Support for higher modem speeds (V.90) is also expected to get available in near future.

 

The most important entities of VoIP systems are terminals, gateways and gatekeepers. These have been developed in the same order as listed above, so that the latest addition is the gatekeeper entity. On the whole, VoIP equipment and software is a lot more complex than traditional telephony equipment.

 

Poor QoS is one of the biggest obstacles for widespread us of VoIP. As it is currently, only high-capacity Intranets can be used for serious VoIP business. QoS problem areas are poor terminal usability, large delays and delay jitter, tandem speech coding, difficulties with echo cancelling and packet losses. As soon as these problems are solved in addition to problems related to security, billing, network management and interoperability, Internet can be taken into use as a global VoIP backbone.

 

REFERENCES

[1] The Pulver Report, October 8, 1998.

[2] VocalTec World Wide Virtual Network.

<http://www.vocaltec.com/products/products.htm>

[3] A trusted source.

[4] ITU-T, General area of responsibility of Study Group 16, 1998.

<http://www.itu.org/itudoc/itu-t/com16/gen_area_35051.html>

[5] ITU-T Recommendation H.323, Packet-based multimedia communications systems, Feb. 1998.

[6] ETSI STC SMG12, Study of H.323 as a multimedia protocol for a GPRS/UMTS real time voice and video services, Motorola, 15.-19.6.1998.

[7] Trillium, H.323 Tutorial – Protocols Specified by H.323, 11.9.1998.

<http://www.webproforum.com/beta/trillium/topic04.html>

[8] Toga, J. & ElGebaly, H. Demystifying Multimedia Conferencing Over the Internet Using the H.323 Set of Standards, Intel Technology Journal, 2Q 1998.

[9] IETF, PSTN and Internet Interworking (pint) Charter, 5.10.1998.

<http://www.ietf.org/html.charters/pint-charter.html>

[10] IETF, Audio/Video Transport (avt) Charter, 5.10.1998.

<http://www.ietf.org/html.charters/avt-charter.html>

[11] IETF, IP Telephony (iptel) Charter, 5.10.1998.

<http://www.ietf.org/html.charters/iptel-charter.html>

[12] IETF, Multiparty Multimedia Session Control (mmusic) Charter, 31.7.1998.

<http://www.ietf.org/html.charters/mmusic-charter.html>

[13] IETF Internet-Draft, SIP: Session Initiation Protocol, 18.9.1998.

[14] Toga, J. H.323 compared with SIP, Intel, 1998.

[15] Bernier, P. Will SIP be a Drain on H.323’s Momentum, Sounding Board 5/98.

<http://www.soundingboardmag.com/articles/851feat2.html>

[16] Spergel, L. & Kimchi G. ETSI TIPHON, Project Overview, 1998.

[17] IMTC, IMTC Background.

<http://www.imtc.org/i/about/i_objctv.htm>

[18] ITU-T Recommendation G.723.1 - Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s, 1996.

[19] Low Bit Rate Vocoder and Fax/Data Relay Modules AC4801D-M. Product Overview. AudioCodes. November 1996.

[20] ITU-T Recommendation G.729 - Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear-prediction (CS-ACELP), 1996.

[21] Voice Over Frame Relay Implementation Agreement FRF.11. Frame Relay Forum. Version 1.0, May 1997.

[22] Voice/Fax Over IP: Internet, Intranet, and Extranet. Technology Overview. MICOM Communications Corp.

[23] ITU-T Recommendation G.165 - Echo Cancellers, 1993.

[24] Clarent and Fortress partner to offer industry's first network-level encryption for Internet telephony. Press release. September 14, 1998.

<http://www.clarent.com/pressroom/pressreleases/pr_1998-09-14-2.html>

[25] Microsoft Netmeeting 2.1. Overview.

<http://www.microsoft.com/netmeeting/>

[26] VocalTec InternetPhone 5.0. Overview.

<http://www.vocaltec.com/products/products.htm>

[27] Voxware VoxPhone Pro 3.0. Overview.

<http://www.voxware.com/productsandtechnologies-mstr.htm>

[28] Quicknet Internet PhoneJACK. Overview.

<http://www.vocaltec.com/oem/quicknet.htm>

[29] Cray, A. Voice Over IP: Hear’s How, 4/1998.

<http://saxphone.agora.com/roundups/voiceip.html>

[30] Hypercom, IEN 6000 data sheet, 1998.

<http://www.hypercom.com/netsys/HNS_Web/Solutions/IP/Products.htm>

[31] LSI Datasheet, CPCI/6400 TMS320C6x Telecommunications Platform, 1998.

<http://www.lsi-dsp.com/products/sheetview.cgi?cpcic6400.htm>

[32] Ericsson, H.323 Gatekeeper System, 1998.

<http://www.ericsson.se/gatekeeper/>

[33] Haeggström, J. Speech quality aspects of recent patents and patent applications concerning packetised speech transmission, 1998.

[34] Alcatel, Internet Screenphone, 1998.

<http://www.alcatel.com/telecom/mbd/products/products/detailed/term/>