Archive

Author Archive

SIP Load Balancing != IP Based-Load Balancing

May 12th, 2011by Dorgham Sisalem under SIP

When it is time to scale up a SIP infrastructure the network planner will most likely ask himself: Because DNS is not a sufficient solution, would a simple IP load balancer be OK?

A simple IP load balancer would act as a front-end for the SIP cluster and all traffic going to the SIP cluster would pass the load balancer. This can be achieved by having a DNS entry for the SIP cluster that maps the URL of the cluster to an IP address that is served by the load balancer. The IP load balancer would then distribute the incoming SIP traffic using some load distribution mechanism such as round-robin or based on the hash of the source IP address.

Such an approach might be sufficient for the case when the SIP nodes in the cluster are transaction stateless SIP proxies. In all other cases, this simple approach would not work:

  • Responses and requests for the same transaction should traverse the same nodes. Hence, the load balancer should at least be able to route the responses based on the VIA header, otherwise the response will reach a SIP node that knows nothing about the transaction and will most likely just drop the response or generate an error. This means that the load balancer will need to act as a transaction stateless proxy and parse at least the VIA headers.
  • In case all requests that belong to the same dialog are expected to be processed by the same server in the cluster then using round-robin or a hash of the source IP address will not work as well. This would be the case, if the SIP server is collecting and generating CDRs for example or the SIP server is an IVR. Why round-robin is not an option should be clear. Using a hash of the source IP address for determining the SIP node could work in a perfect world. However, as a SIP client might change its IP address during the same dialog or the size of the cluster might change. For example, if a server is added or removed from the cluster then the hashing mechanism will lead to wrong results.
  • In some scenarios such as clusters of PSTN gateways, the nodes of the cluster might generate calls themselves. In this case the load balancer will need to be able to route the incoming responses to the right nodes. This will require the load balancer to be able to process the SIP headers and route the responses using the VIA headers.

So, in short, a load balancer for a cluster of SIP nodes must have some SIP logic. The level of SIP logic will depend however on the usage scenario and the type of servers in the cluster as well as the expectations of the operator.

In general one can implement a SIP load balancer in one of two ways:

  • Transparent: The existence of the load balancer is transparent to both the clients and servers. Clients send their traffic to the load balancer, which forwards the traffic to the servers without adding any SIP headers. The servers use the load balancer sort of a router to send their responses back to the clients. The VIA and Record-Route headers in the SIP messages leaving the load balancer will include the IP address of the load balancer. This can be achieved by either convincing the nodes in the cluster to use the IP address of the load balancer when adding a VIA or Record-Route header or by having the load balancer manipulating the messages leaving the cluster and replace the IP addresses included in the messages with its own address.
  • Non-Transparent: The load balancer acts as an outbound proxy that receives traffic from clients, then adds VIA and possibly RR headers and forwards the traffic to some server.

The transparent mode has the advantage that the addresses of the nodes in the cluster are hidden from the clients and provides this way topology hiding. Also, when the servers in the cluster are supporting NAT traversal, then in the case of symmetrical NATs the clients expect that incoming calls are routed through the same SIP server which is handling the registrations and outgoing calls of the client. With the non-transparent approach the load balancer would have to deal with the NAT traversal aspect itself. With the transparent approach the different servers in the clusters would be each responsible for a subset of the clients which would keep the complexity of the load balancer low and its capacity high.

A major advantage of the non-transparent approach is that the load balancer acts as a SIP proxy and can for example reroute requests that are rejected by an overloaded server to another one, for example.

SIP Load Balancing != DNS Based-Load Balancing

March 31st, 2011by Dorgham Sisalem under SIP

Scaling up a SIP infrastructure is theoretically simple. Have a DNS entry for a cluster of servers and when more resources are needed just add another server in the cluster and announce it in the DNS entry for the cluster. A SIP entity that wants to reach the cluster (SIP client) needs only to resolve the DNS entry and choose one of the IP addresses returned by the DNS server.

As already described by Robert in a previous blog, several servers can be added to an SRV record so that resolving for the cluster tekelec.com would result in the following records:

_sip._tcp.tekelec.com. IN SRV   10 1 5060 crowned.tekelec.com.

_sip._tcp.tekelec.com. IN SRV   10 1 5065 crested.tekelec.com.

_sip._tcp.tekelec.com. IN SRV   10 1 6065 golden.tekelec.com.

In this example, all servers have the same priority and weight. which will lead the client to choose each one of the servers approximately 33 percent of the time.

While in theory using DNS should result in perfect load distribution, in reality this does not always work for a number of reasons, including:

  • Stickiness: Often it is the case that all requests belonging to one session must go to the same destination. This is especially the case when load balancing between application servers or IVRs for example. In the best of all worlds, the session initiation request would require a DNS resolution and a certain destination would be chosen. All subsequent requests belonging to the same session would include the IP address of the chosen destination. This is unfortunately not always guaranteed. Hence, if a BYE request for example included as the destination a SIP URI that requires DNS resolution then the BYE might end up at a different destination that knows nothing about the session.
  • Implementation: The SIP client that wants to reach the cluster must be intelligent enough so as to do the load balancing by itself.  However, in reality SIP clients have only an insufficient understanding of DNS:
    • Some DNS implementation that work like: resolve the address, cache the result, use the first entry of the result and never resolve this address again – at least till the next reboot. This means that such SIP clients will not know about the changes to the DNS entry.
    • In case an entry in the DNS record is no longer reachable, the SIP client should try another entry in the DNS record. Lots of SIP clients do not do this and report back an error instead.
    • The SIP client should use the weights and priorities to load balance the traffic. However, some just use the first entry in the DNS record.

Besides the implementation and stickiness issues, DNS-based load balancing has a number of other issues. For a client to know about the availability of a new server in the cluster it would have to query the DNS server rather often. However, DNS queries are time and resource consuming. Assuming that a DNS query requires half a second then the session establishment is delayed by half a second. In order to reduce this load, lots of SIP clients implement a cache in which the results of a DNS resolution is saved. This means, however, that the client will not know about the availability of a new server in the cluster until the expiration of the cached DNS record.

Another issue with DNS-based load balancing is that it is rather static in nature. The used weights and priorities can take the static capabilities of different servers in the cluster into account. However, dynamic capacities of these servers such as their momentary load or availability are not considered.

Hence, DNS load balancing can be a useful tool for distributing traffic in some cases such geographically distributed proxies that do not require stickiness and which number and availability does not change frequently. For clusters of application servers or PBX gateways for which stickiness is of utmost importance and where the momentary load and availability must be considered in the load balancing strategy DNS does not provide an adequate solution.

DNS and SIP: Threats and Protection

February 16th, 2011by Dorgham Sisalem under SIP

Similar to Email or web services, SIP components can use DNS and ENUM servers to find out how to route a SIP message. Using DNS has the great advantage that the operator does not need maintain local routing tables. DNS enables the operator to support load balancing and geographical redundancy, as well as change the IP addresses of some destination only by using DNS servers without the need to provision each SIP server separately with the needed routing information.

However, using DNS is not without costs both in terms of reliability and security. DNS servers can fail. In this case, SIP servers that contact the failed DNS server will not get a reply and will timeout. This introduces significant delays to the call establishment. In order to avoid these delays, multiple DNS servers must be used and the SIP servers should proactively monitor the status of these servers to avoid contacting a failed server. Additionally, SIP servers should implement a DNS cache so the results of DNS queries are saved locally and can be used to serve requests for destinations that were already resolved – even if no DNS server is available. This way, the cache can help keep the VoIP functioning –at least partially – even if no DNS servers are reachable.

Attacks that affect the DNS service will also have negative effects on the VoIP service as well. Here we can distinguish between two types of attacks, redirection and overloading attacks. The goal of redirection attacks is to forward SIP requests to a malicious site. This is achieved by providing a SIP server with manipulated responses for its DNS queries. Hence, a SIP server that tries to locate the IP address of the VoIP server of example.com ends up at a server belonging to the attacker. This can be achieved by intercepting the DNS queries, guessing the content of a DNS query and blindly answering it, DNS cache poisoning or hijacking a DNS server. By forwarding a SIP request to a manipulated server, the attacker can implement a man in the middle attack and either reply to the call himself –pretty bad if the call was going to the bank for example- or manipulate the SIP requests so as to reduce the security level so that the call ends up being established without any encryption allowing the attacker to eavesdrop on the communication.

Overloading attacks are based on misusing the query/response nature of DNS. When a SIP server issues a DNS query then it will block some memory and processing capacity while waiting for the response. On average it takes 1.3 DNS queries to receive an answer with the mean resolution latency less than 100 msec. The resolution latency is considerably increased in the following cases:

  • Irresolvable names
  • Congested networks and overloaded servers

With Overloading attacks the attacker aims at misusing and increasing the processing resources needed for resolving domain names which can lead to memory depletion or blocking of the entire server. This can be achieved by causing the SIP server to resolve domain names that are either irresolvable or are served by overloaded servers.

This kind of attacks can be mounted by sending SIP requests to the SIP server with an irresolvable domain name included in a header that used by the SIP server for routing the messages, e.g. Via or route headers or in the Request-URI. Such requests are otherwise well formatted SIP requests that comply with the SIP standard in every respect.

An attacker can ensure that a domain name is irresolvable by launching a denial of server attack on the authoritative server of this domain. Another approach is to actually register a number of domain names and set the addresses of the authoritative servers of these domains to hosts that do not reply to DNS queries or do not exist at all. For registering a domain name the attacker is supposed to provide his name, address and payment information for a domain name registration company. However, as the name and address information are usually not verified and stolen credit cards can be used for payment the attacker can falsify this information and hide his identity.

Using DNSSec (RFC2137) or secured links, e.g., TLS or IPSEC, between DNS servers and SIP servers can minimize the possibility of eavesdropping, guessing and cache poisoning and hence the chance of a redirection attack. However, using these approaches increases the complexity of using DNS, increases the processing and bandwidth needs for using the DNS server and some cooperation between the different entities marinating the different DNS servers. Also, in case an attacker manages to hijack a DNS server then at least the domains for which the hijacked servers acts as the authoritative server will still be unprotected.

The effects of overloading attacks can be reduced by implementing a DNS cache at the SIP servers. By caching not only the positive but also negative responses to a DNS query, a SIP server will not query a malicious address more than once. This will greatly reduce the number of DNS queries issued by the SIP server. Additionally SIP servers should include “receive” tags to their own Via headers. In the “receive” tag the SIP server includes the IP address from which a request was received. This way, when receiving the response, the SIP server will not have to resolve any DNS entry in the Via headers. 

VoIP Interconnection: NNI vs. UNI

November 10th, 2010by Dorgham Sisalem under SIP

In my last blog entry, I discussed why VoIP peering is needed and what is slowing its introduction. In this sequel, I will take a closer look at some of the issues operators need to deal with once they decide to peer.

SIP was designed to work in a similar manner to email services. That is, a caller that wants to reach a callee either sends its requests to the server–proxy responsible for the callee or to its own proxy. The caller’s proxy then forwards the request to the callee’s proxy. DNS is used to discover the IP address of the involved proxies. While this model was rather successful with email services, VoIP providers decided that such a model is too open for their needs. Allowing users to access network components such as proxies, media servers or PSTN gateways was deemed to be too insecure. This was the moment for the SBC providers that introduced session border controllers that separate the end users from the VoIP service provider. SBCs terminate the SIP sessions of the users and establish new ones to the operator’s servers.

peering final resized 600

Figure 1 UNI vs. NNI scenario

When it comes to peering, operators show a similar reluctance to allow other operators to be able to send traffic directly to their servers and gateways. From a high level point of view one might ask, “Why not just use SBCs on the network to network interface (NNI) as it was done on the user to network interface (UNI)?”

I would say that the main difference between the UNI and NNI stems from the traffic characteristics and security requirements. At the UNI, SBCs are usually located as close as possible to the users, and, hence, there are many of them, with each box dealing with a low amount of traffic.  In contrast, operators will only have a couple of peering points to other operators and will route a high amount of traffic through these points. Besides scale, the kind of traffic control needed at the NNI is different from that needed at the UNI. While SBCs need to keep local registrations and support user authentication, at the NNI border components need only to worry about call signaling, and no user relevant processing is needed. Further, the media compression styles to be supported between two operators can be negotiated beforehand in service level agreements (SLA). Therefore there is less of a need to worry about all compression styles, and the need for transcoding is less urgent than in the case of UNI, where an operator does not know the transcoding supported by user devices. Further, even if transcoding was needed at the NNI, the operator can use dedicated servers with special hardware at the peering points and does not have to equip each border component with this expensive hardware.

From the security point of view the concerns are also different. SBCs need to prevent user fraud, ensure that users’ behavior conforms to the operator’s policy, and protect the network from malicious traffic. At the NNI side the concern is more about ensuring that SLA’s are respected, filtering unwanted traffic, and ensuring the interoperability between the SIP components of the peering partners.

Besides the security and traffic issues, border components at the NNI are expected to provide features that are not necessarily needed by an SBC. This includes providing CDRs to enable billing between operators, as well as flexible routing mechanisms that allow the NNI box to forward incoming calls directly to a gateway or application server. SBCs at the UNI are more relay points between the users agents and the operator service platform, which is responsible for the routing and the CDR generation.

So in short, while the requirements of the UNI and NNI might seem rather similar at first, there is actually a need for border elements at the NNI that differ in their feature set and architecture from the SBCs used at the UNI. While at the UNI lower scale devices with more complex SIP processing capabilities and user oriented logic is needed, at the NNI high scale servers are needed that support flexible routing, denial of service protection and SLA management.

VoIP Interconnection: Why bother, and how come it is not there yet?

September 29th, 2010by Dorgham Sisalem under SIP

The first step towards next generation networks was to replace the core TDM-based networks with SIP components (please note that most of the argumentation here applies just as well to BICC networks). Strangely enough, the usage of IP-based technology usually stops at the borders of the network. TDM is still the basis for interconnection with other operators or even with other divisions of the same company. This is rather inefficient due to multiple reasons:

  • QoS reductions: Moving from IP to TDM and possibly back to IP means that the voice communication will have to be transcoded from IP packets to TDM and back to IP packets. This will introduce delays and possibly losses.
  • Service limitations: As TDM networks are designed for voice only, any other service such as video or messaging can not be exchanged transparently between operators. Actually, not even high quality voice can be used in an end-to-end manner as both operators will have to agree on G.711.
  • Costs: One of the main reasons for moving to IP-based core networks was to reduce costs. The same should apply to peering.

Monetary aspects certainly play a role in delaying the all-IP interconnection. Especially smaller providers of VoIP services seem to be making a living off of the termination fees they receive from their TDM peering partners. Hence, they do not show much willingness to move to a pure IP-based interconnection which is usually though to be free or based on some flat rates as is the case with interconnection of pure IP packets.

However, this does not explain why TDM is still the prevailing interconnection technology between large operators or between the divisions of the same operator.

Looking deeper into this issue we can identify the following reasons for why VoIP peering is not so popular yet:

  • Timing: Lots of operators have just finished igrating their core networks to SIP, and hence it is only natural that they are taking their time to go into the next step.
  • Organization: Interconnection infrastructure is often managed by a different organization than the mobile or fixed divisions inside an operator. In order to go IP, the interconnection division would have to be convinced to move to IP. as well.
  • Security: VoIP as an IP based application still has the taste of something unknown und insecure to many operators. With TDM, there is already a long history and extensive experience in all kinds of troubles that might arise. Operators know how to deal with fraud and interoperability problems in TDM and have established solutions to deal with these issues. VoIP technology poses new threats and problems that require new solutions and often a non-negligible learning curve.
  • Interoperability: Even after years of standardization and interoperability testing it is not uncommon to have interoperability issues between SIP equipment from different vendors. Solving the interoperability issue when rolling out a VoIP core network is rather straightforward. The vendors selected by the operators will most likely not get paid until all interoperability issues are solved. In the case of a peering scenario with other operators this is not as simple. Different operators and even different divisions of the same operator will have their own equipment. The willingness to change the configuration of one’s own equipment or to pressure a vendor to update an implementation will not be too high.
  • Monetary: Especially in the case of large operators interconnection costs over TDM are predictable and low. Hence, the costs of moving to a new technology and establishing new business relations might not be economically justifiable. 

SIP Overload Control: The next QoS?

August 17th, 2010by Dorgham Sisalem under SIP

Throughout the last 40 years or so, quality of service (QoS) and congestion control were among the most popular Internet-related research and standardization topics in academia and industry. There must be at least a couple thousand papers (I must confess that I have contributed to this by at least 10 or 20 papers) that describe possible solutions for supporting some level of QoS in the Internet. I assume that the major reason for this is the vagueness of the problems to be solved and the impossibility of getting a simple answer that fits all scenarios. The interest for these topics has, however, gone down considerably in the last years after recognizing that TCP can work very well with only one or two congestion control schemes and that over-provisioning the bandwidth is an efficient way for achieving a good level of QoS.

Recently, the IETF started the SIP Overload Control group, and I am starting to wonder whether SIP overload research and standardization will actually be the follow-up for classical QoS and congestion control research.

As discussed in a previous blog, there are two possible architectures for solving the overload issue at a SIP component. With the local or stand-alone architecture, the overloaded SIP component will drop or reject excess traffic by itself. In the cooperative architecture, the overloaded SIP component will ask its neighbors to reduce the amount of traffic they are sending.

In my view, the local approach does not require standardization. Actually, it is actually questionable if one single approach can solve all related issues. Algorithms dealing with the communication between user agents and proxies will be different than those dealing with the communication between proxies. Overload control when UDP is used as the transport protocol will often look differently from the cases when TCP or SCTP is used. So at the end, it will be left to the vendor to decide the best approach to use. The overload control logic used at different SIP components will most likely be different depending on the type of the component, e.g., proxy vs. application server, the deployment scenario and the vendor’s preferences.

Achieving a cooperative type of overload control will require the exchange of information between the SIP components. While this should push the overload away from the congested component there are still issues.

The overload is only pushed away one hop. That is, if some component is informed that a neighbor is overloaded then this component will have to deal with the excess traffic by itself and can not ask its neighbors to reduce their traffic and hence can not propagate the overload information to the source of the overload. So this component will have to make the decision about which SIP messages to drop or reject. Not knowing the exact needs of the overloaded component means that the wrong SIP messages might be dropped. However, I am sure someone will come up with an XML extension of SIP in which an overloaded SIP component will indicate to its neighbors what kind of SIP messages should be treated and in which way.

So similar to QoS and congestion control, we have again a problem for which it is unlikely to find a simple solution that fits all scenarios.

This is not to say that the issue of SIP overload control should not be researched. I actually believe that there should be more work on specifying and testing standalone algorithms as each SIP component will need to deploy some overload control. I just hope that the search for the perfect solution will not become the next holy grail of PhD and IETF work that will make SIP even more complex than it is today. 

When you have SIP, do you need MEGACO?

July 8th, 2010by Dorgham Sisalem under SIP

The first gateways that translated between VoIP and PSTN were rather simple in design. They consisted of a single box that translated SS7 into VoIP (H.323 or SIP) and TDM into RTP. In the late nineties a number of protocols were specified, starting with SGCP followed by IPDC, MGCP, and ending with MEGACO. These protocols changed the design of gateways into a distributed architecture that consists of a signaling gateway, media gateway and media gateway controller.

  • The signaling gateway is responsible for translating between different protocols.
  • The media gateway is responsible for all media-related tasks, such as translating between TDM and RTP or transcoding media.
  • The media gateway controller controls the media gateways and uses the MEGACO protocol to provide the instructions needed for fulfilling their tasks.

This architecture has the following advantages:

  • Centralized signaling and distributed media processing: For various reasons, such as billing, management and provisioning, it is beneficial to have all the signaling information processed at a centralized location. However, the media gateways should be placed at the points where they are most needed. So, for example, in the case of a peering scenario in which on operator is peering to multiple operators, it is desirable to have all the signaling information being processed at a central location that might reside in a large data center of the operator. The media data should however be placed as close to the interconnection point as possible.
  • Scaling: Processing of media and signaling requires different kinds of resources. Hence, with a distributed architecture, it is possible to add new media gateways to handle more data flows or more complex scenarios independently of the media gateway controller and signaling gateways.
  • Evolution: With a distributed architecture it would be possible to enhance the scope of the controlling components without having to touch the media gateways. So, if a signaling unit supports translation only between H.323 and SS7 at the beginning, it can be upgraded to support SIP in a second stage without any effects on the media gateways.
  • Vendor independence: As the media gateways and the media gateway controllers are communicating with each other using a standardized protocol it should be possible to buy them from different vendors. This would enable some vendors to concentrate from example on optimizing media handling whereas other might dedicate more work for signaling.

 

While this distributed architecture surely has its advantages, one should not mix it with MEGACO. MEGACO is one protocol that can enable this architecture but is not the only way of getting there.

Using SIP, one can achieve the same distribution and benefits without having to introduce another complex protocol into the network.

 megaco resized 600

This figure presents the scenario of a distributed peering scenario using SIP only. A central gateway controller acts as the contact point for the exchanging calls with the peering partners. However, the servers responsible for the media are located near the peering points because  it would be disadvantageous to route all the media traffic to the central point and redistribute it from there to the final destinations.

When a call comes in at the controller, the call is (step 1) processed and the INVITE request is forwarded to the NNI (Network-Network Interface) closest to the peering point (step 2) – in a MEGACO architecture the controller would be sending MEGACO commands. The NNI is a component that media processing capabilities and a SIP stack (in a MEGACO architecture this would be a MEGACO stack). Based on the content of the SIP INVITE request the media processing capabilities of the NNI are instructed to do the required tasks.

So basically the usage of MEGACO is replaced by SIP. This provides for a simpler central controller as this server only needs to support SIP. The same advantages of scalability, distribution and vendor independency are still maintained.

In a future blog I will try to demonstrate how the various packages and commands of MEGACO are replaced in a SIP architecture.

SIP and Congestion Control: Exploring Solutions

May 25th, 2010by Dorgham Sisalem under SIP

In continuation of my last posting – The Problem with SIP Congestion Control – I would like to briefly discuss some proposed solutions for solving this issue.

As mentioned in the last posting, the only mechanism provided in the SIP specifications is for the server to reply back with a 503 response. However, as the 503 response is kind of a binary mechanism, the server will either receive traffic or not. This can easily lead to oscillations and might cause the entire SIP infrastructure to become unstable under overload situations.

In general, one can distinguish between standalone and feedback based overload mechanisms.

In the standalone approach, an overload control mechanism is implemented at the SIP server. This server will then monitor its own resources, e.g., memory, CPU and bandwidth. Based on the monitored resources the server will recognize when it starts to become overloaded and will have to deal with the incoming traffic by either rejecting new calls or even dropping them -whereas rejecting is preferred as dropping a request will cause the caller to retransmit the request and hence the server will end up having to deal with the same call a number of times.

When dropping/rejecting requests the server will have to ensure that running calls are not interrupted -i.e., it would be a bad idea to accept a new call, but reject a BYE request as losing the BYE request might cause irregularities in the charging process. An example of a standalone approach can be found in the paper – Protecting VoIP Services Against DoS Using Overload Control.

The other approach for overload control is more of a cooperative process. In this scenario the overloaded server regulates the amount of traffic it is receiving from its neighbors by informing them about its current load. The load information can be sent to the neighboring servers by either adding a header in the SIP response or by using the SUBSCRIBE/NOTIFY mechanisms of SIP. The neighboring servers will then adapt the amount of traffic that they are sending to the overloaded server. In case they have to reduce the amount of traffic they want to send to the overloaded server then they will also inform their neighbors to send less traffic to them. This way, the congestion is pushed to the border of the network and calls are not forwarded to the core components only to be dropped there.  A survey of different control mechanisms can be found in the article – Session Initiation Protocol (SIP) Server Overload Control: Design and Evaluation.

Both approaches have their pros and cons. The standalone approach can be deployed without having to rely on other SIP components in the network to also support overload handling. It also does not require any standardization with regard of how to exchange status information. This makes this approach the ideal one for now. However, this mechanism does not cause the overall load of the SIP network to go down.

The feedback based approach can adapt the number of calls in the network to the actual available resources and would push the overload to the borders of the network. In this way, excess calls will be prevented from even reaching the overloaded servers and access points can consider using non-overloaded paths for establishing the calls. On the down-side, a server that ignores the feedback information would still cause overload and packet drops. Hence, to be on the safe side, a SIP server will have to implement a combination of both approaches.

The Problem with SIP Congestion Control

April 13th, 2010by Dorgham Sisalem under SIP

A DoS attack on a SIP component usually has the aim of overloading one or more resources of the SIP components. These resources might include the CPU, memory or any other resource needed for processing incoming SIP messages. Once one or more resources are overloaded the SIP component will no longer be capable of dealing with the incoming traffic, which will lead to the dropping of excess traffic. This will actually lead to an even higher amount of traffic in the network due to the retransmissions of the dropped packets.

Work done at Ohta 2004 suggests that overload situations not only reduce the performance of a SIP server, but can finally lead to a complete failure of the VoIP service. The most straightforward approach for handling such situations is to ensure that the available processing resources of a SIP component are sufficient for handling SIP traffic arriving at the speed of the link connecting this component to the Internet. With modern access lines reaching gigabit speeds, provisioning the VoIP infrastructure of a provider to support such an amount of traffic, which is most likely several times the normal traffic level, can be cost prohibitive.

SIP does not provide much guidance on how to react to overload conditions. A server that is not capable of serving new requests, e.g. because it is overloaded, could reject incoming messages by sending a 503 Service unavailable response back to the sender of the request. This signals to the sender that it should try forwarding the rejected request to another proxy and not to use the overloaded proxy for some time. Further, the 503 response includes a Retry-After header indicating the period of time during which the overloaded server should not be contacted. While this reduces the load on the overloaded proxy, it results in directing the traffic that has caused the overload to another proxy, which might then get overloaded itself.

The figure below depicts a scenario in which a load balancer distributes the traffic to two different proxies.

Load balancer distributes the traffic to two different proxies

In the case of a DoS attack it is most likely that all the SIP servers in a SIP cluster will be affected and will be overloaded at the same time. When the first server replies with a 503, the load balancer will forward the traffic destined to that server to the other server. With the additional traffic this server will become overloaded as well and will issue 503 replies. Shifting the traffic from one server to another has only made the situation worse for this server. This shifting of traffic can also lead to an on-off behavior. That is, consider the case when an attacker is generating traffic that is causing both servers to run at 100% of their capacity. When one of them issues a 503 response, the traffic destined for it will be forwarded to the other, which will now receive traffic at 200% of its capacity. This server will, hence, issue a 503 response. Where the Retry-After value of the first server expires before that of the second server, then that server will suddenly receive traffic at 200% of its capacity and will reject the traffic with another 503. This on-off behavior can actually even lead to a lower average throughput, making the 503 approach not optimal for cases in which a SIP component receives SIP messages from only a small number of other SIP instances. Where a SIP server receives requests from a large number of user agents, then the 503 approach can work much more efficiently as only the user agents that receive the 503 response will try another destination. Further, the on-off behavior will not be observed in this case as spreading out the 503 among the clients has the effect of providing the overloaded SIP instance with more fine-grained controls on the amount of work it receives. Naturally, if the senders are bots that do not respect the Retry-After header, using a 503 will not be sufficient for protecting the server from getting overloaded.

There are already a couple of attempts at solving the issue. In conjunction with the IETF a group was created that is looking at different approaches for making SIP server overload more resistant and are looking at different algorithms for congestion control in the context of SIP. I will provide an overview of this work in my next blog posting.

<% Response.Write("" & vbcrlf) %>