Archive

Posts Tagged ‘IETF’

SIMPLE Working Group Update

March 1st, 2011by Ben Campbell under SIP

As I and others on this blog have mentioned on several occasions, the SIMPLE (or the more formal and rather awkward: “SIP for Instant Messaging and Presence Leveraging Extensions”) working group of the IETF has been responsible for defining how to do Presence and Instant Messaging applications using SIP and related protocols. The SIMPLE working group has existed for some time; in fact, it’s one of the oldest ongoing working groups in the Real-time Applications and Infrastructure (RAI) area of the IETF. I am currently a co-chair for SIMPLE.

I write to tell you that SIMPLE’s work is almost done. We are finally seeing the light at the end of this long tunnel. Of the four remaining work items, one is in the AUTH48 state. (This means that the RFC editor has presented a candidate for the final RFC version back to the authors for any last minute edits and approval.) One entered Working Group Last Call (WGLC) last week. There are only two work items that may still see controversy, and one of those is in IESG review.

These drafts are, respectively, draft-ietf-simple-msrp-acm, draft-ietf-simple-simpledraft-ietf-simple-msrp-sessmatch, and draft-ietf-simple-chat.

The first draft extends the MSRP protocol to allow the endpoints to negotiate which one will open a TCP connection to its peer. I blogged about this draft some time ago. We should see publication of the resulting RFC any day now. In fact, it’s already been assigned a number: RFC 6135. [Update: RFC 6135 was officially published on February 28.]

The second, draft-ietf-simple-simple (aka “SIMPLE made Simple”), is an informational draft that acts as a road-map and secret-decoder-ring for the various specifications produced by the SIMPLE working group. (Keep in mind, that there is no one protocol known as SIMPLE. But we still tend to use the term SIMPLE informally to refer to the resulting suite of protocols and architecture.) The fact that this draft is in WGLC means the author believes that this draft is essentially ready to be sent to the IESG for final review and publication. It’s possible that the last call review could uncover some controversial point that would require more work. But given the nature of this draft, I expect that any WGLC feedback is more likely be clarification and editorial comments.

We do know in advance, however, that draft-ietf-simple-simple may require minor editing to reflect the final disposition of the last two drafts below. This means that, regardless of its current completion state, draft-ietf-simple-simple will be the last draft to be published by SIMPLE.

Draft-ietf-simple-msrp-sessmatch describes an extension to MSRP to make it more friendly to Session Border Controllers (SBCs). The way that MSRP devices match TCP connections to message sessions means that, if an MSRP session traverses an SBC, that SBC has to re-write the To-Path and From-Path header fields in a manner similar to an MSRP Relay. Some working group participants expressed concern that this requirement could impact SBC performance. The sessmatch draft would allow supporting endpoints to work across SBCs that do not change MSRP messages en route. However, there are still ongoing discussions concerning the impact on security and interoperability.

Assuming that the sessmatch draft has not become a moot point by then, I plan to go into considerably more detail on it and the surrounding controversy in my next blog entry.

Then, finally, there’s draft-ietf-simple-chat. This draft defines how to create MSRP “chatrooms” with conference servers. There’s still some controversy over how this draft interacts with some similar work from the XCON working group.

Hopefully, we will resolve the issues around these last two drafts soon–at which time I hope to be able to entitle a blog entry as “SIMPLE Finally Done!”

Early Media or Late Charging?

November 22nd, 2010by Jiri Kuthan under SIP

In today’s article, I would like to address something that is frequently confusing to SIP newcomers: the concept of early media. Early media is about exchanging voice before the call actually happens – but isn’t the call actually happening once you begin to hear each other? What then is this feature good for?

That’s a question which is not rhetorical because it introduces a bunch of protocol traps. For example, a call can be forked in SIP to multiple destinations. They can start exchanging early media with the caller, and caller’s phone may totally confuse the caller by reproducing multiple conversations in parallel. The network may be confused, as well, because the call setup under the early media is neither completed nor declined. The early media keeps “setting up” the call and resources remain allocated in a server with no real impact on the service. One could misuse this behavior to mount a DoS attack on the server, or have an endless “early media” conversation between two cooperating parties. So why are we having this?

This artifact can only be understood with knowledge of the SIP history and the effort to mimic the PSTN in SIP. Particularly, PSTN queuing announcements (please wait until operator is available) and gateway interoperability have been the most frequently debated cases for which the notion of early media has been introduced to SIP. It is not really out of signaling necessity though. A call could be declared technically as already established during the initial announcements, despite that this phase of a call is worthless to the caller. This way the call setup would complete earlier, occupy less network resources and shorten forking race condition window.

However, that’s not the way billing works in the PSTN model. Billing is frequently postponed to the moment when a caller gets a “real service,” such as a human representative of an airline. The SIP standard has chosen to mimic this model in the IP environment. Said shortly, “early media” is as troublesome its side effects appear.

As a colleague in the IETF has mentioned humorously, “It should have been named ‘late charging.’”

How do SIP endpoints find the right servers?

September 22nd, 2010by Robert Sparks under SIP

When a SIP endpoint is ready to register with a service, it has the name of the service and the Address of Record (AoR) that it wants to register under. Both of these are constructed as SIP URIs. For example, my phone might register sip:Robert.Sparks@tekelec.com by sending a REGISTER request to sip:tekelec.com. It takes the domain name from that URI and uses it to start a series of DNS queries as specified in RFC 3263 : Locating SIP Servers.

The algorithms specified in that RFC allow an endpoint to learn what transport (UDP, TCP, TLS over TCP, SCTP) to use, and what IP address and port to send the message to. They also give the service provider tools to provide redundancy and load-leveling. Here’s a short overview of how it works:

overview resized 600

The endpoint will first make a query for all Naming Authority Pointer (NAPTR) records for tekelec.com. These records allow service providers to advertise various services. The records that are returned might look like this:

naptr

The service field contains strings like “SIP+D2U” identifying the service being advertised – the full set of strings currently defined for SIP is:

  SIP+D2T (SIP over TCP)
  SIPS+D2T (SIP over TLS over TCP)
  SIP+D2U (SIP over UDP)
  SIP+D2S (SIP over SCTP)
  SIPS+D2S (SIP over TLS over SCTP)

As new service strings are standardized, they will be registered with IANA.

The order and service fields allow the service provider to say things like “If you support it, you must use TCP” or “Try TCP first and if that fails, try UDP”. The numbers are processed from lowest to highest. Records with lower order values are inspected first. Once a record is found with a protocol the endpoint supports, it will only consider other records with that same order value. When multiple records appear with the same order value, they are considered in preference order.

Some examples:

If these records are returned, the service is saying “If you support TCP, use that. If it fails stop. Only try UDP if you don’t support TCP.”

  tekelec.com. IN NAPTR 10 50 “s” “SIP+D2T” “” _sip._tcp.tekelec.com.
  tekelec.com. IN NAPTR 20 50 “s” “SIP+D2U” “” _sip._udp.tekelec.com.

If the following records are returned, the service is saying “Try SCTP first if you support it. If you don’t or it fails, try TCP. If you don’t support that, or it fails, try UDP”:

  tekelec.com. IN NAPTR 50 10 “s” “SIP+D2S” “” _sip._sctp.tekelec.com.
  tekelec.com. IN NAPTR 50 20 “s” “SIP+D2T” “” _sip._tcp.tekelec.com.
  tekelec.com. IN NAPTR 50 30 “s” “SIP+D2U” “” _sip._udp.tekelec.com.

Let’s proceed assuming those last three records were returned, and that the endpoint I’m using only supports TCP and UPD. In this case, the endpoint will use the second of those three records, learning that it should use TCP and it should take “_sip._tcp.tekelec.com” as input into the next step.

The endpoint now queries the DNS for all the SRV records matching “_sip._tcp.tekelec.com”. The SRV records returned will have this form:

srv

The endpoint will process all records ordered by the priority field, from lowest to highest. If multiple records have the same priority, the endpoint will choose randomly from them, weighting the probability of selecting a particular record using the weight field. This gives the service provider a tool to realize a form of load distribution.

Assume the following records are returned:

_sip._tcp.tekelec.com. IN SRV   10 1 5060 crowned.tekelec.com.
_sip._tcp.tekelec.com. IN SRV   10 1 5065 crested.tekelec.com.
_sip._tcp.tekelec.com. IN SRV   10 2 6065 golden.tekelec.com.

The endpoint will randomly choose crowned.tekelec.com 1/4 of the time, crested.tekelec.com 1/4 of the time, and golden.tekelec.com 2/4 = 1/2 of the time. Lets assume the random selection chose crested.tekelec.com.  The endpoint knows to use port 5065 when it sends its request.  

Finally, the endpoint looks up A or AAAA (depending on whether it is using IPv4 or IPv6) records for crested.tekelec.com, yielding the IP address to send.

At this point the endpoint has the information it needs to send the request to the right server.

That example assumed that the endpoint was starting with a SIP URI. The RFC 3263 steps are the same whether the endpoint is preparing to send a REGISTER request or an INVITE request to start a call.

Sometimes, the endpoint starts with an E.164 formatted telephone number instead of a SIP URI. The ENUM specs define how to convert that telephone number into a URI. Once the endpoint has performed that conversion, it follows the same RFC 3263 algorithm discussed above starting with that URI to find the server to contact.

SIP Overload Control: The next QoS?

August 17th, 2010by Dorgham Sisalem under SIP

Throughout the last 40 years or so, quality of service (QoS) and congestion control were among the most popular Internet-related research and standardization topics in academia and industry. There must be at least a couple thousand papers (I must confess that I have contributed to this by at least 10 or 20 papers) that describe possible solutions for supporting some level of QoS in the Internet. I assume that the major reason for this is the vagueness of the problems to be solved and the impossibility of getting a simple answer that fits all scenarios. The interest for these topics has, however, gone down considerably in the last years after recognizing that TCP can work very well with only one or two congestion control schemes and that over-provisioning the bandwidth is an efficient way for achieving a good level of QoS.

Recently, the IETF started the SIP Overload Control group, and I am starting to wonder whether SIP overload research and standardization will actually be the follow-up for classical QoS and congestion control research.

As discussed in a previous blog, there are two possible architectures for solving the overload issue at a SIP component. With the local or stand-alone architecture, the overloaded SIP component will drop or reject excess traffic by itself. In the cooperative architecture, the overloaded SIP component will ask its neighbors to reduce the amount of traffic they are sending.

In my view, the local approach does not require standardization. Actually, it is actually questionable if one single approach can solve all related issues. Algorithms dealing with the communication between user agents and proxies will be different than those dealing with the communication between proxies. Overload control when UDP is used as the transport protocol will often look differently from the cases when TCP or SCTP is used. So at the end, it will be left to the vendor to decide the best approach to use. The overload control logic used at different SIP components will most likely be different depending on the type of the component, e.g., proxy vs. application server, the deployment scenario and the vendor’s preferences.

Achieving a cooperative type of overload control will require the exchange of information between the SIP components. While this should push the overload away from the congested component there are still issues.

The overload is only pushed away one hop. That is, if some component is informed that a neighbor is overloaded then this component will have to deal with the excess traffic by itself and can not ask its neighbors to reduce their traffic and hence can not propagate the overload information to the source of the overload. So this component will have to make the decision about which SIP messages to drop or reject. Not knowing the exact needs of the overloaded component means that the wrong SIP messages might be dropped. However, I am sure someone will come up with an XML extension of SIP in which an overloaded SIP component will indicate to its neighbors what kind of SIP messages should be treated and in which way.

So similar to QoS and congestion control, we have again a problem for which it is unlikely to find a simple solution that fits all scenarios.

This is not to say that the issue of SIP overload control should not be researched. I actually believe that there should be more work on specifying and testing standalone algorithms as each SIP component will need to deploy some overload control. I just hope that the search for the perfect solution will not become the next holy grail of PhD and IETF work that will make SIP even more complex than it is today. 

IETF Meeting #78

August 11th, 2010by Jiri Kuthan under SIP

From the VoIP perspective, the 78th IETF meeting last week in Maastricht went quite as expected. All in all, RFC3261 is already eight years old, largely in deployment, accompanied by numerous fixing and extending specifications, and there is simply little news to be expected and heralded.

More novelty comes typically from the BoFs (Birds of Father) sessions. These are meetings set up to verify with the larger audience if a new work proposal is acceptable to the IETF community. BoFs are always an interesting mixture of enthusiastic proponents of new and interesting pieces of work in confrontation with all sorts of opponents. In this IETF meeting I found the most exciting BoF to be on a broadband home gateway topic. Previously, the IETF focused on protocol design and was rather cautions about other aspects such as “box design,” “system architectures” and “network architectures.” As an interesting change, the broadband home gateway BoF was well received. On that note, a Cisco-backed presentation on telepresence was quite warmly welcomed as well, despite the fact that it didn’t actually suggest some protocol work.

Yet BoFs are nowadays still meetings with relatively stable agendas. Where the really new efforts form is in the so-called “BarBofs” — loosely organized meetings where new ideas are debated in small befriended expert circles before soliciting feedback from a larger community. Let me pick the “TLS@DNSSEC” for you. While the name may sound terribly technical, the idea behind it is very simple: Use the DNS network as a certification authority. There are several reasons why I believe this makes really good sense. Technically, a structured hierarchical system is built-in and with DNSSEC, the system is fairly secure as well. Economically, providers have always been very favorable with the DNS name. And most importantly, however, I think is that such a system would have a good adoption path that is application-independent. The DNS hierarchical system is today spread on a world-wide basis, well understood both technically and in terms of trust relationships, and it is in use by numerous applications. If you would like to have a new app leveraging a trust relationship established using certificates, then DNS is just an incredibly viable way to achieve it.

SIP Trunking: Request Routing

July 13th, 2010by Adam Roach under SIP

SIP trunking, broadly defined, is a service in which an Internet Telphony Service Provider (ITSP) provides service to a customer-operated Private Branch Exchange (PBX). There has been considerable work on defining parameters around commercial SIP Trunk offerings over the past few years, including the SIPconnect effort within the SIP Forum and the Business Trunking specification developed by ETSI.

One of the problems that has remained most pervasive, however, is the means by which an ITSP knows where to send messages destined for a particular customer. Early offerings frequently required manual provisioning of customer IP addresses – calls addressed to one of a customer’s phone numbers would be routed to the address that they gave the ITSP when the service was set up. Unfortunately, this approach suffers from a large number of shortcomings. For example, the additional provisioning step of gathering IP address information from customers leads to less efficient provisioning and higher operational costs. Also, this kind of set requires customers to contact their ITSP if they ever need to change the IP address of their PBX. And, since such provisioning changes often take hours or days, this approach can leave customers without phone service for very long periods of time.

The first serious attempts to solve this problem came from the IMS network, and were modeled on the way IMS handles single users with multiple AORs. Basically, the PBX would register a single identity – a lead number, for example – and the ITSP would presume that calls for all the identities associated with that PBX should be routed to the same destination. It was a very simple solution to the problem, and it worked passably for the kinds of environments that IMS can assume (i.e., tightly controlled walled garden networks, where non-standard behavior can be provisioned into SIP servers by bilateral agreement between the ITSP and the PBX owner).

This naïve solution to the problem, however, suffered from a number of drawbacks. Significant details about processing of inbound INVITE requests were left unspecified, leading to very real deployment issues in the field. Further, this very real change to the semantics of REGISTER – that is, its nature of registering many disparate AORs instead of a single AOR – was not signaled between the PBX and the ITSP. Outside of tightly-controlled walled garden networks, this lead to situations in which the ITSP or the PBX thought the IMS mechanism was in use while the other end did not. The resulting call failures – which often would involve signaling loops – were difficult to diagnose, and even more difficult to solve. The solution also suffered from being designed without significant input from SIP protocol experts, making mistakes such as defining a wildcarding syntax that is fundamentally incompatible with SIP syntax in general.

However, the key problems were far more structural than these, which could be solved by minor tweaks to the specification. In particular, while these attempts did manage to make basic calls work under the right circumstances, they were designed without regard for key registration-based mechanisms developed within the IETF. Interaction with the registration event package was added as an afterthought, and in a way that assumed everyone in the network would be aware of the new REGISTER semantics. No provisions were made for allowing the use of temporary GRUUs, which are a critical part of the ability to make and receive calls in an anonymous fashion.

To address this situation, the IETF took on work near the end of last year to specify a mechanism for registering multiple AORs with a single SIP message. This work was spurred predominantly by the SIP Forum’s SIPconnect work. Within the SIPconnect effort, it became apparent that the existing solutions weren’t sufficient for the more general architectures they wanted to enable. The resulting working group – called MARTINI – has been working at a feverish pitch over the past six months to produce a mechanism that solves the registration problem, while addressing the shortcomings of the previous mechanisms.

The proposed solution [1] has largely stabilized, and is now entering a final comment period within the MARTINI working group before being passed off to the IETF leadership for publication. At a high level, this solution sidesteps a large number of the problems that existed in prior solutions by closely simulating what would happen if the PBX sent a separate REGISTER message for each of its phone numbers. In other words, it uses REGISTER to update a registration database, in contrast to earlier solutions that were effectively updating a broader domain routing database.

The solution also includes significant provisions to ensure that previously-defined registration-related mechanisms in SIP remain viable for PBXes that choose to use it.

With any luck, then, we should finally have a general-purpose solution to the problem of how to route requests over a SIP trunk to a PBX finished and stabilized within the year. Combined with the other work being done in the SIP Forum SIPconnect group, this should lead to a well-defined, unified specification that allows ITSPs to quickly and confidently deploy SIP trunking services. And that can only be a good thing for SIP.

__

[1]­ Full Disclosure: I am the editor of the solution developed by the working group, and have been deeply involved in its design.

FAQ: What is Early Media?

May 19th, 2010by Robert Sparks under SIP

In short, early media is any media related to an attempt to initiate a session that arrives before the session is fully established.

SIP negotiates the setup of media streams using the offer/answer technique defined in RFC 3264. The simplest call establishment might look something like the flow shown in Figure 1.

Figure 1

The offer contains a description (using the Session Description Protocol) of how and where Alice is willing to receive media.  This description specifies the address and port Alice will receive packets, the protocol used to send those packets (typically RTP using the AVP profile), and information about how the media will be encoded in those packets.

Once Bob receives Alice’s offer, he can start sending media based on that description right away. In many deployments, that’s a normal behavior for Bob’s endpoint, especially if Bob doesn’t answer immediately. His endpoint may send a ringing sound or an announcement as shown in Figure 2.

Figure 2
 

This frequently happens when Alice’s SIP INVITE reaches a PSTN gateway. When the gateway receives the INVITE it tries to set up a call on the PSTN side and will send any media it receives before that call completes (such as ringback) back to Alice. See Adam’s SIP-I and SIP-T Challenges blog entry for more detail.

The core SIP specification did not include any mechanism to ensure that provisional responses (like the 183 Session Progress in the above example) are reliably delivered. One could be lost in transition and neither end would know.  RFC3262 defines an extension to SIP to add that reliability mechanism in the form of a Provisional Response ACKnowledgement (PRACK) request. When using this extension, the responder will retransmit the provisional response, following a proscribed retransmission time algorithm, until it receives a corresponding PRACK, as shown in Figure 3. This extension is especially important if the answer carried in the provisional response contains information that Alice would need to be able to make sense of what was in the media packets. For instance, in some uses of SRTP, that answer will contain data that Alice’s endpoint must receive before it can decrypt the media streams. 

Figure 3

SIP proxies can “fork” an INVITE request as they forward it, delivering the request to multiple locations. In Figure 4, Alice’s call to Bob is delivered in parallel to Bob’s cell phone, his desk phone, and his home phone. All three phones ring simultaneously. The protocol has mechanisms that will stop the ringing at Bob’s cell phone and home phone as soon as he answers his desk phone (making it unlikely that Alice will have to deal with more than one of Bob’s phones being answered). However, if more than one of Bob’s devices sends early media, Alice’s phone will have to do something reasonable while receiving multiple streams. Many existing endpoints play only one stream to Alice (chosen arbitrarily) and quietly discard the information from the other streams. Adam explored some of the consequences of this for PSTN interworking in an earlier article.

Figure 4

It’s worth remembering that SIP does not put a limit on how far apart in time the INVITE and 200 OK occur. It’s possible to place an INVITE and wait many minutes – even hours, before the call is answered with a 200 OK. During that entire time, the calling endpoint may be receiving (and sending) early media. This happens for some calls that transition into the PSTN that terminate on IVRs for example. Some of those systems are configured to leave the call in the ringing state, playing announcements and collecting keypresses, potentially until the end of the call. The call might only be “answered” (resulting in a 200 OK from the SIP side of the gateway) if the interaction with the IVR caused a connection with a human agent.

SIP and NAT Traversal: If not SBCs, then how?

April 20th, 2010by Adam Roach under SIP

Several previous entries in this blog have dealt with the issues that arise when SBCs and other back-to-back user agents (B2BUAs) are included in a SIP network. Of course, SBCs do serve useful purposes in the network – that’s why they were deployed – and you can’t really get rid of them until you understand how you’re going to do those things without an SBC in place.

One of the biggest issues that SBCs typically address is helping the audio and video sessions that are set up with SIP get through NATs and firewalls. If we get rid of SBCs, how do we do this? Luckily, the IETF has developed a suite of tools for exactly this purpose: STUN, TURN, and ICE. And, although they’ve been a long time in coming, the final RFC versions of these protocols are about to be published in the upcoming few weeks.

STUN, defined in RFC 5389, allows clients to determine that they are behind a NAT; and, if they are, to figure out which public address and port has been assigned to them by the NAT. Depending on the kind of NAT, this may be sufficient to allow NAT traversal for media. The load on a STUN server is generally very low, since it only has to process one message exchange for each call established. There’s also an adjunct RFC, RFC 5780, which will be published soon; it allows clients to determine some of the properties of the type of NAT they’re behind.

Once a client has used STUN to determine the address assigned to it on the firewall, it can then send this address to the other SIP device as the location to send media.

Figure 1: Using STUN to find an external IP address

 

TURN, defined in the forthcoming RFC 5766, uses a network server to act as a relay for client media. The SIP endpoint uses TURN to set up an association with the TURN server, and then advertises the TURN server’s address as the place that media is to be sent to. They use the TURN association to send and receive media through the TURN server to and from the remote endpoint. This has a significantly higher chance of success than STUN servers. On the other hand, the load on a TURN server is generally very high, as they must relay every packet in a media session to and from the endpoint.

Figure 2: Using TURN to relay media

 

ICE, defined in the upcoming RFC 5245, doesn’t have dedicated network servers per se. ICE is a technique employed by the endpoints to find the “best” viable path between the endpoints. ICE uses both STUN and TURN as means to collect potential candidate addresses. They then try these candidate addresses (along with other addresses they have, such a local IP addresses) pair-wise with the other endpoint. There’s a ranking system that ICE uses to try to find the “best” path (direct is better than through a NAT; through a NAT is better than using TURN, etc). This allows it to set up an optimal connection with the other terminal without needing detailed information about the network topology.

Figure 3: Using STUN and TURN with ICE

 

In practice, the application of ICE gives endpoints about the same chance of success as TURN does. The key difference is that when the endpoints use ICE, the TURN server isn’t burdened with calls that could have succeeded using STUN or direct connections.


The Problem with SIP Congestion Control

April 13th, 2010by Dorgham Sisalem under SIP

A DoS attack on a SIP component usually has the aim of overloading one or more resources of the SIP components. These resources might include the CPU, memory or any other resource needed for processing incoming SIP messages. Once one or more resources are overloaded the SIP component will no longer be capable of dealing with the incoming traffic, which will lead to the dropping of excess traffic. This will actually lead to an even higher amount of traffic in the network due to the retransmissions of the dropped packets.

Work done at Ohta 2004 suggests that overload situations not only reduce the performance of a SIP server, but can finally lead to a complete failure of the VoIP service. The most straightforward approach for handling such situations is to ensure that the available processing resources of a SIP component are sufficient for handling SIP traffic arriving at the speed of the link connecting this component to the Internet. With modern access lines reaching gigabit speeds, provisioning the VoIP infrastructure of a provider to support such an amount of traffic, which is most likely several times the normal traffic level, can be cost prohibitive.

SIP does not provide much guidance on how to react to overload conditions. A server that is not capable of serving new requests, e.g. because it is overloaded, could reject incoming messages by sending a 503 Service unavailable response back to the sender of the request. This signals to the sender that it should try forwarding the rejected request to another proxy and not to use the overloaded proxy for some time. Further, the 503 response includes a Retry-After header indicating the period of time during which the overloaded server should not be contacted. While this reduces the load on the overloaded proxy, it results in directing the traffic that has caused the overload to another proxy, which might then get overloaded itself.

The figure below depicts a scenario in which a load balancer distributes the traffic to two different proxies.

Load balancer distributes the traffic to two different proxies

In the case of a DoS attack it is most likely that all the SIP servers in a SIP cluster will be affected and will be overloaded at the same time. When the first server replies with a 503, the load balancer will forward the traffic destined to that server to the other server. With the additional traffic this server will become overloaded as well and will issue 503 replies. Shifting the traffic from one server to another has only made the situation worse for this server. This shifting of traffic can also lead to an on-off behavior. That is, consider the case when an attacker is generating traffic that is causing both servers to run at 100% of their capacity. When one of them issues a 503 response, the traffic destined for it will be forwarded to the other, which will now receive traffic at 200% of its capacity. This server will, hence, issue a 503 response. Where the Retry-After value of the first server expires before that of the second server, then that server will suddenly receive traffic at 200% of its capacity and will reject the traffic with another 503. This on-off behavior can actually even lead to a lower average throughput, making the 503 approach not optimal for cases in which a SIP component receives SIP messages from only a small number of other SIP instances. Where a SIP server receives requests from a large number of user agents, then the 503 approach can work much more efficiently as only the user agents that receive the 503 response will try another destination. Further, the on-off behavior will not be observed in this case as spreading out the 503 among the clients has the effect of providing the overloaded SIP instance with more fine-grained controls on the amount of work it receives. Naturally, if the senders are bots that do not respect the Retry-After header, using a 503 will not be sufficient for protecting the server from getting overloaded.

There are already a couple of attempts at solving the issue. In conjunction with the IETF a group was created that is looking at different approaches for making SIP server overload more resistant and are looking at different algorithms for congestion control in the context of SIP. I will provide an overview of this work in my next blog posting.

<% Response.Write("" & vbcrlf) %>