Wait a Second!
SIP has several robustness mechanisms that leverage being able to say “Wait a bit before you try that again.”
A 486 Busy Here response can contain a Retry-After header field, allowing the endpoint to say “Please don’t try to call here again for 30 minutes,” based perhaps on knowledge it obtained from its user’s calendar.
A 500 Server Internal error can use Retry-After to say “Something’s keeping me from servicing this particular request right now, but please try again in 5 seconds.”
A 503 Service Unavailable error can use Retry-After to say something much stronger: “Something’s keeping me from servicing _any_ requests. Don’t send me anything more for at least 30 seconds.” As we’ll see in a moment, this is a very strong statement – one that needs to be carefully invoked.
The SIP Events architecture provides a way for an event server (such as a presence server) to say “I’m tearing down this subscription, and I need you to resubscribe, but don’t try to do so until at least 20 seconds have passed.” It does this with a NOTIFY containing a Subscription-State header similar to this:
Subscription-State: terminated;reason=probation;retry-after=20
These mechanisms allow servers to avoid, and even redistribute, load. A registrar handling a burst of simultaneous registrations can quickly tell some or all of them to wait, using a different wait times to spread the returning load out a little. One node in a cluster of presence servers can move its subscriptions to its peers by throwing all of its subscriptions into probation, as described above, again using a range of different wait times for different subscriptions. As the clients re-establish their subscriptions, the mechanisms for finding SIP servers can distribute the subscriptions among the peers.
While the mechanisms are useful in the situations I’ve described so far, and may be exactly the tools an application relying on a limited external resource like a specialized DSP needs, they aren’t sufficient to handle the general case of overload protection. The granularity the tools work at is either very small (affecting this particular method applied to this particular resource), or very large (affecting all traffic between two elements). The IETF’s SOC working group is developing richer ways to help a server avoid being overloaded.
But even with those tools, there are situations where crushing load can appear before mechanisms at the SIP layer have a chance to help. Avalanche-restart scenarios, when whole campuses or even cities full of clients all come online at the same time due perhaps to restoration of power are a good example. In the extreme, action closer to the physical layer of the network (such as using firewalls to introduce the load in smaller increments) is warranted.
Finally, like most tools, using them without understanding what they do can lead to surprising results. Any code that generates a 503 Service Unavailable response, for example, deserves careful inspection. Some early proxy implementations make the mistake of forwarding 503 responses, when they should be taking a received 503 as input into generating their own final response. By blindly forwarding a 503, they are saying “Stop talking to me” instead of “I can’t find something that can handle this request,” which leads to unintended failures, such as the following:
Here, Alice and Bob are in a SIP dialog, perhaps for a phone call. Carol and Dave are in a separate dialog, either a different call or perhaps they have a subscription set up.

Something goes wrong with Bob’s UA and it has to return a 503 to a request it received. Proxy 2 does the wrong thing and forwards the 503.

Now Carol’s next request towards Dave can’t be forwarded through Proxy 2, even though there was nothing really preventing Proxy 2 from being able to service the request. Carol and Dave have lost service unnecessarily and have no idea why.

Alice (or anyone else whose requests towards Dave would have taken the path from Proxy 1 to Proxy 2) can’t reach Dave either.
Proxy 2 should have returned its own response, probably a 480 Temporarily unavailable, to the request that elicited the 503 from Bob’s UA. That way only the requests Alice was sending to Bob would be affected.






























