SIP and “Secure” Communication: What does it mean?
One of the recurring topics in the discussion of SIP security is how you give users the information they need to make informed decisions. In most of these conversations, a parallel is drawn between web browser security and SIP security – usually, in terms of “why can’t SIP terminals have a simple lock icon that tells the user the call is secure?” And all major web browsers do have a simple visual indicator, like these two from Internet Explorer and Firefox:
![]()
Unfortunately, the issue with SIP is significantly more difficult than that. With web browsers, you really need to ensure only two things: that the website you’re connecting to is the web site you think you’re connecting to (authentication), that no one other than you and the website can see the information you’re sending and receiving (confidentiality). For the web, this is easy to do because TLS (used by https) provides both of these properties.
With SIP, you have at least five different major problems to solve – and possibly more, depending on how you account for them: Caller-ID, Called Party Identity, Media Privacy, Media Authentication, and Signaling Confidentiality.
Caller ID and Called Party Identity
First, when a call arrives, the user is going to want to know who is calling, similar to Caller-ID on today’s PSTN. Jiri did a series of posts (1,
2,
3) detailing the need for identity in the SIP network. (While this is a good treatment of the need for identity, I think its conclusion – that we should use the same spam-prevention mechanisms as email – is a bit naïve; as Ben later points out, 94% of all email is spam, and I think we need to do better than that.)
While some techniques can be employed to “spoof” caller ID information on the PSTN, it’s difficult to do, so people generally can and do trust what their phone says when it rings. On the other hand, since SIP signaling flows all the way out to the edge of the network, this kind of identity is much easier to fake in a SIP network. Some deployment architectures have developed specialized “transitive trust” models that get you pretty close to what the PSTN provides today, but they don’t work across the general Internet, or when you transition from one architecture to another.
A more bulletproof means of conveying identity can be performed with RFC 4474, which uses cryptography to let a proxy on the call path make an assertion about the calling party’s identity. Unfortunately, RFC 4474 does suffer from some deployment difficulties, such as perceived deficiencies in key distribution, the difficulty in asserting ownership of phone numbers, and bad interactions with SBCs. And while there are good answers to each of those issues, they still have slowed down acceptance of RFC 4474 as a solution.
A related issue is validation that the person you’re trying to reach is the person you’ve actually reached. For example, if Alice is trying to reach Bob but really reaches Charlie, she needs to know this to make an informed decision. This is even more important when Alice is trying to reach, for example, her bank. There are fairly benign reasons that the called party might not be who the caller was trying to reach – a call-forwarding service, for example – but it also may indicate something more nefarious. To fill this niche, RFC 4916
defines a mechanism for conveying called party identity back to a calling party. It shares RFC 4474’s strengths (cryptographic assertions, leveraging the web’s public key infrastructure), but suffers from the same drawbacks as well.
One interesting twist to the behavior of RFCs 4474 and 4916 is that they only protect the caller and called parties’ addresses, not their names. To protect things like caller names, it becomes necessary to use a mechanism like cryptographic certificates with S/MIME.
Media Privacy and Authentication
Another user expectation of “secure calls” is a guarantee that third parties cannot intercept their call. This is especially important when users make calls on a shared network, such as a public WiFi network, a hotel network, or certain types of cable networks. Unless the media itself is encrypted, anyone on the same network can use any one of a variety of easy-to-use call interception tools, including some very sophisticated, free ones, and record any call or calls they want to.
The other issue with media is ensuring that the media you receive is coming from the person you think it is. The ability to insert new media into a call can be highly damaging for certain types of calls.
Unfortunately, this area has historically suffered from too many solutions, as opposed to not enough. Luckily, the IETF finally winnowed the solution space down to a single approach for SIP media encryption: RFC 5763. There is also a competing solution in zRTP. This approach has some interesting properties that Jiri discussed in a previous posting – but it also suffers some non-technical drawbacks (see my response at the end of that article) that are likely to limit its deployment outside of the opensource and hobbyist communities. And, while zRTP provides encryption, it requires an onerous manual step to ensure that you’re talking to the person you think you’re talking to (and, without this protection, your call can be listened to by a sophisticated attacker in the middle of the network).
Hopefully, with the recent publication of RFC 5763, we’ll start seeing more vendor support for media privacy and authentication.
Signaling Confidentiality
A final aspect of SIP security that needs to be addressed is confidentiality of the signaling information itself. For voice calls, access to the signaling allows you to figure out who called whom and when. And, while the privacy implications of exposing that kind of information are evident enough, things get much worse once you start mixing in features like instant messaging and presence: eavesdroppers on this information can learn highly sensitive information, such as the contents of instant message conversations.
Support of TLS to protect information as it passes between network entities (say, from a phone to its proxy) is required by the baseline SIP protocol, and has fairly good implementation (on the average, approximately 50% of the implementations at the SIPit interop event
have had TLS support over the past few years). That’s a really good way to ensure that arbitrary third parties can’t eavesdrop on the information being sent.
But TLS doesn’t protect information from being intercepted by servers on the call path.
And while I might be happy to get my SIP service from bobs-discount-voip.com, I may be a bit more reticent to trust them with things I send and receive via instant messages – things like my banking information. And that brings us back to the use of S/MIME certificates, which can be used to hide this kind of information from proxies on the path (while still providing them enough information to route messages correctly).
Summary
So, back to the original question: if you wanted to have a simple, visual indicator to indicate that a call is secure… what would it mean? Is it a promise that the phone number on the caller ID is correct? How about the name? Does it mean that the media is encrypted? And, if it is, can you be sure it’s coming from where you think it’s coming from? Is the signaling protected? And, if so, is it protected from everyone, or can proxies along the call path read it? There are so many degrees of freedom here that there’s no good way to render them all to the user in a sensible fashion. And an all-or-nothing indicator (like a single lock icon) is completely nonsensical – as you’ve seen, SIP security is just about as far from “all-or-nothing” as you can get.
At this point, sadly, it’s mostly a moot point anyway – just about all SIP service providers employ exactly none of these techniques. But as user expectations around identity and privacy start colliding with the reality of service providers’ carelessness, we’re going to run into a few challenges making sure that users can be given the information they need to make informed decisions.

