Mobile giants toppled by HLR failures

Mobile data networks are failing!

Well, some are. Okay, only a few – but they’re pretty high-profile failures. This month alone we’ve seen T-Mobile USA, France’s Orange and (just last week) O2 in the UK experience serious data outages on their networks.
 
Then there’s Verizon Wireless in the US, which has been suffering from service outages from late last year to as recently as February
 
What’s going on?
 
According to GigaOm (citing reports from Computerworld UK and Information Age), the Orange and O2 blackouts can be traced back to the HLR (home location register):
 
The HLR plays its dispatch role by receiving a constant stream of signals from devices updating the database on their current locations and activities. According to Computerworld, a data glitch in an Orange HLR node generated error messages, which then multiplied as they got knocked back and forth around the network. Just because the HLR was failing, that didn’t stop devices from sending out their updates. Like a million kids screaming “look at me!” from the backseat while you’re trying to deal with the coffee you just spilled in your lap, smartphones kept pinging the suffering HLR creating a huge bottleneck. The end result: the whole system fails, leaving millions of handsets without their lifelines to the network core.
 
The Verizon outages have been attributed to a similar kind of fault.
 
Interestingly, because none of this was a failure of RAN capacity, all of this adds up to a great pitch for for vendors championing Diameter signaling, says GigaOm’s Kevin Fitchard:
 
Diameter’s load balancing techniques would allow the network to shift the signaling load away from elements experiencing problems — isolating failures rather than allowing them to infect everything around them.
 
Just one problem, Fitchard points out: Verizon implemented Diameter signaling platforms from Tekelec a year ago, well before its own network outages started to occur:
 
Tekelec certainly isn’t to blame for the outages – they were caused by software bugs in other elements. Yet its diameter routers weren’t able to contain the problem, either, when the network started going haywire.
 
[EDITED TO ADD (17 July 2012): Tekelec has since contacted GigaOm (and Telecom Asia) to clarify that while Tekelec did announce Verizon as a customer for its Diameter signaling router in August 2011, Verizon hadn't actually deployed them by the time the operator began experiencing service outages at the end of the year.] 
 
I don’t know how much of this is cause for alarm – I wouldn't call four network outages out of hundreds of 3G operators an epidemic – but if nothing else it serves as a warning to mobile broadband operators in the region to at least double-check their HLRs – and have a contingency plan in place, whether its Diameter signaling or something else.
 
(Judging from the experience in Hong Kong, they might want to check their AAA software and power supply as well.)