The FCC released a report of its investigation into T-Mobile’s major network outage over the summer, determining T-Mobile didn’t follow network reliability best practices that could have avoided the more than 12-hour outage, or prevented it from becoming an issue that affected customers across the country.
T-Mobile’s widespread network outage started on June 15 around 12:30 p.m. and lasted until 12:45 a.m. on June 16, when it was fully resolved. Many T-Mobile customers couldn’t make calls or send texts, although data was still accessible to most customers.
It also impacted other carriers’ customers who had trouble completing and receiving calls to users on T-Mobile’s network.
The staff report (PDF) released Thursday details what caused the outage and its impacts, as well as steps T-Mobile took to help ensure it doesn't happen again.
The FCC’s Public Safety and Homeland Security Bureau (PSHSB) estimates that at least 41% of all calls on or to T-Mobile’s network failed during the outage, including 23,621 calls to 911. The FCC said it didn’t receive any comments that indicated anyone experienced physical harm as a direct result of the outage or not being able to reach 911.
Still, input from public commenters showed personal impacts, such as not being able to contact medical professionals, family members, emergency help or roadside assistance.
“The outage likely produced a large financial impact for individuals, employees, and businesses,” the report says. It cited commenters who could not communicate with clients, perform jobs like scanning packages, or missed job interviews via phone.
Notably, the 41% figure doesn’t include any call failures that might have happened to T-Mobile subscribers using VoLTE or Voice over Wi-Fi calling because an estimate couldn’t be determined, according to the report.
“However, PSHSB expects that if this number could be determined, it would result in PSHSB’s estimate being much larger,” the report states.
When the outage happened, T-Mobile said the issues primarily affected VoLTE calling, but according to the report, T-Mobile could only provide PSHSB with the number of failed call attempts for 3G and 2G networks.
Since customers with wireless service from other carriers had trouble completing calls to T-Mobile users and vice versa, it initially prompted concerns that multiple carriers were having network issues.
Overall, PSHSB estimated that more than 250 million calls or 73% originating from other carriers’ subscribers to T-Mobile customers didn’t go through, based on confidential and non-confidential data other operators shared with the FCC.
AT&T, for example, reported that more than 30.4 million wireless and wireline calls were blocked that day versus 213,704 blocked on an average Monday. For four hours between 2 p.m. and 6 p.m. on June 15 AT&T estimates more than 99.9% of calls originating on its network were blocked from reaching T-Mobile’s.
For Verizon, more than 11.8 million Verizon wireless calls to T-Mobile customers were blocked during the outage. That compares to an average Monday, when less than 10 calls per hour don’t make it from Verizon’s network to T-Mobile’s.
According to U.S. Cellular, during two periods that spanned 3.5 hours, 99% of calls from its network to T-Mobile’s were blocked, versus the usual 1.9%.
“T-Mobile’s outage was a failure,” said FCC Chairman Ajit Pai in a statement (PDF). “Our staff investigation found that the company did not follow several established network reliability best practices that could have either prevented the outage or at least mitigated its impact. All telecommunications providers must ensure they are adhering to relevant industry best practices, and I encourage network reliability standards bodies to apply their expertise to the issues identified in this report for further study.”
What was the issue?
The FCC’s report provides a detailed account of the outage that seems to mostly align with T-Mobile’s explanation of the issue, but said measures could’ve been in place to prevent or mitigate the impact.
At the time of the outage T-Mobile said the problem was triggered by a leased fiber outage from a third-party provider in the Southeast that exploited a routing platform configuration, causing circuit overload, resulting in an “IP traffic storm” that led to major capacity issues in the IMS core network.
Similarly, according to the report, the problem started when T-Mobile was installing new routers in part of its network and a fiber transport link failed. Fiber link failures are common, the FCC acknowledged, and wouldn’t normally cause an outage of the scale seen in June. However, the report says T-Mobile misconfigured a link on a different router that wasn’t built to handle call signaling traffic and didn’t have a fail-safe process in place to prevent or be notified of the issue, which was then compounded further by other problems.
“T-Mobile could have prevented the outage if it had audited its network during the new router integration to ensure that the traffic destined for the failed link would redirect to a router that was able to pass it,” the report said.
Exacerbating the issue was a software problem in T-Mobile’s network “that had been latent for months and interfered with customers’ ability to initiate or receive voice calls during the outage.” The FCC said T-Mobile could’ve discovered the flaw and routing misconfiguration by validating in a test environment first.
As the outage continued and T-Mobile tried to fix the problem and restore service, engineering missteps made it worse when certain links couldn’t be restored remotely after they realized the issue was misdiagnosed.
“When T-Mobile engineers were able to access the equipment on site and correct their mistake by restoring the link an hour later, customers in the Atlanta market were again able to attempt to register to VoLTE. However, this again created additional congestion because T-Mobile engineers had not yet addressed the software error that prevented registrations from completing,” according to the report.
As device registration attempts on Voice over Wi-Fi and VoLTE flooded T-Mobile’s IP Multimedia Subsystem, the IMS core system became overloaded and the outage spread from the Atlanta market to a nationwide issue, eventually affecting 3G and 2G networks, and 911 networks.
T-Mobile has already taken a number of steps that the PSHSB decided will likely prevent a similar outage from happening again because traffic can be routed to a different path designed to handle it, and T-Mobile would have improved its network capacity to handle increased congestion.