Optus identifies cause of nationwide outage, says ‘changes to routing information’ after software upgrade are to blame
Optus says “changes to routing information” following a “routine software upgrade” were to blame for last week’s nationwide outage, which affected 10.2 million Australians and 400,000 businesses.
Most important points:
- A routine software glitch was the cause of Optus’ national power outage last week
- The telco says it has taken steps to ensure that the outage does not reoccur
- The reason for the outage comes after Optus offered extra free data to customers as compensation for the blackout
In a statement released on Monday afternoon, Optus said its network was affected by “changes to the routing information of an international peering network” at around 4:05am AEDT last Wednesday, “following a routine software upgrade”.
“These changes to routing information propagated through multiple layers in our network and exceeded preset security levels on key routers that could not handle them,” the company said.
“This resulted in the routers disconnecting from the Optus IP Core network to protect themselves.”
The extent of the outage meant Optus engineers had to physically reconnect or restart the system, the telco said, and also meant the investigation into the cause “took longer than we would have liked”.
“The restoration required a large-scale effort from the team and in some cases required Optus to physically reconnect or reboot routers, requiring people to be sent to a number of locations in Australia,” an Optus spokesperson said.
“This is why the recovery was gradual throughout the afternoon.
“Given the widespread impact of the outage, the investigation into the issue took longer than we would have liked as we explored various paths to recovery.
“Restoring the network was our priority at all times and we then set up the business together with our partners.”
Optus says it has since made changes to its network to address the issue so it does not reoccur, and that it “will continue to invest” to improve the resilience and services of its network.
It comes after Optus made an extra 200GB of data available to customers from Monday to compensate for last Wednesday’s outage.
It was “highly unlikely” that a software upgrade would be the cause, the CEO said last week
Before Monday’s revelation by Optus, experts had theorized that the glitch was likely a “regular software upgrade gone wrong”.
“The problem is too widespread to be due to a broken cable or equipment failure,” said Tom Worthington, senior lecturer in computer science at the Australian National University in Canberra.
The software upgrade theory suspected by telecommunications analysts and experts last Wednesday was put to Optus CEO Kelly Bayer Rosmarin, who dismissed these suggestions.
“It’s highly unlikely, our systems are actually very stable,” she told ABC Radio Sydney last Wednesday morning.
“We offer customers excellent coverage, this is very rare.”
On Monday afternoon, Mr Worthington said it was “no surprise that a software upgrade caused the Optus outage”, and that the problem would still have occurred if there was redundancy.
“This is a similar issue that knocked out the Australian population count in 2016,” he said.
“It would be possible to replicate all the hardware, but that would double the cost of servicing customers and would not stop a systematic outage of this kind.
“There are some clear lessons from the Optus outage: don’t have all your phones and internet provided by one company, [and] if you provide safety-critical services, ensure connections to multiple networks.”
Associate Professor Mark Gregory from RMIT University said the cause identified by Optus was a “human error” that resulted in a “cascading failure”.
“It appears that a routine software upgrade of one or more major routers was the cause of the outage,” he said.
“Optus has not explained what went wrong with the testing process that should have taken place before the routing software upgrade took place.
“Also, there is no explanation as to why there appears to have been a lack of redundancy of the key routers, so that if there were a problem, the key routers would switch to the redundant routers, which you would expect to be running the previous iteration of software.”
Research fellow at the Center for Defense Communications and Information Networking at the University of Adelaide, Mark Stewart, said the reason for the outage is “predictable” and common with software updates.
“Network instabilities due to changes in routing information are a known and predictable problem, often associated with software updates,” he said.
“A major telecom company should have a disaster recovery plan that is more advanced than the average corporate network.”
“They should have at least had a plan to roll back the changes or restart their systems remotely.
“Optus’ statement does not make clear in any way how this event was exceptional, or what preventive measures they had in place to mitigate the impact.”
Graeme Hughes, director of the Business Lab at Griffith University, said it was fortunate from an emergency communications perspective that the outage occurred when it did.
“If the outage had occurred a week earlier, at the height of the raging bushfires, the consequences would have been catastrophic,” he said.
Optus boss faces Senate on Friday
Optus is facing a number of inquiries and investigations as a result of the outage, including a Senate inquiry that will hold its first public hearings on Friday.
Ms. Bayer Rosmarin is currently the only witness to confirm her presence.
The telco said in a statement that it supports the government and Senate reviews and will “cooperate fully”.
The reason for the outage follows the federal government’s announcement earlier on Monday that it would require telecommunications companies in Australia to report their cyber security measures to prevent a repeat of last year’s Optus cyber hack.
Under the laws, telecommunications companies would be classified as “critical infrastructure,” requiring their companies’ boards to report to the government on their cybersecurity strategies, in the same way as energy companies, hospitals and ports do.
This article contains content available only in the web version.
If you are unable to load the form, you can open it here.