The major power outage that grounded all British Airways (BA) flights earlier this week highlights the vital importance of embedded monitoring in data centre power systems, one of the key themes of the Embedded Blog.
Initial reports from BA pointed to a power surge at a data centre at Heathrow that shut down servers, and the backup systems failed to kick in. This superficial description raised a lot of questions that are slowly being answered, and I looked at this for EEnews Europe Power Management: Mystery surrounds power outage at British Airways
What this also highlights is the increasing need for intelligent power to monitor not only the current and voltage but the temperature profile of the racks. Connecting this to the Internet of Things and
effective big data analytics would have given some early warning of the emerging problems.
"There was a loss of power to the UK data centre which was compounded by the uncontrolled return of power which caused a power surge taking out our IT systems. So we know what happened we just need to find out why," said BA after its initial statements about a power surge. "It was not an IT failure and had nothing to do with outsourcing of IT, it was an electrical power supply which was interrupted."
The power supplier, UK Power Networks, has however categorically denied that there was a power surge. The BA statement points to problems with the power sequencing when starting up systems, although whether these were the main servers or the backup is unclear. This led to the messaging systems being compromised so that systems could not communicate, leading to the cancellation of all BA flights around the world on Saturday afternoon and Sunday.
Since then, staff at the data centre have pointed to the infrastructure as the problem. While the servers and power systems were upgraded, the cooling systems had not kept up. This led to temperature spikes and servers and power supplies overheating and shutting down. This would be more consistent with the explanations given by BA and also the problems with the power sequencing if some systems did not respond.
There are now reports that a contractor 'pulled out a plug' and by-passed the entire power backup system, which seems highly unlikely as this is exactly when a UPS should kick in.
This raises questions about the disaster recovery strategy and management's understanding of the risks in the data centre. It is possible the backup systems were in the same data centre and so suffered from the same infrastructure problem. If the backup was offsite, then that says the specification of the cooling system was at fault as the same problem hit.
The company has now commissioned a detailed report on what actually happened which hopefully will allow other companies to learn from the problems this week.
Related stories on the Embedded blog:
- King Cobra rugged server design provides a “total rack” in only 2U
- IoT drives software defined power into the data centre
- IoT data deluge drives new hardware and software architectures