Outages can run into huge damages – just ask BA
The backbone of companies are the people who work in them, and while we all need people, we can't stop them from making mistakes.
Just ask British Airways. One of its data centres went down for 15 minutes on Saturday 26 May, causing an IT systems outage that lasted almost 48 hours, stranding tens of thousands of passengers and grounding all flights over the weekend.
A nightmare situation
It is now believed that a fail-safe measure designed to protect systems from hardware failure did not work. Backup generators were bypassed, causing the entire system to shut down. And when power was restored, the system was turned back on in a haphazard way that ended up causing physical damage to the system.
And if that isn't bad enough, it is still not known why BA's second data centre wasn't able to pick up the slack.
Investigations into the BA incident continue, but it is clear that human error has played a big part in this catastrophe, which has cost the airline £150m ($189m, €170m) in damages.
No one can foresee the future, but if you want to prevent a nightmare, you need to ensure that your IT people are implementing proper processes and training their people properly, so that if an incident happens, potential damages can be mitigated.
This is what the Zero Outage Industry standard association is trying to achieve.
It matters what type of people you hire as employees – in a critical situation, you need people who can think on their feet and have the expertise to identify urgent problems and seek to fix them quickly.
Highly skilled employees who are trained and certified to the latest standard, and who care about the company as a whole rather than just their own projects ensure that there is more oversight over all procedures and greater accuracy, which in turn helps prevent grievous errors from occurring.
You also need to make sure that the employees you hire are the ones that add value to the company – sadly corporate espionage is a rising problem.
Recently Verelox, a web hosting provider in the Netherlands suffered a huge outage on all of its services after an ex-employee deleted all customer information and wiped most of its servers. The damage was so bad that Verelox didn’t think it could get any of the data back.
However, in this case the web host got lucky – it had backed up all of its data and was able to restore all the servers.
But others have not been so fortunate. In the US, Allegro Microsystems is currently suing its ex-IT administrator, who allegedly left the firm a parting “gift” of malware that was timed to initiate just after the start of the new financial year.
The malware was designed to deliberately delete financial data so that it would be impossible for Allegro to complete its year-end audit. The incident has cost the firm $100,000 in damages, but could have been prevented if processes had been put in place that provided checks and balances to the systems.
Processes need to be improved
But it’s also about implementing proper processes. Outages can happen to anybody, and the risk of damages is huge. Employees are usually trained on the procedure of how to evacuate a building in the event of a fire. We place such an importance on saving human lives, but how about saving our businesses?
There needs to be a culture whereby IT departments look at the worst things that could possibly happen and put in processes to prevent such an eventuality.
It would also make sense to put employees in simulated risk management scenario and test how they deal with an outage to iron out the best method for dealing with problems – this idea is now being used by IBM, but only for cybersecurity, not outages.
An outage could easily happen to you. Are you willing to wait until you become the next victim, or will you stand up to take preventative action?