In recent weeks both press and politicians have been exercised by the UK air Traffic Control’s Computer system “glitch”. Around the same time both the Financial Control Authority (FCA) and the Prudential Regulation Authority (PRA) heavily fined the Royal Bank of Scotland Group (RBS) for its handling of their “glitch” in the summer of 2012. The FCA published a report outlining the causes and failures that severely affected customers in RBS and their subsiduaries including NatWest and especially Ulster Bank.
While David Cameron and Vince Cable have said that the disruption to airline passenger services was unacceptable, prima facie it seems to me that NATS the operator responsible for the UK Air Traffic Control system took the appropriate recovery and business continuity actions.
Of course any incident should be thoroughly investigated and learned from. Indeed there may be recommendations in terms of procedures and investment but stuff happens, and having an error in one line of code out of four million lines as claimed by NATS, in my opinion is a certainty rather than a possibility. First and foremost for Air Traffic Control, safety is the main concern and it seems clear that it was never compromised.
On the other hand a bank’s first priority is the protection of its customers’ money assets, and that is impossible if one does not have accurate account balance figures.
The report produced by the FCA found failures at every level in RBS. This included the operational level, the oversight and audit level and at the RBS group strategic level. Customers were aware of errors in their accounts on the 19th June 2012. Overall it took 29 days to rectify all the issues. Ulster Bank retail customers continued to experience problems until 10th July.
During the crisis BBC NI and Radio Ulster interviewed representatives of Ulster Bank and it was hard to gleam from them what the core problem was. In reality the people put up to put a brave face and reassure the public did not have the technical expertise to explain the issues in simple terms. In my opinion this lack of clarity contributed to a belief by some in Northern Ireland and in the Republic that the bank had actually run out of funds.
The report shows that the initial problem that occurred, was in an upgrade to the batch scheduler that did all the overnight processing for NatWest and Ulster Bank. A batch scheduler at its simplest is a computer program that controls a list of other programs called jobs that must be run in sequence. My experience of batch jobs in a banking environment is that they are designed such that if a program fails the job may be restarted at different points and run again. Sometimes the jobs have to be run individually and this often called manual running. This technology has been common in banking software for decades. Manual running tends to delay the overall processing time. The upgrade was carried out on Sunday 17th June 2012. The first run of the updated software was on Monday 18th June. Certain batches failed, but these were run manually and all the jobs completed before open of business on Tuesday 19th. Given that there had been problems with processing time it was decided to back out the software upgrade and return to the earlier version of the software. This process failed catastrophically and with Tuesday nights processing “a significant number of jobs” were not run. Despite manual intervention a backlog of jobs had built up sufficiently to interfere with the next day’s processing which in turn created more backlogs.
The report was critical of RBS technical services not fully testing backward compatibility. They were also severely criticised for not appreciating the business risk of the batch scheduler failing.
As with the Air Traffic Control system the batch scheduler software had been reliable and resident for many years.
It is reasonably certain that the RBS business continuity strategy and IT related risk assessment will improve immeasurably as result of this disaster. The fines amounted to £56 million and they compensated customers to the tune of £73 million.
One wonders however if the operational people had decided not to back out the scheduler upgrade and continued to run failed batches manually as they had done so on Monday 18th June. Then by progressively fixing the errors in the scheduler over time would we have even known there was a problem.
The real lesson here is for organisations is to properly appreciate and assess risk and to realise that that relatively minor IT glitches can have a major impact on the business.