EUIST

EUIST

Just another ComMetrics – social media monitoring, best metrics, marketing metrics weblog

LSE outage – five lessons for achieving better network dependability

September 11th, 2008 · 3 Comments ·

    The London Stock Exchange (LSE) experienced a seven-hour breakdown that gave traders barely more than an hour of trade book-ending this Monday. Monday, TradElect a proprietary system developed by Accenture using HP ProLiant Servers and Microsoft .Net and SQL Server systems crashed due to overload and a software bug. Microsoft and Accenture are still investigating the root cause of the problem.
    We provide you with five lessons we got out of this case.The case illustrates why public e-communication networks must offer resilience that minimizes the risks for such network failures. The loss of connectivity betwen clients and the LSE’s trading systems forced the market to suspend trading shortly after 9:00 o’clock. Nobody seems to know the costs this has caused (loss of trades, etc.).

Recently I started a series of posts focusing on dependability and resilience of public e-communciation networks:

Dependability of public e-communication networks – ropes to skip – introduction

This Monday (2008-09-08) it happened, London’s stockbrokers were forced to twiddle their thumbs, as trading on thee main stock exchange shut down due to a software glitch.

On Monday the LSE claimed that the outage was not related to an increase in trading volumes. Instead, it blamed connectivity issues for the outage.

Here are some of the lessons – not necessarily in order of importance – we should take from this serious incident of network failure.

Lesson 1 – proper software and hardware architecture required

On Monday, TradElect, a proprietary system developed by Microsoft Technology for the LSE had an outage. Unfortunately, no technical information was provided

Later, LSE told some media people that a software problem crashed LSE pricing, leaving trading floors close to silent.

Without all the facts it is difficult to get a handle on this issue. Monday’s problems at the LSE were signifianct. Nevertheless, they represented the first major outage in eight years.
But the LSE outage has to result in traders rethinking their risk management. Traders must rethink their reliance on a single system that does neither offer the software nor hardware architecture giving dependability levels and redundancy that are satisfactory.

For instance, assuring that the overloaded system has access to additional processing capacity, fiber optics capacity and redundancy switches is a must. Such an investment is required to assure satisfactory dependability. The LSE is an important public e-communication network for traders and their clients around the globe. Its software and hardware architecture should reflect the importance the LSE has for financial markets.

Lesson 2 – extensive testing required

Besides a lack of software and hardware architecture that such a critical system requires, extensive testing is needed to avoid the problems that happened this Monday. The FTSE 100 ended down sharply on 2008-09-05 (Fri). This forced traders who were betting on further falls in the index to close their positions. In turn, the system experienced a kind of ‘overload’.

An exercise that just tests parts of the system will not cut it. Instead, the full system such as user access, trade execution and database redundancy and overload must be part of such an exercise that tests the systems limits as experienced with Friday’s trading spike.

In this case, it seems as if no such comprehensive testing had been done previous to Friday. The Friday spike in trades lead to the crash on Monday. LSE outage demonstrates that extensive testing exercises are required to assure that the system is dependable enough to deal with trading spikes.

Lesson 3 – tried and tested recovery procedures are a must

Extensive testing based on realistic simulations using the full system also helps in testing if recovery procedures put in place work properly

The trading overload happened on Friday and system monitoring indicated that not all was working properly already at that time. Unfortunately, the LSE did not have the right procedures in place that allowed it to fix the problem in the 62 hours (19:00 hours after all was done – Friday evening through Monday 9:00 hours) the weekend gave it.

Tried and tested recovery procedures are needed to avoid such problems, LSE management failed here by not insisting to have such procedures in place while providing the budget needed.

Lesson 4 – accurate and truthful information is the minimum one should expect

LSE’s communication regarding this outage has been poor if not disastrous. For starters, LSE responded very slowly in making a decision about wheter it was an orderly market or not. Once it closed down the system around 9:00 hours, it failed to put up information on its web site. Even 2008-09-11 (Thu) you cannot find any information about this seven-hour outage in the press section of the LSE:

LSE press releases

When such a calamity happens, assuring the continued confidence and trust customers have in your brand requires that you communicate accurately, quickly and truthfully about the problem. LSE failed the test.

Lesson 5 – new platforms highly dependent on LSE for reference price

Measured by the value traded, Chi-X looks to have been down by a third on the previous business day which was 2008-09-05 (Fri). Turquoise (PS. experienced a breakdown last week) and ITG were similarly affected.

The above illustrates that traders and clients were reluctant to deal on these new platforms without a reference price available through the LSE trading board. Monday’s problems demonstrated that these exchanges are relying on the LSE to set the market price – in other words – dependability of alternative exchanges upon the LSE board cannot be underestimated.

Hence, the LSE outage affects many trades beyond those made by its customers but also those trading on new platforms

Conclusion

We can now move on to business as usual or take the necessary steps that improve the LSE’s trading systems’ dependability and resilience when it comes to such things as trading overloads and so on.

Customers should surely ask for the communication strategy of the LSE to be changed. Mos important, monitoring the performance of individual components, like databases, software and servers will not do. Instead, regular exercises are needed that:

1) generate the required artificial ‘load’ in order to test the full trading system for instances with high volume activity - as happened last Friday, AND

2) implement tried and tested recovery procedures that work – a weekend is enough to fix the problem – LSE failed to fix the problem over the weekend

So will we learn from this incident or are we willing to have it again sometime down the road with effects that we might be unable to imagine today? For instance, what will it do to financial markets worldwide if London – or any important exchange – crashes again?

The costs to the economy must be exhorbitant.

Tidbit

The LSE’s platform runs on TradElect, a proprietary system developed by Accenture using Microsoft .Net and SQL Server systems. It is about 15 months old and has been running since June 2007.

LSE has touted that the system will enable it to expand and speed up its capacity for trades. September 2008 was the deadline given for the system to show that it can reach and handle without problems 10,000 continuous messages per second.

So far it cannot handle it and I am not aware of any Microsoft software supported solution that is currently able to do this.

LSE moved IT management for TradElect in-house from supplier Accenture. The latter still provides software development services to LSE.

As well, due to this outage I am not sure if TradElect will be able to offer trades in Italian equities this month, as planned. Of course, as you know LSE acquired Borse Italia in 2007.



|

→ 3 CommentsTags: exchanges · lesson · monday · outage · resilience · traders · trading