Just another ComMetrics – social media monitoring, best metrics, marketing metrics weblog

CyTRAP Labs trend – computing in a cloud – your BlackBerry and YouTube services fail

February 26th, 2008 · 3 Comments ·

It can be as simple as a system update or a mess-up of Internet routing table. The result is the same. You can neither check your e-mail with your BlackBerry nor access the content on YouTube for several hours. Some people get a bit nervous when this happens.
During week 7 of this year, BlackBerry users suffered a North America-wide outage that lasted for hours. This was the second one within a year – we reported about the first one here:

Reliability and dependability of information networks – millions hit by BlackBerry systems failure

A key weakness in the way that RIM’s service operates is that it funnels all of the e-mail messages and other data sent over its network in North America through a single “network operations center” or NOC in Waterloo, Ontario. Another NOC in Ontario serves the Asian market. A similar center operates in Britain for the European traffic

We are dependable on our backbone working and when a system fails to work properly, millions of people will be unable to use their RIM devices or PCs using online tools that might run somewhere on a server.

Computing in a cloud (Choose option Login as Guest, click on the link again to get free access)

can help in reducing these risks and various firms have created services using this approach. Examples are such as Amazon Web Services or Google. Nevertheless, these services can go down. The challenge is to

a) limit the frequency this happens and how long it may last, AND

b) restrict the failure in service to a small geographical location

To illustrate, 2008-02-19 – Tuesday morning for a short while Google search was no longer accessible for Swiss surfers in and around Zurich. The reason being that a fiber optic cable got cut and led to a service interruption for a short while.

In the above case the service interruption had little if anything to do with Google using computing in a cloud. In fact, computing in a could allowed Google to limit the outage upon users coming from a relatively small region. We had less luck on 2008-02-24 (Sunday) when Pakistan’s attempt to prevent access to YouTube resulted in a global outage for just about 2 hours. The service interruption is connected to Pakistan Telecom and Asian internet service provider PCCW.

First, anyone in Pakistan attempting to go to YouTube was instead re-directed to a different address. Unfortunately, this information was leaked out by PCCW and as a result YouTube was mistakenly blocked by internet service providers around the world. PCCW tried to correct this mistake after being contacted by YouTube staff.

All this happened because YouTube content included Danish cartoons depicting the Prophet Muhammad

CyTRAP Labs take on the issue

With Amazon’s S3 service, companies can use to rent data storage on Amazon’s infrastructure. We are becoming ever more dependent on these type of services. And why a blackout of 2-4 hours might feel like much when it happens to a user, if it happens once a year, it might be acceptable.

Hence, a single outage is not a good measure for performance or quality of service. Instead, people need to look at the level of service over a period of time.

Even if one does not use outsourcing, an event such as the one caused by a construction worker in Switzerland who cut a fiber optic cable cannot be prevented easily. Neither can we prevent a government and an internet service provider from ‘IP hijacking’ an address (i.e. the web site’s unique address is being taken over by corrupting the internet’s routing tables). We know that if internet routing tables are screwed up, the flow of data around the world will be discrupted – see YouTube last Sunday. This can neither be prevented by computing in a cloud nor by running the infrastructure in-house.

Nonetheless, the biggest drawback of such a model – computing in a cloud – is that a business is putting critical operations in somebody else’s hands. When such a service goes offline one depends on somebody else to get the service up again.

Since it costs $0.15 per GB a month to store data, with additional fees to transfer data in and out, such a service as Amazon’s S3 is often cheaper than buying and operating equipment oneself. In fact on its Web site, Amazon touts that S3 allows customers to scale their data demands up or down rapidly, and that the service is available 99.99% of the time. If 99.99% means 3 hours a year of outage, maybe that is a good measure of performance.

So highly disruptive tolerant routing and data replication services are essential if computing in a cloud using locations around the globe is supposed to work. And there lies the challenge,

Reliability and dependability of information networks – undersea cable cut, disrupting outsorcing work in India

Nonetheless, we are still vulnerable when undersea cables are damaged, construction workers cut a fiber optic cable or a government messes with the Internet routing tables (see Pakistan and YouTube).

All we can do is minimize thes risks to acceptable levels but we have to learn to live with these risks and cope with service interruptions of up to 12 hours or so.



Reliability and dependability of information networks – Taiwan’s earthquake – are we redundancy compliant?

Reliability and dependability of information networks – lessons from the Taiwan earthquake

UBS IT infrastructure fails – can we learn anything from this event?

Service-oriented architecture and security or what is the system really worth?



For instance, Google uses the computing in a cloud concept by having its about 500,000 or more servers located in nearly 40 location around the globe. Each of these locations is connected with the Internet and, most importantly, if one goes offline, data can be shifted lesewhere in Google’s network.

Because data are always backed up in multiple places, an outage in one place does not take the entire system down. The clouds Google uses are managed by a grid (see grid computing).



→ 3 CommentsTags: amazon · blackberry · cloud · computing · disrupting · outage · outsorcing · undersea