System update led to BlackBerry outage
By BRUCE MEYERSON, AP Technology Writer
After two days of silence about a lengthy outage in its BlackBerry
e-mail service, the company that makes the addictive mobile device
issued a jargon-laden update indicating that a minor software upgrade
had crashed the system.
The statement Thursday night by Research in Motion Ltd. said the
outage from Tuesday evening into Wednesday morning was triggered by
"the introduction of a new, non-critical system routine" designed to
optimize the cache, or temporary memory, on the computer servers that
run the BlackBerry network.
RIM said "the pre-testing of the system routine proved to be
insufficient."
The failed upgrade apparently set off a domino effect of glitches,
which the company referred to as "a compounding series of interaction
errors between the system's operational database and cache."
The Canadian company said a "failover process" to switch to a backup
system "did not fully perform to RIM's expectations." That led to a
delay in restoring service and "processing the resulting message
queue," a reference to the backlog of undelivered e-mail that
accumulated during the outage.
While most of the outage happened outside "work" hours, the
always-connected mentality fueled by BlackBerry's success left many
users feeling disjointed and aggravated when their devices stopped
buzzing. Grumbles were heard at the highest levels of business and
government, including the White House and the Canadian Parliament.
The outage and the company's delayed, tightlipped response to the
situation angered some customers. It is an approach RIM has taken with
past service outages, which in fact have been rare.
Jim Balsillie, RIM's co-chief executive, downplayed the criticism of
the company's communications as "a trifle unfair," because the focus
was restoring service, and the primary means of contacting users was
unavailable.
"The issue is just how do you tell people what it is when it is e-mail
that people are counting on, and that very communications path is
down," Balsillie told The Associated Press in an interview.
Furthermore, he said, there was no information to disseminate until the
cause was identified.
"Once we had the facts, we made sure they were available," he said.
"People got a very accurate characterization of what it was."
Yet with the company rapidly expanding beyond its longtime focus on
business users -- the new BlackBerry Pearl has been a smash hit with
consumers since its launch last summer -- some experts say RIM needs
to get more savvy in dealing with problems.
"So far, all we have gotten from RIM are explanations fit for
engineers, not customers," said Richard S. Levick, whose firm Levick
Strategic Communications LLC specializes in crisis communications.
During the last major failures nearly two years ago, RIM waited hours
before confirming the problem, then issued a cryptic technological
description of what happened.
This time around, from the time the e-mail ceased flowing Tuesday
evening, it took RIM more than 12 hours to issue a vague
three-sentence statement acknowledging the disruption. No further
updates were provided until late Thursday's statement, prompting
criticism in online forums and Web logs.
"They have to stop thinking like engineers and start thinking like a
utility," Levick said. "When the telephone lines go down or the power
goes out, the first thing these utilities do is try to fix the problem
while simultaneously communicating with the media and customers. Why
does RIM think it can't do two things at once?"
John Corcoran, a Tampa, Fla., franchisee in the Wireless Toyz retail
chain, said his customers were generally patient.
"Because of how solid RIM has been in the past, BlackBerry users have
been very understanding and willing to give them the benefit of the
doubt this time around," he said. "Historically, BlackBerry users have
been part of an exclusive, tech-savvy club. Because of the knowledge
they hold it is very important for RIM to be upfront with them about
any operational issues."
RIM said it has ruled out security and capacity issues, along with
hardware failure or core software issues, as the cause of the
disruption. The company said it is improving its testing, monitoring
and recovery processes to prevent such an outage from happening again.
Copyright 2007 The Associated Press.
NOTE: For more telecom/internet/networking/computer news from the
daily media, check out our feature 'Telecom Digest Extra' each day at
http://telecom-digest.org/td-extra/more-news.html . Hundreds of new
articles daily. And, discuss this and other topics in our forum at
http://telecom-digest.org/forum (or)
http://telecom-digest.org/chat/index.html
For more news and headlines, please go to:
http://telecom-digest.org/td-extra/AP.html