Friday, November 16, 2007

Yesterday's downtime

We had a bad outage yesterday on the newly-installed web server. This followed two days of needle-like 5-10 minute outages. Needless to say, we've gone back to the old server.

It was a bad one—four-hours long and in the middle of the day. Worse, we didn't have a "down" page up. This wasn't for lack of trying; our server was completely non-responsive. When we got it back, we had a number of hours of "rolling outages" as the server caches refilled. Add a couple of logistical issues* and it was a nightmare. Although user comment has been kind—so kind that I fear that negative voices are going unheard!**—you have a right to expect more. This was a bad one, and we're going to learn from it.

I do want to stress that no data was lost. This was all about the "web server" (the part that sends you the page) not the "database servers," which have all the data. We have five live backups of your data now, and daily offsite backups too. We didn't have working web server backups. We should.

Details. The last blog post includes a paragraph that is, in retrospect, a bit funny. (Not funny-ha-ha, mind you.)
"If you don't notice anything, you can congratulate Felius [John], who just moved us to a new, dedicated web server."
Well, the new server was the problem. And if you can't congratulate him on that, you can congratulate him on getting things back up quickly once he was brought in. (Initially we thought we could do it without him, and it was the middle of the night in Australia.) He worked like a dog yesterday, and will be doing so today. Fortunately, we now have really excellent monitoring in place. The monitoring didn't help us in the crisis—we were monitoring a dead man—but it will help John reconstruct what happened.

In the wake of this, he has two jobs: Figure out what happened and make sure it never happens again. In system issues, John is the "decider," but we have a rough idea what needs to happen. First, we need webserver fail-over. Second, we need better tools for getting back on our feet. It makes no sense to have rolling blackouts for users when search-engines take up about half our traffic. After that John will work to the new webserver working, this time for good.

Casey, Chris and I are going to be doing our part to help on systems today. We can't do what John does, but we can do something. We're running on 8/12 memory cache. I don't expect problems, but I can't be sure.

Thanks for all your patience or, if you didn't have any, for your righteous indignation! We need them both.

In other news: (whew!)

*I'm in Cambridge, MA so I couldn't get into the server room to work on it, although I was about to drive up. Our "colo" guy, who should have been available, was unreachable too, something that's never happened before—and a good reason not to host out of Portland, ME where there's only one server guy at the colo. And our "remote reboot" wasn't installed yet.
**This is an interesting reversal of something I saw with the Second-Life post, where negative voices drowned out positive. I don't want to criticize members who cut us slack, but I think naysayers can also feel squelched.

Labels:

13 Comments:

Blogger esta1923 said...

Since nothing was lost, and we have had great times with LT, I'm for not complaining. I salute the folks who worked so hard all day/night, and hope all goes well from now on. Esta1923

11/16/2007 11:32 AM  
Anonymous Anonymous said...

Suck it up, Tim. Your users love you, and think you're handling things the right way. Sorry!!
(But please, please, please, don't let anything like this happen again.)

11/16/2007 1:41 PM  
Blogger Jill ONeill said...

People are more than willing to forgive LibraryThing its growing pangs because you all are so upfront about failure. It's when folks try to sweep it under the rug that the community loses trust.

11/16/2007 1:58 PM  
Anonymous Anonymous said...

As LT grows bigger, in so many different ways (users, functions, etc), it's to be expected that things will go kaput every once in awhile. Knowing that you are working on it and keeping us informed is enough for me. Keep up the good work!

11/16/2007 4:37 PM  
Blogger Barbara said...

I think it's outrageous. Given how much I pay for this service . . .

Oh.

Hey, my library pays a LOT for our catalog and it goes down. Nobody says "I'm a little concerned you're not voicing your justifiable distress." But maybe that's because they get large checks instead of fan mail and it administers some sort of anesthetic.

11/16/2007 6:54 PM  
Blogger Unknown said...

"Just a little bump in the road"

At least when the road gets a pot hole, there is a back-up database

11/16/2007 7:57 PM  
Blogger Mark Nenadov said...

The righteous indignation is in the mail.

11/16/2007 9:33 PM  
Blogger Leslie Shelor said...

Stuff like this happens; I just missed you!

11/17/2007 11:09 AM  
Blogger Unknown said...

hate to say it but LT has been reducing my expectation quite a bit over quite a time so big outages do not outrage me much. At some stage - not too long ago - LT's amateur approach became mismatched to what its users could be expected to expect. You guys really need to get professional IT wise.

11/17/2007 2:51 PM  
Anonymous Anonymous said...

When you have downtime, could you do a transparent redirect so when whatever comes back up, people can just press refresh and go where they originally wanted to? I have now forgotten. That would be awesome, thanks!

11/18/2007 8:43 PM  
Blogger James said...

Um, yeah. Not such a big deal, really--maybe you've heard of a little operation called "Skype" that had kind of a big(ger) outage issue recently? ;-) Just use this to learn--and get some better IT support. Keep up the great work.

11/19/2007 2:28 AM  
Blogger C.A.Williams said...

Everyone has a bad day now and then. Last time I checked that even applied to web services.

11/20/2007 11:26 PM  
Blogger Lynne Rutter said...

well i don't know about these people but my stuff is backed up. downtime is not the end of the world. go outside and play once in a while for god's sake.

meanwhile i sent you $25 last night so i hope that helps buy you a bigger band-aid.

11/27/2007 2:45 AM  

Post a Comment

<< Home