Tuesday, April 01, 2008

On Rotation

I have another rotation this week. That means my life is going to be a bloody rotten mess for the next seven days. All my regular readers and friends have heard me say that before. However, I thought I would take a second and write about what that means. The simplest form of it means that I am the point of contact (POC or pock as we pronounce it) for all the tens of millions of dollars of systems on the floor. However, that's a deceptive way of saying what my responsibilities are.

First and foremost I am responsible for keeping those systems up. That means monitoring them through our different ways of doing so (not going to detail that here, sorry). When one of them coughs or sneezes I have to investigate and do log trawling and diagnosis even on systems that I don't normally work with except for when I am on rotation. My normal day job is to work on the center wide file system and its hosting cluster. When I go on rotation, I end up taking care of the IBM SP5 cluster and Cray XT4 (amongst others).

This doesn't happen merely during the daylight hours. I carry a pager and if something bad happens then I have to get up and take care of it. that means that if there is an issue that I need to be able to think and respond when I have little to no sleep for extended periods. My longest war with a machine, the now defunct seaborg, lasted nearly 30 hours. Then I had to call my boss and declare I was close to nonfunctional and had to have someone else take over. My worst serial marathon session was when I had multiple outages (3) in four days. That means that the machine was considered unusable and I needed to work with the vendor (in this case IBM) to resolve the issue. It, uh, sucked.

I am not as it seems doing this alone though. We have a team that is divided up based on machine. Some are assigned to the Cray. Others to the IBM. Still others to the linux clusters. Or SGI. If I cannot figure out the problem, and I cannot even discern what is happening well enough to work with an engineer from the vendor, then I am supposed to seek out help. I tend not to do this as much as others. We're also supposed to document everything that happens during our week: what happened, why, and how we resolved it. This is tedious at best and can get messy when the center goes through a meltdown (rare, but it happens). However, it's absolutely crucial: we need to be able to figure out what happened and why especially if some bits are actually trouble nodes or there's buggy software....and there are far more bugs in our stuff than yours.

Consider then number of Crays there are in the world. Then consider the number of PCs. uh huh. We have a lot less people out there running into all the edge cases than you do. Except at the same time, we have to have utilization rates that exceed 90 percent and even 95 percent if at all possible. That's. Really. Hard. Only the best HPC centers can even come close to this. It's something that NERSC has largely done after breaking in the machines we get. it's something we're rather proud of. It's also something that exhausts us.

Rotation comes around right now once every 11 weeks. It was "luck" that this has happened more often for me recently, but that's not the norm these days. When I was first hired, there were four people on rotation and that meant they were going on once per month. If I think that people get strung out now, all I have to do is think back to June, 2001. However, no matter what, whoever is coming off of rotation is strung out and dead. Even when we make a system as a whole more stable, there are still thousands of parts (19k processors in our XT4) and even if you have a 99.99% daily reliability rate...uh huh. yep. You have a lot of things happening still.

Normally for me, I get paged nightly a couple times. It's not unheard of for me to not get more than a single hour of contiguous sleep. Yet again, I - and everyone in my group - is expected to be able to go from sleep-to-attack or dead-to-critical thinking mode in minimal time. This really isn't your average sysadmin job. It's harder and more demanding. People breath down your neck when a $30 to 50 million dollar machine face plants. it's not even strictly just a sysadmin job.

We do some system coding. We do a lot of debugging. We get to do esoteric things with computers that I hadn't even dreamed of before coming here. We work with software that has a user base that you can almost count on your fingers and toes. We also get to participate and work with some of the truly brilliant people in the world. Sometimes you get to do some cutting edge work beyond what we normally do.

Y'know, it's a fantastic job. if you have nerves of steel, endurance, and are quick minded. This isn't for your average guy tht thinks rebooting the computer is going to solve anything.

And that, folks, is rotation for my group at NERSC.

No comments: