On the production status of the Departmental Filestore
The Google TechTalk on the subject of Scrum, given by Ken Schwaber, contains one of my favourite quotes. You can see the whole thing here; and if you haven’t, it’s worthwhile devoting an hour to watching it. Can’t be bothered? Then don’t bother reading on. And the quote? To paraphrase,
our discipline has a tried and tested way of going faster. Cut corners, cut quality. That way you can produce more crap.
So, how does this relate to the DFS?
Well, I’m coming under pressure to declare that the DFS is production-ready. This isn’t a technical thing, this is purely one of PR. Where does this pressure come from? From “on high” – ie, from someone who is one hop away from seeing anything other than that we have a working cluster, what’s holding things up? (Bob’s actually pretty reasonable about this – he’s stuck between a rock and an opinionated git.)
There have been unavoidable supplier and technical delays involved in getting this far. The trouble is that dates have been randomly selected on no basis whatsoever (actually, on the basis of having a meeting and me saying, “it will take n days of uninterrupted work by all involved with nothing else getting in the way, assuming no impediments, no unforseen hitches, and the continued availability of the emotional energy required to sustain that velocity by all involved” – and then n being added to the date of that meeting); and then those dates have been missed because, for example, we require additional FC ports in order to plug in our development array and the vendor arbitrarily cancelled our order and didn’t tell us about it. That kind of thing. I work hard on it, at a rate that I consider to be sustainable – I’ve already managed to get to the point where I looked like a corpse and couldn’t focus my eyes whilst fixing the mess that the previous kit had put us in. It’s about expectation management. So stuff slips.
So, I’ve repeatedly resisted that pressure. Why?
First, I should point out that there is no difference in what we do now compared to what we would do with a system that is production – providing nothing goes wrong. The difference happens when something does go wrong.
What would happen now is that we would run ourselves ragged making stuff up on the fly, recovering the system as quickly as possible, but basically choosing our path on the basis of our best expectation (which would be reasonable except that empirically, we’ve come to understand that Windows clustering seldom meets our best expectations).
In a production system, we would have already simulated that problem, developed and practised the recovery process, and have it documented and understood by at least two people.
That’s the difference.
Now, in a recent meeting, I was challenged with this:
If we had held any of our current production systems up to the same standards of delivery, they wouldn’t be in production.
That may be true, but it isn’t a reason to cut corners and produce crap. It’s a comment on the other services that leak into production, not the state of the DFS.
Why is this the case?
- A lack of project management. We lack well-defined milestones. We’re not in the habit of setting them. Instead, our teams tend to operate in silos, interrupt-driven, incrementally doing development work when the stoking of production systems isn’t in the way. We don’t run clean iterations with clean, measurable, achievable milestones.
- A lack of project teams. We’re not geared up to fix the first problem because of the way that our department is organised. People live under fixed organisational structures. It means that sorting out the logistics (find a DBA; find a sysadmin; find some kit; find a project manager; etc) is difficult because you are naturally trying to squeeze attention out of a small number of vital people who don’t directly live in the same organisational branch that you do.
- A lack of capability to fix the above. The kind of short-term, well-defined, project-related work needs effort from middle-management. It needs a willingness to devote people to well-identified pieces of work for a fixed period, to let them get on. It needs up-front planning and project-management skills. It needs a bit of vision and a bit of courage.
So that’s where we are and what I think is wrong. We’re departmentally in a rut. It’s the role of the departmental directors to sort this out. I’m not sure it’s perceived as a problem; but it’s certainly the case that we could use people better, in more varied ways, and identify the key resource shortages if we were to stop spreading those key people so thinly. We should be giving everyone a more varied, more rewarding experience at work.
And if, when it comes to resourcing a project milestone, 60 people are left in the room because the key people have already been earmarked for working on a particular thing that month: well, then that’s a result, not a failure. Better to identify that problem than to settle for a working practice that permits you to ignore it.
The good news is that entropy is letting me fill in addition sections in my list of “what to do when part X craps out” anyway: we had a fairly hard switch failure the last week and that went swimmingly well; although obviously the nodes that lost paths to the array behind it needed a reboot to find them again afterwards.