KPIs and SLAs – More important than outcomes?

Some time ago I was working in a vSphere role and I got an escalation through about an unexepcted reboot of a host. Sure, we all encounter a PSOD if we’re unlucky. They’re certainly not a normal occurace, and it’s a stop screen where the default config doesn’t result in a reboot. I started looking at the logs, vpxd.log, hostd.log, vpa.log, and so on. There was no indication of any failure. The logs just stopped, then re-started when the host was booting up.

That’s neither normal nor expected. No problem though, keep looking and something will show up. I had a look at the host SEL to see if anything showed up. The host had been rebooted by a user in a different team. Problem solved.

Do we want to solve problems, or say we’ve solved problems?

I saw the above tweet the other day. It’s fairly amusing, and later I saw a few imply it was the process solution implemented as a result of the Facebook outage on 4th October 2021.

It resonated – How often do we implement solutions without solving problems? How often is a process introduced (‘Do not unplug’) that makes no attempt to solve the underlying problem (Whatever cannot be unplugged).

All too often we focus on implementing a solution without taking a step back to ensure it actually solves the problem. If we ignore all the bad stuff we can only talk about the good stuff, right? A report that’s all green is better than a report that has red on it, right? All that matters is the report, right?

Ostrich management never ends well.