Some time ago I was working in a vSphere role and I got an escalation through about an unexepcted reboot of a host. Sure, we all encounter a PSOD if we’re unlucky. They’re certainly not a normal occurace, and it’s a stop screen where the default config doesn’t result in a reboot. I started looking at the logs, vpxd.log, hostd.log, vpa.log, and so on. There was no indication of any failure. The logs just stopped, then re-started when the host was booting up.
That’s neither normal nor expected. No problem though, keep looking and something will show up. I had a look at the host SEL to see if anything showed up. The host had been rebooted by a user in a different team. Problem solved.
I went to have a chat with the person who rebooted a functional ESXi host to find out what caused them to do that. They said that the console was unresponsive. I don’t dispute the symptoms they experienced, but the log files showed that the host was functional and responsive. What I think happened was that the Java virtual console they were using didn’t initialize properly, or didn’t grab the keyboard correctly, so it appeared that the host wasn’t responding.
Rebooting the host was their response to the situation, which resulted in HA being invoked for the virtual machines which were killed. HA worked as it should have, great. However, there was still some kind of service outage for the VMs that were running on the host.
This felt like a learning opportunity – How to diagnose an unresponsive host. Sadly, nobody was interested. I was met with “it was back within SLA”. They were entirely correct, but in my opinion this is something that didn’t need to happen in the first place.
What’s more important – Your KPI/SLA or your outcome?