Since it’s important to practice what you preach (apparently) here’s my post incident report on a P1 homelab failure
Timeline#
09:30 - Services slow, services down
10:00 - Attempt to upgrade Ubuntu and reboot VM
10:00 - CPU spiking 100% across all 8 cores
10:15 - Increase core count to 16 and reboot VM
10:30 - Slow recovery but some services still down
16:00 - Server not on network
18:00 - Server powered on but no response
18:30 - Server disassembled and left to cool - fans cleaned a bit
19:00 - Services recovered
Findings#
- There was no indication that temperature was an issue even after our primary on call engineer (me) saying “it’s pretty hot in here (lounge)”.
- No monitoring of system stats present
- While upping CPU core count helped it made the situation worse by ultimately overheating
- No notifications system in place for failures, notification of system down was via a third party (ADSB Exchange)
- Fans are really dirty
Notes#
External logging (partially done) and monitoring need to put in place.
Cans of compressed air have been ordered so that the fan can be cleared out properly to help airflow.
A bigger rework of the “server cabinet” (it’s a few shelves in the lounge) needs to be done. If the server cabinet is moved into the garage then temperature and dust wouldn’t be too much of an issue. Actually having a proper server cabinet would be nice as well!
The reverse proxy setup is an annoying problem, if the main vm goes down then I lose access to friendly urls for Proxmox, I’ve documented the IP in my runbooks but it’d be nicer to pull the proxy setup into it’s own lxc container (partially done) that boots first.
There is a rogue Forgejo container running on boot and I’ve no idea where it’s setup, I need to remove it properly since it’s not needed.
Rclone mounts get corrupted very easily. I had to run the disable/enable/re-setup process twice for the docker mounts. It also means that other docker services can’t be started properly.
Conclusion#
It was a typical homelab failure, lots learnt and lots to do to improve things. I’m a bit annoyed that I didn’t have temperature down as a potential failure in my system. I’ve no doubt decreased the lifespan of the hardware now as well.
It was made clear to me by stakeholders (my wife) that it was not acceptable that Home Assistant was down. My pay has been docked this month.