vSphere NFS Connection Status Alarm Auto-Remediation

One minor annoyance I had when moving to a new company was a load of alerts that were simply noise.  Not to anyone's fault, but simply poor default VMware implementation.

VMware comes packaged w/ an alert to notify when a storage path is down.  The problem is that, that's all it does.  So say you have the SNMP traps generated from that to create a ticket.  Great, now you have a ticket, but little did you know that the problem fixed itself via some failover method.

So now you have a ticket, you were woken up in the middle of the night for something that was benign.  Wouldn't it be great if another SNMP trap were sent that said "Hey, I'm ok now, close out the ticket"?

The assumption is that the SNMP trap collector, like HP's BSM/SiteScope tool, were smart enough to associate a trap to the same alarm.  Which, in my case, it does.  So here is how I redesigned an NFS alarm to send trap stating a problem and then having that alarm send a trap designating everything is ok.

First here is VMware's default alarm (vSphere 5.5 for reference):
Great, it lets me know when my storage paths have failed.  It doesn't set a status, so I don't end up w/ ugly marks on my servers, so that's ok I guess.  Not helpful in SNMP traps, so we need statuses.

Not good, it'll simply just keep sending SNMP traps to my destinations every 5 minutes until the problem is solved.  Thankfully HP BSM/SiteScope will just ignore the repeats as it's smart enough to know the problem is the same as the ones sent before it.  The bad is that the problem will remain open until it is informed otherwise.

So, being that I don't have FC in-house, I simply removed those Storage based alerts and replaced them w/ NFS related alerts and marked their status.  

Here is my alert mechanism (NFS Connection Status):
Essentially, alert when a disconnect is detected and normal when restored.

So what does it do?  Sends an SNMP trap once for each status.  Here is the workflow:
  1. vCenter sends a SNMP trap that NFS disconnect as occurred.
  2. SNMP trap is captured
  3. Ticket is created based on 'alert' status
  4. vCenter sends a SNMP trap that NFS has reconnected.
  5. SNMP trap is captured.
  6. Ticket is closed based on 'normal' status.
So the cool thing, my work has just been tracked for me.  Either I fixed the problem manually, or the system recovered on it's own.  In either case, I didn't need to touch a ticket.  The key for this to work is that the alarm must self-remediate.  If this were set up as two different alarms, statuses would not clear, and HP BSM/Sitescope wouldn't be able to relate the two events.



No comments: