Auto-Remediation vCenter Alarm

December 17, 2013

Summary:
I've been dealing w/ issues on Emulex OCe11102-FM logging out of the vSAN Fabric and not recovering.

Long story short, it's been a firmware problem and I was able to get a alpha firmware that fixed the issue.

In the meantime, I had created a PowerCLI script to auto-remediate every hour if a degraded path was detected. I wasn't fond of the solution, so I looked to vCenter alarms to see if I could have them do it for me. Turns out they can, but there are caveats to this approach.

What follows is an example of how to set something like this up and details specific to the errors/configuration I was dealing with.

Example:

Alarm General Settings: Host, Monitor for specific events

Alarm Triggers:

Degraded Storage Path Redundancy: Warning
Entered Maintenance Mode: Alert
Host connected: Normal

Alarm Actions:

Enter Maintenance Mode upon "Warning" status
Reboot Host: Upon "Alert" Status
Exit Maintenance Mode: upon "Normal" status

This does the following:

Degraded Storage Path Redundancy detected and placed into "Warning" status
Once "Warning" status is triggered, the "Enter maintenance mode" action is executed.
Once the host has "Entered maintenance mode", an "alert" status is triggered.
Once "Alert" status is triggered, the "reboot host" action is executed.
The host goes into disconnected state when it's rebooted, so when it comes back online, this triggers a "Host Connected" event which places a status of "Normal".
Once "Normal" status is triggered, the "Exit maintenance mode" action is executed.

Caveats:

The alarm does not understand the notion of cluster status, so if other hosts trigger the same degradation status, they will all attempt to enter maintenance mode @ the same time. Since they will all fail to reach "Entered maintenance mode" they will simply be stuck.
Think of this as a workaround solution only, I would not recommend using this consistently.
Doing firmware updates on the cards causes a degradation event which kicks off this alarm.

Configuration / Errors:

Host -> Nexus 5K -> Cisco MDS

OCe11102-FM Firmware Version: 4.6.142.10

Nexus 5K is configured for NPV

Cisco MDS is configured for NPIV

This particular issue was seeing that the connection from the Host to the Nexus 5K always logging out of the vSAN.

The connection between the Nexus 5K and Cisco MDS were always 'up'.

Errors recorded:

2013 Oct 21 12:16:23 SWITCHNAME %PORT-5-IF_TRUNK_DOWN: %$VSAN 20%$ Interface vfc1, vsan 20 is down (Gracefully shutdown)

2013 Oct 21 12:16:23 SWITCHNAME %PORT-5-IF_DOWN_NONE: %$VSAN 20%$ Interface vfc1 is down (None)

2013 Oct 21 12:16:23 SWITCHNAME %PORT-5-IF_TRUNK_DOWN: %$VSAN 20%$ Interface vfc1, vsan 20 is down (waiting for flogi)

2013 Oct 21 12:20:52 SWITCHNAME %PORT-5-IF_TRUNK_DOWN: %$VSAN 20%$ Interface vfc1, vsan 20 is down (Initializing)

Problem came down to bad firmware from Emulex. Something was getting hung and not re-logging back into the fabric w/o resetting the virtual fabric port on the switch or rebooting the ESXi host.

Port captures on the switch show that under normal operating circumstances, it is expecting a VLAN discovery packet from the host to begin FLOGI negotiation. When the problem occurs, it no longer receives this VLAN discovery packet from the host. Performing the same packet captures on the host also did not show the VLAN discovery packet which points to a 'firmware' level call that can only be seen in the actions of the HBA itself and its diagnostic dumps. This is when we interfaced w/ Emulex for analysis of those dumps.

Comments

Anonymous said…

Hi chris,

About this alpha firmware for the Emulex OCE11102-FM pci board, could you explain where it's possible to download it for test ? I don't see this firmware on the Emulex Support website. We have this type of problem on several systems and we would test the alpha firmware.
Thanks for your information.

February 26, 2014 at 2:31 PM