I've been dealing w/ issues on Emulex OCe11102-FM logging out of the vSAN Fabric and not recovering.
Long story short, it's been a firmware problem and I was able to get a alpha firmware that fixed the issue.
In the meantime, I had created a PowerCLI script to auto-remediate every hour if a degraded path was detected. I wasn't fond of the solution, so I looked to vCenter alarms to see if I could have them do it for me. Turns out they can, but there are caveats to this approach.
What follows is an example of how to set something like this up and details specific to the errors/configuration I was dealing with.
- Alarm General Settings: Host, Monitor for specific events
- Alarm Triggers:
- Degraded Storage Path Redundancy: Warning
- Entered Maintenance Mode: Alert
- Host connected: Normal
- Alarm Actions:
- Enter Maintenance Mode upon "Warning" status
- Reboot Host: Upon "Alert" Status
- Exit Maintenance Mode: upon "Normal" status
- This does the following:
- Degraded Storage Path Redundancy detected and placed into "Warning" status
- Once "Warning" status is triggered, the "Enter maintenance mode" action is executed.
- Once the host has "Entered maintenance mode", an "alert" status is triggered.
- Once "Alert" status is triggered, the "reboot host" action is executed.
- The host goes into disconnected state when it's rebooted, so when it comes back online, this triggers a "Host Connected" event which places a status of "Normal".
- Once "Normal" status is triggered, the "Exit maintenance mode" action is executed.
- The alarm does not understand the notion of cluster status, so if other hosts trigger the same degradation status, they will all attempt to enter maintenance mode @ the same time. Since they will all fail to reach "Entered maintenance mode" they will simply be stuck.
- Think of this as a workaround solution only, I would not recommend using this consistently.
- Doing firmware updates on the cards causes a degradation event which kicks off this alarm.
Configuration / Errors:
Host -> Nexus 5K -> Cisco MDS
OCe11102-FM Firmware Version: 220.127.116.11
Nexus 5K is configured for NPV
Cisco MDS is configured for NPIV
This particular issue was seeing that the connection from the Host to the Nexus 5K always logging out of the vSAN.
The connection between the Nexus 5K and Cisco MDS were always 'up'.
2013 Oct 21 12:16:23 SWITCHNAME %PORT-5-IF_TRUNK_DOWN: %$VSAN 20%$ Interface vfc1, vsan 20 is down (Gracefully shutdown)
2013 Oct 21 12:16:23 SWITCHNAME %PORT-5-IF_DOWN_NONE: %$VSAN 20%$ Interface vfc1 is down (None)
2013 Oct 21 12:16:23 SWITCHNAME %PORT-5-IF_TRUNK_DOWN: %$VSAN 20%$ Interface vfc1, vsan 20 is down (waiting for flogi)
2013 Oct 21 12:20:52 SWITCHNAME %PORT-5-IF_TRUNK_DOWN: %$VSAN 20%$ Interface vfc1, vsan 20 is down (Initializing)
Problem came down to bad firmware from Emulex. Something was getting hung and not re-logging back into the fabric w/o resetting the virtual fabric port on the switch or rebooting the ESXi host.
Port captures on the switch show that under normal operating circumstances, it is expecting a VLAN discovery packet from the host to begin FLOGI negotiation. When the problem occurs, it no longer receives this VLAN discovery packet from the host. Performing the same packet captures on the host also did not show the VLAN discovery packet which points to a 'firmware' level call that can only be seen in the actions of the HBA itself and its diagnostic dumps. This is when we interfaced w/ Emulex for analysis of those dumps.