VMware: physical vmnic# not showing up after upgrade...

Summary:
This was a weird one.  I had a couple of Dell FC630's (FX2 Blades) w/ qlogic broadcom 57810 integrated card in them.  Went to upgrade them from 6.0 to 6.5, that's when the fun began.  Before upgrade, my hosts could see them just fine.  After upgrade, they could only 'see' vmnic1.  Fresh install was also having issues.

Solution/Workaround:
In my case, I had to literally remove the FC630 blade from the FX2 enclosure so that all residual power would be drained.  Once done, whatever it was that was hanging the firmware for my nic finally cleared for ESXi to take control of it.

Details:


It appears that my issue stemmed from that fact, prior to upgrade, I had SR-IOV enabled on vmnic0 in firmware, but disabled in host config.  I confirmed this when I worked w/ my 4 hosts.  2 hosts had suffered these symptoms, before upgrading the other 2, I disabled SR-IOV in firmware and upgraded w/o issue.

You'd think driver or firmware problems at first, but I'd expect both nics, since they are dual port single card, to have issues, but it was only the nic that I had enabled SR-IOV on.  Interestingly looking through the vmkernel logs I came across these messages:

2018-03-07T16:19:00.532Z cpu0:33635)Uplink: 9471: Opening device vmnic0
2018-03-07T16:19:00.532Z cpu33:32982)<3>bnx2x: [bnx2x_open:12914(vmnic0)]Recovery flow hasn't been properly completed yet. Try again later.
If you still see this message after a few retries then power cycle is required.

Prior to finding these messages, I had simply attempted to:

  1. Update firmware
  2. Fresh install ESXi
Problem would never resolve and you'd think a power cycle would clear the problem like mentioned above, but problem would continue no matter fresh install or not.  When trying to install 6.5U1, I'd get the following error stopping @ 85%:

vmkctl.HostCtlException: Unable to get node: Sysinfo error: Not foundSee VMkernel log for details

Basically, all I'd ever find in the vmkernel log was that weasel would complain and stop at the networking step.  Funny thing is that 6.0 didn't care and would install w/o halting.

Being that these were 'blades' in an FX2 chassis, my 'power cycles (cold boot)' were not true power cycles.  In order to 'fix' the problem, the blades literally had to be pulled from the chassis so that whatever was hung in the nic firmware could be reset and released.  It's bizarre because you would think that firmware update would resolve the issues.

Side note:
I almost started cursing Lam's name because the thought crossed my mind that someone was screwing w/ me due to this article:

Thankfully or unthankfully it wasn't the case.  I just had run into a pretty crappy bug.

Comments

Popular posts from this blog

NSX-T: Release associated invalid node ID from certificate

NSX-T: vCenter and NSX-T Inventory out of Sync (Hosts in vSphere not showing up in NSX-T)

Azure VMware Solution: NSX-T Active/Active T0 Edges...but