Converged Networking Perils...
Summary:
Had a wonderful experience where a P2V VM w/ bonded NIC's brought down several of our ESXi hosts. HA compounded the problem by powering up the VM on other hosts once the host w/ this VM was brought down. The perils of converged networking and why it's important to keep your ESXi management/storage separate from your other physical ports. If these were 'physically' separate, the problem would have isolated to one host and prevented the cascading HA events.
Here is the config in short:
Dell Blade two nPar'd 10Gb ports --> Internal Dell I/O aggregator ports --> External Dell I/O aggregator ports --> Nexus 5K
Management, vMotion, NFS, AND VM traffic go over these two ports.
One port goes over Fabric A, the other over Fabric B. Two physically separate uplinks.
What happened:
VM w/ bonded NIC's comes online. This seemed to cause a 'spanning-tree' like event which caused the Internal Dell I/O aggregator ports to go into an 'error-disable' like state. I say like because neither of these functions are in the Dell IOA's
Looking @ the Dell I/O aggregator internal ports attached to the blade, we saw something like this:
Workaround:
Waiting on Dell to see why the IOA reacted as it did. In the meantime, we've moved management, NFS, and vMotion to Fabric B while leaving VM networking running over Fabric A.
This way the problem will keep the ESXi and VM's running, but only disconnect their network activity should this ugly issue rear it's head again.
Below is quick snippet I wrote up to reconnect several VM's network connections due to the issues that occurred above.
Script Snippet to reconnect several VM's:
$ClusterVMs = Get-Cluster MyClusterName | Get-VM
$Problems = $ClusterVMs | where {$_.powerstate -eq "poweredon"} | get-networkadapter | where {$_.ConnectionState.connected -ne $True}
$Problems | set-networkadapter -connected:$True
Had a wonderful experience where a P2V VM w/ bonded NIC's brought down several of our ESXi hosts. HA compounded the problem by powering up the VM on other hosts once the host w/ this VM was brought down. The perils of converged networking and why it's important to keep your ESXi management/storage separate from your other physical ports. If these were 'physically' separate, the problem would have isolated to one host and prevented the cascading HA events.
Here is the config in short:
Dell Blade two nPar'd 10Gb ports --> Internal Dell I/O aggregator ports --> External Dell I/O aggregator ports --> Nexus 5K
Management, vMotion, NFS, AND VM traffic go over these two ports.
One port goes over Fabric A, the other over Fabric B. Two physically separate uplinks.
VM w/ bonded NIC's comes online. This seemed to cause a 'spanning-tree' like event which caused the Internal Dell I/O aggregator ports to go into an 'error-disable' like state. I say like because neither of these functions are in the Dell IOA's
Looking @ the Dell I/O aggregator internal ports attached to the blade, we saw something like this:
- Port Description Status Speed Duplex Vlan
- Te 1/12 Up 10000 Mbit Full —
- Port Description Status Speed Duplex Vlan
- Te 1/12 Up 10000 Mbit Full 1,31,42,69
Workaround:
Waiting on Dell to see why the IOA reacted as it did. In the meantime, we've moved management, NFS, and vMotion to Fabric B while leaving VM networking running over Fabric A.
This way the problem will keep the ESXi and VM's running, but only disconnect their network activity should this ugly issue rear it's head again.
Below is quick snippet I wrote up to reconnect several VM's network connections due to the issues that occurred above.
Script Snippet to reconnect several VM's:
$ClusterVMs = Get-Cluster MyClusterName | Get-VM
$Problems = $ClusterVMs | where {$_.powerstate -eq "poweredon"} | get-networkadapter | where {$_.ConnectionState.connected -ne $True}
$Problems | set-networkadapter -connected:$True
Comments