VMware: vSAN Disk Group Cache Drive Dead or Error (VSAN Absent Disk)

A cache disk failed in my host taking along with it the disk group.  This is expected behavior, but for some reason, the disk group also disappeared from GUI so I couldn't decommission the disk group to basically replace the cached drive.  So, had to do it through powercli/esxcli.  Wish I took a screenshot, cause it was kind of annoying.

PowerCLI Example:

Once you've deleted the offending disk group, you can now create a new disk group utilizing the replaced cache disk and former capacity disks.

VMware: vxlan to vxlan traffic randomly fails or only works on the same ESXi host...

Here are the basics:

  • Leaf/Spine Architecture (Basic illustration only show ToRs)
    • Basic Illustration for explanation purposes
  • vSphere 6.5U1 / vSAN 6.6
  • NSX 6.3.3
    • Multi-VTEP Deployment w/ LoadBalance-SRCID
    • Standard VLAN for VTEP connections.
  • 2x Nexus 9K ToRs
  • Dell R630's
Long story short, Switch vPC's were stripping VLAN ID info before sending to peer ToR then to ESXi host.  ESXi host dumped it causing these strange issues.  Load Balance SrcID w/ Multi-VTEP made this especially difficult to figure out because of the basic randomness.  Switch vPC link has a configuration advantage, so in order to keep it, we ran additional links between the switches to make some standard trunk connections.  Once done, we configured our NSX VTEP VLAN network to traverse those trunk connections rather than the vPC.  This resolved our stripping issue.

See past page break for tools and more details on what we (mostly vmware NSX senior support staff) did to figure this out.
[FYI: Cisco recommendations appear to be only to use vPC between switches if the downstream host links utilize port channel (LACP) as well.  There are factors in play in the larger scheme of the network fabric, but this is from the viewpoint of a compute engineer.]