vSAN: The cascade scenario that vSAN stretch cluster has issues with...


Basically while testing stretch cluster, we ran into strange failover behavior.  The fact that it was not simply occuring.  During this testing, we found a dirty little secret about stretch cluster failovers.  One that makes me rethink if stretch clusters really is worth doing.

Documented Failure Scenarios


All documented scenarios effectively deal w/ a 'single' type of failure.  The problem is disasters/failures can be multi-faceted and cascading in some instances.  Taking the Secondary Site Failure or Partitioned scenario and adding the 'cascading failure' to it and you end up in a whole world of trouble depending on the next 'failure'.

Below effectively depicts the failure of the interconnect between the two sites.  The problem this fails to take into account is that there are typically 3 things involved to this.  

  1. The networking between the two sites
  2. The preferred site routers
  3. The secondary site routers

So here is a slightly more involved diagram to highlight a case where the primary site routers link to the secondary site fails FIRST in a cascading failure scenario.

  1. VMs in Secondary Site are HA powered off
  2. VMs in Secondary Site are powered on in Preferred Site.
This is fine, but 'what if' the primary site routers link to secondary was simply a signal to a greater disaster of preferred site going completely offline? What happens to your VMs? Do they failover to secondary site? Short answer is no. The reason?
The problem w/ this cascading failure scenario is that witness detected secondary site cannot communicate w/ primary and has already declared HA to failover to preferred site. EVEN though, the witness can still communicate w/ secondary site systems.

Witness cannot send preferred site a signal to HA event because it does not know its actual status to start systems on secondary site. The data on secondary site has also been declared stale at this point because the link between preferred and secondary was broken first. This is not an issue if secondary site were the one to fail.

So what can you do in this case?
  • Restore from backup
  • Contact VMware for the black magic voodoo to force a failover to secondary site.
What can VMware do to improve this? ¯\_(ツ)_/¯ 
All roads point to using storage policies and defining data locality (preferred only or secondary only), but at that point, you're working to make your applications above redundant. I would like to be able to define a dual mirroring policy w/ a way to state who my actual 'preferred' site is, but unsure if that really gains me anything.

Release the black magic voodoo so that you can force failover to your secondary site? 
This is not without risk though, because remember, the data on the secondary site is stale and there may have been new data written while systems were up in preferred site that secondary site never received data for.

It's somewhat of an edge case, but in a DR scenario, anything is possible.

Is vSAN stretch cluster worth it?
I'd argue probably not, knowing the behavior above. I'd be more likely to lean toward DR tools like SRM (even though it wouldn't be real time replication).  Or rely on application level replication tech.  However, I'm sure there are use cases where vSAN stretch cluster would make sense, but the very real failure scenario above definitely gives me pause.


Popular posts from this blog

NSX-T: vCenter and NSX-T Inventory out of Sync (Hosts in vSphere not showing up in NSX-T)

MacOS: AnyConnect VPN client was unable to successfully verify the IP forwarding table modifications.

Azure VMware Solution: NSX-T Active/Active T0 Edges...but