vSAN: Rebuilding an ESXi host that has vSAN claimed disks...

Summary:
While configuring my hosts, I ran into various issues.  One host simply decided to stop talking and the hostd service became unstable.  This meant vCenter could not access the ESXi host to manage it.  One issue I had was that my hosts were missing PTR entries, but even w/ that resolved, I was still stuck w/ one host having issues.

Quick Fix (Assumes no data on vSAN disks, use info at your own risk):
Assuming you have vSAN claimed disks, this is how you can clear them up.
  1. Gather your list of disk on the host using this command:
    • ls /vmfs/devices/disks
  2. Ones appended w/ a :1 or 2 are typically your vSAN disks, you can double check using this command:
    • partedUtil getptbl /vmfs/devices/disks/naa.#################
    • Return looks like this:
  3. Once you've determined which ones have those partitions, delete them:
    1. partedUtil delete /vmfs/devices/disks/naa.################# 1
    2. partedUtil delete /vmfs/devices/disks/naa.################# 2
  4. Once all have been deleted, restart services:
    • services.sh restart

Details:
After rebuilding the host from iso, it continued to exhibit issues.  I tried adding it back to vCenter after the rebuild, (mind you I still had vSAN turned on and set to automatic on the cluster), it reached 80% then failed w/ the following error:

A general system error occurred: Unable to push CA certificates and CRLs to host stupidESXihost.mydomain.local

Attempting to login directly via fat client to the box simply provided:

An unknown connection error occurred. (The server could not interpret the client's request. (The remote server returned an error: (503) Server Unavailable.))

After this, I attempted to rebuild the host from iso again, but this time I had turned off vSAN on my cluster object.  Unfortunately, it appears that the damage had been done to the extent that my vSAN disks were still claimed by vSAN which was noted by the # symbol next to my vSAN disks in the install screens.

This appeared to be cause the ESXi host to now simply go into error 503 state even after rebuilding the host from scratch.  I had to actually delete the vSAN claimed disks partitions and restart the services to get the host back into a healthy state.

Helpful Info:
http://www.virtuallyghetto.com/2013/09/additional-steps-required-to-completely.html
*ESXCLI method described by Lam doesn't work in this case because application server is in 503 state, so no API/CLI methods available. 

1 comment:

Anonymous said...

Huge help - thanks a ton!!!!