vROPs: Take Cluster Offline: Status: Removing Slice

Applies to vROPs running in NON-HA mode, below was thanks to help from VMware support.  They have an internal KB on this so it should eventually be published or more likely fixed in the future.  Mostly for my own notes, but interesting for those tinkerers out there.

Case:
Attempting to remove remote collector node from vROPs to replace w/ larger remote collector.  Correct way would have been to move 'instances' to other collector or data nodes before putting collector node offline.  Instead, I just removed the collector node from the admin console.  That's when it got stuck in this state.

It's supposed to move the 'instances' to other collectors, but that didn't seem to work and it got stuck in a state of taking the cluster offline.  So to remove the remote collector and get back to a stable state, VMware support walked me through below.

Resolution to Case:
DO NOT ATTEMPT BELOW STEPS UNLESS YOUR CASE MATCHES ABOVE EXACTLY OR YOU REALLY KNOW WHAT YOU ARE DOING.


I moved all 'instances' to other collectors or data nodes, then brought the remote collector node offline.

If stuck in this state, stop master node services in the following order:
service vmware-vcops stop
service vmware-casa stop

#Backup current casa.db script to tmp directory
cp /storage/db/casa/webapp/hsqldb/casa.db.script /tmp/casa.db.script.$(date +%F.%H%M)

#Modify casa script
vi /storage/db/casa/webapp/hsqldb/casa.db.script

#Look for the below block, change to highlight text and delete/remove struck out text.
INSERT INTO CASA_DOCS VALUES('clusterMembership','{"onlineState":"ONLINEOFFLINE","cluster_name":"vROPsMastersoftheUniverse","is_ha_enabled":false,"ha_transition_state":null,"initialization_state":"NONE","remove_node_state":"REMOVINGNONE","document_time":1435068088309,"online_state":"ONLINEOFFLINE","online_state_time":1434726691663,"online_state_reason":"online_state.change_reason.remove_a_slice","cluster_members":[],"admin_slices":[],"installation_state":"DONE","slices":{"00059a72-349f-47ed-a910-4c01c9ac4e2e":{"slice_uuid":"00059a72-349f-47ed-a910-4c01c9ac4e2e","is_admin_node":true,"ip_address":"server1000.local","slice_name":"p00x04vropm001","membership_state":null},"a6ba9353-c240-40b1-bc5c-9d7d5d667ab2":{"slice_uuid":"a6ba9353-c240-40b1-bc5c-9d7d5d667ab2","is_admin_node":false,"ip_address":"server900.local","slice_name":"p00x04vropd001","membership_state":null},"ee53d2ce-0c9d-4091-9410-b4ccb08305ad":{"slice_uuid":"ee53d2ce-0c9d-4091-9410-b4ccb08305ad","is_admin_node":false,"ip_address":"server400.local","slice_name":"p00x03vropc001","membership_state":null},"4862541a-b66a-4dbd-b1de-035082468ca8":{"slice_uuid":"4862541a-b66a-4dbd-b1de-035082468ca8","is_admin_node":false,"ip_address":"server300.local","slice_name":"p00x07vropc001","membership_state":null},"7dbbaf73-e409-47cc-9f98-8e3b6d66dec5":{"slice_uuid":"7dbbaf73-e409-47cc-9f98-8e3b6d66dec5","is_admin_node":false,"ip_address":"server200.local","slice_name":"p00x04vropc001","membership_state":null},"4dcc7006-81c2-42f4-a295-4f540d3a1297":{"slice_uuid":"4dcc7006-81c2-42f4-a295-4f540d3a1297","is_admin_node":false,"ip_address":"server10.local","slice_name":"p00x04vropd002","membership_state":null},"2df836e9-fb5f-49ef-9327-84a544f26e51":{"slice_uuid":"2df836e9-fb5f-49ef-9327-84a544f26e51","is_admin_node":false,"ip_address":"Server20.local","slice_name":"savvis-manheim","membership_state":null},"12adbdfb-79bc-44e2-bd2d-dadd100b5931":{"slice_uuid":"12adbdfb-79bc-44e2-bd2d-dadd100b5931","is_admin_node":false,"ip_address":"Server30.local","slice_name":"roswell-manheim","membership_state":null}}}')

#If you are removing a data node follow the steps associated w/ gemfire.properties below.  Does not apply to removing a remote collector.
cp $ALIVE_BASE/user/conf/gemfire.properties /tmp/gemfire.properties.$(date +%F.%H%M)

vi $ALIVE_BASE/user/conf/gemfire.properties 

Change "serverCount" entry in gemfire.properties file to number that matches number of nodes (data and master) IF removed node was a data node.

#Start services
service vmware-casa start
service vmware-vcops start

Comments

Thomas Bridle said…
Big thanks, this worked on a vROPS 6.7 cluster despite the fact the actual collector node had gone from the UI and in that file. The cluster whether offline or online never lost its spinning status which therefore meant I couldn't upgrade because 'node removal in progress' error was still occurring over three days later.

Popular posts from this blog

NSX-T: Release associated invalid node ID from certificate

NSX-T: vCenter and NSX-T Inventory out of Sync (Hosts in vSphere not showing up in NSX-T)

MacOS: AnyConnect VPN client was unable to successfully verify the IP forwarding table modifications.