vROPs: Take Cluster Offline: Status: Removing Slice
Applies to vROPs running in NON-HA mode, below was thanks to help from VMware support. They have an internal KB on this so it should eventually be published or more likely fixed in the future. Mostly for my own notes, but interesting for those tinkerers out there.
Case:
Attempting to remove remote collector node from vROPs to replace w/ larger remote collector. Correct way would have been to move 'instances' to other collector or data nodes before putting collector node offline. Instead, I just removed the collector node from the admin console. That's when it got stuck in this state.
It's supposed to move the 'instances' to other collectors, but that didn't seem to work and it got stuck in a state of taking the cluster offline. So to remove the remote collector and get back to a stable state, VMware support walked me through below.
Resolution to Case:
DO NOT ATTEMPT BELOW STEPS UNLESS YOUR CASE MATCHES ABOVE EXACTLY OR YOU REALLY KNOW WHAT YOU ARE DOING.
I moved all 'instances' to other collectors or data nodes, then brought the remote collector node offline.
If stuck in this state, stop master node services in the following order:
service vmware-vcops stop
service vmware-casa stop
#Backup current casa.db script to tmp directory
cp /storage/db/casa/webapp/hsqldb/casa.db.script /tmp/casa.db.script.$(date +%F.%H%M)
#Modify casa script
vi /storage/db/casa/webapp/hsqldb/casa.db.script
#Look for the below block, change to highlight text and delete/remove struck out text.
#If you are removing a data node follow the steps associated w/ gemfire.properties below. Does not apply to removing a remote collector.
cp $ALIVE_BASE/user/conf/gemfire.properties /tmp/gemfire.properties.$(date +%F.%H%M)
vi $ALIVE_BASE/user/conf/gemfire.properties
Change "serverCount" entry in gemfire.properties file to number that matches number of nodes (data and master) IF removed node was a data node.
#Start services
service vmware-casa start
service vmware-vcops start
Case:
Attempting to remove remote collector node from vROPs to replace w/ larger remote collector. Correct way would have been to move 'instances' to other collector or data nodes before putting collector node offline. Instead, I just removed the collector node from the admin console. That's when it got stuck in this state.
It's supposed to move the 'instances' to other collectors, but that didn't seem to work and it got stuck in a state of taking the cluster offline. So to remove the remote collector and get back to a stable state, VMware support walked me through below.
Resolution to Case:
DO NOT ATTEMPT BELOW STEPS UNLESS YOUR CASE MATCHES ABOVE EXACTLY OR YOU REALLY KNOW WHAT YOU ARE DOING.
I moved all 'instances' to other collectors or data nodes, then brought the remote collector node offline.
If stuck in this state, stop master node services in the following order:
service vmware-vcops stop
service vmware-casa stop
#Backup current casa.db script to tmp directory
cp /storage/db/casa/webapp/hsqldb/casa.db.script /tmp/casa.db.script.$(date +%F.%H%M)
#Modify casa script
vi /storage/db/casa/webapp/hsqldb/casa.db.script
#Look for the below block, change to highlight text and delete/remove struck out text.
INSERT INTO CASA_DOCS VALUES('clusterMembership','{"onlineState":"ONLINEOFFLINE","cluster_name":"vROPsMastersoftheUniverse","is_ha_enabled":false,"ha_transition_state":null,"initialization_state":"NONE","remove_node_state":"REMOVINGNONE","document_time":1435068088309,"online_state":"ONLINEOFFLINE","online_state_time":1434726691663,"online_state_reason":"online_state.change_reason.remove_a_slice","cluster_members":[],"admin_slices":[],"installation_state":"DONE","slices":{"00059a72-349f-47ed-a910-4c01c9ac4e2e":{"slice_uuid":"00059a72-349f-47ed-a910-4c01c9ac4e2e","is_admin_node":true,"ip_address":"server1000.local","slice_name":"p00x04vropm001","membership_state":null},"a6ba9353-c240-40b1-bc5c-9d7d5d667ab2":{"slice_uuid":"a6ba9353-c240-40b1-bc5c-9d7d5d667ab2","is_admin_node":false,"ip_address":"server900.local","slice_name":"p00x04vropd001","membership_state":null},"ee53d2ce-0c9d-4091-9410-b4ccb08305ad":{"slice_uuid":"ee53d2ce-0c9d-4091-9410-b4ccb08305ad","is_admin_node":false,"ip_address":"server400.local","slice_name":"p00x03vropc001","membership_state":null},"4862541a-b66a-4dbd-b1de-035082468ca8":{"slice_uuid":"4862541a-b66a-4dbd-b1de-035082468ca8","is_admin_node":false,"ip_address":"server300.local","slice_name":"p00x07vropc001","membership_state":null},"7dbbaf73-e409-47cc-9f98-8e3b6d66dec5":{"slice_uuid":"7dbbaf73-e409-47cc-9f98-8e3b6d66dec5","is_admin_node":false,"ip_address":"server200.local","slice_name":"p00x04vropc001","membership_state":null},"4dcc7006-81c2-42f4-a295-4f540d3a1297":{"slice_uuid":"4dcc7006-81c2-42f4-a295-4f540d3a1297","is_admin_node":false,"ip_address":"server10.local","slice_name":"p00x04vropd002","membership_state":null},"2df836e9-fb5f-49ef-9327-84a544f26e51":{"slice_uuid":"2df836e9-fb5f-49ef-9327-84a544f26e51","is_admin_node":false,"ip_address":"Server20.local","slice_name":"savvis-manheim","membership_state":null},"12adbdfb-79bc-44e2-bd2d-dadd100b5931":{"slice_uuid":"12adbdfb-79bc-44e2-bd2d-dadd100b5931","is_admin_node":false,"ip_address":"Server30.local","slice_name":"roswell-manheim","membership_state":null}}}')
#If you are removing a data node follow the steps associated w/ gemfire.properties below. Does not apply to removing a remote collector.
cp $ALIVE_BASE/user/conf/gemfire.properties /tmp/gemfire.properties.$(date +%F.%H%M)
vi $ALIVE_BASE/user/conf/gemfire.properties
Change "serverCount" entry in gemfire.properties file to number that matches number of nodes (data and master) IF removed node was a data node.
#Start services
service vmware-casa start
service vmware-vcops start
Comments