Tuesday, January 21, 2020

Cohesity: How to do hot swap of failed cohesity node...

Learn Storage, Backup, Virtualization,  and Cloud. AWS, GCP & AZURE.
............................................................................................................................................................................

How to remove a faulty Cohesity node (or hot swap) in cohesity cluster.

1. Log into a node in question by using hostname or Node IP.
ssh cohesity@cohesity-node-1

2. Issue an IPMI command to power down Chassis. It helps identifying the node visually which will have lights turned off upon execution.
cohesity-node-1::> sudo ipmitool power chassis off

(Soon you enter this, it will kill the session from that node)

3. Now remove the Node from cluster. Log into cluster and run iris_cli command.
cohesity-cluster::>iris_cli -username=admin -password=‘xxxxx’ node rm id= force=true
You can only force if the node is unreachable. i.e. since you powered it off in previous step, it let you use force.


4. Now insert the New Node into cluster, power it back on, Assign Initial IP info.  
(Log in using Cohesity user, and under its home directory, there will be three main scripts.)
(It is important to run configure_network.sh  (Located at /home/cohesity/bin/network/configure_network.sh) Script to initially assign IP, gatemasks and such before it can be added to cluster)
a. Select Option to assign networking configuration for Node Management.
b. Configure IP, and other networking info.
Note: If you are using multiple Vlans, and your node management uses non-native vlan then you have to have interface name,bond info handy. I ran into issue when, Even after assigning using script, it required manual intervention.  In some case, it also requires bond interface created, if not created that interface during part of config. So you have to create interfacde manually, modify ip address info, and restart network service, and then only you can move forward.
>>Log into Node via crash cart and run below.
:>>cat /etc/sysconfig/network-scripts/ifcfg-bond0.
DEVICE=bond0.vlanID
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
VLAN=yes

IPADDR=
PREFIX=26
GATEWAY=
MTU=1500
ZONE=public
NM_CONTROLLED=no

4. Upon persistence IP assignment, you can run same  configure_network.sh  and select option to configure IPMI configuration. Apply IPMI network configuration.

5. Now new node in in system, Cluster can discover the newly inserted Node and can be added to cluster. 
(I ran into issue when it didn’t discover from Cluster UI. I did it manually by following below. This actually is independent of what version is new Node running. So in my case when new node was on low rev of DataProtect, upon addition to cluster, its OS also got updated on its own.)
:: ris_cli cluster add-nodes auto-update=true node-ids= node-ips= node-ipmi-ips=

##Just for sanity check, it is helpful if we restart he Nexus service to equally distribute VIPS, since I ran into issue last time when one node failed and second node hosted two VIPS, but didnt’ redistribute VIPS to newly added node upon replacement causing spotty backup failures. So restarting Nexus service helps.
::$ allssh.sh "sudo systemctl restart nexus"


(Rest of the things occurs in the background).


You are Welcome :)
Source: support.cohesity.com