Lets Talk SAN, NAS, Cloud, & Backup

Monday, May 2, 2022

AWS: How to improve S3 storage Performance with partition and prefixes best practices

Learn Storage, Backup, Virtualization, and Cloud. AWS, GCP & AZURE.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

S3 object storage service is the most scalable storage solution in AWS storage solution portfolio. But how does its performance scale ? Are there any tuning that can be done to make it more performant ? Yes there are guidelines and best practices. AWS has so many publicly available documentations around S3 performance and scalability along with best practice guidelines. In this blog, lets take a look at S3 performance.

First: for smaller workloads (<50 total requests per second), none of the below applies, no matter how many total objects one has! S3 has a bunch of automated agents that work behind the scenes, smoothing out load all over the system, to ensure the myriad diverse workloads all share the resources of S3 fairly and snappily. Even workloads that burst occasionally up over 100 requests per second really don’t need to give us any hints about what’s coming…we are designed to just grow and support these workloads forever. S3 is a true scale-out design in action.

S3 scales to both short-term and long-term workloads far, far greater than this. We have customers continuously performing thousands of requests per second against S3, all day every day. Some of these customers simply ‘guessed’ how our storage and retrieval system works, on their own, or may have come to S3 from another system that partitions namespaces using similar logic. We worked with other customers through our Premium Developer Support offerings to help them design a system that would scale basically indefinitely on S3. Today were going to publish that guidance for everyones benefit.

Some high-level design concepts are necessary here to explain why the approach below works. S3 must maintain a ‘map’ of each bucket’s object names, or ‘keys’. Traditionally, some form of partitioning would be used to scale out this type of map. Given that S3 supports a lexigraphically-sorted list API, it would stand to reason that the key names themselves are used in some way in both the map and the partitioning scheme…and in fact that is precisely the case: each key in this ‘keymap’ (that’s what we call it internally) is stored and retrieved based on the name provided when the object is first put into S3 – this means that the object names you choose actually dictate how we manage the keymap.

Internally, the keys are all represented in S3 as strings like this:

bucketname/keyname

Further, keys in S3 are partitioned by prefix.

As we said, S3 has automation that continually looks for areas of the keyspace that need splitting. Partitions are split either due to sustained high request rates, or because they contain a large number of keys (which would slow down lookups within the partition). There is overhead in moving keys into newly created partitions, but with request rates low and no special tricks, we can keep performance reasonably high even during partition split operations. This split operation happens dozens of times a day all over S3 and simply goes unnoticed from a user performance perspective. However, when request rates significantly increase on a single partition, partition splits become detrimental to request performance. How, then, do these heavier workloads work over time? Smart naming of the keys themselves!

We frequently see new workloads introduced to S3 where content is organized by user ID, or game ID, or other similar semi-meaningless identifier. Often these identifiers are incrementally increasing numbers, or date-time constructs of various types. The unfortunate part of this naming choice where S3 scaling is concerned is two-fold: First, all new content will necessarily end up being owned by a single partition (remember the request rates from above…). Second, all the partitions holding slightly older (and generally less ‘hot’) content get cold much faster than other naming conventions, effectively wasting the available operations per second that each partition can support by making all the old ones cold over time.

The simplest trick that makes these schemes work well in S3 at nearly any request rate is to simply reverse the order of the digits in this identifier (use seconds of precision for date or time-based identifiers). These identifiers then effectively start with a random number – and a few of them at that – which then fans out the transactions across many potential child partitions. Each of those child partitions scales close enough to linearly (even with some content being hotter or colder) that no meaningful operations per second budget is wasted either. In fact, S3 even has an algorithm to detect this parallel type of write pattern and will automatically create multiple child partitions from the same parent simultaneously – increasing the system’s operations per second budget as request heat is detected.

Example 1
Consider this small sample of incrementally increasing game IDs in the fictional S3 bucket ‘mynewgame’:

2134857/gamedata/start.png
2134857/gamedata/resource.rsrc
2134857/gamedata/results.txt
2134858/gamedata/start.png
2134858/gamedata/resource.rsrc
2134858/gamedata/results.txt
2134859/gamedata/start.png
2134859/gamedata/resource.rsrc
2134859/gamedata/results.txt

All these reads and writes will basically always go to the same partition…but if the identifiers are reversed:

7584312/gamedata/start.png
7584312/gamedata/resource.rsrc
7584312/gamedata/results.txt
8584312/gamedata/start.png
8584312/gamedata/resource.rsrc
8584312/gamedata/results.txt
9584312/gamedata/start.png
9584312/gamedata/resource.rsrc
9584312/gamedata/results.txt

This pattern instructs S3 to start by creating partitions named:

mynewgame/7
mynewgame/8
mynewgame/9

These can be split even further, automatically, as key count or request rate increases over time.

Clever readers will no doubt notice that using this trick alone makes listing keys lexigraphically (the only currently supported way) rather useless. For many S3 use-cases, this isn’t important, but for others, another slightly more complex scheme is necessary, in order to allow groups of keys to be easily list-able. This scheme also works for more structured object namespaces as well. The trick here is to calculate a short hash (note: collisions don’t matter here, just pseudo-randomness) and pre-pend it to the string you wish to use for your object name. This way, operations are again fanned out over multiple partitions. To list keys with common prefixes, several list operations can be performed in parallel, one for each unique character prefix in your hash.

Example 2
Take the following naming scheme in fictional S3 bucket ‘myserverlogs’:

service_log.2012-02-27-23.hostname1.mydomain.com
service_log.2012-02-27-23.hostname2.mydomain.com
service_log.2012-02-27-23.hostname3.mydomain.com
service_log.2012-02-27-23.hostname4.mydomain.com
service_log.2012-02-27-23.john.myotherdomain.com
service_log.2012-02-27-23.paul.myotherdomain.com
service_log.2012-02-27-23.george.myotherdomain.com
service_log.2012-02-27-23.ringo.myotherdomain.com
service_log.2012-02-27-23.pete.myotherdomain.com

With thousands or tens of thousands of servers sending logs an hour, this scheme becomes untenable for the same reason as the first example. Instead, combining hashing and reversing domain identifiers, a scheme like the following provides the best balance between performance and flexibility of listing:

c/service_log.2012-02-27-23.com.mydomain.hostname1
4/service_log.2012-02-27-23.com.mydomain.hostname2
9/service_log.2012-02-27-23.com.mydomain.hostname3
2/service_log.2012-02-27-23.com.mydomain.hostname4
b/service_log.2012-02-27-23.com.myotherdomain.john
7/service_log.2012-02-27-23.com.myotherdomain.paul
2/service_log.2012-02-27-23.com.myotherdomain.george
0/service_log.2012-02-27-23.com.myotherdomain.ringo
d/service_log.2012-02-27-23.com.myotherdomain.pete

These prefixes could be a mod-16 operation on the ascii values in the string or really any hash function you like – the above are made up to illustrate the point. The benefits of this scheme, though, are now clear: it is possible to prefix-list all sets: an hour across all domains, or an hour in a single domain, as well as sustained reads and writes well over 1500 per second in these key map partitions (16 in all, using regular expression shorthand below):

myserverlogs/[0-9a-f]

16 prefixed reads get you all the logs for mydomain.com for that hour:

http://myserverlogs.s3.amazonaws.com?prefix=0/service_log.2012-02-27-23.com.mydomain
http://myserverlogs.s3.amazonaws.com?prefix=1/service_log.2012-02-27-23.com.mydomain
…
http://myserverlogs.s3.amazonaws.com?prefix=e/service_log.2012-02-27-23.com.mydomain
http://myserverlogs.s3.amazonaws.com?prefix=f/service_log.2012-02-27-23.com.mydomain

As you can see, some very useful selections from your data are more easily accessible given the right naming structure. The general pattern here is: after the partition-enabling hash, you should name keys with the key name elements you’d like to request by furthest to the left.

By the way: two or three prefix characters in your hash are really all you need: here’s why. If we target conservative targets of 100 operations per second and 20 million stored objects per partition, a four character hex hash partition set in a bucket or sub-bucket namespace could theoretically grow to support millions of operations per second and over a trillion unique keys before we’d need a fifth character in the hash.

This hashing ‘trick’ can also be used when the object name is meaningless to your application. Any UUID scheme where the left-most characters are effectively random works fine (base64 encoded hashes, for example) – if you use base64, we recommend using a URL-safe implementation and avoiding the ‘+’ and ‘/’ characters, instead using ‘-‘ (dash) and ‘_’ (underscore) in their places.

Source: https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/

You are Welcome :)

Thursday, December 17, 2020

Cohesity: In Azure, how to Destroy Azure Cluster nodes and repurpose them to add it to running cluster.

Learn Storage, Backup, Virtualization, and Cloud. AWS, GCP & AZURE.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

1. Stop the cluster.

2. Destroy the cluster.

3. Wipe config on freed Nodes.

4. Add Nodes to Cluster.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Code Running 6.3.1g:

[cohesity@shcstypazbk003-000d3a317381-node-1 ~]$ iris_cli cluster stop

[cohesity@shcstypazbk003-000d3a317381-node-1 ~]$ iris_cli cluster status

[cohesity@shcstypazbk003-000d3a317381-node-1 ~]$ iris_cli cluster destroy id=<Cluster_ID>

[cohesity@shcstypazbk003-000d3a317381-node-1 ~]$ iris_cli cluster

[cohesity@ClusterName--node-1 ~]$ ps -ef iris_cli

After cluster Destroy, log into individual Node and run iris_cli node status.

[cohesity@ClusterName--node-1 ~]$iris_cli node status

NODE ID : 123456789107

NODE IPS : 10.10.9.100, fe80::20d:3aff:fe31:7c6f

NODE IN CLUSTER : false

CLUSTER ID : -1

CLUSTER INCARNATION ID : -1

SOFTWARE VERSION :

LAST UPGRADED TIME :

NODE UPTIME :

ACTIVE OPERATION :

MESSAGE : Node is not part of a cluster.

(If response says, node is not part of cluster, its green to go.)

(ButIf, if node says, its part of cluster, It might need to wipe out data and config manually with prepopulated script)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[support@azure-cohesity--node-1 bin]$ pwd

/home/cohesity/bin

[cohesity@azure-cohesity--node-1 bin]$ cd rescue/

[cohesity@azure-cohesity--node-1 rescue]$ ls

breakfix_nvme_ssd.sh clean_node.sh defsh.sh erase_disk.sh make_bootable_device.sh reset_linux_users.sh rollback_upgrade.sh

[cohesity@azure-cohesity--node-1 rescue]$

[cohesity@ClusterName--node-1 rescue]$ ./clean_node.sh

CLEAN NODE IS A DESTRUCTIVE OPERATION. DO YOU WANT TO PROCEED? (Y/N): y

Cleaning...

[cohesity@ClusterName--node-1 rescue]$ reboot.sh

RECEIVED REQUEST TO REBOOT NODE. DO YOU WANT TO PROCEED? (Y/N): y

Rebooting...

Connection to 10.249.8.135 closed by remote host.

By Now, Node will be free and not part of Cluster.

AT this step, you can move to Node Add.

1. Log into the Cluster where you want to join new nodes.

2. iris_cli

3. admin@127.0.0.1> cluster cloud-join node-ips=10.10.9.100,10.10.9.101,10.10.9.102

(These IPs are IPs from recently destroyed Nodes.)

Monitor Nodes add work in Siren and/or GUI.

Thursday, October 1, 2020

Cohesity- How to expand the cluster, and how to remove node from the cluster-- with examples.

Learn Storage, Backup, Virtualization, and Cloud. AWS, GCP & AZURE.

..........................................................................................................................................................................

Expand a Cluster

Perform the following steps before adding new nodes to the cluster.

1 Two methods are available:

Use the iris_cli.

a Use the iris_cli vlan add command to set up the non-native VLAN to be used for the node add workflow. Example:

iris_cli vlan add if-name=bond0 id=101 subnet-mask-bits=8

b Use the following command to set the non-native VLAN logical bond interface as primary. Replace vland_id with the ID of the VLAN you added.

iris_cli ip config interface-name=<bond0.vland_id> interface-role=primary

Alternatively, to configure the IP on a new node and access the node using the IP (not required if using Avahi to discover all nodes), use this command:

iris_cli ip config interface-name=<bond0.vlan_id> iface-ips=xx subnet-gateway=yy subnet-mask-bits=zz mtu=qq

2 Or use the configure_network.sh script.

a Use configure_network.sh option 10.

Location: /home/cohesity/bin/network/configure_network.sh

3 Restart the Nexus service:

sudo service nexus restart

4 Run ifconfig and ensure Avahi runs on the non-native VLAN bonded interface.

5 On any node in the existing cluster, start the node add workflow from the UI and provide cluster IPs from the configured non-native VLAN.

NOTE: If necessary, the user can configure cluster IPs and VIPs from the non-native VLAN and keep the IPMI in the native VLAN or some other subnet.

https://docs.cohesity.com/6_1_1/Web/UserGuide/index.htm#CLI/VLANTagging.htm%3FTocPath%3DCluster%2520Administration%7CNetworking%7C_____13

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Remove a Node from Cluster.

This is a clean way. Tricky way could be failing the node and let it reconstruct in the background— given you have right redundancy settings in place, which can be checked under Storage domain Configuration.

Log into Cluster/node

> iris_cli cluster status

(It lists Node ID with IP and Serial Numbers)

> iris_cli node rm -id=<serial number of node>

(It will prompt cluster username (admin) and Password, followed by message—

“Success: Node ID: <Serial Number> marked for removal successfully.”)

Note: There is no way to track removal process using cli. But if you were to logged in to Siren Page, and go to Scribe, it will show you KRemoveNode process and metadata/replica that node holds constantly decreasing. It indicates the Node is being removed. Scribe service track/manage metadata and metadata removal and data removal from owned disks from the node in question runs in parallel. However, metadata finishes quickly. Once Data gets reshuffled across other nodes, by logging into the node and running same commands as above will show a message— Node is not part of cluster, and/or password is reset to default admin password, not the one you have it changed for entire cluster.

You are Welcome :)

Friday, September 25, 2020

Cohesity- Network related troubleshooting during initial cluster build

Learn Storage, Backup, Virtualization, and Cloud. AWS, GCP & AZURE.

..........................................................................................................................................................................

Network Related Troubleshooting during Initial Cluster Build With Examples

1. To Begin with start from Looking at 10G interfaces.

Bond0 is used by Cohesity Nodes.

By default the 10G interfaces are included in bond0.

ens802f0/ens802f1 are 10G interfaces on C2xxx,4xxx, and 6xxx Series.

Mode 1 (active-backup policy)

Mode 4 (active-active policy)- LACP (Recommended)

2. Ensure what primary interface is used for running setting.

[cohesity@node ~]$ primary_interface_name.sh

bond0 (This should be listed in result)

[cohesity@node ~]$allssh.sh ‘ip a |grep bond’

(This will list what interfaces are member of bond0)

3. Ensure 10G ports are connected.

[cohesity@node ~]$ sudo ethtool ens802f0

Settings for ens802f0:

Supported ports: [ FIBRE ]

Supported link modes: 10000baseT/Full

Supported pause frame use: Symmetric

...

Speed: 10000Mb/s

Duplex: Full

Port: Direct Attach Copper

PHYAD: 0

Transceiver: internal

Auto-negotiation: off

...

Link detected: yes

[cohesity@node ~] sudo ethtool ens802f1

Settings for ens802f1:

Supported ports: [ FIBRE ]

Supported link modes: 10000baseT/Full

Supported pause frame use: Symmetric

...

Speed: 10000Mb/s

Duplex: Full

Port: Direct Attach Copper

PHYAD: 0

Transceiver: internal

Auto-negotiation: off

...

Link detected: yes

4. Ensure LLDP service is enabled on switch port.

[cohesity@node ~] sudo lldpctl ens802f0

-----------------------------------------------------

LLDP neighbors:

------------------------------------------------

Interface: ens802f0, via: LLDP, RID: 3, Time: 6 days, 02:38:24

(This should list Chassis/SerialNumber— details of connected Switch).

[cohesity@node ~] sudo lldpctl ens802f1

-----------------------------------------------------

LLDP neighbors:

-----------------------------------------------------

Interface: ens802f1, via: LLDP, RID: 4, Time: 6 days, 02:42:21

5. Ensure Bond Ports are up, active and with no issues.

[cohesity@node ~]$ cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation

Transmit Hash Policy: layer3+4 (1)

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 0

Down Delay (ms): 0

802.3ad info

LACP rate: slow

Min links: 0

Aggregator selection policy (ad_select): stable

…

Slave Interface: ens802f0

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: a4:bf:01:2d:7f:56

Aggregator ID: 3

Actor Churn State: none

Partner Churn State: none

Actor Churned Count: 0

Partner Churned Count: 0

6. When used Native vlan, Ensure Port Config is correct on Switch running config.

nexus-1

interface port-channel 101

description vpc 101 cohesity-node1-ens802f0

switchport mode trunk

switchport trunk allowed vlan 50

switchport trunk native vlan 50

spanning-tree port type edge trunk

vpc 101

interface Ethernet1/5

description vpc 101 cohesity-node1-ens802f0

switchport mode trunk

switchport trunk allowed vlan 50

switchport trunk native vlan 50

channel-group 101 mode active

nexus-2

interface port-channel 101

description vpc 101 cohesity-node1-ens802f1

switchport mode trunk

switchport trunk allowed vlan 50

switchport trunk native vlan 50

spanning-tree port type edge trunk

vpc 101

interface Ethernet1/5

description vpc 101 cohesity-node1-ens802f1

switchport mode trunk

switchport trunk allowed vlan 50

switchport trunk native vlan 50

channel-group 101 mode active

7. Map out Mac address with IP of Node using arp tool. If duplicate IP address is used, it helps identify that.

[cohesity@node-1 ~]$ hostips

10.19.65.50 10.19.65.51 10.19.65.52 10.19.65.53

[cohesity@node-1 ~]$ ping 10.19.65.50

PING 10.19.65.50 (10.19.65.50) 56(84) bytes of data.

64 bytes from 10.19.65.50: icmp_seq=1 ttl=64 time=0.112 ms

64 bytes from 10.19.65.50: icmp_seq=2 ttl=64 time=0.071 ms

[cohesity@node-1 ~]$ arp -na |grep 10.19.65.50

? (10.19.65.50) at 00:1e:67:9c:49:90 [ether] on bond0.101

[cohesity@node-1 ~]$ ifconfig bond0.101

bond0.101: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 10.19.65.50 netmask 255.255.254.0 broadcast 10.5.65.255

inet6 fe80::21e:67ff:fe9c:4990 prefixlen 64 scopeid 0x20<link>

ether 00:1e:67:9c:49:90 txqueuelen 1000 (Ethernet)

8. Check ethtool errors. Check Sfp, cable, or connection if there is error- especially crc/errors.

[cohesity@node ~]$ sudo ethtool -S ens802f0 |egrep "dropped|error"

rx_errors: 0

tx_errors: 0

rx_dropped: 0

tx_dropped: 0

rx_over_errors: 0

rx_crc_errors: 0

rx_frame_errors: 0

rx_fifo_errors: 0

rx_missed_errors: 24758

tx_aborted_errors: 0

tx_carrier_errors: 0

tx_fifo_errors: 0

tx_heartbeat_errors: 0

rx_long_length_errors: 0

rx_short_length_errors: 0

rx_csum_offload_errors: 864

rx_fcoe_dropped: 0

9. New cluster install with NON-Native VLAN trunk port configuration.

Example shows customer wanting to use vlan 101. It needs to configure vlan 101 to be used as primary interface.

[cohesity@node-1 ~]$ primary_interface_set.sh bond0.101

[cohesity@node-1 ~]$ primary_interface_name.sh

bond0.101

When ./configure_network.sh is used, select option 10 that allows you to customize vlans, and IP info.

10. After Cluster is created, validate vlans for VIPS

[cohesity@optimus-64-11 ~]$ iris_cli vlan ls

11. Once all of these are verified, and checked out, it is possible that Cohesity’s NEXUS service, which is responsible for network, might need a restart too.

for i in `seq 180 184`; do ssh 10.123.23.$i date; done

for i in `seq 180 184`; do ssh 10.123.23.$i sudo systemctl stop nexus; done

for i in `seq 180 184`; do ssh 10.123.23.$i sudo systemctl restart nexus; done

These are the areas to be looked at that may potentially indicate issue in their logs.

[cohesity@node ~]$less nexus_exec.FATAL

[cohesity@node ~]$less nexus_exec.INFO

[cohesity@node ~]$ less logs/nexus_proxy_exec.INFO (Displays NEXUS Service issues).

[cohesity@node ~]$ ls -ltr logs/*FATAL*

[cohesity@node ~]$cat /etc/sysconfig/network-scripts/ifcfg-bond0

You are Welcome :)

Cohesity: How to create a new Cohesity Cluster--with Examples

Learn Storage, Backup, Virtualization, and Cloud. AWS, GCP & AZURE.

..........................................................................................................................................................................

This is the method used to create a new cluster using IPMI.

This one applies to 6XX models. (If you were to use it for C25xx or 4xxx model, you set value of “3”)

C6xxx uses username: admin, and Password: administrator for IPMI.

Console into the very first Node.

It will take you to black Screen.

[cohesity@node ~]$sh (Type sh and enter)

UserName:cohesity

Password: Cohe$1ty

(This will take you to Cluster shell) You run these below Commands

sudo ipmitool lan print 1

sudo ipmitool lan set 1 ipsrc static

sudo ipmitool lan set 1 ipaddr 10.123.123.20

sudo ipmitool lan set 1 defgw ipaddr 10.123.123.1

sudo ipmitool lan set 1 access on

Now that you have enabled IPMI, you can Use IP address on URL and access KVM remotely.

Once Logged in to KVM.

[cohesity@node ~]$cd bin/network

$ ls

(This will list available Scripts)

Select configure_network.sh Script.

[cohesity@node ~]$./configure_network.sh

(It will list 12 options. Select Option 7 to configure LACP bonding across two 10G ports on Cohesity side. You must have 10G LACP configured the same way on Switch side too).

(LACP config on Switch Side should look like this:

SwitchA

interface Ethernet1/5

description cohesity-node1-ens802f0

switchport mode trunk

switchport trunk allowed vlan 50

switchport trunk native vlan 50

channel-group 101 mode active

mtu 9216

SwitchB:

interface Ethernet1/5

description cohesity-node1-ens802f1

switchport mode trunk

switchport trunk allowed vlan 50

switchport trunk native vlan 50

channel-group 101 mode active

mtu 9216

In an event BMC/IPMI Port becomes inresponsive, Log Into IPMI from another node and run this to reboot.

ipmitool -I lanplus -U admin -P administrator -H 10.123.123.20 mc reset cold

(If a IPMI interface is frozon, then you can use this to reset the IPMI using IPMI from a different node).

Part of ./configure_network.sh uses Node IP. You can ssh into that NODE IP (E.G.10.123.123.40) now.

Once ssh into NODE IP,

[cohesity@node ~]$cat /proc/net/bonding/bond0 (This gives info on what kind of bond config is configured)

It Shows something like this.

[cohesity@node ~]$ cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation

Transmit Hash Policy: layer3+4 (1)

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 0

Down Delay (ms): 0

802.3ad info

LACP rate: slow

Min links: 0

Aggregator selection policy (ad_select): stable

…

Slave Interface: ens802f0

MII Status: up

Speed: 10000 Mbps

Duplex: full

Link Failure Count: 0

Permanent HW addr: a4:bf:01:2d:7f:56

Aggregator ID: 3

Actor Churn State: none

Partner Churn State: none

Actor Churned Count: 0

Partner Churned Count: 0

[cohesity@node ~]$avahi-browse -tarp

(This goes out discovering all the Nodes connected in the cluster using IPV6 internal processes). If this doesn’t see any nodes, it needs to be looked at.

At this Stage, you can use Node IP in URL and should be able to discover all the Nodes in discovery to be able to start Creating Cohesity Cluster.

This is Interactive session, you get to assign NODE IP, VIPS, SMTP, DNS, NTP Servers.

At the end of interactive session, it gives a message notifying you that Cluster has been created, and You can use the provided URL using admin user.

https://10.123.123.40

Username: admin

Password: admin

Note: If you want to update gflags, and other things, you may at this point in time.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Validation Steps for Cluster Settings:

1. Now that Cluster Is Up, you can run this at any Node. MII Should show UP on all the nodes you have as part of the cluster.

[cohesity@node ~]$ allssh.sh 'cat /proc/net/bonding/bond0' | grep MII

[01;31m[KMII[m[K Status: up

[01;31m[KMII[m[K Polling Interval (ms): 100

[01;31m[KMII[m[K Status: up

[01;31m[KMII[m[K Polling Interval (ms): 100

[01;31m[KMII[m[K Status: up

[01;31m[KMII[m[K Polling Interval (ms): 100

[01;31m[KMII[m[K Status: up

[01;31m[KMII[m[K Polling Interval (ms): 100

[01;31m[KMII[m[K Status: up

[01;31m[KMII[m[K Polling Interval (ms): 100

[01;31m[KMII[m[K Status: up

[cohesity@node ~]$ allssh.sh 'cat /proc/net/bonding/bond0' | grep Mode

(This should list link aggregation mode. Mode 4 i.e. LACP is dynamic link aggregation mode)

Bonding [01;31m[KMode[m[K: IEEE 802.3ad Dynamic link aggregation

[cohesity@node ~]$ iris_cli node status
[cohesity@node ~]$ iris_cli cluster status
[cohesity@node ~]$ allssh.sh hostips

(This will list all nodes iPs in the cluster)

[cohesity@node ~]$ less logs/iris_proxy_exec.FATAL (lists any fatals related to iris service)

You are Welcome :)

Thursday, February 13, 2020

Netapp: Excessive DNS queries by Netapp Harvest Server against monitored Netapp Cluster

Learn Storage, Backup, Virtualization, and Cloud. AWS, GCP & AZURE.

..........................................................................................................................................................................

Synopsis: Netapp nabox, if used ova image for 2.5 version considering it as a GA release is good in terms of use case. However, Server generates thousands of DNS queries against cluster in less than 24 hours.

I have ran into a problem where DNS server got choked with NABOX server that generated about 70K DNS requests for both 27A and 27AAAA request in less than 24 hours.

Resolution:

1. Use IP address of Cluster (i.e. use cluster management IP) as source of cluster to be monitored.

2. Use Beta Version of newer release running 2.6 that has the function that bundles dnsmasq as a local resolver.

I went with Option 2 as it was seamless and worth running hostname rather than using IP address and such.

Ref:

https://nabox.org/

You are Welcome :)