VMware vSphere High Availability Basics

VMware vSphere HA is one of the core feature in a cluster. So let’s bring some more precision about it. High Availability – HA enables a cluster of ESXi hosts to work together so that they can provide high levels of High Availability for virtual machines rather than just an ESXi host by itself. In brief, the High Availability feature is provided by pooling virtual machines and the ESXi hosts in the cluster for protection. Some examples could be host failures, host isolations and application crashes. The requirements for HA is a minimum of two hosts, vCenter Server and Shared Storage.

Photo Credits: VMware.com
Photo Credits: VMware.com

One ESXi goes down

By default, HA uses management network (Service Console/Management Network VMkernel connections). Let’s take a scenario where there are three ESXi hosts in a cluster. In the event where a physical server (ESXi hosts) goes down, the VM machines will be restarted on the other ESXi hosts. We can also set up applications to be started on the other physical server. From the three physical servers in the cluster one is going to be elected as master. The master server is going to keep track of other ESXi hosts through the heartbeat of other servers. This is done at the management network level. The master server will always expect to have heartbeat responses from other ESXi hosts.

Only the management network went down

If at any moment, the master server detects that a host is down, it will report that to the vCenter server and all servers will be powered on the other ESXi hosts. What is more interesting is that if only the management network goes down, and other network such the datastore network is still working, that would be referred as an Isolation incident. In that case, the vSphere will communicate to the master server and will claim that the ESXi host is still active is through the datastore heartbeat. In that case, the VMs will not be powered onto other ESXi host because it is an Isolation incident.

Only the Datastore network went down

Now, what if only the Datastore network went down and not the Management network? The master server will still receive heartbeat messages from other ESXi hosts, but no data communication is being sent to the datastore. Another element that is included in HA is VMCP – VM Component Protection which is a component that detects that if a VM is having access to the datastore. In the event of failure messages from the datastore heartbeat, the VMs will be powered onto other ESXi hosts where the datastore is sending alive heartbeat messages.

In all three scenarios, HA implies downtime as servers will be restarted in other ESXi hosts, but same is usually done within minutes. Another point to keep in mind is that HA applies only to physical host. For example, if a particular VM encounter a BSOD or Kernel Panic, HA will not know about it because the Physical server (ESXi host) is still communicating with the master server.

How the election process takes place to become the master?

When HA gets activated in the vSphere, the election process takes around 10-15 seconds. In that process (Enabling HA) an agent gets installed to activate HA which is called FDM – Fault Domain manager. Logs can be checked at /var/log/fdm.log. The election process is defined by an algorithm with two rules. For the first, the host with access to the greatest number of datastores wins.

Now, what if all ESXi hosts see the same number of datastores ? There will be a clash. This is where the second rule kicks in i.e; the host with the lexically-highest Managed Object ID (MOID) is chosen. Note that in vCenter Server each object will have a MOID. For example, objects are ESXI servers, folders, VMs etc.. So the lexical analyzer is a first component where it takes a character stream as input, outputs a token which goes into a syntax analyzer and the lexical analysis is performed. Care must be taken when attempting to rig this election because lexically here means, for example, that host-99 is in fact higher than host-100.

What IF …. ?


So what if vCenter Server goes down after setting up HA? 

The answer is HA will still work as it now the capacity to power on the vCenter Server. FDMs are self sufficient to carry on the election process as well as to start the vCenter Server. FDMs are inside the VMs but not inside the vCenter Server.

Enable and Configure vSphere HA
I will be using the free labs provided by VMware to set up HA.
1.The first action is to choose the Cluster then click on ‘Actions‘  then ‘Settings‘.
Photo Credits: VMware.com
Photo Credits: VMware.com

2. Choose ‘vSphere Availability‘ on the left -> then click on ‘Edit‘.

Photo Credits: VMware.com
Photo Credits: VMware.com

3. Click on ‘Turn ON vSphere HA’.

Photo Credits: VMware.com
Photo Credits: VMware.com

4. Choose ‘Failures and Responses‘ option and click on -> and enable ‘VM and Application monitoring‘.

Photo Credits: VMware.com
Photo Credits: VMware.com

5. On the ‘Admission control‘ -> check the ‘Cluster resource percentage‘ option.

Photo Credits: VMware.com
Photo Credits: VMware.com

6. Click on ‘Heartbeat Datastores’ and select ‘Automatically select datastores accessible form the host‘.

Photo Credits: VMware.com
Photo Credits: VMware.com
7. From the ‘Summary’ tab click on ‘vSphere Availability‘, it should mentioned vSphere HA: Protected.
Photo Credits: VMware.com
Photo Credits: VMware.com
1.VMware Tech Plus:
2.VMware White paper:
3.VMware Labs:
4.Other Links:

ESXi installation on my Dell Laptop and hands on VMware Labs

If you are thinking why i should install a bare metal hypervisor on a laptop, i assure you its just for educational and testing purpose only. I noticed that it was quite difficult for me to get this done. However, after some research it looks that my Dell Inspiron n5110 motherboard will not authorised me to install ESXi 6.x. Probably, it looks like there are some drivers missing or the motherboard does not support it.

Here is what my processors looks like from the configuration menu on VMware vSphere Center

Anyway, i have been able to inject some network drivers – VIB files into the ESXi5.0 which allowed me to install the ESXi 5.0 on the laptop. You can follow the instructions at the link how to make your unsupported NIC work with ESXi. Once installed, VMware will provide you with a two months free trial before you purchase the license.

Another way of messing around VMware Vsphere is to deploy a lab from labs.hol.vmware.com That’s so easy to deploy labs and access the VMware vSphere web client. All credentials will be available on the readme.txt file found on the desktop. Also a lab manual will be shown alongside whilst working on the environement labs.

I am sure this would help anyone to get into hands on lab quickly and it would be a nice start for beginners.

Out of Memory (OOM) in Linux machines

Since some months I have not been posting anything on my blog. I should admit that I was really busy. Recently, a friend asked me about the Out of Memory messages in Linux. How is it generated? What are the consequences? How can it be avoided in Linux machines? There is no specific answer to this as an investigation had to be carried out to have the Root Cause Analysis. Before getting into details about OOM, let’s be clear that whenever the Kernel is starved of memory, it will start killing processes. Many times, Linux administrators will experience this and one of the fastest way to get rid of it is by adding extra swap. However, this is not the definite way of solving the issue. A preliminary assessment needs to be carried out followed by an action plan, alongside, a rollback methodology.

If after killing some processes, the kernel cannot free up some memory, it might lead to a kernel panic, deadlocks in applications, kernel hungs, or several defunct processes in the machine. I know cases where the machine change run level mode. There are cases of kernel panic in virtual machines where the cluster is not healthy. In brief, OOM is a subsystem to kill one or more processes with the aim to free memory. In the article Linux kernel crash simulation using kdump, I gave an explanation how to activate Kdump to generate a vmcore for analysis. However, to send the debug messages during an out of memory error to the vmcore, the SYSCTL file need to be configured. I will be using a CentOS 7 machine to illustrate the OOM parameters and configurations.

1.To activate OOM debug in the vmcore file, set the parameter vm.panic_on_oom to 1 using the following command:

systctl -w vm.panic_on_oom=1

To verify if the configuration has been taken into consideration, you can do a sysctl -a | grep -i oom. It is not recommended to test such parameters in the production environment.

2. To find out which process the kernel is going to kill, the kernel will read a function in the kernel code called badness() . The badness() calculate a numeric value about how bad this task has been. To be precise, it works by accumulating some “points” for each process running on the machine and will return those processes to a function called select_bad_process() in the linux kernel. This will eventually start the OOM mechanism which will kill the processes. The “points” are stored in the /proc/<pid>/oom_score. For example, here, i have a server running JAVA.

As you can see, the process number is 2153. The oom_score is 21

3. There are lots of considerations that are taken into account when calculating the badness score. Some of the factors are the Virtual Memory size (VM size), the Priority of the Process (NICE value), the Total Runtime, the Running user and the /proc/<pid>/oom_adj. You can also set up the oom_score_adj value for any PID between -1000 to 1000. The lowest possible value, -1000, is equivalent to disabling OOM killing entirely for that task since it will always report a badness score of 0.

4. Let’s assume that you want to prevent a specific process from being killed.

echo -17 > /proc/$PID/oom_adj

5. If you know the process name of SSH Daemon and do not it from being killed, use the following command:

pgrep -f "/usr/sbin/sshd" | while read PID; do echo -17 > /proc/$PID/oom_adj; done

6. To automate the sshd from being killed through a cron which will run each minute use the following:

* * * * * root pgrep -f "/usr/sbin/sshd" | while read PID; do echo -17 > /proc/$PID/oom_adj; done
7. Let's now simulate the OOM killer messages. Use the following command to start an out of memory event 
on the machine.
echo f > /proc/sysrq-trigger 

You will notice an OOM error message in the /var/log/messages.
As you can notice here, the PID 843 was calculated by the OOM killer before killing it. 
There is also the score number which is 4 in our case.

Before the 'Out of memory' error, there will be a call trace which will be sent by the kernel.

8. To monitor how the OOM killer is generating scores, you can use the dstat command. To install the dstat 
package on RPM based machine use: 
yum install dstat 

or for debian based distribution use:
apt-get install dstat

Dstat is used to generate resource statistics. To use dstat to monitor the score from OOM killer use:
dstat -top-oom


  • oom_score_adj is used in new linux kernel. The deprecated function is oom_adj in old Linux machine.
  • When disabling OOM killer under heavy memory pressure, it may cause the system to kernel panic.
  • Making a process immune is not a definite way of solving problem, for example, when using JAVA Application. Use a thread/heap dump to analyse the situation before making a process immune.
  • Dstat is now becoming an alternative for vmstat, netstat, iostat, ifstat and mpstat. For example, to monitor CPU in a program, use dstat -c –top-cpu -dn –top-mem
  • Testing in production environment should be avoided!

A trip to a Wind Farm at Plaine Des Roches

This Sunday the 16th of April, I came across an interesting location in the North-East at Plaine Des Roches, Mauritius where electricity is produced through Wind Farms. The company is Quadran which has invested in this environmental friendly interesting project. Quadran is the global actor in renewable energy encompassing hydroelectricity, solar energy, wind energy and biogas. It has 130 collaborators from 13 agencies and subsidiaries in France metropolitan and Outre-mer including Reunion Island.Quadran. 

The electricity is produced by means of kinetic energy from the wind. The wind turns the blades, which spin a shaft connected to a generator which makes electricity. At some moment, when there is not enough wind, a fuel-powered engine will use switch on automatically for some seconds to run the turbine after which, the wind will take over to turn the blades.

I believe that Mauritius which is looking forward for a more eco-friendly island should invest more in these type of project. This project which involves 11 wind turbines with a power production of 9, 35 MW will satisfy the energy consumption of approximately 10,150 people.

However, side effects of wind turbines are not false. According to some source, there are also reports of negative effects on radio and television reception in wind farm communities. Potential solutions include predictive interference modeling as a component of site selection. A 2007 report by the U.S. National Research Council noted that noise produced by wind turbines is generally not a major concern for humans beyond a half-mile or so. Low-frequency vibration and its effects on humans are not well understood and sensitivity to such vibration resulting from wind-turbine noise is highly variable among humans. – www.nap.edu

Hackers.mu attracted a massive crowd at the DevConMru 2017

This is yet another dazzling inspiration that hackers.mu brought into the mind of the audience today on the 1st of April 2017 at the DevConMru – Day 2. After the mesmerising speech at the DevConMru by Logan, this time Codarren Velvindron, core member of hackers.mu hit the conference room with so many attendees. Fast Coding Skills – A well chosen topic especially for the curious ones, beginners or professionals who want to remove the barrier between the code and them. Codarren started the presentation by giving some examples about the applications he ventured into, for example MariaDB.

The room was full with over fifty attendees. While some were sitting on the floor, others leaned up against the wall focussed on Codarren. I heard someone from the crowd murmuring “I want to be a hacker”.. 🙂


Several analogies were brought to the attention of the audience such as the difficulties which one has to encounter whilst coding. Tips and tricks to get relief from these difficulties were offered; such as playing, breaking the huge task into parts and analysing the mini parts of each. Another way to understand how the code works is by “deleting” part it after a backup to know how it would behave in a different environment. Codarren also shared his experience about the IETF hackathon in which he participated.

Here is the Slide of Codarren at the DevConMru 2017

Fast Coding Skills by Codarren Velvindron on Scribd

At the end, we thanked Codarren for the job done. Members of hackers.mu kept on responding to people from the audience who were showing interest in coding. Some questions from the audience were about the challenges faced in the IETF hackathon as well as Codarren’s favourite programming language. “Talk is cheap, show me the code” – Linus Torvalds.