Category: Linux System

Debugging disk issues with blktrace, blkparse, btrace and btt in Linux environment

blktrace is a block layer IO tracing mechanism which provides detailed information about request queue operations up to user space. There are three major components: a kernel component, a utility to record the i/o trace information for the kernel to user space, and utilities to analyse and view the trace information. This man page describes blktrace, which records the i/o event trace information for a specific block device to a file.

 

Photo credits : www.brendangregg.com
Photo credits : www.brendangregg.com

Limitations and Solutions with blktrace

There are several limitations and solutions when using blktrace. We will focus mainly on its goal and how the blktrace tool can help us in our day to day task. Assuming that you want to debug any IO related issue, the first command will be probably an iostat. These utilities can be installed through a yum install sysstat blktrace. For example:

iostat -tkx -p 1 2

The limitation here with iostat is that it does not gave us which process is utilising which IO.  To resolve this issue, we can use another utility such as iotop. Here is an example of the iotop output.

blktrace, iotop, blkparse, btt and btrace

Here iotop shows us exactly which process is consuming ‘which’ and ‘how’ much IO. But another problem with that solution is that it does not give detailed information. So, blktrace comes handy as it gives layer-wise information. How it does that? It sees what is going on exactly inside block I/O layer. When used correctly, it’s possible to generate events for all I/O request and monitor it from where it is evolving. Though it extracts data from the kernel, it is not an analysis tool and the interpretation of the data need to be carried out by you. However, you can feed the data to btt or blkparse to get the analysis done.

Before looking at blktrace, let’s check out the I/O architecture. Basically, at the userspace the user will write and read data. This is what the User Process at the user space. The user do not write directly to the disk. They first write to the VFS Page Cache and from which there are various I/O Scheduler and the Device Driver will interact with the disk to write the data.

Photo credits: msreekan.com
Photo credits: msreekan.com

The blktrace will normally capture events during the process. Here is a cheat sheet to understand blktrace event capture.

photo credits: programering.com
photo credits: programering.com

blkparse will parse and format events acquired from blktrace. If you do not want to run blkparse, btrace is a shortcut to generate data out of blktrace and blkparse. Finally we have btt which will analyze data from blktrace and generate time deltas for each layer.

Another tool to grasp before moving on with blktrace is debugfs which is a simple-to-use RAM-based file system specially designed for debugging purposes. It exists as a simple way for kernel developers to make information available to user space. Unlike /proc, which is only meant for information about a process, or sysfs, which has strict one-value-per-file rules, debugfs has no rules at all. Developers can put any information they want there.lwn.

So the first thing to do is to mount the debugfs file system using the following command:

mount -t debugfs debugfs /sys/kernel/debug

The aim is to allow a kernel developer to make information available in user space. The output of the command below describe how to mount and verify same. You can use the command mount to test if same has been successful. Now that you have the debug file system, you can capture the events.

Diving into the commands

1.So you can use blktrace to trace out the I/O on the machine.

blktrace -d /dev/sda -o -|blkparse -i -

2. At the same time, on another console launch the following command to generate some I/O for testing purpose.

dd if=/dev/zero of=/mnt/test1 bs=1M count=1

From the blktrace console you will get an output which will end up as follows :

CPU0 (8,0):
 Reads Queued:           2,       60KiB Writes Queued:       5,132,   20,524KiB
 Read Dispatches:        2,       60KiB Write Dispatches:       61,   20,524KiB
 Reads Requeued:         0 Writes Requeued:         0
 Reads Completed:        2,       60KiB Writes Completed:       63,   20,524KiB
 Read Merges:            0,        0KiB Write Merges:        5,070,   20,280KiB
 Read depth:             1         Write depth:             7
 IO unplugs:            14         Timer unplugs:           9
Throughput (R/W): 8KiB/s / 2,754KiB/s
Events (8,0): 21,234 entries
Skips: 166 forward (1,940,721 -  98.9%)

3. Same result can also be achieved using the btrace command. Apply the same principle as in part 2 once the command has been launched.

btrace /dev/sda

4. In part 1, 2 and 4 the blktrace commands were launched in such a way that it will run forever – without exiting. In this particular example,  I will output the file name for analysis. Assume that you want to run the blktrace for 30 seconds, the command will be as follows:

blktrace -w 30 -d /dev/sda -o io-debugging

5. On another console, launch the following command:

dd if=/dev/sda of=/dev/null bs=1M count=10 iflag=direct

6. Wait for 30 seconds to allow step 4 to be completed. I got the following results just after:

[[email protected] mnt]# blktrace -w 30 -d /dev/sda -o io-debugging
=== sda ===
  CPU  0:                  510 events,       24 KiB data
  Total:                   510 events (dropped 0),       24 KiB data

7. You will notice at the directory /mnt  the file will be created. To read it use the command blkparse.

blkparse io-debugging.blktrace.0 | less

8. Now let’s see a simple extract from the blkparse command:

8,0    0        1     0.000000000  5686  Q   R 0 + 1024 [dd]
8,0    0        0     0.000028926     0  m   N cfq5686S / alloced
8,0    0        2     0.000029869  5686  G   R 0 + 1024 [dd]
8,0    0        3     0.000034500  5686  P   N [dd]
8,0    0        4     0.000036509  5686  I   R 0 + 1024 [dd]
8,0    0        0     0.000038209     0  m   N cfq5686S / insert_request
8,0    0        0     0.000039472     0  m   N cfq5686S / add_to_rr

The first column shows the device major,minor tuple, the second column gives information about the CPU and it goes on with the sequence, the timestamps, PID of the process issuing the IO process. The 6th column shows the event type, e.g. ‘Q’ means IO handled by request queue code. Please refer to above diagram for more info. The 7th column is R for Read, W for Write, D for block, B for Barrier operation and finally the last one is the block number and a following + number is the number of blocks requested. The final field between the [ ] brackets is the process name of the process issuing the request. In our case, I am running the command dd.

9.The output can be also analyzed using btt command. You will get almost the same information.

btt -i io-debugging.blktrace.0

Some interesting information here is D2C means the amount of time the IO has been spending into the device whilst Q2C means the total time take as there might be different IO concurrent.

A graphical user interface to makes life easier

The Seekwatcher program generates graphs from blktrace runs to help visualize IO patterns and performance. It can plot multiple blktrace runs together, making it easy to compare the differences between different benchmark runs. You should install the seekwatcher package if you need to visualize detailed information about IO patterns.

The command to be used to generate a picture to for analysis is as follows where seek.png is the output of the png name given.

seekwatcher -t io-debugging.blktrace.0 -o seek.png

What is also interesting is that you can create a movie-like with seekwatch to view the graph.

seekwatcher -t io-debugging.blktrace.0 -o seekmoving.mpg --movie

Tips:

  • For the debugfs, you can also edit the /etc/fstab file to make the mount point permanently. 

  • By default, blktrace will capture all events. This can be limited with the argument -a.
  • In case you want to capture persistently for a long time or for a certain amount of time, then use the argument -w.
  • blktrace will also stored the extracted data in local directory with a format device.blktrace.cpu, for example sda1.blktrace.cpu.
  • At step 1 and 2, you will need to fire a CTRL +C to stop the blktrace.
  • As seen in part 2, you have created test test1 file, do delete it same may consume disk space on your machine.

  • On part 5, the size of the file created in the  /mnt will not exceeds more that 1M since same has been specified in the command.
  • At part 9, you will noticed several other information which can be helpful such as D2C and  Q2C.
    • Q2D latency – time from request submission to Device.
    • D2C latency – Device latency for processing request.
    • Q2C latency – total latency , Q2D + D2C = Q2C.
  • To be able to generate the movie, you will have to install mencoder with all its dependencies.

Linux memory analysis with Lime and Volatility

Lime is a Loadable Kernel Module (LKM) which allows for volatile memory acquisition from Linux and Linux-based devices, such as Android. This makes LiME unique as it is the first tool that allows for full memory captures on Android devices. It also minimises its interaction between user and kernel space processes during acquisition, which allows it to produce memory captures that are more forensically sound than those of other tools designed for Linux memory acquisition. – Lime. Volatility framework was released at Black Hat DC for analysis of memory during forensic investigations.

Analysing memory in Linux can be carried out using Lime which is a forensic tool to dump the memory. I am actually using CentOS 6 distribution installed on a Virtual Box to acquire memory. Normally before capturing the memory, the suspicious system’s architecture should be well known. May be you would need to compile Lime on the the suspicious machine itself if you do not know the architecture. Once you compile Lime, you would have a kernel loadable object which can be injected in the Linux Kernel itself.

Linux memory dump with Lime

1. You will first need to download Lime on the suspicious machine.

git clone https://github.com/504ensicsLabs/LiME

2. Do the compilation of Lime. Once it has been compiled, you will noticed the creation of the Lime loadable kernel object.

make

3. Now the kernel object have to be loaded into the kernel. Insert the kernel module. Then, define the location and format to save the memory image.

insmod lime-2.6.32-696.23.1.el6.x86_64.ko "path=/Linux64.mem format=lime"

4. You can view if the module have been successfully loaded.

lsmod | grep -i lime

Analysis with Volatility

5. We will now analyze the memory dump using Volatility. Download it from Github.

git clone https://github.com/volatilityfoundation/volatility

6.  Now, we will create a Linux profile. We will also need to download the DwarfDump package. Once it is downloaded go to Tools -> Linux directory, then create the module.dwarf file.

yum install epel-release libdwarf-tools -y && make

7. To proceed further, the System.map file is important to build the profile. The System.map file contains the locations of all the functions active in the compiled kernel. You will notice it inside the /boot directory. It is also important to corroborate the version appended with the System.map file together the version and architecture of the kernel. In the example below, the version is 2.6.32-696.23.1.el6.x86_64.

8. Now, go to the root of the Volatility directory using cd ../../ since I assumed that you are in the linux directory. Then, create a zip file as follows:

zip volatility/plugins/overlays/linux/Centos6-2632.zip tools/linux/module.dwarf /boot/System.map-2.6.32-696.23.1.el6.x86_64

9. The volatility module has now been successfully created as indicated in part 8 for the particular version of the Linux and kernel version. Time to have fun with some Python script. You can view the profile created with the following command:

python vol.py --info | grep Linux

As you can see the profile LinuxCentos6-2632 profile has been created.

10. Volatile contains plugins to view details about the memory dump performed. To view the plugins or parsers, use the following command:

python vol.py --info | grep -i linux_

11. Now imagine that you want to see the processes running at the time of the memory dump. You will have to execute the vol.py script, specify the location of the memory dump, define the profile created and call the parser concerned.

python vol.py --file=/Linux64.mem --profile=LinuxCentos6-2632x64 linux_psscan

12. Another example to recover the routing cache memory:

python vol.py --file=/Linux64.mem --profile=LinuxCentos6-2632x64 linux_route_cache

Automating Lime using LiMEaid

I find the LiMEaid tools really interesting to remote executing of Lime. “LiMEaide is a python application designed to remotely dump RAM of a Linux client and create a volatility profile for later analysis on your local host. I hope that this will simplify Linux digital forensics in a remote environment. In order to use LiMEaide all you need to do is feed a remote Linux client IP address, sit back, and consume your favorite caffeinated beverage.” – LiMEaid

Tips:

  • Linux architecture is very important when dealing with Lime. This is probably the first question that one would ask.
  • The kernel-headers package is a must to create the kernel loadable object.
  • Once a memory dump have been created, its important to take a hash value. It can be done using the command md5sum Linux64.mem
  • I would also consider to download the devel tools using yum groupinstall “Development Tools” -y
  • As good practice as indicated in part 8 when creating the zip file, use the proper convention when naming the file. In my case I used the OS version and the kernel version for future references.
  • Not all Parsers/Plugins will work with Volatile as same might not be compatible with the Linux system.
  • You can check out the Volatile wiki for more info about the Parsers.

 


Auditing Linux Operating System with Lynis

Auditing a Linux System is one of the most important aspect when it comes to security. After deploying a simple Centos 7 Linux machine on virtual box, I made an audit using Lynis. It is amazing how many tiny flaws can be seen right from the beginning of a fresh installation. Lynis Enterprise performs security scanning for Linux, macOS, and Unix systems. It helps you discover and solve issues quickly, so you can focus on your business and projects again.Cisofy.

Credits: cisofy.com
Credits: cisofy.com

Introduction

The Lynis tool performs both security and compliance auditing. It has a free and paid version which comes very handy especially if you are on a business environment. The installation of the Lynis tool is pretty simple. You can install it through the Linux repository itself, download the tar file or clone it directly from Github.

 

Scanning Performed by Lynis

1. I downloaded the tar file with the following command:

wget https://cisofy.com/files/lynis-2.6.0.tar.gz

2. Then, just untar the file and get into it

tar -xzf lynis-2.6.0.tar.gz && cd lynis

3. Once into the untar directory, launch the following command:

./lynis audit system --quick

 
As you can see from the output above, there are several suggestions at the end of the scan. In case the paid version of the application was used, more information and commands as how to remediate the situation would be given including support from Lynis. As regards to the free version, you can also debug by yourself several security aspects from the suggestions.
 
Suggestions, Compliance and Improvement.

 
1.The first two suggestions were about minimum and maximum password age.

Configure minimum password age in /etc/login.defs [AUTH-9286]

Configure maximum password age in /etc/login.defs [AUTH-9286]

To check the minimum and maximum password age, use the chage command :
chage -l
 

2. Use chage -m root to set the minimum password age and chage -M root to set maximum password age:

Also, you will have to set the parameter in the /etc/login.defs file

3. Delete accounts which are no longer used [AUTH-9288]

It is also suggested to delete accounts which are no longer in use. This suggestion was prompted as I created a user  “nitin” account during installation and did not use it yet. For the purpose of this blog, I deleted it using userdel -r nitin

4. Default umask in /etc/profile or /etc/profile.d/custom.sh could be more strict (e.g. 027) [AUTH-9328]

Default umask values are taken from the information provided in the /etc/login.defs file for RHEL (Red Hat) based distros. Debian and Ubuntu Linux based system use /etc/deluser.conf. To change default umask value to 027 which is actually 022 by default, you will need to modify the /etc/profile script as follows:

5. To decrease the impact of a full /home file system, place /home on a separated partition [FILE-6310]

  To decrease the impact of a full /tmp file system, place /tmp on a separated partition [FILE-6310]

  To decrease the impact of a full /var file system, place /var on a separated partition [FILE-6310]

In the article Move your /home to another partition, you will have detailed explanations to sort out this issue.

6. Disable drivers like USB storage when not used, to prevent unauthorized storage or data theft [STRG-1840]

   Disable drivers like firewire storage when not used, to prevent unauthorized storage or data theft [STRG-1846]

To disable USB and firewire storage drivers, add the following lines in /etc/modprobe.d/blacklist.conf then do a modprobe usb-storage && modprobe firewire-core

blacklist firewire-core
blacklist usb-storage

7. Split resolving between localhost and the hostname of the system [NAME-4406]

This issue is only about hostname and localhost in /etc/hosts which could confuse some applications installed on the machine. According to cisofy, for proper resolving, the entries of localhost and the local defined hostname, could be split. Using some middleware and some applications, resolving of the hostname to localhost, might confuse the software.

8. Install package ‘yum-utils’ for better consistency checking of the package database [PKGS-7384]

      Consider running ARP monitoring software (arpwatch,arpon) [NETW-3032]

The yum-utils and arpwatch are nice tools to perform more debugging and verification. Install it using the following commands:

yum install yum-utils arpwatch -y

9. You are advised to hide the mail_name (option: smtpd_banner) from your postfix configuration. Use postconf -e or change your main.cf file (/etc/postfix/main.cf) [MAIL-8818]

You just have to uncomment the following line and lauch a postconf -e. However, since this is a fresh install, and I’m not using postfix, it is better to stop the service.

 10.  Check iptables rules to see which rules are currently not used [FIRE-4513]

Since, I’m not on a production environment, it is very difficult to identify unused iptables rules right now. Once on the production environment, this situation is different. According to Cisofy, the best way is to “use iptables –list –numeric –verbose to display all rules. Check for rules which didn’t get a hit and repeat this process several times (e.g. in a few weeks). Finally remove any unneeded rules.”

 11. Consider hardening SSH configuration [SSH-7408]

  •     – Details  : AllowTcpForwarding (YES –> NO)
  •     – Details  : ClientAliveCountMax (3 –> 2)
  •     – Details  : Compression (YES –> (DELAYED|NO))
  •     – Details  : LogLevel (INFO –> VERBOSE)
  •     – Details  : MaxAuthTries (6 –> 2)
  •     – Details  : MaxSessions (10 –> 2)
  •     – Details  : PermitRootLogin (YES –> NO)
  •     – Details  : Port (22 –> )
  •     – Details  : TCPKeepAlive (YES –> NO)
  •     – Details  : UseDNS (YES –> NO)
  •     – Details  : X11Forwarding (YES –> NO)
  •     – Details  : AllowAgentForwarding (YES –> NO)

Again, hardening SSH is one of the most important to evade attacks especially from SSH bots. It all depends how your network infrastructure is configured and whether it is accessible from the internet or not. However, these details viewed are very informative.

12. Periodic system scan, malware and ransomware scanners are now a must. According to statistics, servers are being hacked constantly. Pervasive Monitoring is becoming a heavy cash deal for malicious softwares. 

The Lynis Command

Lynis documentation is pretty straight forward with a cheat sheet. The arguments are self explicit. Here are some hints.

1.Performs a system audit which is the most common audit.

lynis audit system

2. Provides command to do a remote scan.

lynis audit system remote <host>

3. Views the settings of default profile.

lynis show settings

4. Checks if you are using most recent version of Lynis

lynis update info

5. More information about a specific test-id

lynis show details <test-id>

6. To scan whole system

lynix --check-all Q

7. To see all available parameters of Lynis

lynis show options

At the end of any Lynis command, it will also prompt you where the logs have been stored for your future references. It is usually in /var/log/lynis.log. The systutorial on lynis is also a good start to grasp the command. All common systems based on Unix/Linux are supported. Examples include Linux, AIX, *BSD, HP-UX, macOS and Solaris. For package management, the following tools are supported:- dpkg/apt, pacman, pkg_info, RPM, YUM, zypper.


Securing MySQL traffic with Stunnel in a jail environment on CentOS

Stunnel is a program by Michal Trojnara that allows you to encrypt arbitrary TCP connections inside SSL. Stunnel can also allow you to secure non-SSL aware daemons and protocols (like POP, IMAP, LDAP, etc) by having Stunnel provide the encryption, requiring no changes to the daemon’s code. It is a proxy designed to add TLS encryption functionality to existing clients and servers without any changes in the programs’ code. Its architecture is optimized for security, portability, and scalability (including load-balancing), making it suitable for large deployments. – Stunnel.org

The concept that lies behind Stunnel is about the encryption methodology that is used when the client is sending a message to a server using a secure tunnel. In this article, we will focus on using MySQL alongside Stunnel. MariaDB Client will access the MariaDB server database using the Stunnel for more security and robustness.



Photo Credits: danscourses.com
Photo Credits: danscourses.com

I will demonstrate the installation and configuration using the CentOS distribution which is on my Virtual Box lab environment. I created two CentOS 7 virtual machines with hostname as stunnelserver and stunnelclient. We will tunnel the MySQL traffic via Stunnel. You can apply the same concept for SSH, Telnet, POP, IMAP or any TCP connection.

The two machines created are as follows:

  1. stunnelserver : 192.168.100.17 – Used as the Server
  2. stunnelclient : 192.168.100.18 – Used as the Client

Basic package installation and configuration on both servers

1. Install the Stunnel and OpenSSL package on both the client and the server.

yum install stunnel openssl -y

2. As we will be using Stunnel over MariaDB, you can use the MariaDB repository tools to get the links to download the repository. Make sure you have the MariaDB-client package installed on the stunnelclient which will be used as client to connect to the server. Also, install both packages on the stunnelserver. The commands to install the MariaDB packages are as follows:

sudo yum install MariaDB-server MariaDB-client

3. For more information about installations of MariaDB, Galera etc, refer to these links:

MariaDB Galera Clustering

MariaDB Master/Master installation

MariaDB Master/Slave installation

Configuration to be carried out on the stunnelserver (192.168.100.17)

 

 

4. Once you have all the packages installed, it’s time to create your privatekey.pem. Then, use the private key to create the certificate.pem. Whilst creating the certificate.pem, it will prompt you to enter some details. Feel free to fill it.

openssl genrsa -out privatekey.pem 2048

openssl req -new -x509 -days 365 -key privatekey.pem -out certificate.pem

5. Now comes the most interesting part to configure the stunnel.conf file by tunnelling it to the MySQL port on the stunnelserver. I observed that the package by default does not come with a stunnel.conf or even a Init script after installing it from the repository. So, you can create your own Init script. Here is my /etc/stunnel/stunnel.conf on the server:

chroot = /var/run/stunnel
setuid = stunnel
setgid = stunnel
pid = /stunnel.pid
debug = 7
output = /stunnel.log
sslVersion = TLSv1
[mysql]
key = /etc/stunnel/privatekey.pem
cert = /etc/stunnel/certificate.pem
accept = 44323
connect = 127.0.0.1:3306

6. Position your privatekey.pem and certificate.pem at /etc/stunnel directory. Make sure you have the right permission (400) on the privatekey.pem.

 

7. Based upon the configuration in part 5, we will now create the /var/run/stunnel directory and assign it with user and group of stunnel:

useradd -G stunnel stunnel && mkdir /var/run/stunnel && chown stunnel:stunnel /var/run/stunnel

8.  The port 44323 is a non reserved port which I chose to tunnel the traffic from the client.

9. As we do not have the Init script by default in the package, start the service as follows:

stunnel /etc/stunnel/stunnel.conf

10. A netstat or ss command on the server will show the Stunnel listening on port 44323.

Configuration to be carried out on the stunnelclient (192.168.100.18)

11. Here is the stunnel.conf file on the stunnel client :

verify = 2
chroot = /var/run/stunnel
setuid = stunnel
setgid = stunnel
pid = /stunnel.pid
CAfile = /etc/stunnel/certificate.pem
client = yes
sslVersion = TLSv1
renegotiation = no
[mysql]
accept = 24
connect = 192.168.100.17:44323

12. Import the certificate.pem in the /etc/stunnel/ directory.

scp <user>@<ipofstunnelserver>:/etc/stunnel/certificate.pem

13. Based upon the configuration in part 11, we will now create the /var/run/stunnel directory and assign it with user and group of stunnel:

useradd -G stunnel stunnel && mkdir /var/run/stunnel && chown stunnel:stunnel /var/run/stunnel

14. You can now start the service on the client as follows:

stunnel /etc/stunnel/stunnel.conf

15. A netstat on the client will show the Stunnel listening on port 24.

16. You can now connect on the MySQL database from your client to your server through the tunnel. Example:

mysql -h 127.0.0.1 -u <Name of Database> -p -P 24



Tips:

  • When starting Stunnel, the log and the pid file will be created automatically inside the jail environment that is /var/run/stunnel.
  • You can also change the debug log level. Level is a one of the syslog level names or numbers emerged (0), alert (1), crit (2), err (3), warning (4), notice (5), info (6), or debug (7). All logs for the specified level and all levels numerically less than it will be shown. Use debug = debug or debug = 7 for greatest debugging output. The default is notice (5).
  • If you compile from source, you will have a free log rotate and Init scripts. Probably on CentOS, it’s not packaged with the script!
  • You can also verify if SSLv2 and SSLv3 have been disabled using openssl s_client -connect 127.0.0.1:44323 -ssl3 and try with -tls1 to compare.
  • For the purpose of testing, you might need to check your firewall rules and SELINUX parameters.
  • You don’t need MariaDB-Server package on the client.
  • Stunnel is running on a Jail environment. The logs and the PID described in part 5 and 11 will be found in /var/run/stunnel.
  • You can invoke stunnel from inetd. Inetd is the Unix ‘super server’ that allows you to launch a program (for example the telnet daemon) whenever a connection is established to a specified port. See the “Stunnel how’s to” for more information. The Stunnel manual can also be viewed here.

 


Out of Memory (OOM) in Linux machines

Since some months I have not been posting anything on my blog. I should admit that I was really busy. Recently, a friend asked me about the Out of Memory messages in Linux. How is it generated? What are the consequences? How can it be avoided in Linux machines? There is no specific answer to this as an investigation had to be carried out to have the Root Cause Analysis. Before getting into details about OOM, let’s be clear that whenever the Kernel is starved of memory, it will start killing processes. Many times, Linux administrators will experience this and one of the fastest way to get rid of it is by adding extra swap. However, this is not the definite way of solving the issue. A preliminary assessment needs to be carried out followed by an action plan, alongside, a rollback methodology.

If after killing some processes, the kernel cannot free up some memory, it might lead to a kernel panic, deadlocks in applications, kernel hungs, or several defunct processes in the machine. I know cases where the machine change run level mode. There are cases of kernel panic in virtual machines where the cluster is not healthy. In brief, OOM is a subsystem to kill one or more processes with the aim to free memory. In the article Linux kernel crash simulation using kdump, I gave an explanation how to activate Kdump to generate a vmcore for analysis. However, to send the debug messages during an out of memory error to the vmcore, the SYSCTL file need to be configured. I will be using a CentOS 7 machine to illustrate the OOM parameters and configurations.

1.To activate OOM debug in the vmcore file, set the parameter vm.panic_on_oom to 1 using the following command:

systctl -w vm.panic_on_oom=1

To verify if the configuration has been taken into consideration, you can do a sysctl -a | grep -i oom. It is not recommended to test such parameters in the production environment.

2. To find out which process the kernel is going to kill, the kernel will read a function in the kernel code called badness() . The badness() calculate a numeric value about how bad this task has been. To be precise, it works by accumulating some “points” for each process running on the machine and will return those processes to a function called select_bad_process() in the linux kernel. This will eventually start the OOM mechanism which will kill the processes. The “points” are stored in the /proc/<pid>/oom_score. For example, here, i have a server running JAVA.

As you can see, the process number is 2153. The oom_score is 21

3. There are lots of considerations that are taken into account when calculating the badness score. Some of the factors are the Virtual Memory size (VM size), the Priority of the Process (NICE value), the Total Runtime, the Running user and the /proc/<pid>/oom_adj. You can also set up the oom_score_adj value for any PID between -1000 to 1000. The lowest possible value, -1000, is equivalent to disabling OOM killing entirely for that task since it will always report a badness score of 0.

4. Let’s assume that you want to prevent a specific process from being killed.

echo -17 > /proc/$PID/oom_adj

5. If you know the process name of SSH Daemon and do not it from being killed, use the following command:

pgrep -f "/usr/sbin/sshd" | while read PID; do echo -17 > /proc/$PID/oom_adj; done

6. To automate the sshd from being killed through a cron which will run each minute use the following:

* * * * * root pgrep -f "/usr/sbin/sshd" | while read PID; do echo -17 > /proc/$PID/oom_adj; done
7. Let's now simulate the OOM killer messages. Use the following command to start an out of memory event 
on the machine.
echo f > /proc/sysrq-trigger 

You will notice an OOM error message in the /var/log/messages.
As you can notice here, the PID 843 was calculated by the OOM killer before killing it. 
There is also the score number which is 4 in our case.


Before the 'Out of memory' error, there will be a call trace which will be sent by the kernel.



8. To monitor how the OOM killer is generating scores, you can use the dstat command. To install the dstat 
package on RPM based machine use: 
yum install dstat 

or for debian based distribution use:
apt-get install dstat

Dstat is used to generate resource statistics. To use dstat to monitor the score from OOM killer use:
dstat -top-oom


TIPS:

  • oom_score_adj is used in new linux kernel. The deprecated function is oom_adj in old Linux machine.
  • When disabling OOM killer under heavy memory pressure, it may cause the system to kernel panic.
  • Making a process immune is not a definite way of solving problem, for example, when using JAVA Application. Use a thread/heap dump to analyse the situation before making a process immune.
  • Dstat is now becoming an alternative for vmstat, netstat, iostat, ifstat and mpstat. For example, to monitor CPU in a program, use dstat -c –top-cpu -dn –top-mem
  • Testing in production environment should be avoided!