Managing and Analysing disk usage on Linux

Disk usage management and analysis on servers can sometimes be extremely annoying even disastrous if proper management and analysis are not carried out regularly. However, you can also brew some scripts to sort out the mess. In large companies, monitoring systems are usually set up to handle the situation, especially where there are several servers with consequent sized partitions allocated for let us say /var/log directory. An inventory tool such as the OCS Inventory tool can also be used to monitor a large number of servers.

This blog post will be updated as my own notebook to remember the commands used during the management of disk usage. You can update me some tricks and tips. I will update the article here :)

Managing disk space with 'find' command

1. To find the total size of a particular directory of more than 1000 days

find . -mtime +1000 -exec du -csh {} + | grep total$

2. Find in the partition/files of more than 50 M and do an ls which is longlisted and human readable.

find / -xdev -type f -size +50M -exec ls -lh '{}' ';'

3. Find in the /tmp partition every file or directory with mtime of more than 1 day and delete same.

find /tmp -name "develop*" -mtime +1 -exec rm -rf {} \;

4.Count from the file /tmp/uniqDirectory in the /home directory (uniqDirectory), every directory having the same unique name.

find /home > /tmp/uniqDirectory && for i in $(cat /tmp/uniqDirectory);do echo $i; ls -l /home/test/$i |wc -l;done

5. Find from /tmp all files having the extension .sh or .jar and calculate the total size.

find . -type f \( -iname "*.sh" -or -iname "*.jar" \) -exec du -csh {} + | grep total$

6. Find all files in /tmp by checking which files are not being used by any process and delete same if not being used.

find /tmp -type f | while read files ; do fuser -s $files || rm -rf  $files ; done

7. Once I encountered a VM during an incident which had turned on read-only mode after an intervention on SAN. After several hours, the disk was back on read-write mode. At that material time, there were several processes using the disk such as Screen, ATP, NFS etc.. I noticed that the disk usage turn to 90 % on /var partition despite du command does not show the same amount consumed. To troubleshoot the issue, the following command came handy which showed the process that has locked the disk. After restarting the service, it was back to 2%.

lsof | grep "/var" | grep deleted

Another interesting issue that you might encounter is a sudden increase of the log size which might be caused by an application due to some failure issues Example a sudden increase of binary logs generated by MySQL or a core dump generated!

Let's say we have a crash package installed on a server. The crash package will generate a core dump for analysis as to why the application was crashed. This is sometimes annoying as you cannot expect as to when an application is going to fail especially if you have many developers, system administrators working on the same platform. I would suggest a script which would send a mail to a particular user if ever the core dump has been generated. I placed a script here on GitHub to handle such a situation.

Naturally, a log rotate is of great help as well as crons to purge certain temporary logs. The "du" command is helpful but when it comes to choose and pick for a particular reason, you will need to handle the situation with the find command.

Tips:

- You should be extremely careful when deleting files from find command. Imagine some replication log file is being used by an Oracle Database Server which has been deleted. This would be disastrous.
- lsof on a mount point can also be interesting to troubleshoot disk usage.
- Also, make sure that you see the content of the log as any file can be named as the *log or *_log