Besides basic capacity metrics I have been checking out CPU Load (queues). Load is the average number of instructions waiting to be processed on a system. Like a line up in a store each line represents a CPU core. As long as there are no lines (load/queues) for the check-out Clerk there is not an issue.
In this case there were lines ups for the CPU cores but not all of the check-outs (cores) are being used. On a web applications that is supposed to be multi-threaded this a clear indication of an issue. The cause can be either IO related, too many application instances running or just poor threading/forking program code.
In this instance the root cause was having debug=on for Java thus creating an io bottleneck to disk. This issue had been bothering me for a week or so, but this morning when working with a Linux guy we saw a server with .01% CPU and queues of 2 I knew we had a clear victim to investigate on.
Last week a similar issue was encountered with sftp debug options.
Hey! If your in production turn off debug.
Regards.
Scott Wardley