Disk response time in milliseconds shows just how well a system is setup and configured. If you capture all the disks response time with iostat or via an agent collection (BMC/HP OVPA) you can oversee the health of the environment.
With most disks giving better than 5ms response times anything greater may be a cause for concern. If the disks are disks 0 & 1 on Unix (internal disks) or C/D on Windows and the response time is >5ms then chances are either an application installed locally is thrashing the disk or the system may be paging memory due to bad setup or lack of physical memory.
If the disks are SAN attached a >5ms response time is more of a concern as the SAN has a large cache on the front of it so you are writing to RAM instead of disk. SAN response time should be much better and if you are having issues then it could be something like a bad query that doing a table scan instead of an index lookup. Paging wouldn't be an issue unless this is a diskless server which is more and more common.
Try and find the process id that is accessing the disk/lun. First stop is your senior Unix/Windows team leads to pin-point the process and the Lun/Disk. If its a Database Lun/Disk then approach the DBA's and ask what database is on the lun and ask for the sql query that is use the process id. They should be able to investigate, identify and thank you for helping out. If its an application process, then use the metrics, raise a ticket/incident to log the problem and go the application owners/vendor with the details.
Don't let people reboot the box as it may just delay finding this issue again. Shoot for root cause and when the issues is resolved take credit, document it, put a savings to cost avoidance of a system upgrade and pull the list out and show your Executive next time you need money to add to your Capacity Planning Tool kit.
Regards.
Scott Wardley
No comments:
Post a Comment