Sunday, November 28, 2010

Use statistics to tell the story.

You are more than likely to run into a system that has more problems with politics that capacity.

These system usually have basic capacity issues related to diskio and memory that are impacting CPU utilization represented by waitio. But, because there are Money people driving decisions instead of Technical Leaders we end up with a major spend/project occuring instead of basic tuning.

Use statistics to tell the story. Lots of emotion surround these type of systems and its best to gather accurate statistics and be the provider of information to help people make decisions or reduce risk. When people are ready they will use the data and your message will get out.

iostat when used over a good sampling period is one of the best tools followed by looking at run queues off of a top command.  Not everything can be resolved by an upgrade. Especially something like ftp security logging with debug that flogs the root disks.

Regards.

Scott.

Monday, November 8, 2010

Heat Maps

With millions of pieces of data and numerous untold stories the typical monthly report get complicated. Expecting 20 pages per domain is common if support detail is included. If your remit covers Servers, Storage, Virtual Servers, Citrix, E-Messaging, Middleware, VOIP, Backup, Data Centre and Networks the amount of details is just too much for even the advance Technical person to take it.

Add on pricing, people and risk components and its not surprise your typical Executive is not getting a clear picture of where to put resources to help you.

Enter the HeatMap. One page that tells it all and shows it all in on view. Application or Infrastructure component, a comment with the issue and a timeline to show just how long this problem has existed. The timeline is important because is shows you and others have known about the problem, have had time to come up with a resolution and also gives clarity into the Managements Decision Making process or lack of it.

 Some rules exist with Heat Maps. They can be used for both Application and Infrastructure components. Any Yellow item requires a resolution plan or written agreement from the Executive they are willing to live with the risk. All yellow items are reviewed every month to make sure its not forgotten. All red items get esclated to both the Business and to the IT Executive with a very clear statement like:

"Your application name Oracle Inventory has been on our Issue list for the last 6 months and had gone into a critical state. We have highlighted this monthly and you have agreed to accept the risk and proceed with the 'no action' plan . We believe in it current state we can't not provide consistency of service and request we turn off threashold monitoring as they are being breached daily. We will remove the application from our support matrix and put it into a best effort only status. Please Approve."

Send to the Application owner and CC their boss and their boss and maybe even their boss. Everyone is busy and it could be that one persons view of an application is different from another.  Or one boss plans for an application is different that the others. One level may be thinking of phasing out the application and his/her bos could be thinking of  phasing out its Manager and building up the application. So, cover all bases and make sure everyone knows there is risk.

Risk may also be related to the time required to react to a situation. For instance a SAN addition takes 4 months to validate, quote, procure, build, ship, site prepare, deliver, install, cable, configure, test before going live. So turn things to a Red status early if you need to.

Start with the One Page heat map because sometimes you only get the 15 minutes of focus from an Executive. If they are good, they will take the time to go through all the Capacity Reports as its their chance to drive strategy into the Infrastructure. Or perhaps this months fire has their mind occupied. Either way start out on the Heat Map and let it tell its store.

Regards.

Scott Wardley