Monday, December 13, 2010

Debug and CPU queues

As part of a peak season, I have been reviewing multiple environments with the various Application Owners. 

Besides basic capacity metrics I have been checking out CPU Load (queues). Load is the average number of instructions waiting to be processed on a system. Like a line up in a store each line represents a CPU core. As long as there are no lines (load/queues) for the check-out Clerk there is not an issue.

In this case there were lines ups for the CPU cores but not all of the check-outs (cores) are being used. On a web applications that is supposed to be multi-threaded this a clear indication of an issue. The cause can be either IO related, too many application instances running or just poor threading/forking program code.

In this instance the root cause was having debug=on for Java thus creating an io bottleneck to disk. This issue had been bothering me for a week or so, but this morning when working with a Linux guy we saw a server with .01% CPU and queues of 2 I knew we had a clear victim to investigate on.

Last week a similar issue was encountered with sftp debug options.

Hey! If your in production turn off debug.

Regards.

Scott Wardley


Sunday, November 28, 2010

Use statistics to tell the story.

You are more than likely to run into a system that has more problems with politics that capacity.

These system usually have basic capacity issues related to diskio and memory that are impacting CPU utilization represented by waitio. But, because there are Money people driving decisions instead of Technical Leaders we end up with a major spend/project occuring instead of basic tuning.

Use statistics to tell the story. Lots of emotion surround these type of systems and its best to gather accurate statistics and be the provider of information to help people make decisions or reduce risk. When people are ready they will use the data and your message will get out.

iostat when used over a good sampling period is one of the best tools followed by looking at run queues off of a top command.  Not everything can be resolved by an upgrade. Especially something like ftp security logging with debug that flogs the root disks.

Regards.

Scott.

Monday, November 8, 2010

Heat Maps

With millions of pieces of data and numerous untold stories the typical monthly report get complicated. Expecting 20 pages per domain is common if support detail is included. If your remit covers Servers, Storage, Virtual Servers, Citrix, E-Messaging, Middleware, VOIP, Backup, Data Centre and Networks the amount of details is just too much for even the advance Technical person to take it.

Add on pricing, people and risk components and its not surprise your typical Executive is not getting a clear picture of where to put resources to help you.

Enter the HeatMap. One page that tells it all and shows it all in on view. Application or Infrastructure component, a comment with the issue and a timeline to show just how long this problem has existed. The timeline is important because is shows you and others have known about the problem, have had time to come up with a resolution and also gives clarity into the Managements Decision Making process or lack of it.

 Some rules exist with Heat Maps. They can be used for both Application and Infrastructure components. Any Yellow item requires a resolution plan or written agreement from the Executive they are willing to live with the risk. All yellow items are reviewed every month to make sure its not forgotten. All red items get esclated to both the Business and to the IT Executive with a very clear statement like:

"Your application name Oracle Inventory has been on our Issue list for the last 6 months and had gone into a critical state. We have highlighted this monthly and you have agreed to accept the risk and proceed with the 'no action' plan . We believe in it current state we can't not provide consistency of service and request we turn off threashold monitoring as they are being breached daily. We will remove the application from our support matrix and put it into a best effort only status. Please Approve."

Send to the Application owner and CC their boss and their boss and maybe even their boss. Everyone is busy and it could be that one persons view of an application is different from another.  Or one boss plans for an application is different that the others. One level may be thinking of phasing out the application and his/her bos could be thinking of  phasing out its Manager and building up the application. So, cover all bases and make sure everyone knows there is risk.

Risk may also be related to the time required to react to a situation. For instance a SAN addition takes 4 months to validate, quote, procure, build, ship, site prepare, deliver, install, cable, configure, test before going live. So turn things to a Red status early if you need to.

Start with the One Page heat map because sometimes you only get the 15 minutes of focus from an Executive. If they are good, they will take the time to go through all the Capacity Reports as its their chance to drive strategy into the Infrastructure. Or perhaps this months fire has their mind occupied. Either way start out on the Heat Map and let it tell its store.

Regards.

Scott Wardley

Friday, October 22, 2010

Hard Disk Response Time

Disk response time in milliseconds shows just how well a system is setup and configured. If you capture all the disks response time with iostat or via an agent collection (BMC/HP OVPA) you can oversee the health of the environment.

With most disks giving better than 5ms response times anything greater may be a cause for concern. If the disks are disks 0 & 1 on Unix (internal disks) or C/D on Windows and the response time is >5ms then chances are either an application installed locally is thrashing the disk or the system may be paging memory  due to bad setup or lack of physical memory.

If the disks are SAN attached a >5ms response time is more of a concern as the SAN has a large cache on the front of it so you are writing to RAM instead of disk. SAN response time should be much better and if you are having issues then it could be something like a bad query that doing a table scan instead of an index lookup.  Paging wouldn't be an issue unless this is a diskless server which is more and more common.

Try and find the process id that is accessing the disk/lun. First stop is your senior Unix/Windows team leads to pin-point the process and the Lun/Disk. If its a Database Lun/Disk then approach the DBA's and ask what database is on the lun and ask for the sql query that is use the process id. They should be able to investigate, identify and thank you for helping out. If its an application process, then use the metrics, raise a ticket/incident to log the problem and go the application owners/vendor with the details.

Don't let people reboot the box as it may just delay finding this issue again. Shoot for root cause and when the issues is resolved take credit, document it, put a savings to cost avoidance of a system upgrade and pull the list out and show your Executive next time you need money to add to your Capacity Planning Tool kit.

Regards.

Scott Wardley

Wednesday, October 20, 2010

Agents or Agentless Server Collection

Agents collection of metrics from servers are the life blood of any tightly control environment.

The question is how many? A HP agent for openview, a Tivoli agent monitoring, a BMC agent for capacity collection, a Cirba agent for consolidation efforts....and on and on. And an agent can cost from $40 to $500 to buy and they you have to pay 20% on top of that for annual maintenance. Take that over a 4000 server environment and push the total cost out over the life of the equipment and you are into the many of millions of dollars.

Personally, I like agentless collection by snmp as its fast to implement and gathers 90% of the required metrics. Examples are MOM, SMS, BMC, Uptime, Quest and my favourite open source is Cacti and Ganglia. It will not deep dive into database filesystem utilization but it will give you the base line metrics and allow you to fix identify elevated baseline cpu's and excessive cpu queues or memory issues.

On a new site, do both.  Talk the talk about agents and get the budget lined up for the next years spend and do the agentless to show results. Your buddies are the Operating System teams that should be the owners of the results and will help you with getting it going. Take there support and praise them for their efforts with the Executive and let them use it for their own uses.

Most collections allow you to group servers by applications owners or by infrastructure components. One group for all E-Messaging servers, one group for all servers on a particular SAN array, one group for all servers on a particular vlan.....all these help you to be a leader when a big issue occurs.

Big plus for agentless is most of it is 1 to 15 minutes delayed and will give you that near real time view of the environment. For the core Capacity Planners, harvest the data daily, put it into SAS and graph, trend or model away.

Regards.

Scott Wardley

Tuesday, October 19, 2010

Where does it stop?

If you are good Capacity Planner looking after servers you may at some point get side swiped by another domain. Several times I have been invited into a meeting just to take abuse about lack of Data Centre Rack Space or SAN disk space, or lack of Software Licenses. People perceive these as Capacity issues and rightly so.

Surprise! Everything need Capacity Planning, including staffing, buildings, and even parking spaces.

So stake a spot, claim it, do it well and shrug responsibility for the others or become a generalist across all the domains and own that. What ever you do, your goal is to ensure there are no surprises for the Executive. Document it , circulate and escalate when there are issues. Ignoring a problem won't make it go away, it will just get outsourced instead.

Regards.

Scott Wardley

Wednesday, October 13, 2010

Runaway Processes

Tools help collect data that help you identify patterns. Like the following stuck backup process that hit several machines and created what we call a Plateau. Effectively one CPU is fully utilized running the process. In this case the machines have enough capacity so there was no impact to production, but if it happened again then we could have an avoidable service disruption.

The hard part of catching these is the effort it takes to convince people there is an issue. Best to document and raise an incident ticket to get the "get out of jail" card. Make sure you document the fix and add a dollar amount to it. For instance one machine cost $100k fully implemented and you just saved 20% or $20,000. For most sites this will be about $3 million a year. These metrics show a good case for further investment in tools or people and tell success which your Manager will like.

Note that you can see the drop on the last day as the processes were removed.


Regards.

Scott Wardley

New Company - the mirror

A new site for a Capacity Planner is like going to a shopping mall. There are lots of areas of interest and just not enough time to see them all.  Like any good shopper, we look in the mirror figure out what doesn't look right and try to find something to fix it.

For me the one closest mirror is the one showing Server Metrics. After a quick look around I like to visit the Data Centre Inventory which tell just how long the current wardrobe has been around. A solid hardware  refresh policy will show when there are only a few items more than 6 years old. This is infrequently the case as most people have been driving value over the last 2 years of economic uncertainty.

Since the first quick looks in the mirror was a concern to us, we next check our wallet to see just what we can afford to fix. The keeper of the wallet is Finance. And they will be quick to tell you that there is no money and if you need any then you need to take it from your existing spending.

With a dire problem and no money we have to do what most people do and cut the Cable TV feed and save the monthly money until we can buy a new TV and only then turn the cable back on. First port of call is the vendor list and an audit will soon show opportunities.

If procurement is up for the challenge and we can drop some licensing spend we now have some hardware refresh money or some funds for proper reporting tools. These will allow you to the show  just where to spend instead of wasting your saved pennies on the fancy outfits your flashy loud talking  friends say you need.

Next time....lets talk about tools.

Regards.

Scott Wardley