Capacity Canuck: October 2010

Friday, October 22, 2010

Hard Disk Response Time

Disk response time in milliseconds shows just how well a system is setup and configured. If you capture all the disks response time with iostat or via an agent collection (BMC/HP OVPA) you can oversee the health of the environment.

With most disks giving better than 5ms response times anything greater may be a cause for concern. If the disks are disks 0 & 1 on Unix (internal disks) or C/D on Windows and the response time is >5ms then chances are either an application installed locally is thrashing the disk or the system may be paging memory due to bad setup or lack of physical memory.

If the disks are SAN attached a >5ms response time is more of a concern as the SAN has a large cache on the front of it so you are writing to RAM instead of disk. SAN response time should be much better and if you are having issues then it could be something like a bad query that doing a table scan instead of an index lookup. Paging wouldn't be an issue unless this is a diskless server which is more and more common.

Try and find the process id that is accessing the disk/lun. First stop is your senior Unix/Windows team leads to pin-point the process and the Lun/Disk. If its a Database Lun/Disk then approach the DBA's and ask what database is on the lun and ask for the sql query that is use the process id. They should be able to investigate, identify and thank you for helping out. If its an application process, then use the metrics, raise a ticket/incident to log the problem and go the application owners/vendor with the details.

Don't let people reboot the box as it may just delay finding this issue again. Shoot for root cause and when the issues is resolved take credit, document it, put a savings to cost avoidance of a system upgrade and pull the list out and show your Executive next time you need money to add to your Capacity Planning Tool kit.

Regards.

Scott Wardley

Wednesday, October 20, 2010

Agents or Agentless Server Collection

Agents collection of metrics from servers are the life blood of any tightly control environment.

The question is how many? A HP agent for openview, a Tivoli agent monitoring, a BMC agent for capacity collection, a Cirba agent for consolidation efforts....and on and on. And an agent can cost from $40 to $500 to buy and they you have to pay 20% on top of that for annual maintenance. Take that over a 4000 server environment and push the total cost out over the life of the equipment and you are into the many of millions of dollars.

Personally, I like agentless collection by snmp as its fast to implement and gathers 90% of the required metrics. Examples are MOM, SMS, BMC, Uptime, Quest and my favourite open source is Cacti and Ganglia. It will not deep dive into database filesystem utilization but it will give you the base line metrics and allow you to fix identify elevated baseline cpu's and excessive cpu queues or memory issues.

On a new site, do both. Talk the talk about agents and get the budget lined up for the next years spend and do the agentless to show results. Your buddies are the Operating System teams that should be the owners of the results and will help you with getting it going. Take there support and praise them for their efforts with the Executive and let them use it for their own uses.

Most collections allow you to group servers by applications owners or by infrastructure components. One group for all E-Messaging servers, one group for all servers on a particular SAN array, one group for all servers on a particular vlan.....all these help you to be a leader when a big issue occurs.

Big plus for agentless is most of it is 1 to 15 minutes delayed and will give you that near real time view of the environment. For the core Capacity Planners, harvest the data daily, put it into SAS and graph, trend or model away.

Regards.

Scott Wardley

Tuesday, October 19, 2010

Where does it stop?

If you are good Capacity Planner looking after servers you may at some point get side swiped by another domain. Several times I have been invited into a meeting just to take abuse about lack of Data Centre Rack Space or SAN disk space, or lack of Software Licenses. People perceive these as Capacity issues and rightly so.

Surprise! Everything need Capacity Planning, including staffing, buildings, and even parking spaces.

So stake a spot, claim it, do it well and shrug responsibility for the others or become a generalist across all the domains and own that. What ever you do, your goal is to ensure there are no surprises for the Executive. Document it , circulate and escalate when there are issues. Ignoring a problem won't make it go away, it will just get outsourced instead.

Regards.

Scott Wardley

Wednesday, October 13, 2010

Runaway Processes

Tools help collect data that help you identify patterns. Like the following stuck backup process that hit several machines and created what we call a Plateau. Effectively one CPU is fully utilized running the process. In this case the machines have enough capacity so there was no impact to production, but if it happened again then we could have an avoidable service disruption.

The hard part of catching these is the effort it takes to convince people there is an issue. Best to document and raise an incident ticket to get the "get out of jail" card. Make sure you document the fix and add a dollar amount to it. For instance one machine cost $100k fully implemented and you just saved 20% or $20,000. For most sites this will be about $3 million a year. These metrics show a good case for further investment in tools or people and tell success which your Manager will like.

Note that you can see the drop on the last day as the processes were removed.

Regards.

Scott Wardley

New Company - the mirror

A new site for a Capacity Planner is like going to a shopping mall. There are lots of areas of interest and just not enough time to see them all. Like any good shopper, we look in the mirror figure out what doesn't look right and try to find something to fix it.

For me the one closest mirror is the one showing Server Metrics. After a quick look around I like to visit the Data Centre Inventory which tell just how long the current wardrobe has been around. A solid hardware refresh policy will show when there are only a few items more than 6 years old. This is infrequently the case as most people have been driving value over the last 2 years of economic uncertainty.

Since the first quick looks in the mirror was a concern to us, we next check our wallet to see just what we can afford to fix. The keeper of the wallet is Finance. And they will be quick to tell you that there is no money and if you need any then you need to take it from your existing spending.

With a dire problem and no money we have to do what most people do and cut the Cable TV feed and save the monthly money until we can buy a new TV and only then turn the cable back on. First port of call is the vendor list and an audit will soon show opportunities.

If procurement is up for the challenge and we can drop some licensing spend we now have some hardware refresh money or some funds for proper reporting tools. These will allow you to the show just where to spend instead of wasting your saved pennies on the fancy outfits your flashy loud talking friends say you need.

Next time....lets talk about tools.

Regards.

Scott Wardley