Monday, May 23, 2011

Cacti to stop the Noise

Open source snmp collection is top priority to eliminate noise about Capacity Issues.

Despite our strategic capacity collectors like BMC, SRM, SAS ITRM, MOM/SCOM they do not help in near real-time situations. Having a product like Cacti allows all the technical people determine baseline performance changes in near realtime without impact to the environment.

With a organizational facing accessible portal of graphic representation of CPU, Memory, CPU Queues, network/disk IO, filesystem utilization we can stop the noise and spin when an system isn't responding.  The middle layer management can see what is going on while the SME's investigate and shoot for root-cause.

Regards.

Scott Wardley

Software Release and Entire Web Tier is Down

Item - software release to web tier
Issues - dropped sessions
Root Cause - very old log shipping sync program started that had been disabled 2 releases before.
Identified - Top command showed waitio. Waitio tracked back to internal disk running at 27ms on disks that should have been 5.  Process determined by looking at length and volume of data by process.
Team members that tracked root cause - 2 of 50 on swat team.
False paths - tracked wait to network io that was also impacted with addition transmission requests caused by dropped sessions....this was tracked to nfs use by tier that was hosted on one of the nodes(bad idea). Detailed investigation of nfs node found disk io issue and eventually figured out disk io on all nodes.
Performance Impact - waitio had taken out 3 of 8core box reducing available cpu capacity and poor disk response time impact the rest. Tier would stay alive until weblogic sessions increased and then sessions would drop.
Lesson Learned - Performance Test environment should be a image copy of production prior to software changes!

Saturday, January 1, 2011

Hidden Backup

Ran into a server with high network io in primetime. My first thought was it was a backup, but after checking the historical usage over the last 40 days figured that this wasn't the case. The particular server is at the hub a the Customer Billing environment so raised a severity ticket to bring teams together to figure out what was going on.

By the time we looked again the issue was gone yet nobody knew the cause. Investigation revealed that the  cause was a Full backup and that it has been cancelled for the last few months based on a previous request to get more CPU cycles to catchup on some billing runs. And it had been cancelled again while I was raising the severity ticket.

More digging revealed a new media server had been implemented to off-host this backup 6 months ago, but if was forgotten.

Step 1 get a full backup run as rebuilding from 3 months of incremental backups isn't recommended and step 2 get the Project Office to complete the original media server implementation.

Driving these types of issues isn't in the normal remit of a Capacity Planner, but sometimes when we discover something we have to drive it until a firm resolution is in place.  Make sure you partner with your Backup Team and identify any other backups that aren't getting completed as these could be capacity and performance issues.

Regards.

Scott Wardley

Monday, December 13, 2010

Debug and CPU queues

As part of a peak season, I have been reviewing multiple environments with the various Application Owners. 

Besides basic capacity metrics I have been checking out CPU Load (queues). Load is the average number of instructions waiting to be processed on a system. Like a line up in a store each line represents a CPU core. As long as there are no lines (load/queues) for the check-out Clerk there is not an issue.

In this case there were lines ups for the CPU cores but not all of the check-outs (cores) are being used. On a web applications that is supposed to be multi-threaded this a clear indication of an issue. The cause can be either IO related, too many application instances running or just poor threading/forking program code.

In this instance the root cause was having debug=on for Java thus creating an io bottleneck to disk. This issue had been bothering me for a week or so, but this morning when working with a Linux guy we saw a server with .01% CPU and queues of 2 I knew we had a clear victim to investigate on.

Last week a similar issue was encountered with sftp debug options.

Hey! If your in production turn off debug.

Regards.

Scott Wardley


Sunday, November 28, 2010

Use statistics to tell the story.

You are more than likely to run into a system that has more problems with politics that capacity.

These system usually have basic capacity issues related to diskio and memory that are impacting CPU utilization represented by waitio. But, because there are Money people driving decisions instead of Technical Leaders we end up with a major spend/project occuring instead of basic tuning.

Use statistics to tell the story. Lots of emotion surround these type of systems and its best to gather accurate statistics and be the provider of information to help people make decisions or reduce risk. When people are ready they will use the data and your message will get out.

iostat when used over a good sampling period is one of the best tools followed by looking at run queues off of a top command.  Not everything can be resolved by an upgrade. Especially something like ftp security logging with debug that flogs the root disks.

Regards.

Scott.

Monday, November 8, 2010

Heat Maps

With millions of pieces of data and numerous untold stories the typical monthly report get complicated. Expecting 20 pages per domain is common if support detail is included. If your remit covers Servers, Storage, Virtual Servers, Citrix, E-Messaging, Middleware, VOIP, Backup, Data Centre and Networks the amount of details is just too much for even the advance Technical person to take it.

Add on pricing, people and risk components and its not surprise your typical Executive is not getting a clear picture of where to put resources to help you.

Enter the HeatMap. One page that tells it all and shows it all in on view. Application or Infrastructure component, a comment with the issue and a timeline to show just how long this problem has existed. The timeline is important because is shows you and others have known about the problem, have had time to come up with a resolution and also gives clarity into the Managements Decision Making process or lack of it.

 Some rules exist with Heat Maps. They can be used for both Application and Infrastructure components. Any Yellow item requires a resolution plan or written agreement from the Executive they are willing to live with the risk. All yellow items are reviewed every month to make sure its not forgotten. All red items get esclated to both the Business and to the IT Executive with a very clear statement like:

"Your application name Oracle Inventory has been on our Issue list for the last 6 months and had gone into a critical state. We have highlighted this monthly and you have agreed to accept the risk and proceed with the 'no action' plan . We believe in it current state we can't not provide consistency of service and request we turn off threashold monitoring as they are being breached daily. We will remove the application from our support matrix and put it into a best effort only status. Please Approve."

Send to the Application owner and CC their boss and their boss and maybe even their boss. Everyone is busy and it could be that one persons view of an application is different from another.  Or one boss plans for an application is different that the others. One level may be thinking of phasing out the application and his/her bos could be thinking of  phasing out its Manager and building up the application. So, cover all bases and make sure everyone knows there is risk.

Risk may also be related to the time required to react to a situation. For instance a SAN addition takes 4 months to validate, quote, procure, build, ship, site prepare, deliver, install, cable, configure, test before going live. So turn things to a Red status early if you need to.

Start with the One Page heat map because sometimes you only get the 15 minutes of focus from an Executive. If they are good, they will take the time to go through all the Capacity Reports as its their chance to drive strategy into the Infrastructure. Or perhaps this months fire has their mind occupied. Either way start out on the Heat Map and let it tell its store.

Regards.

Scott Wardley

Friday, October 22, 2010

Hard Disk Response Time

Disk response time in milliseconds shows just how well a system is setup and configured. If you capture all the disks response time with iostat or via an agent collection (BMC/HP OVPA) you can oversee the health of the environment.

With most disks giving better than 5ms response times anything greater may be a cause for concern. If the disks are disks 0 & 1 on Unix (internal disks) or C/D on Windows and the response time is >5ms then chances are either an application installed locally is thrashing the disk or the system may be paging memory  due to bad setup or lack of physical memory.

If the disks are SAN attached a >5ms response time is more of a concern as the SAN has a large cache on the front of it so you are writing to RAM instead of disk. SAN response time should be much better and if you are having issues then it could be something like a bad query that doing a table scan instead of an index lookup.  Paging wouldn't be an issue unless this is a diskless server which is more and more common.

Try and find the process id that is accessing the disk/lun. First stop is your senior Unix/Windows team leads to pin-point the process and the Lun/Disk. If its a Database Lun/Disk then approach the DBA's and ask what database is on the lun and ask for the sql query that is use the process id. They should be able to investigate, identify and thank you for helping out. If its an application process, then use the metrics, raise a ticket/incident to log the problem and go the application owners/vendor with the details.

Don't let people reboot the box as it may just delay finding this issue again. Shoot for root cause and when the issues is resolved take credit, document it, put a savings to cost avoidance of a system upgrade and pull the list out and show your Executive next time you need money to add to your Capacity Planning Tool kit.

Regards.

Scott Wardley