Tuesday, June 14, 2011

Take out Before you Put in

Glory only existing on implementing new projects and rarely for untangling the mess they create withing the Data Centre. 

Always ask when confronted with new implementations "what are they replacing" and put an entry in your calendar for a take out of the older system. After the glory has passed you can go back and recover the rack, space and power that inactive system decommissioning will provide.

Track it and you will be surprised on the total power save over the course of a year. And my choice for power metrics is from Ideas International were you can get Specint type ratings called RPE for apples to apples comparisons of through put capabilities and power changes on source and target systems.

Regards.

Scott Wardley




Monday, May 23, 2011

Cacti to stop the Noise

Open source snmp collection is top priority to eliminate noise about Capacity Issues.

Despite our strategic capacity collectors like BMC, SRM, SAS ITRM, MOM/SCOM they do not help in near real-time situations. Having a product like Cacti allows all the technical people determine baseline performance changes in near realtime without impact to the environment.

With a organizational facing accessible portal of graphic representation of CPU, Memory, CPU Queues, network/disk IO, filesystem utilization we can stop the noise and spin when an system isn't responding.  The middle layer management can see what is going on while the SME's investigate and shoot for root-cause.

Regards.

Scott Wardley

Software Release and Entire Web Tier is Down

Item - software release to web tier
Issues - dropped sessions
Root Cause - very old log shipping sync program started that had been disabled 2 releases before.
Identified - Top command showed waitio. Waitio tracked back to internal disk running at 27ms on disks that should have been 5.  Process determined by looking at length and volume of data by process.
Team members that tracked root cause - 2 of 50 on swat team.
False paths - tracked wait to network io that was also impacted with addition transmission requests caused by dropped sessions....this was tracked to nfs use by tier that was hosted on one of the nodes(bad idea). Detailed investigation of nfs node found disk io issue and eventually figured out disk io on all nodes.
Performance Impact - waitio had taken out 3 of 8core box reducing available cpu capacity and poor disk response time impact the rest. Tier would stay alive until weblogic sessions increased and then sessions would drop.
Lesson Learned - Performance Test environment should be a image copy of production prior to software changes!

Saturday, January 1, 2011

Hidden Backup

Ran into a server with high network io in primetime. My first thought was it was a backup, but after checking the historical usage over the last 40 days figured that this wasn't the case. The particular server is at the hub a the Customer Billing environment so raised a severity ticket to bring teams together to figure out what was going on.

By the time we looked again the issue was gone yet nobody knew the cause. Investigation revealed that the  cause was a Full backup and that it has been cancelled for the last few months based on a previous request to get more CPU cycles to catchup on some billing runs. And it had been cancelled again while I was raising the severity ticket.

More digging revealed a new media server had been implemented to off-host this backup 6 months ago, but if was forgotten.

Step 1 get a full backup run as rebuilding from 3 months of incremental backups isn't recommended and step 2 get the Project Office to complete the original media server implementation.

Driving these types of issues isn't in the normal remit of a Capacity Planner, but sometimes when we discover something we have to drive it until a firm resolution is in place.  Make sure you partner with your Backup Team and identify any other backups that aren't getting completed as these could be capacity and performance issues.

Regards.

Scott Wardley