Monday, May 23, 2011

Cacti to stop the Noise

Open source snmp collection is top priority to eliminate noise about Capacity Issues.

Despite our strategic capacity collectors like BMC, SRM, SAS ITRM, MOM/SCOM they do not help in near real-time situations. Having a product like Cacti allows all the technical people determine baseline performance changes in near realtime without impact to the environment.

With a organizational facing accessible portal of graphic representation of CPU, Memory, CPU Queues, network/disk IO, filesystem utilization we can stop the noise and spin when an system isn't responding.  The middle layer management can see what is going on while the SME's investigate and shoot for root-cause.

Regards.

Scott Wardley

Software Release and Entire Web Tier is Down

Item - software release to web tier
Issues - dropped sessions
Root Cause - very old log shipping sync program started that had been disabled 2 releases before.
Identified - Top command showed waitio. Waitio tracked back to internal disk running at 27ms on disks that should have been 5.  Process determined by looking at length and volume of data by process.
Team members that tracked root cause - 2 of 50 on swat team.
False paths - tracked wait to network io that was also impacted with addition transmission requests caused by dropped sessions....this was tracked to nfs use by tier that was hosted on one of the nodes(bad idea). Detailed investigation of nfs node found disk io issue and eventually figured out disk io on all nodes.
Performance Impact - waitio had taken out 3 of 8core box reducing available cpu capacity and poor disk response time impact the rest. Tier would stay alive until weblogic sessions increased and then sessions would drop.
Lesson Learned - Performance Test environment should be a image copy of production prior to software changes!