Monday, May 23, 2011

Software Release and Entire Web Tier is Down

Item - software release to web tier
Issues - dropped sessions
Root Cause - very old log shipping sync program started that had been disabled 2 releases before.
Identified - Top command showed waitio. Waitio tracked back to internal disk running at 27ms on disks that should have been 5.  Process determined by looking at length and volume of data by process.
Team members that tracked root cause - 2 of 50 on swat team.
False paths - tracked wait to network io that was also impacted with addition transmission requests caused by dropped sessions....this was tracked to nfs use by tier that was hosted on one of the nodes(bad idea). Detailed investigation of nfs node found disk io issue and eventually figured out disk io on all nodes.
Performance Impact - waitio had taken out 3 of 8core box reducing available cpu capacity and poor disk response time impact the rest. Tier would stay alive until weblogic sessions increased and then sessions would drop.
Lesson Learned - Performance Test environment should be a image copy of production prior to software changes!

No comments:

Post a Comment