Wednesday, October 13, 2010

Runaway Processes

Tools help collect data that help you identify patterns. Like the following stuck backup process that hit several machines and created what we call a Plateau. Effectively one CPU is fully utilized running the process. In this case the machines have enough capacity so there was no impact to production, but if it happened again then we could have an avoidable service disruption.

The hard part of catching these is the effort it takes to convince people there is an issue. Best to document and raise an incident ticket to get the "get out of jail" card. Make sure you document the fix and add a dollar amount to it. For instance one machine cost $100k fully implemented and you just saved 20% or $20,000. For most sites this will be about $3 million a year. These metrics show a good case for further investment in tools or people and tell success which your Manager will like.

Note that you can see the drop on the last day as the processes were removed.


Regards.

Scott Wardley

No comments:

Post a Comment