Your MTTR There will always be situations and problems in our software systems. I know many people feel we might get beyond most issues, but as long as we continue to develop and deploy software, I know we'll have issues. Hardware fails, bugs slip through tests and are triggered by edge cases we never anticipated, or perhaps the data is unexpected. That last one might be one of the most common issues, and a source of many security issues. Too many developers think that quality will always be high, but that's not always the case. When things go wrong, how quickly can you get the system back up and running? Over time, a mean-time-to-recovery (MTTR) can help determine if you are getting a better handle on your environment or are things getting worse? Both your operation staff and your developers should better understand the system over time, and hopefully get broken applications back up and running quicker over time. Do you track your MTTR? Or if you're operations, maybe you track a mean-time-to-identification (MTTI). This is the time to actually figure out what's wrong. I don't know anyone tracking this metric, but that's an interesting one to note. If we can't identify problems quickly, or the MTTI grows over time, perhaps we have a training or turnover issue. Or perhaps we have a disconnect between developers and operations staff. Even in a DevOps environment where developers are responsible for parts of the production environment, there will be differing levels of ability, and this metric might help you identify who needs more training or practice in troubleshooting if the number rises. For most of my career, I've reported on uptime (or downtime) to management. That's not a bad metric, but it doesn't help the dev or Ops staff understand if where they might have problems. Many of us have ticketing systems where incidents are logged, and we add notes over time. Knowing how long it takes to find a problem and then fix it can be metrics that help you improve your system reliability over time. That's if you use them to do so. If these are just numbers to try and make your group look good to upper management, then someone will manipulate things, close tickets early or open them late. They might even be more willing to close a ticket quickly and open another one to reduce the MTTI and MTTR times. We can use metrics to improve how we work or just look good. One of these will help build an effective, efficient, strong department that does a great job building and running applications. The other usually ends up building an environment where quality stagnates, people don't stay longer than necessary, and keeps the traditional IT stereotypes alive. There will always be situations and problems in our software systems. I know many people feel we might get beyond most issues, but as long as we continue to develop and deploy software, I know we'll have issues. Hardware fails, bugs slip through tests and are triggered by edge cases we never anticipated, or perhaps the data is unexpected. That last one might be one of the most common issues, and a source of many security issues. Too many developers think that quality will always be high, but that's not always the case. When things go wrong, how quickly can you get the system back up and running? Over time, a mean-time-to-recovery (MTTR) can help determine if you are getting a better handle on your environment or are things getting worse? Both your operation staff and your developers should better understand the system over time, and hopefully get broken applications back up and running quicker over time. Do you track your MTTR? Or if you're operations, maybe you track a mean-time-to-identification (MTTI). This is the time to actually figure out what's wrong. I don't know anyone tracking this metric, but that's an interesting one to note. If we can't identify problems quickly, or the MTTI grows over time, perhaps we have a training or turnover issue. Or perhaps we have a disconnect between developers and operations staff. Even in a DevOps environment where developers are responsible for parts of the production environment, there will be differing levels of ability, and this metric might help you identify who needs more training or practice in troubleshooting if the number rises. For most of my career, I've reported on uptime (or downtime) to management. That's not a bad metric, but it doesn't help the dev or Ops staff understand if where they might have problems. Many of us have ticketing systems where incidents are logged, and we add notes over time. Knowing how long it takes to find a problem and then fix it can be metrics that help you improve your system reliability over time. That's if you use them to do so. If these are just numbers to try and make your group look good to upper management, then someone will manipulate things, close tickets early or open them late. They might even be more willing to close a ticket quickly and open another one to reduce the MTTI and MTTR times. We can use metrics to improve how we work or just look good. One of these will help build an effective, efficient, strong department that does a great job building and running applications. The other usually ends up building an environment where quality stagnates, people don't stay longer than necessary, and keeps the traditional IT stereotypes alive. Which one do you work in and which would you prefer? Steve Jones - SSC Editor Join the debate, and respond to today's editorial on the forums |