The Benefits of Good Versus Excellent Testing
It might seem strange to say that doing a “good” job of something is better than doing an “excellent” one. However, I’ve become convinced that this is true when it comes to software testing.
Early in my career, I used to think that if we had more time for testing, we’d eventually find all the bugs. As time went by, I started noticing that most escaped defects were things I never would have found even with all the time and testing in the world—or things that couldn’t have been easily prevented anyway. Doing good testing will always be important, but instead of investing in truly excellent testing, I’d rather assume things will go wrong and be prepared to handle failure.
Defining Good and Excellent Testing
For the purposes of this blog post, let’s define excellent and good testing.
- Excellent testing: thinking of as many test cases as you can, trying your best to run them all and continuing to test until you run out of time.
- Good testing: thinking of as many test cases as you can, but then choosing to run only the ones that will give you the most information about how well your product is working. Then, spending your remaining time planning how to detect and resolve failures you weren’t able to anticipate.
Planning for Failure
Why are there always failures that you won’t anticipate? I’ve learned some very hard and painful lessons in this area. For one thing, you don’t know what you don’t know, and you certainly won’t be testing for it. I can tell countless stories of times where my team and I designed very robust test plans, and tried out absolutely every scenario and permutation we could think of—and the bug that bit us was always something that would have been impossibly difficult to know that we should have checked for.
Another thing that contributes to the inability to think of every potential failure point is the number of dependencies in modern software systems. Things that are completely outside of your control can impact your system, and it’s very likely that you don’t even know all of the things your system is dependent on.
Given that we can’t be 100% confident that we’ll discover all important bugs, it’s important to invest in observability and monitoring for your system, as well as safe roll-out and roll-back strategies. If you can’t find all the bugs before you release new software, a very good backup plan is to be able to detect and resolve them quickly once they’re happening.
How to Combine Good Testing and Failure Detection
To get started with an approach that involves “good” testing with strong failure detection and resolution techniques, I’d recommend the following steps.
First of all, make sure to do that good, solid testing. Ask many different people on your team to help come up with a detailed list of tests to run. Test lots of those things, until you feel pretty confident that you’ve uncovered most of the key problems and addressed the most important risks.
Then, instead of continuing to test further, invest the rest of your time in planning for failure. Here are some questions you could ask yourself to get started:
- Is there a way to roll out your changes in stages, or to a limited subset of users at first?
- Do you know how you’d quickly roll back to an earlier version if you needed to?
- What information can you collect that will signal something is going wrong?
- How do you want to be notified when a problem is detected?
- Who needs to take action when there is an issue?
- What are some good first troubleshooting steps to take when an issue is detected?
- Who else needs to be kept in the loop while issues are being investigated, and what kind of information do they need?
- Can you implement any extra resiliency or self-healing for any aspects of your product, so that straightforward issues could be resolved without human intervention?
Testing Monitoring Systems
An important new consideration as additional monitoring systems are implemented is that the monitoring itself also needs to be tested. Do your alerts fire when you think they will? Are you happy with the thresholds you’ve set? Chaos testing can also be very informative; basically, this means breaking some things on purpose and confirming that your monitoring systems identify the issue and correctly lead you towards the root cause.
As your team’s focus shifts away from trying to identify absolutely every issue prior to releasing new software, your results and outcomes will begin to feel a little different. If you’re doing a bit less testing than you were previously, you may find that more bugs are slipping out into production (or not—in many cases, most of the important defects are found during much earlier phases of testing, rather than during the last mile). In any case, since you now have better failure detection and resolution processes in place, the overall impact of each production issue should be much lower. Consider whether there are ways to quantify the impact of issues that occur in your system (rather than simply counting the quantity of issues) and look for this to trend downwards over time.
One thing that’s almost certain is that you will start noticing more bugs if you’re watching production systems more closely, even if your users aren’t reporting them. In most systems there are usually tons of little things going wrong all the time, often with very low user impact (or sometimes the problems even resolve themselves eventually), but reviewing this data can still be very insightful for understanding where a system might be lacking in robustness.
Adjusting Your View of Failure
Given the changes in focus and expectations, you may want to adjust your views on what success looks like. Rather than singularly focusing on lowering the number of escaped defects, it’s more important to optimize the speed at which issues can be detected and resolved.
During post-incident discussions, instead of focusing only on things like “how can we make sure our testing won’t miss a bug like this again?” it’s also important to consider “is there a way we could detect a recurrence of this same issue in the future, and put an automatic resolution into place so that we don’t even need to take action on it?”
A key thing to keep in mind as you build and maintain software is that it’s a good idea to be ready to handle failure, because in all likelihood, you’re not going to find all the bugs anyway.
Written by:

Tina Fletcher
Tina Fletcher, Senior Director, Software Engineering at SkillsWave, holds a Bachelor of Science from McMaster University with a focus on computer science and neuroscience. If she could time travel, Tina would head back to July 20, 1969 to witness the Apollo 11 moon landing. In the present time, she enjoys tending to her vegetable garden.
