Tuesday, June 5, 2018

The Last Measurable Ounce of Quality Can Be Expensive



A Brief Lexicon of the Software Quality Landscape


The quality of a software product is a multi-dimensional measurement, spanning such things as functionality, correctness, performance, documentation, ease of use, flexibility, maintainability, and many others. Many of these qualities are difficult to measure, are difficult to see, and hence are difficult to manage. The result is that they are ignored by all but the most enlightened in management.

The topic of interest for this screed is going to be correctness. This is not going to concern correctness in the end-user sense, meaning the correct meeting of requirements. It is going to mean "is the code doing what we think it should be doing." Since we don't generally have provably correct programs, it is a matter of convincing ourselves, through lack of evidence, that our programs are working as we'd like. This precarious situation is perfectly portended by Edsger Dijkstra's observation that "absence of evidence is not evidence of absence."

So we have several levels of ways to convince ourselves of correctness. They are, from most detailed to most abstract, unit testing, integration testing, functional testing, and system testing. Note that there is not industry agreement concerning these exact terms, but the general concepts are recognized.

In typical object oriented designs, unit testing involves isolating a given class and driving its state and/or behavior and verifying that we see what we expect. Integration testing is a layer above that where we use multiple classes in the tests. Functional testing is yet above that, where we try to deploy our programs in the natural components that they would inhabit in production, like a server or a process. Finally system test covers testing in the full production-like environment.

Unit Testing


Our focus will be on the lowest level, namely unit testing. The expectation is that unit tests are both numerous (think many hundreds or even thousands) and are extremely fast (think milliseconds). To be effective, these tests should be run on every single compile on every developer's machine across the organization. The goal is that the unit tests precisely capture the design intent behind the implementation of the class code, and that any violation of those intents result in immediate feedback to the developer making code changes.

I'd like to tell you that every developer is doggedly focused on both the quality of the production logic and the thoroughness of the unit tests that back that logic. Through a combination of poor training, lack of emphasis at the management level, and just plain laziness, developers produce tests that span from greatness all the way down to downright destructive (more on that in another blog entry). One of the easiest ways to try to externally track this testing is through code coverage.

Code Coverage


Code coverage is a set of metrics that can give developers and other project stakeholders a sense of how much of the production logic has been tested by the unit tests. The simplest metric is the "covered lines of code" aka line coverage. This is usually a percentage and it means that if a class has 50 lines of code in it, and it has 60% code coverage, then 30 lines of that production logic is executed as part of the running of the unit tests for that class. There are other coverage metrics that can help you gauge the goodness of your tests, like branch coverage, class coverage, and method coverage. But here, we will focus on line coverage since that is most widely used.

The general, common sense assumption is that "more is better", so mis-guided management and deranged architects insist on 100% code coverage, thinking that would give the maximum confidence that the quality of the code is high. If we had an infinite amount of time and money to spend on projects, this conception could represent the optimum. Since this luxury has never been true in the last 4 billion years, we have to spend our money wisely. And this changes things drastically.

The truth is that it might cost M dollars+time to achieve say 80% line coverage, but it might take M *more* dollars+time to get that last 20%. In some cases, getting the last few percentage might be extremely expensive. The reason for this non-linear cost is complicated.

First, production logic should be tested through its public interface where possible rather than through a protected or private interface. It can be laborious to construct the conditions necessary to hit a line of code buried in try/catches and conditional logic behind public interfaces. This cost can be lowered by refactoring the code towards better testability, but this is a continuous struggle as new code is produced. There is a truism in the veteran developers that increasing the testability of the production logic improves its design.

Second, some code has high complexity also known as cyclomatic complexity. Arguably this code should be refactored, but projects do have a certain percentage of their code with high cyclomatic complexity that gets carried forward from sprint to sprint.

The third reason is a bit technical. Code like Java is compiled into byte code. The code coverage tools run off of an analysis of the byte code, not the source code. The Java compiler will consume the source code and emit byte code that may have extra logic in it, meaning code with extra branches. It might not be possible to control the conditions which would take one path or the other through this invisible branch. Further complicating this, is that the invisible logic can change from Java compiler release to release, putting a burden on the test logic to reverse engineer the conditions needed to cover this invisible logic.

Summary


Based on the above discussion, achieving 100% line coverage can be very expensive. On teams that I have worked on over the years, a reasonable line coverage would be 70% or more. But you should let the development team determine this limit. If you force your teams to get to 100% line coverage, you are spending money that might be better spent on automation tests. In addition, I have seen cases where developers will short-circuit the unit tests by writing tests only for the purpose of increasing the coverage. You can readily identify these test because they have no assertion or verification check in them - they just make a call and never check on the result.

In short, you should be careful what you ask for. Make sure you interact with the development team in making the decision about code coverage. Spending another 50% of scarce testing dollars on that last 10% coverage is unlikely to bring a return on investment.