Sunday, 29 December 2013

Is Defect Removal Efficiency a Fallacy?

I noticed a tweet by Lisa Crispin in which they had said they had commented on this article http://blog.btdconf.com/?p=43.  I may not agree with everything in the article but it makes some interesting points that I may come back to at a later date, still not seen that comment from Lisa yet... 

However I did notice a comment by Capers Jones which I have reproduced below:

Capers Jones
December 29, 2013 at 10:01 am
This is a good general article but a few other points are significant too: There are several metrics that violate standard economics and distort reality so much I regard them as professional malpractice: 1 cost per defect penalizes quality and the whole urban legend about costs going up more than 100 fold is false. 2 lines of code penalize high level languages and make non coding work invisible The most useful metrics for actually showing software economic results are: 1 function points for normalization 2 activity-based costs using at least 10 activities such as requirements, design, coding, inspections, testing, documentation, quality assurance, change control, deployment, and management. The most useful quality metric is defect removal efficiency (DRE) or the percentage of bugs found before release, measured against user-reported bugs after 90 days. If a development team finds 90 bugs and users report 10 bugs, DRE is 90%. The average is just over 85% but top projects using inspections, static analysis, and formal testing can hit 99%. Agile projects average about 92% DRE. Most forms of testing such as unit test and function test are only about 35% efficient and fine one bug out of three. This is why testing alone is not sufficient to achieve high quality. Some metrics have ISO standards such as function points and are pretty much the same, although there are way too many variants. Other metrics such as story points are not standardized and vary by over 400%. A deeper problem than metrics is the fact that historical data based on design, code, and unit test (DCUT) is only 37% complete. Quality data that only starts with testing is less than 50% complete. Technical debt is only about 17% complete since it leaves out cancelled projects and also the costs of litigation for poor quality.

My problem is the use of this metric, which I feel is useless, to measure the quality of the software, it seems too easy to game.  Once people (or company) are aware that they are being measured their behaviour on a psychological level (intended or not) adjusts so that they look good against what they are being measured against.

So using the example given by Capers

Say a company wants their DRE to look good so they reward their testing teams for finding defects, no matter how trivial and they end up finding 1000, the customers still only find 10.

Using the above example that means this company can record a DRE of 99.0099. WOW -  that looks good for the company.

Now let us say they really really want to game the system and they report 1, 000, 000 defects against the customers 10 - this now starts to become a worthless way to measure the quality of the software. 

It does not take into account the defects that still exist and never found, how long do you wait before you can say your DRE is accurate?  The client finds another 100 six months later, a year later?  The company testing finds another 100 when they have released to client how is this included in the DRE %?

As any tester would ask:
IS THERE A PROBLEM HERE?
This form of measurement does not take into account the technical debt of dealing with this ever growing list of defects.   Measuring using such a metric is flawed by design,  never mind the other activities which would also have hidden costs as mentioned by Caper, that  will quickly spiral.  By the time you have adhered to all of this, given the current market place of rapid delivery (dev ops) your competitors have prototyped,  shown it to the client, adapted to meet what the customer wants and released to the client, implementing changes to the product as the customer desires and not focusing on numbers that quickly become irrelevant.

At the same time I question numbers quoted by Capers such as:
  • Most forms of testing such as unit test and function test are only about 35% efficient and fine one bug out of three. 
  • Other metrics such as story points are not standardized and vary by over 400%
  • Quality data that only starts with testing is less than 50% 
  • Plus others
Where are the sources for all these metrics?  Are they independently verified?

Maybe I am a little more cynical of seeing numbers being quoted especially after reading "The Leprechauns of Software Engineering" by Laurent Bossavit.  Without any attributions for the numbers I do question their value.

To finish this little rant I would like to use a quote from Albert Einstein
"Not everything that can be counted counts, and not everything that counts can be counted."

3 comments:

  1. Like "cost per defect," the described "defect removal efficiency" penalizes writing quality code. It rewards writing lots of easily found defects, i.e. sloppiness.

    ReplyDelete
    Replies
    1. Thank you George for your comment.

      What I want to see as a tester is hard to find defects, I want the challenge of testing software in which 'quality' has been at the forefront from design, code and test, rather than measure something at the end and start to as Lisa Crispin commented within the article 'beat up members of the team'

      The number of defects found or not found is not important, IMO is what we provide to our customer useful, what they want and good enough for them to make money from? That is better way to measure the quality of software

      Delete
  2. I share your concern about DRE. Questioning the source of the figures is a neat ploy to point out that it is a pretty worthless metric. However, I think that the source for these claims isn't really relevant. Even if the source is respectable studies all that they could say was that in specific cases certain results were observed. It would be a huge, and tendentious, leap from that to say that the results were generally applicable, that they could, or should be obtained in other projects and other organisations. I'd prefer to attack the idea that DRE can ever be useful than to chase its advocates for the source of specific claims. I can't help feeling that would leave people with the impression that DRE might be conceptually sound and only needs a rigorous study to prove it.

    James Christie

    ReplyDelete