The expected result was 42. Now what was the test?: Measuring

Showing posts with label Measuring. Show all posts

Tuesday, 14 October 2014

Risk vs Uncertainty in Software Testing

Traditionally software testing appears to be based upon risk and many models and examples of this have been published, just search the internet for ‘risk based testing’.

The following are a few examples from a quick search

The objective of Risk Analysis is to identify potential problems that could affect the cost or outcome of the project. Ståle Amland, 1999 http://www.amland.no/WordDocuments/EuroSTAR99Paper.doc

In simple terms – Risk is the probability of occurrence of an undesirable outcome ISTQB Exam Certification – What is Risk Based Testing 2014 - http://istqbexamcertification.com/what-is-risk-based-testing/

Risk:= You don’t know what will happen but you do know the probabilities, Uncertainty = You don’t even know the probabilities. Hans Schaefer, Software Test Consulting, Norway 2004 http://www.cs.tut.fi/tapahtumat/testaus04/schaefer.pdf

Any uncertainty or possibility of loss may result in non conformance of any of these key factors. Alam and Khan , 2013 Rsik Based Testing Techniques A perspective study http://www.academia.edu/3412788/Risk-based_Testing_Techniques_A_Perspective_Study

James Bach goes a little deeper and introduces risk heuristics

“Risk is a problem that might happen” James Bach 2003 Heuristics of Risk Based Testing http://www.satisfice.com/articles/hrbt.pdf

And continues with the following statement in the 'Making it All Work' section:

..don’t let risk-based testing be the only kind of testing you do. Spend at least a quarter of your effort on approaches that are not risk focuses..”

All of the examples above look at software testing and how to focus testing effort based upon risk they make no mention uncertainty. I have struggled to find any software testing models or articles on uncertainty which I feel can have value to the business in software projects. There are a few misconceptions of risk and uncertainty with people commonly mixing the two together and stating they are the same.

Some of the articles appear to follow the fallacy of mixing risk with uncertainty and attempting to measure uncertainty in the same way as risk. The issue I find with these articles in how you can measure something which has no statistical distribution?

One type of uncertainty that people attempt to measure is the number of defects in a product. Using complex formulas based upon lines of code or some other wonderful statistical model. Since the number of defects in any one product is uncertain I am unsure of the merits of such measures and their reliability.

Defect Density Prediction with Six Sigma - http://www.e-p-o.com/downloads/defectdensityprediction.pdf
Test Cases and Defects - http://www.softwaremetrics.com/Articles/defects.htm
Estimating the Number of Defects: http://www.cs.colostate.edu/~malaiya/p/li98.pdf

The concern here is how would you define a defect? Surely it is not only based upon the number of lines of code or number of test cases defined, but upon the uniqueness of each and every user? In other words what some may see as defects others will gladly ignore and say it is ok, it is the character of the program.

Let’s look at what we mean by risk and uncertainty:

Risk: We don’t know what is going to happen next, but we do know what the distribution looks like.
Uncertainty: We don’t know what is going to happen next, and we do not know what the possible distribution looks like.

Michael Mauboussin - http://www.michaelmauboussin.com/

What does this mean to the lay person?

Risk can be judged against statistical probability for example the roll of a dice. We do not know what the outcome (roll) will be (if the dice is fair) but we know the outcome will be a number between 1 and 6 (1 in six chance).

Uncertainty is where outcome is not known and there is no statistical probability. An example of uncertainty is what does your best friend intend to eat next week on Thursday at 5pm. Can you create a probability model for that event?

Basically risk is measurable uncertainty is not.

“To preserve the distinction which has been drawn in the last chapter between the measurable uncertainty and an unmeasurable one we may use the term "risk" to designate the former and the term "uncertainty" for the latter.” : - Risk, Uncertainty, and Profit Frank Knight 1921 - http://www.econlib.org/library/Knight/knRUP7.html

The problem is that many people see everything as a risk and ignore uncertainty. This is not a deliberate action and is how our brains work to deal with uncertainty. The following psychological experiment shows this effect

Ellsberg Paradox: http://en.wikipedia.org/wiki/Daniel_Ellsberg

The following example of the Ellsberg paradox is taken from the following article: http://www.datagenetics.com/blog/december12013/index.html

_____________

Let’s play a different thought experiment. Imagine there are two urns.

Urn A contains 50 red marbles and 50 white marbles.
Urn B contains an unknown mixture of red and white marbles (in an unspecified ratio).

You can select either of the Urns, and then select from it a random (unseen) marble. If you pick a red marble, you win a prize. Which Urn do you pick from?

Urn A
Urn B

In theory, it should not matter which urn you select from. Urn A gives a 50:50 chance of selecting a red marble. Urn B also gives you the same 50:50 chance.

Even though we don’t know the distribution of marbles in the second urn, since it only contains red and white marbles, this ambiguity equates to the same 50:50 chance.

For various reasons, most people prefer to pick from Urn A. It seems that people prefer a known risk rather than ambiguity.

People prefer to know the risk when making a decision rather than base it on uncertainty.

Next experiment: This time there is only one urn. In this urn is a mixture or Red, White and Blue marbles.

There are 90 marbles in total. 30 are Red, and the other 60 are a mixture of White and Blue (in an unknown ratio). You are given a choice of two gambles:

Gamble 1 you win $100 if you pick a Red marble.
Gamble 2 you win $100 if you pick a White marble.

Which gamble do you take? Now that you've read a section above you will see that most people seem to select Gamble 1. They prefer their risk to be unambiguous. A quick check of the expected value of both gambles shows they are equivalent (with a ⅓ probability). They go with the known quantity.

____________

The summary of this is that we tend to trend towards known risks rather than uncertainty.

What has all of this to do with software testing?

The majority of our testing is spent on testing based upon risk, with outcomes that are statistically known. This is an important task to do however does it have more value than testing against uncertainty? Using automated tools it is possible to test against all the possible outcomes when we are using a risk based testing approach. Risk is based upon known probabilities which machines are good at calculating and working through.

Since it is difficult to predict the future of uncertain events and we find it even more difficult to adjust our minds to looking for uncertainties then an exploratory testing approach may provide good value against uncertainties. Tools here can be of use such as random data generators, emulators where the data used for testing is not based upon risk but is entirely random and can provide unexpected results.

The key message of this article is that we need to be aware of confusing uncertainty with risk and ask ourselves are we testing based upon risk today or upon uncertainty. Each has value however sometimes one has more value than the other.

Monday, 27 January 2014

Measuring Exploratory Testing

A quick post on a concept we are working on within our company

One of the difficulties I have found with implementing Exploratory Testing is a way to measure how much testing you have done at a high level (stakeholders). This article looks at this problem and tries to provide a solution, it should be noted that there are good ways currently of reporting automation (checking) and for this article that will be out of scope.

The current way of we manage exploratory testing is by using time boxed sessions () and for reporting at a project level we can (and do) use . This leaves the question open of how much testing (exploratory) has been done against the possible amount of testing time available.

After having discussions with some work colleagues, we came up with the following concept (this was a great joint collaboration effort,I cannot claim the ideas as just mine). The basic concept of session based test management is that you time box your exploration (charters) in to sessions where one session equates to one charter (if you have not come across the terminology of charters then refer to the session based test management link) To simplify we use an estimation that one session is half a day (sometime you do more, sometimes less), therefore we now have a crude way to estimate the possible number of charters you could run in a period of time.

For example if you have a sprint/iteration of two weeks you have per person a possible number of sessions you could run of 20 sessions, if you have 5 testers then you could have a total possible number of sessions of 5 * 20 = 100 sessions within your sprint. No one on a project would be utilized like this for 100% of the time so the concept that we came up with is that for your project you set a target of how much of your time you want your team to be doing exploratory testing. The suggestion is to begin with to set this to a value such as 25%, with the aim to increase this as your team moves more and more into automation for the checking and exploratory for the testing, the goal being a 50/50 split between checking and testing.

Using the example above we can now define a rough metric to see if we are meeting our target (limited by time)

If we have 2 weeks and 5 testers and a target of 25% exploratory we would expect by the end of the two weeks if we are meeting our target to have done: 25 exploratory sessions.

We can use this to report at a high level if we are meeting our targets within the concept of exploring, within a dashboard as shown below,

Possible sessions	100
% Target Sessions	25%
Number of actual sessions	25
% Actual Target	25%

Following this format we can this using colours to indicate if we are above or below our target: (red/green)

Possible sessions	100
% Target Sessions	25%
Number of actual sessions	15
% Actual Target	15%

We feel this would be a useful indication of the amount of time available and the amount of time actual spent doing exploratory testing rather than checking (manually or automated)

There are some caveats that go with using this type of measurement.

Within session based test management the tester reports roughly the amount of time they spend:

Testing
Reporting
Environment set-up
Data set-up

This is reported as a percentage of the total time in a session, therefore more detailed reporting can be done within a session but we feel this information would be of use at a project level rather than a stakeholder level. This is something that, if it would be of use to stakeholders we could revisit and come back to.

Your thoughts on this concept would be most welcome and we see this as a starting point for a discussion that hopefully will provide a useful way at a high level to report how much time we are spent testing compared to checking.

We am not saying this will work for everyone but for us it is ideal way of saying to stakeholders that of all the possible time we could have spent testing (exploratory), this is the amount of time we did spend and the associated risks that may have.

Sunday, 29 December 2013

Is Defect Removal Efficiency a Fallacy?

I noticed a tweet by Lisa Crispin in which they had said they had commented on this article http://blog.btdconf.com/?p=43. I may not agree with everything in the article but it makes some interesting points that I may come back to at a later date, still not seen that comment from Lisa yet...

However I did notice a comment by Capers Jones which I have reproduced below:

Capers Jones
December 29, 2013 at 10:01 am
This is a good general article but a few other points are significant too: There are several metrics that violate standard economics and distort reality so much I regard them as professional malpractice: 1 cost per defect penalizes quality and the whole urban legend about costs going up more than 100 fold is false. 2 lines of code penalize high level languages and make non coding work invisible The most useful metrics for actually showing software economic results are: 1 function points for normalization 2 activity-based costs using at least 10 activities such as requirements, design, coding, inspections, testing, documentation, quality assurance, change control, deployment, and management. The most useful quality metric is defect removal efficiency (DRE) or the percentage of bugs found before release, measured against user-reported bugs after 90 days. If a development team finds 90 bugs and users report 10 bugs, DRE is 90%. The average is just over 85% but top projects using inspections, static analysis, and formal testing can hit 99%. Agile projects average about 92% DRE. Most forms of testing such as unit test and function test are only about 35% efficient and fine one bug out of three. This is why testing alone is not sufficient to achieve high quality. Some metrics have ISO standards such as function points and are pretty much the same, although there are way too many variants. Other metrics such as story points are not standardized and vary by over 400%. A deeper problem than metrics is the fact that historical data based on design, code, and unit test (DCUT) is only 37% complete. Quality data that only starts with testing is less than 50% complete. Technical debt is only about 17% complete since it leaves out cancelled projects and also the costs of litigation for poor quality.

My problem is the use of this metric, which I feel is useless, to measure the quality of the software, it seems too easy to game. Once people (or company) are aware that they are being measured their behaviour on a psychological level (intended or not) adjusts so that they look good against what they are being measured against.

So using the example given by Capers

Say a company wants their DRE to look good so they reward their testing teams for finding defects, no matter how trivial and they end up finding 1000, the customers still only find 10.

Using the above example that means this company can record a DRE of 99.0099. WOW - that looks good for the company.

Now let us say they really really want to game the system and they report 1, 000, 000 defects against the customers 10 - this now starts to become a worthless way to measure the quality of the software.

It does not take into account the defects that still exist and never found, how long do you wait before you can say your DRE is accurate? The client finds another 100 six months later, a year later? The company testing finds another 100 when they have released to client how is this included in the DRE %?

As any tester would ask:

IS THERE A PROBLEM HERE?

This form of measurement does not take into account the technical debt of dealing with this ever growing list of defects. Measuring using such a metric is flawed by design, never mind the other activities which would also have hidden costs as mentioned by Caper, that will quickly spiral. By the time you have adhered to all of this, given the current market place of rapid delivery (dev ops) your competitors have prototyped, shown it to the client, adapted to meet what the customer wants and released to the client, implementing changes to the product as the customer desires and not focusing on numbers that quickly become irrelevant.

At the same time I question numbers quoted by Capers such as:

Most forms of testing such as unit test and function test are only about 35% efficient and fine one bug out of three.
Other metrics such as story points are not standardized and vary by over 400%
Quality data that only starts with testing is less than 50%
Plus others

Where are the sources for all these metrics? Are they independently verified?

Maybe I am a little more cynical of seeing numbers being quoted especially after reading "The Leprechauns of Software Engineering" by Laurent Bossavit. Without any attributions for the numbers I do question their value.

To finish this little rant I would like to use a quote from Albert Einstein

"Not everything that can be counted counts, and not everything that counts can be counted."

Friday, 2 August 2013

Stop Doing Too Much Automation

When researching my article on for the Testing Planet a thought stuck me about the amount of people who indicated that ‘Test’ Automation was one of the main learning goals of many of the respondents. This made me think a little about how our craft appears to be going down a path that automation is the magic bullet that can be used to resolve all the issue we have in testing.

I have had the idea to write this article floating around in my head for a while now and the final push was when I saw the article by Alan Page (Tooth of the Angry Weasel) - in which he said much of what I was thinking. So how can I expand on what I feel is a great article by Alan?

The part of the article that I found the most interesting was the following:

“..In fact, one of the touted benefits of automation is repeatability – but no user executes the same tasks over and over the exact same way, so writing a bunch of automated tasks to do the same is often silly.”

This is similar to what I want to write about in this article. I see time and time again dashboards and metrics being shared around stating that by running this automated ‘test’ 1 million times we have saved a tester running it manually 1 million times and therefore if the ‘test’ took 1 hour and 1 minute to run manually and 1 minute to run automated it means we have saved 1 million hours of testing. This is so tempting and to a business who speaks in value and in this context this mean costs. Saving 1 million hours of testing by automating is a significant amount of cost saving and this is the kind of thing that business likes to see a tangible measure that shows ROI (Return on Investment) for doing ‘test’ automation. Worryingly this is how some companies sell their ‘test’ automation tools.

If we step back for a minute and go back and read the statement by Alan. The thing that most people who state we should automate all testing talk about the repeatability factor. Now let us really think about this. When you run a test script manually you do more than what is written down in the script. You think both , you observe things far from the beaten track of where the script was telling you to go. Computer see in assertions, true or false, black or white, 0 or 1, they cannot see what they are not told to see. Even with the advances in artificial intelligence it is very difficult for automation systems to ‘check’ more than they have been told to do. To really test and test well you need a human being with the ability to think and observe. Going back to our million times example. If we ran the same test a million times on a piece of code that has not changed the chances of find NEW issues or problems remains very slim however running this manually with a different person running it each time our chances of finding issues or problems increases. I am aware our costs also increase and there is a point of diminishing returns. James Lyndsay has talked about this on his blog in which he discusses The article also has a very clever visual aid to demonstrate why diversity is important and as a side effect it helps to highlight the points of diminishing return. This is the area that the business needs to focus on rather than how many times you have run a test.

My other concern point is the use of metrics in automation to indicate how many of your tests have you or could you automate. How many of you have been asked this question? The problem I see with this is what people mean by “How many of your tests?” What is this question based upon? Is it...

all the tests that you know about now?
all possible tests you could run?
all tests you plan to run?
all your priority one tests?

The issue is that this is a number that will constantly change as you and learn more. Therefore if you start reporting it as a metric especially as a percentage it soon becomes a non-valuable measure which costs more to collect and collate than any benefit it may try to imply. I like to use the follow example as an extreme view.

Manager: Can you provide me a % of all the possible tests that you could run for system X that you could automate.

Me: Are you sure you mean all possible tests?

Manager: Yes

Me: Ok, easy it is 0%

Manager: ?????

Most people are aware that testing can have an infinite amount of tests even for the most simple of systems so any number divided by infinity will be close to zero, therefore the answer that was provided in the example scenario above. Others could argue that we only care about of what we have planned to do how much of that can be automated, or only the high priority stuff and that is OK, to a point, but be careful about measuring this percentage since it can and will vary up or down and this can cause confusion. As we test we find new stuff and as we find new stuff our number of things to test increases.

My final worry with ‘test’ automation is the amount of ‘test’ automation we are doing (hence the title of this article) I seen cases where people automate for the sake of automation since that is what they have been told to do. This links in with the previous statement about measuring tests that can be automated. There need to be some intelligence when deciding what to automate and more importantly what not to automate. The problem is when we are being measured by the number of ‘tests’ that we can automate human nature will start to act in a way that makes us look good against what we are being measured. There are major problems with this and people stop thinking about what would be the best automation solution and concentrate on trying to automate as much as they can regardless of costs.

What ! You did not realise that automation has a cost? One of the common problems I see when people sell ‘test’ automation is they conveniently or otherwise forget to include the hidden cost of automation. We always see the figures of the amount of testing time (and money) being saved by running this set of ‘tests’ each time. What does not get reported and very rarely measured is the amount of time maintaining and analysing the rests from ‘test’ automation. This is important since this is time that a tester could be doing some testing and finding new information rather than . This appears to be missing whenever I hear people talking of ‘test’ automation in a positive way. What I see is a race to automate all that can be automated regardless of the cost to maintain.

If you are looking at implementing test automation you seriously need to think about what the purpose of the automation is. I would suggest you do ‘just enough’ automation to give you confidence that it appear to work in the way your customer expects. This level of automation then frees up your testers to do some actual testing or . You need to stop doing too much automation and look at ways you can make your ‘test’ automation effective and efficient without it being a bloated, cumbersome, hard to maintain monstrosity (Does that describe some peoples current automation system?) Also automation is mainly code so should be treated the same as code and be regularly reviewed and re-factored to reduce duplication and waste.

I am not against automation at all and in my daily job I encourage and support people to use automation to help them to do excellent testing. I feel it plays a vital role as a tool to SUPPORT testing it should NOT be sold on the premise that that it can replace testing or thinking testers.

Some observant readers may wonder why I write ‘test’ in this way when mentioning ‘test’ automation. My reasons for this can be found in the article by

Sunday, 30 June 2013

Measuring Test Coverage

This article follows on from my previous article on Why we need to explore and looks at when we simplify the constructs of software development to expectations and deliverables how measuring test coverage becomes a difficult task. I should acknowledge that the model I use here is extremely simplified and is only to used to aid clarification. There are many more factors that are involved especially within the expectations section as Michael Bolton quite rightly commented about in the previous article.

If we go back to our original diagram (Many thanks to James Lyndsay) which shows our expectations and our deliverable and the it where they meet is where our expectation are met by the deliverable.

At a simple level we could then make the following reasonable deduction (at a simplified level)

We can express all our known expectations as 100% and therefore for measurement purposes say that x % of our expectations have been met by the deliverable and y % have not been met. This gives us a simple metric to measure how much of our expectations have been met. This seems very clear and could to some people be a compelling measurement to use within testing. The following diagram gives a visual reference to this.

This is only half the story since on the other side the part where we need to do some exploring and experimentation. This is the stuff in the deliverable that we do not know or expect. This is the bread and butter of our testing effort. The problem is since we do not know what is in this area or how big or small it is (I will return to that point later). We are now in a measurement discomfort zone, how do we measure what we do not know? The following diagram shows a visual representation of this.

This measurement problem is also compounded by the fact that as you explore and discover more about the deliverable you tacit knowledge can become more explicit and your expectations start to grow. So you end up in the following situation:

Now your expectation percentage is 100%+ and as you explore more continuously growing. So your % of meeting and not meeting your expectations becomes a misleading and somewhat pointless metric.

I was ask if there is anything that could be done to increase the area where the expectations are met by the deliverable and this lead to me adding another diagram as shown below.

**Still not to scale

Since testing in theory could be an infinite activity, how much testing we do before we stop is determined by many factors, Michael Bolton has a superb list as an article here.

In summary the amount we know and expect from a piece of software is extremely small in comparison to what we do not know about the software (deliverable) hence my first post in the article series on the need to explore the system to find out the useful information. We need to be careful when using metrics to measure progress of testing especially when that measurement appears easy to gather.

Monday, 21 February 2011

Measuring Testing

I saw a couple of tweets by @Lynn_Mckee recently on the metrics that are used in testing.

There are many great papers on #metrics. Doug Hoffman's "Darker Side of Metrics" provides insight on behavior. http://bit.ly/gKPHcj #testing

Ack! So many more that are painful... Scary to read recent papers citing same bad premises as papers from 10 - 15 yrs ago. #testing #metrics

And it made me think about how we measure testing.

This article is not going to be

'This is how you should measure testing’

offer any ‘best practice’ ways of measuring

My concern with any of the ways in which we measure is that it is done without context or connection to what question you which to have answered with the numbers. It is a set of numbers devoid of any information to their ‘real’ meaning. There are many and various debates within the software testing field about what should and should not be measured. My take on all of this is:

Can I provide useful and meaningful information with the metrics I track?

I still measure number of test cases that pass and fail and number of defects tests and fixed.

Is this so wrong?

If I solely presented these numbers without any supporting evidence and a story about the state of testing then yes it is wrong it can be very dangerous.

I view the metrics that are gathered during testing to be an indication that something might be correct or wrong, working or not working, I do not know this just from the metrics it is from talking to the team, debriefing and discussing issues.

I capture metrics on requirement coverage, focus area coverage, % of time spent testing, defect reporting, system setup. So I have a lot of numbers to work with which on their own can be misleading, confusing and misinterpreted. If I investigate the figures in detail and look for patterns I notice missing requirements, conflicting requirements and what is stopping me executing testing.

So what is this brief article saying?

Within the software testing community I see that we get hung up on metrics and how we measure testing and I feel we need to take a step back.

It is not too important what you measure but how you use and present what measurements you have captured. It is the stories that go with the metrics that are important, not the numbers.