Test your story points estimates with real data

From velocity to throughput in scrum teams.

I used velocity as a metric for planning and forecasting in agile teams for many years, before switching to throughput and cycle time. Should you switch too?

What is velocity?

Velocity is an efficiency measure widely used within the Scrum practitioner community. It’s not by itself formally part of Scrum, nonetheless, velocity it’s a fairly standard system in scrum teams, typically measured in estimated story points and charted in burn-up/burn-down charts.

How does it work? In short, the complexity of each Product Backlog Item (PBI) is estimated over a pre-determined sequence of story points, for instance: 1,3,5,8,13. You have then your PBIs loaded into sprints at the sprint planning meeting. During the sprint, each time a PBI is moved to done, story points are tracked with burn-down or burn-up charts, and the total story points burnt down at the end of the sprint measures your actual velocity. Burn-down charts usually display also an ideal work-remaining line. So, let’s say in your last sprint you have moved to done eight items weighing 5 story points each? That’s 40 story points velocity, and that’s the indicator of capacity of your team for future sprints.

Fig 1. Example of Burn-Down Velocity Chart

The Problem with Velocity as a metric, or “In God we trust, all others must bring data”

So far, so good. But, if story points are estimations, how good are they, I asked myself.

Before answering, I need to warn you. There are different interpretations of the metric and its meaning in the agile community, and flamed, hatred discussions often pop-up online, even if you just attempt to experiment with hard data to see if your hypothesis is grounded in reality or not. So long for PDCA and Inspect and Adapt!

But anyway, just to be on the same page, I will use the same definition of the inventor of Story Points Ron Jeffries, “So a story would be estimated at three points, which meant it would take about nine days to complete.” So Story points are direct measures of duration and elapsed time.

Since we used them as a tool for planning the sprint, estimating which work packages, and how many should we pick up next, how can I check how useful are they? How much are those estimated elapsed times consistent with reality? So, I decided to experiment and compare the estimated story points with the actual development times, to see how well they correlated overall. So I set an experiment to check the quality of my team’s estimations.

The experiment

  • 1) collect story points and cycle times (activity durations) from the last three sprints into an excel table
  • 1a) clean outliers, such as activities that got nuts due to previously unknowns factors
  • 2) draw collected data on a scatterplot chart with story points on x axis and duration time on y axis
  • 3) draw a regression line and calculate R-squared (R2).
  • 4) pass/fail criteria. Any team and environment is different, but your target should be an R2 as closest to 1 as possible. I set myself a target of R2>0,8 for my team’s estimates.

You can visually appreciate the difference between good estimations (Fig 2a) and bad estimations (Fig. 2b)

To my dismay, my results were closer to Fig. 2b, and showed a very low correlation between estimated story points and real cycle times.

But Wait, story points are not about time and duration!

Some, in the agile communities, believe that story points cannot be considered as a measure of the duration of a PBI/Task and that they should be considered as a relative and not absolute measure of complexity and effort.

But how many times has it happened to you that the effort put into developing a PBI of 13 story points was equal to thirteen times the effort of a PBI of 1 story point? If your answer is never or very rarely, continue reading

A Leaner approach

But the problem was not Story Points, the problem was predicting single-point durations in non-repetitive, complex domains.

So, we are doomed. We have to answer the “When will it be done?” and “How much work will be done” questions, but we cannot create reliable estimates. Lucky me, there was a different approach, built upon measuring throughput and probabilistic thinking.

The alternative, based on queue theory and lean management is the model lately adapted to become a core part of the Kanban method. It’s based on Little’s law and four metrics: throughput, cycle time, WIP, and item age. In this article, I’ll briefly discuss Little’s law and using Throughput and WIP in Scrum teams. I’ll discuss cycle time, item age and forecasting in a future post.

Little’s law

Little’s law original definition is expressed mathematically as:


  • L average number of customers in a system
  • λ average arrival rate
  • W average time that a customer stays in the system

To use the law in a knowledge work domain with agile teams we must steal from the lean/operations management community, that in the early 1990s reformulate it with a stronger perspective on throughput, as follows: CT=WIP/TP or the equivalent WIP = CT*TP .

  • WIP= average number of work items in the system
  • CT= Average Cycle time, the time work items spend in the system. Definitions here can vary in different organizations, particularly on where to place the start and finish times that define CT
  • TP= Throughput rate, meaning the average number of work items closed in a certain time period

There are some caveats though. We must be conscious of the assumptions to make Little’s law work: values are meant to be long-term averages, and the system we are measuring has to be a stable one, meaning arrival rates are more or less equivalent to exit rates in the long term.

The law looks very simple, but there is a lot beneath the surface. If you are in for some math and queuing theory, check the MIT paper John C. Little himself wrote in 2011, for the 50th Anniversary of the Law published.

In any case, for our purposes, we will use Little’s law to clarify that in a stable system, an increase in WIP tends to increase the cycle time and worsen the throughput rate, under other equal conditions.

The first steps

Once you made up your mind, the first step is to start collecting throughput data. In Scrum and in Kanban, the throughput rate is calculated as the total number of PBIs moved to Done in a certain time period, typically daily.

In Scrum, you could also look at Throughput with the eyes of the sprint. Since you plan a certain number of PBIs for the Sprint, you count the number of PBIs closed during the sprint, a concept that seems very similar to velocity. The difference with velocity here is that we sum real data, not an estimation. It’s worth noting that your main view on throughput must be the daily one, meaning, you want to track the throughput day by day. This way you will have the right granularity and a consistent unit of measure for your metrics.

How do you calculate work in progress (WIP) in a scrum setting? You define it as the number of PBIs included in the sprint backlog, still waiting to be closed. As you move PBIs to done, your throughput increases and the WIP decreases. You can track both WIP and Throughput in a single Throughput/WIP run chart.

The new chart gives you a data-based indication of your team’s capacity (i.e. daily througput rate=1) and shows you your daily progress. We can see from the chart, with a throughput=1, we closed 14 PBIs in the sprint, but we planned for 25, hence the WIP of 11 remaining at the end of the sprint. How was that? We had team members on sick leave? Any unforeseen impediment? Have we been too optimistic about the team’s capacity? Unknown factors popped out?
In the next article, I’ll dig deeper into the lean measurement approach in Scrum teams, to understand teams’ operations, and to correctly set-up for solid forecasting by using Monte Carlo simulations

What you can do from now

I suggest you run the experiment yourself, and depending on the results, you decide if you want to switch to a lean/kanban approach. As we know, any team is different, so if Story Points work for you and your team, that’s fine, keep on using them. Otherwise, you can try the approach I am sharing here. Choice is yours.