Take a scientific approach to analyse and compare methods of estimating effort using story points.
Sarah Fan · Follow
Published in · 13 min read · Jan 7, 2021
--
Agile has been widely used in software development and project delivery. Although they are the most popular way to estimate effort in Sprint Planning and estimation, story points can be misused and mismeasured due to lacking a thorough definition. Management may use them to measure a team's productivity or effectiveness, leading to a delivery team inflating story points, gaming the system, and other anti-agile behaviours. This reflects the misunderstanding of the meaning and purpose of story points.
There are extensive materials and discussions available online on this topic. In summary, agile is a project management philosophy that guides orientation on managing projects through key values and principles iteratively.
Extreme Programming (XP) and Scrum are commonly used for agile methods. XP has evolved from the problems caused by the long development cycles of traditional development models and has “theorised” on the key principles and practices after a number of successful trials in practice (e.g. constant integration and automated testing, frequent small releases that incorporate continual customer feedback, and a teamwork approach). The “extreme” refers to taking these principles and practices to extreme levels. See the life cycle of the XP process in the diagram below.
Scrum is an empirical approach applying the ideas of industrial process control theory to systems development resulting in an approach that reintroduces the ideas of flexibility, adaptability and productivity. It encourages teams to learn from experience, self-organise, and reflect on their wins and losses for continuous improvement through a set of ceremonies (i.e. Sprint Planning, daily stand-up, iteration review and retrospective).
Both methods have similar processes and use story points for effort estimation. They both aim for small-sized teams. For XP, it is recommended to limit between three and a maximum of twenty project members, while Scrum is suitable for less than ten engineers.
According to the process maps above, both the XP and Scrum have a planning phase for development team members to discuss each prioritised backlog item and collectively estimate the effort involved to complete, and then make a Sprint forecast outlining how much work the team can achieve within the Sprint. The collective effort estimation is where story points come in. Story points represent the overall effort required to fully implement a product backlog item or any other piece of work. In the Scrum literature, the effort is a multi-facet construct consisting of risk, complexity and repetition.
Contradictory Views on Story Points
The effort estimated in a Sprint is a latent concept, meaning cannot be directly observed or measured, unlike observable concepts such as temperature and distance. Therefore, it is not surprising to see different even contradictory views on how effort should be estimated, particularly if story points should be a function of time.
Ron Jeffries is the co-founders of the Extreme Programming (XP) software development methodology and one of the original signatories of the Agile Manifesto. In his 2019 post titled “Story Points Revisited”, he touched on the inception of a story point. In XP, stories used to be estimated in time using “Ideal Days” (i.e. implementation time in days without interruption). In reality, there is nearly impossible to find a day without meetings and other distraction, and therefore, Ron and his team multiply a “Load Factor” to address this. A rule of thumb of a load factor is three, that is, three actual days to get one ideal day’s work done.
However, this confused stakeholders as they kept hearing people talking about taking three days to get a day’s work done since people usually left “ideal” out. As a result, they abstracted “ideal days” by renaming it to “points” and they “really only used the points to decide how much work to take into an iteration anyway”. Inherently, the story point is a reflection of the time needed to complete work.
On the other hand, in Scrum, it is believed that story points are independent of time, as the excerpt quoted below from the Scrum Framework:
The Scrum Framework itself does not prescribe a way for the Scrum Teams to estimate their work. The teams who rely on the Scrum Framework do not deliver their estimates of user stories based on time or person-day units. Instead, they provide their estimates by using more abstract metrics to compare and qualify the effort required to deliver the user stories.
However, I argue that because a Sprint is capped within a fixed time frame when using past velocities to define the scope of stories to be completed in the next Sprint (e.g. on average the delivery team can deliver x story points per Sprint), it is translated as x story points can be completed within y ideal hours or days. Ultimately, each story point equals z ideal hours or days, even though the calculation isn’t done explicitly. Therefore, a story point is associated with time.
How story points get used implies the scale of measure the numeric values are supposed to apply, and therefore, provides evidence on how to estimate story points is more appropriate. Story points are usually used to calculate velocity. Velocity is the speed/rate of progress of a scrum team. It is the sum of story points completed that the delivery team completes per Sprint. When calculating velocity, only entirely completed user stories that fulfil their Definition of Done are counted. A stable velocity is desired as velocity is used by the product owner who works with the delivery team to
- predict the throughput of the delivery team better and therefore to determine if a story can be included or excluded from a Sprint,
- plan software releases more accurately (this requires all the user stories making up a project to be estimated consistently), and
- discover any issues in the agile practice (e.g. a see-sawing velocity pattern indicates the need for a finer-grained decomposition of stories).
In Jeff Sutherland (Scrum co-founder)’s post, he emphasised the importance of velocity as it informs the unit of production per Sprint, which is the precondition to revenue:
Not knowing the velocity of team production is the root cause of 100% failure of [accurate release plans] in their board meetings.
Scale of Measure
In statistics, there are four types of data measurement scales: nominal, ordinal, interval and ratio.
- The nominal scale is a qualitative measurement, and can be called “labels” or “categories”. Even if numeric values assigned to labels or categories (e.g. Yes = 1 and No = 0, jersey numbers of football players), the numbers are only used to categorise or identify elements. The only permissible aspect related to numbers in a nominal scale is counting.
- The ordinal scale is a quantitative measurement and reports the ranking and ordering of data without establishing the degree of an interval between them. For example, when measuring customer satisfaction (e.g. very unsatisfied = 1, unsatisfied = 2, neutral = 3, satisfied = 4 and very satisfied = 5), the ordinal scale is used as a comparison parameter to understand whether the variables are greater or lesser than one another using sorting. Because the interval is unknown, the summation and subtraction are not meaningful, e.g. neutral minus unsatisfied doesn’t equal to very unsatisfied.
- The interval scale is a quantitative measurement. Data in this scale is ordered, has equal interval and meaningful difference, mean and median, and the presence of zero is arbitrary and therefore can have negative values (e.g. temperature). Numbers can be added, or subtracted but not multiplied or divided (e.g. Net Promoter Score).
- The ratio scale is a quantitative measurement and is the most informative scale. Data in this scale has the characteristics of the interval scale and has an absolute zero point and therefore, cannot be negative. Numbers can be meaningfully added, subtracted, multiplied and divided (e.g. age, weight, distance). All statistical analysis, including mean, mode, the median can be calculated using the ratio scale. Another characteristic of the ratio scale is to allow unit conversion (e.g. 8 ideal hours = 1 ideal day).
Based on the characteristics of all four scale of measures discussed, the summation of story points suggests that the story point, in terms of the scale of measure, is a quantitative measurement and should be either interval scale or ratio scale. The scale of measure requires the story points to be assigned appropriately.
The Fibonacci scale is commonly used for story points to address risk and uncertainty. The Fibonacci sequence consists of numbers that each number is the sum of the two preceding ones, starting from 0 and 1. The beginning of the sequence is as follows:
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ……
There are variations of this sequence for practical reasons. For example,
0, 1, 2, 3, 5, 8, 13, 20, 40 and 100
There are several benefits to use the Fibonacci scale:
- It reduces the cognitive effort. Estimation at best is an educated guess based on knowledge and experience, and there is always something unknown until working on a task.
- It increases the accuracy in effort estimation. This may sound counter-intuitive as in general, having a less granular scale yields better accuracy than using a more granular scale such as 1, 2, 3, 4, 5, etc. The explanation is two-fold:
- When two large numbers are too close to each other, it gets harder to distinguish as estimates than two small numbers (e.g. 13 and 14vs. 2 and 3 in days). People tend to argue more about whether a task is worth 2 or 3 days or effort than one day difference between 13 and 14 days, even though the difference is the same.
- Risk and uncertainty should be considered in the estimation as well to reflect the true effort. For example, when an estimated effort is 4, following the Fibonacci sequence, 5 will be assigned to address the potential risk and uncertainty.
Therefore, for better estimation, it is recommended to use smaller points from the Fibonacci scale. In our projects, we usually observed a cutoff value that most of the tasks/stories have a higher likelihood not to be completed within a Sprint. This is a clear sign that we overestimated our capability to complete the tasks/stories (a common human tendency). The task/story needs to be broken down into smaller tasks/stories. The latter can alleviate the former point (i.e. overestimation) and spread out the workload across multiple Sprints or/and people.
Due to the different views on the story point and the under-defined steps to estimate, it is not surprising to observe several methods used to estimate story point in practice.
Use the First Story as a Benchmark
Assign a number for the first story. Any other story in this Sprint will be compared to the first story. Future Sprints will repeat this process and align on the same scale. For example, if a story is about the same amount of work like the one you have already sized, give it the same number of points. It is clear that the effort estimation is done relatively.
Pros
- The human brain is good at comparing, and therefore, this method has a less cognitive load.
- Since only need to compare with the first story, the estimation is relatively lightweight in cognitive processing and time-efficient.
- Independent to development time, this method rewards team members for solving problems and focus on value delivery.
- The story points are in the interval scale and are meaningful in summation, subtraction, and medium. The interval scale is sufficient for the velocity calculation.
Cons
- The initial point assigned to the first story is entirely arbitrary. Even though the value doesn’t matter as all the following points are assigned relatively, the initial point and story do matter as they set up the story point system’s scale.
- Although it is easy to compare the efforts between tasks, it is difficult to gauge the magnitude of difference as the numeric values don’t pertain to anything directly measurable/observable.
- This method doesn’t translate from story points to the time required to completed and therefore, couldn’t answer a common question from stakeholder: how long will feature A be completed?
T-Shirt Sizes
Use multiple sizes such as extra-small (XS), small (S), medium (M), large (L) and extra-large (XL) to estimate the effort at a high-level. Each size corresponds to a value from the Fibonacci sequence, e.g. XS — 1, S — 2, M — 3, L — 8, XL — 13.
Pros
- As it gives a quick and rough estimate for how much work is expected for a project, it is time-efficient.
- Independent to time, this method rewards team members for solving problems and focus on value delivery.
Cons
- It is unclear how risk, complexity and repetition attribute to the size estimation, and therefore difficult to achieve consistency over time.
- The nature of the various T-Shirt sizes is ordered categories, corresponding to the ordinal scale. When converting to the Fibonacci sequence as story points, the value assignment is arbitrary. Someone can easily challenge why a size gets assigned to one value instead of another.
- The summation, subtraction and average on the story points are not meaningful, e.g. Does the difference in effort between a medium size story and a small size story is extra-small size story? What does velocity as 20 story points mean, e.g. two large and two small stories?
- This method doesn’t translate from story points to the time required to completed and therefore, couldn’t answer a common question from stakeholder: how long will feature A be completed?
Note: Some other methods refer to effort different from T-Shirt sizes, such as animals and gummy bears. Essentially they’re the same idea.
Use Ideal Hours or Days
This method is from the XP methodology. For each story, the delivery team discusses how many ideal days or hours it requires. The ideal day or hour, using the Fibonacci sequence, can be based on the average time a dev needs, or the time an average dev needs. To determine when a feature can be finished, we can use a load factor (i.e. an ideal day equals to 3 actual days) or a percentage (i.e. assume in a day, we only have 70% of the time to do actual work) to convert the ideal days or hours into the actuals.
Pros
- Using ideal hours or days to estimate effort promotes story points to the ratio scale. This allows meaningful comparison between story points in various ways, such as summation, subtraction, multiplication, division, medium, etc. For example, a story with story points as 8 means 4 times more effortful to complete than a story with story points as 2. The ratio scale also allows producing any metric of interest, e.g. error metrics.
- The measurement of effort is more well-defined and easy to explain.
- The three aspects of effort can be easily captured by time.
- Can directly generalise the average times needed for specific tasks, e.g. 2 ideal days for writing tests, 0.5 ideal days for documentation,
Cons
- Human is not good at estimating effort in time as we are inclined to overestimate our capability and thus, under-estimate the ideal days or hours needed.
- Management may hold the hours or days against the delivery team on why the development falls behind.
As there is no silver bullet, every method has pros and cons, and therefore, a trade-off is inevitable. A set of guiding principles is used to assess all three methods.
Easy to Clarify and Explain Story Points Assignment
The method of estimating effort needs to be clarified and explained to others. Clear communication brings in transparency, resulting in increased trust and expectation without misunderstanding. In the planning poker, one needs to articulate why x story points should be assigned to a task, not y. As a relative estimation method, using the first story as a benchmark can easily identify if a task requires more or less effort than a referenced task. However, it is hard to quantify the effort difference merely through description without relying on ideal hours or equivalent. Methods that use T-Shirt sizes or equivalent translate categories into arbitrary story points. The difference in points hardly reflects the true magnitude. Using ideal hours or days is the most straight forward method to estimate effort, easy to understand and communicate even to the management, and provide a relatively objective representation of effort.
Verdict: Use Ideal Hours or Days > T-Shirt sizes = Use the First Story as a Benchmark
Accuracy
Realistically, effort estimation is an art at best. It is impossible to get accurate figures no matter what you do. The goal is to get good enough estimation with a little effort and improve the numbers over time. Therefore, it is critical to learn from what happened in the past (e.g. history often repeats itself), iteratively improve next estimation, and use a simple way to factor a significant difference into the new estimation.
Even if absolute accuracy is impossible to achieve, we should pick a method likely to produce more accurate estimation since effort estimations improve the agile practice over time. There is some empirical research supporting that using ideal days or hours is more accurate.
- Relative Estimation of Software Development Effort: It Matters With What and How You Compare provided empirical results showing that relative estimation can result in biased assessments of similarity and over-optimistic effort estimates. Tasks tend to be assessed as more similar than they actually are, and the perceived similarity of two tasks depends on the direction of the comparison. The estimation asymmetry was also observed when estimating the population of countries.
- Extreme Programming: A Survey of Empirical Data from a Controlled Case study used ideal hours to estimate effort. The study revealed the significant improvement of estimation over time (see Figure 4 where the green line is actual hours as the reference, and the blue line is divergent of the estimated hours to the actuals). This is due to the delivery team intentionally reducing task segments to a size between 4–8 hours on average to improve their work control mechanisms.
- Scrum + Engineering Practices: Experiences of Three Microsoft Teams later also demonstrated the estimation accuracy using staff/person days can be improved over time. The three delivery teams with different characteristics improved quality, productivity, and estimation accuracy through Scrum and nine engineering practices.
Verdict: Use Ideal Hours or Days > T-Shirt sizes > Use the First Story as a Benchmark
Meaningful Metrics
Metrics are used to track status, summarise data and reveal underlying issues, and therefore, they should be meaning and measure the right things. As discussed before, story points have to be in the interval scale and the ratio scale to produce meaningful velocity. The ratio scale can unleash the ability to produce additional metrics (e.g. estimation error metrics, productivity metrics) to monitor the improvement in agile processes.
Verdict: Use Ideal Hours or Days > Use the First Story as a Benchmark > T-Shirt sizes
Based on the assessment above, using ideal hours or days has shown a number of advantages over other methods. If your current estimation method works for you, stick to it as your team has found a way to work well with and around it. If you are about to start on agile or your existing method doesn’t work well as expected, try ideal hours or days.