July 2010

Archive for July, 2010

Leaner than agile: Better products more quickly and cheaply

Friday, July 9th, 2010

This post advocates the use of a ‘lean’ software development process for web based products and services. I outline why those involved in product marketing and software development should consider using a lean approach in their organisation. Where I use the word ‘product’ you can happily substitute ‘service’ if it suits you better.

I caveat the above with “web based products and services” because the web offers fantastic opportunities to measure how the customer values products and features. Value can be measured in various ways, for example measuring usage and goal attainment, running split tests and getting feedback. One or more of these feedback mechanisms are essential to the lean approach.

So what do we mean by a ‘lean’ approach? The term ‘lean’ is borrowed from lean manufacturing processes (such as the Toyota Production System). Wikipedia defines Lean Manufacturing:

Lean manufacturing or lean production, often simply, “Lean,” is a production practice that considers the expenditure of resources for any goal other than the creation of value for the end customer to be wasteful, and thus a target for elimination

For software development, my approach to lean is to:

Take the smallest possible step that can test an assumption or idea; move quickly in small increments and learn from each step we make, minimize work in progresses and quickly learn what our users value.

This enables us to quickly determine what adds value whilst wasting less development time on features which do not add value. This usually means minimize work in progress, releasing early and often, measure and test value to the user, then iterate – quickly. Here’s the Wikipedia entry for Lean Software Development for other’s take and perspective.

Values and processes

Before we go into this in depth, I’m not saying one software development process is universally better than another, every business is different and puts a different value on different things. The perfect process for a particular company can’t be found in a text book or blog post, to a certain extent it requires some trial and error to match the company’s values. A company’s values may change over time as a company grows. Use the right tool for the job at hand.

When I use the word ‘values’ I’m not talking about some airy fairy spiritual thing. Some business have customers that value stability and no change (or very slow change) in the product offering. Some services should be straightforward and just work – always, no exceptions, ever. Other businesses will make amazing gains from adapting their products quickly to delight their customers and gain business from their competitors. Startups usually fall into the later category (particularly early stage startups). For them there is immense value in their ability to quickly put new features and products out until they find the right markets and products/features for those markets.

For me one of the primary duties of the product development team is to maximise return on investment or put in less scary language their effectiveness. In the context of product development, I define effectiveness as:

Effectiveness = value added / cost of adding that value

Our aim is to increase the value of our product from our users’ perspective (hopefully also leading to more users) at a reasonable cost. You can think of cost as money or time or a combination of both – the gist is the same.

Adding value requires discovery of what adds value

To increase value, it is essential that we find which features users use and which features either help them or encourage them to complete goals. This is an investigative process, a search that involves lots of trial, measurement, refinement, retrial and so on. For a startup product development is a race, you want your product to be valued by users as quickly as possible. Learning from measurement, testing and feedback is *key* to the lean approach.

Discovering which features users really value or how features ought to work is rarely as simple as *just* asking them. Ask a sociologist, often people don’t do what they think and say they do. Over the years I’ve done a fair bit of split testing and testing by observing users complete test tasks. In both cases we’d frequently be surprised at which features users did and which they did not use. Do gather and encourage verbal/written feedback from your users but use this to inspire investigation, as a catalyst and not as sufficient evidence in itself.

The quick cycle of feedback then iterate again is essential here. With the non-lean approach you might get lucky, hit upon something and add a dramatic amount of value but then again you might not. The same can be said for the lean approach but the cost of discovering this is lower and you can learn and try again quickly. The non-lean approach will probably lead to increased value. However, the non-lean approach probably won’t maximise value or do so quickly and at minimal cost.

Minimize Work-in-progress and avoid queue bloat

Minimizing work-in-progress is important to improve effectiveness by reducing the amount of work which does not add value. For software, by work-in-progress I don’t mean just what the developers are working on at that particular moment, I also include anything they have recently finished which has not yet been released.

What happens when people have to wait for the software development team to get their new features developed and released? Let’s define ‘lag’ as the time from requesting some work to release of that work. Let’s consider how high lag leads to bloat and waste..

Often customers of software teams are like very hungry people waiting for their first meal for a day or two. Once they get served, they’re gonna eat. They’ll eat plenty more than they really need and become bloated. So, using the royal ‘I’… My development slot doesn’t come up another 6 weeks and even then it won’t get release for say another 6 weeks. Any new features or improvements I want won’t get released for another 12 weeks, so I’m going to stuff it with as much as I can. This probably means I’ll ask for a larger slot of development time than I really need, making others wait even longer, increasing the general hunger and leading to more bloat. The queue of work starts to grow with work that is not really necessary.

High lag puts people off trying alternatives. If you don’t try alternatives then you’re not searching for the best solution, not innovating. They’ll polish their new baby in the hope that it’ll succeed in what may be its one shot at success (at least for fairly long time).

Predictability vs Effectiveness

Organisations often put too much emphasis on predictability to the detriment of effectiveness (Value added / Cost of adding that value). They like the certainty that work will begin on features A, B, C, X, Y and Z in 6 weeks time and 6 weeks after that these features will be released. Hit those timescales with those features and nobody can have the finger of blame pointed at them. Except of course the people who are responsible for growing the business fast and before they run out of money, or get beat to it by a competitor.

It’s not just bloat, there’s another problem with 6 week plus release cycles. People become frustrated with the lack of progress and blame the busy development team. In their hunger, people start attempting to jump the queue. Jumping the queue leads to ill feeling and bad decisions. Some may argue that the solution is just to ban queue jumping, to staunchly refuse it. Resisting queue jumping is just treating a symptom rather than addressing a cause – a need. I think queue jumping is inevitable in a business that has customers and competitors and is measuring, testing, listening and responding to the world around them. Priorities change as we learn more about things, it’s a fact of life.

Organising Lean

So treating the cause rather than the symptom. If our approach is to learn from the customer by trialing features and ditching or iteratively improving them based on measurement, tests or feedback, then we’re learning. If we’re learning then our current priorities will change as we learn more. If our priorities change, then it’s important to be adaptive. The key to being adaptive is low lag, that is being responsive and minimising work in progress – a lean approach.

With lean we are searching for how to add value. It’s a search so by definition we don’t know everything up front before we begin. A common question about lean processes is what does a project look like and how is it organised. I like to define projects in terms of their goals rather than their steps. So for example, I might initially define a project as the goal, “Increase the percentage of forum users providing answers to other users’ questions”, then working with the team, we’d think about ways of doing this and how to measure and test them, all the time looking to keep the steps as small as possible.

You will likely have multiple projects going at the same time, depending upon the size of your team and how long it takes to collect sufficient data or other feedback. Switching between each project as things are learnt and next steps are determined. I’ve found the best way to organise this is as a list or work ‘pipeline’ with the following sections:

Backlog => (Pending => In progress => Ready) => Released

Some people call this a Kanban chart or board and use an actual board with post-its. Personally, I prefer a shared document or dedicated agile project tracking system such as Pivotal Tracker

New work requests are added to the backlog, which is kept in priority order. Like an agile process, before the next iteration begins, the decision makers should review the backlog and determine the highest priority items which will progress into the pending section at the start of the next iteration.

The sections enclosed in parentheses – Pending, In Progress and Ready are the work-in-progress. We minimize this by releasing early an often.

Summary

So in summary, the aim of lean software development is to improve effectiveness (return on investment), adding maximum value through measurement and feedback, and reducing or eliminating work on features which do not add value.

  • Always take the smallest possible step necessary to determine if we’re adding value

    Keep asking the question, is this really the minimum step in better understanding if this feature will be valued by users?

  • Learn. Measurement and feedback are king

    Get feedback, that is: usage data, split testing and/or testing with users.

  • Move quickly

    Minimise work in progress and release often.

  • Iterate and investigate. Innovation is a search, don’t be afraid

    The quicker you move the more you can try, the more you try the more likely you are to get a result. Remember you are measuring and getting feedback so you’ll know if you make things worse. As you’re moving quickly, you can quickly put things right quickly too.

  • Be responsive

    If you are not responsive, fear will take hold and bloat will gradually kill effectiveness (value/cost).

Website performance: Concurrency and its evil brother latency

Tuesday, July 6th, 2010

There was a lot of buzz about Ruby On Rails a few years ago, many would say there still is. I find Rails focus on productivity very attractive, so a few years ago, for web work, I shifted over from Java to Rails. One significant difference between the two environments is how concurrent requests are handled; in particular the potential impact of high latency requests. This post isn’t about Java vs Ruby. However, the differences and similarities between their deployments did get me thinking about the impact of latency and concurrency on web performance and wondering about other approaches such as the event based approach of Node.js.

I use the term “high latency” to mean requests where the response doesn’t complete in under a couple of seconds, though arguably a couple of seconds is not particularly quick.

In our Java web systems we had lots of threads available. We didn’t tend to worry all that much about the odd high latency request as there were always plenty of other threads to serve other requests. In reality though, that was perhaps a little naive. In many applications there’s plenty of opportunity for a thread to consume and hold resources needed by threads e.g. db connections.

I shifted to using Ruby on Rails a few years ago now. Moving to Ruby on Rails was an eye opener for lots of positive reasons but (at the time) I was surprised to discover that Ruby servers are effectively single-threaded (there is threading in the de facto ruby 1.8 implementation (Matz MRI) but it’s cooperative rather than pre-emptive switching). The threading model effectively means that processes are used for concurrency – one request per process. On many systems (but by no means all) spawning processes can be slow, also processes consume memory – even if just the web stack when idle. As such there’s a relatively low limit to how many Ruby processes can co-exist vs the number of threads available to a similar Java web system running on the same hardware.

The one request per process model tends to make me much more paranoid about high latency requests than in my Java days. However, looking back I wonder if our old Java systems would have benefited from greater attention to eliminating high latency requests. Server crashes tend to creep up on you and often it’s not obvious what the one thing that’s causing the crashes is because it’s not one thing but a gradual degradation in responsiveness, that eventually gets to a point at which it tips into a server crash/stall. I rather grandly think of this as the “tipping point of crashes”.

You see a similar effect on busy motorways (highways, autobahns..). Think of the lanes as being like the processes available to service requests from browsers concurrently. If there is a slow car in one lane, the lane starts backing up and vehicles start switching to the other lanes. Capacity is reduced and the likelihood of a slow car entering one of the other lanes increases. The problem effectively cascades until suddenly everything stops. Once the congestion has passed and you’ve travelled a few miles more you may wonder what caused the problem in the first place, there’ll be no obvious cause.

So how to solve this.. Well continuing the motorway analogy, more lanes can obviously help. With more lanes you’re reducing the chance of a slow car affecting others. With additional lanes you may have moved the tipping point further out, a single slow car is less disruptive to others. However, the problem is still slow cars. Take the slow car off the road and the motorway copes with no problems. Here lies the danger, whilst lanes help, don’t let more lanes fool you into ignoring slow cars.

The Node.js web framework is interesting because it doesn’t follow the common model of process/thread per connection. Node.js provides an asynchronous framework, which put in very plain language means the thread doesn’t sit around waiting for databases, filesystems, messaging systems etc to reply to requests, instead it registers a callback to be called when the database/filesystem etc responds, the thread proceeds onto processing the next request.

Programming a system like this is a bit of a paradigm shift for most web developers, it is more like programming a GUI. However, decoupling concurrent processing of requests from the number of processes/threads should result in lower memory footprints and fewer problems with occasional high latency requests blocking out new requests and causing congestion problems.

I’m planning to have a go with Node.js pretty soon so it’ll probably form the basis of a future post here. In the meantime, here’s a good example explaining node.js.

So my tip: no matter what the system, always pay attention to high latency requests particularly if you are having occasional “mysterious” crashes.