No one likes using an unreliable computer system.

There can be few things more annoying (in terms of 1st world problems anyway) than having an application freeze and losing a load of work.

Losing work is annoying but what if the system performed a more important function such as managing your bank account or maybe even helping an aircraft navigate?

Systems reliability is also critical for user uptake & acceptance. A now somewhat dated study by Bailiey & Pearson (1983) found the top 5 factors impacting user satisfaction of a system were Accuracy, Reliability, Timeliness, Relevancy and Confidence in the system. This is also extremely satisfying as suggests a systems function is more important than its aesthetics (which we all knew anyway) which often tends to have a lot more time dedicated to it.

Building a resilient system requires knowledge from multiple disciplines encompassing software architecture & practices to hardware & network architecture to well defined project requirements. For this article I am going to be focussing on the software design aspects.

Distributed Systems

Unless you are working on a fairly trivial or maybe some kind of embedded application the chances are that your application is made up of calls to multiple systems.

Leslie Lamport defines a distributed system as:

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"

Even a simple web/database setup involves a connection between many machines. For example at its simplest level (ignoring dns, networky stuff etc) the user, the webserver & the web server & database. Even with this minimal setup there is a lot to go wrong!

Current trends such as cloud computation, IoT, mobile and micro services only increase the number of systems & different types of connections.

The mistake we all have made/continue to make

As developers we are an optimistic bunch and tend to believe in the reliability of communication over a network. Every developer should be familiar with Deutsch’s 8 fallacies of distributed computing but we continue to ignore these principals.

Back to resiliency

Webster’s dictionary which seemed as good as any other dictionary defines resiliency as:

"Capable of withstanding shock without permanent deformation or rupture Tending to recover from or adjust easily to misfortune or change"

http://www.merriam-webster.com/dictionary/resilient

The recover from part is particularly relevant for us.

One approach to building systems is to try and anticipate every possible failure & develop preventative measures. Unfortunately this doesn’t tend to work out well as it’s near impossible to anticipate every possible failure.

Did you really in your design anticipate the systems administrators’ gerbil biting through an essential network cable?

Embrace failure

Instead of attempting to run away from failure we should embrace it - because shit happens & when it hits the fan in one system its great if it doesn’t take out every system which brings us to the topic of subsystems.

Subsystems

Many systems will fall over if a problem occurs in a subsystem.

Imagine a high traffic ecommerce site. This sites main function is to sell widgets. Now marketing (and its nearly always marketing!) want to serve/spam some adverts to users to bring in some much needed funds.

Let’s say a problem occurs in this advert system. Now whilst marketing may tell you & genuinely believe this advert serving system is critical to the success of widgets.com the truth is that it’s more important that the company can continue to sell widgets than everything fall over if the advert system goes offline.

Sadly systems are often not put together this way. In an ideal scenario & it’s one companies such as Netflix strive for no one system should be able to bring down everything.

Measures of resiliency

We can measure how reliable a system is in a number of ways.

Some possible measures (originating from engineering) are:

  • Mean Time to Failure (MTTF)
  • Mean Time to Recovery (MTTR)
  • Mean Time between Failure (MTBF)

Mean time to failure specifies the time you can expect the system to function under certain parameters before it will fail. For example a MTTF of an application could be 1 month with a load of 1000 average (whatever average means & it would be important to define this) users.

MTTR is how long after a failure it takes to get everything working again & is arguably more important than MTTF.

Whilst 30 minutes to restart an ecommerce application may be acceptable it would er be a long time for a device such as a pacemaker!

Patterns

There are a number of patterns/approaches to designing fault tolerant software.

Many of these patterns and approaches have been known for some time, are easy to understand and relatively trivial to implement.

One of the best references of these is Hanmer’s Patterns for Fault Tolerant Software- (http://au.wiley.com/WileyCDA/WileyTitle/productCd-1118351541.html). Sadly this book is out of print but is available in electronic format from several sites including Safari Online – highly recommended.

Before we look at any approaches it is important to note that implementing any of these comes with trade-offs such as:

  • Decreasing performance
  • More complexity
  • Longer to develop
  • Harder to debug

Some of the most popular patterns & approaches include:

  • Bulkheads – bulkheads are a concept from ship engineering which amongst other benefits limit the damage a leak could do by providing a water tight compartment. When applied to systems a bulkhead approach ensures a problem in one system does not bring down others
  • Automatic retry. By automatically retrying requests we can often resolve any transient/time out type errors
  • Time outs. Time outs are a simple mechanism that can have a massive impact on a systems stability and scalability. By defining a period to abandon a network call we can avoid tying up a systems resources with calls that will never or are slow to succeed and provide a response to a user quickly and present alternatives
  • Circuit breaker. An extension of timeout where after a set number of failures we will no longer attempt to access a system
  • Supervisory patterns. Patterns such as heartbeat where a system updates a field to indicate it is still alive.
  • Input validation – never trust any input even from your own systems! Many systems make the mistake that inputs from other internal systems are free from error – don’t! Mistakes happen and you want to ensure your system does not accept or propagate these. It is worth referring to Postel’s Law (https://en.wikipedia.org/wiki/Robustness_principle). Postel wrote part of the TCP specification and said “Be conservative in what you do, be liberal in what you accept from others”

Libraries exist to assist with implementing these patterns. The .net library Polly (https://github.com/App-vNext/Polly) makes it very easy to wrap calls and implement many of these.

Fallback

An important approach to designing resilient systems is to provide fall back options when failure does occurs. For example Netflix who arguably popularized the embrace failure approach take the following approach with their personalized movie recommendation system

  • The system first of all attempts to use the personalized recommendation system (1st choice)
  • If personalisation system is not available then the system attempts to use a list of current popular movies
  • If this fails then a fixed list is used
  • Finally if the fixed list is available then the call simply does nothing but doesn’t present an error to the user

This approach ensures that if any of these subsystems are down then a user can still view movies even if they don’t get the best experience possible.

For more info see the excellent preso at https://www.youtube.com/watch?v=3D0zS3kPNUU.

Wrapping calls

Caitie McCaffrey in her verification of a distributed system presentation quotes an awesome 2014 study by Yuan, Luo, Zhuang, Rodrigues, and Zhao (https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf) that found that:

  • 92% catastrophic failures due to incorrect handling of non-fatal errors
  • 35% of catastrophic failures are caused by trivial mistakes in error handling logic

This is stuff we as devs are failing at and can do better.

One approach (favoured by Netflix) to dealing with this is to create client libraries to access services that also contain error handling logic. This has some disadvantages too around maintenance & flexibility but does ensure errors are handled correctly.

Netflix also mandate that all access to subsystems occurs via a library called Hystrix (https://github.com/Netflix/Hystrix). Hystrix also contains built in logic to retry & report failing systems. Netflix even go as far as to monitor that calls are made using this library.

Suggestions

In addition to software approaches the following can help ensure systems are reliable:

  • Game days – where failures are deliberately created in a system and the team test how the system and they handle it
  • Fault injection – deliberately ensuring systems fail to check the systems cope with this e.g. Netflix’s Simian army is a great example of this that randomly shuts down aws instances, creates latency etc
  • Formal testing – an approach that uses a mathematical model to test a systems assumption. This has often been employed for hardware & critical systems but is rarely used for typical web based applications. Amazon found the usage of formal methods discovered a number of issues (http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf)
  • Ops embedded in teams can help ensure teams design systems better, communicate knowledge & if issues occur resolve them quicker
  • Usage of formal methods such as Iron Fleet may assist designing formal systems (http://research.microsoft.com/pubs/255833/IronFleet-onecol.pdf)
  • Actor based systems such as Akka & Project Orleans may provide an effective way to handle & isolate faults
  • Test your systems! Yuan, Luo, Zhuang, Rodrigues, and Zhao (2014) found that 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code & a majority of the production failures (77%) can be reproduced by a unit test (https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf)

Conclusion

Failure is inevitable and we need to design & develop from this base. There are many easy to implement patters that ensure our systems continue to function should issues occur.

Further reading

  • https://github.com/CaitieM20/TheVerificationOfDistributedSystem Note Catie has some awesome links (many of which I have used in this article) & has a number of other presentations around resiliency that everyone should check out
  • http://www.slideshare.net/ufried/patterns-of-resilience
  • http://www.slideshare.net/InesSombra/architectural-patterns-of-resilient-distributed-systems
  • http://www.slideshare.net/palvaro/lineagedriven-fault-injection-sigmod15
  • ReleaseIT https://www.amazon.com/Release-Production-Ready-Software-Pragmatic-Programmers/dp/0978739213/ref=pd_bxgy_14_img_2?ie=UTF8&refRID=87MMTR1KQCVSZMY9PV6P