Wednesday, April 18, 2007

Five Nines in a Service Oriented World - Part 2(c) - Understanding the minimum operating requirements

Okay so its "nice" to have the preferred form of address for a customer when you put up the order response page. But should you display an "order not processed" just because the service that tells you if its "Mr Jones" or "Steve Jones" isn't available or should you just say "Dear Customer" instead? Well clearly for something like that the answer is go with "Dear Customer".

This isn't limited however to just trivial data formatting elements it can apply to things that appear to be much more critical. The key here is to understand how to aggressively degrade the application. What this means isn't just planning for when a call fails, but actively making things fail and reducing the operation of the system to a minimal subset, and ensuring that the reliability of that subset is maximised.

With the failure lists above it was a case of something failing and then the calling service coping, with this approach its actually a case of deliberately not accessing services and operating as if they never existed. Why would you want to do this? To reduce the risk of knock on effects from failures in terms of system resources, live/dead locks and data errors. If you can shutdown into a "safe" mode of operation you can at least keep the lights on and keep the core business running.

As an example if you have a system that actually runs the core production line at a drug company and there are a number of systems that can change what is being made and which the production line reports. The critical factor is just keeping the line operating, as for every hour its not working means lost revenue for the company, then here you could look at the safe mode as designing the system so it can operate successfully without any links to external systems. This could involve log files being shifted after hours, or even tapes being couriered to another data centre. It means understanding how long this form of operation can continue and giving IT and the business time to put in place contingency plans. The point here is that the manufacturing line is the minimum operating condition, if there is any risk that external factors could force it to operate below peak efficiency then these should be ruthlessly shutdown and the core allowed to continue to operate.

This sort of approach is very important when dealing with 3rd party systems where they could be legal or trust issues that require you to shutdown access in a hurry either because you feel the remote system has been compromised or because they are not meeting their SLAs.

The difference between minimal operation and planning for failure is that the services might not actually have failed, but an operational decision is made to work without them. This is the critical difference. Planning for failure is about coping with failure that happens operationally an implementing a mitigation plan. Deliberately degrading the system is a business and technical decisions where it is decided that the risk of failure or information error outweighs the benefits.

Another example of this would be when a system has to handle dramatically increased loads due to an external event, maybe a surge in demand due to an overly successful marketing campaign or some external problem that has caused a surge in exception conditions. Rather than the system taking the classic "normal" two step IT solution to this problem:
  1. Specify the hardware to a level so large it bankrupts the company
  2. When the system falls over because the company refused to go into bankruptcy go "na, na, told you so"
There is another choice here which is to enable the business to have the "oh shit" moment and then start turning off pieces that they currently decide are non-critical and just operate the core that is required to handle this unexpected event. This might mean, for an issue, concentrating resources on the support functions and for a marketing campaign it might mean taking the business decision to take the orders and batch up the payments at the end of the day or preventing people from searching for products as 95% of people are looking for the specific element from the marketing campaign.

Design a system so element can be deliberately failed is a big change in the way applications are built today, but in a distributed SOA environment is going to become more and more important.

Old ways won't work. So what does it take to actually do this in a system? First off it means that you need to understand for each service what it calls and have a simple "Shall I call" check that can be toggled at runtime (not exactly difficult these days), if the answer is "don't" then you need to have a mitigation plan in place. Stage 1 is to put the check in place, and have no mitigation. This is the cheap first way of enabling your system to adapt in future as such challenges become your operational reality. The important bit is to really understand what the actual core operation is for your business so you can start planning for that and not creating an SOA environment that is rich and dynamic when everything is fine, and when there are issues its buggered.

Its not a tough question to ask the business... but its one I've rarely seen asked.

Technorati Tags: ,

No comments: