The software supportability and reliability site
Supportability is not a feature
Article copied from Adi Oltean's Weblog Antimail (October 2005), with permision of the author.
Supportability is not a feature
I recently read about a Linux to Windows switch story. It made me remember something: supportability is crucial to the success of your software. Your programs are out there in the wild, alone. There are no helpful debuggers that you (or anyone) can hook up. There are no creative developers around that would find the real issue in five minutes.
It's not uncommon to have extremely hard requirements on what you can do while investigating that server. For example, expect that corporate servers are completely secured. Hey - a serious email server is not even connected to the internet. The actual hardware is designed to run 24/7, and any shutdown, interruptions, etc. must be planned days or weeks in advance. In many cases, you are simply not allowed to reboot the machine or to hook up a kernel-mode debugger...
And, on top of that, in a some cases your customer might have very stringent privacy concerns. He won't even let you touch his machine. Sometimes he will not send you any logs, except very basic text files, coming from straightforward shell commands. That's because his machine contains stuff that he wants to keep private, and you need to respect that.
Supportability is hard stuff.
But what is supportability? Simply put, it is an inherent capacity of shipped software to allow easier diagnose of any problems in the field. In other words, the capability to "be supported" in the field. (Disclaimer: there is a more general definition of supportability, but here I will focus only to the specific ways in which supportability affects the whole process of software development).
What ultimately matters is how the support process goes on, how long it is going to take, and what are the costs for your customer and your company. Of course, an important goal is to minimize both the number of support incidents and the time spent in each incident. This is a complex goal that involve many people from many different areas. And here are two main misconceptions.
First, many people mistakenly believe that supportability only concerns the support department, nothing more. Not really - the support process starts directly with the customer. If your program displays the correct failure message with the right action steps, then your customer will be able to fix the problem itself, without calling the support department. And, if the customer is able to figure out the problem by itself (or if the program fixes the problem automatically), then the operational costs are reduced, and everyone will benefit from this.
Second, supportability is ultimately a responsability of the development team - not only that the app must surface errors the right way, but another great way to slash these supportability costs is when the application knows how to handle exceptional cases without even bothering the user. For example, if you want to create a new file in c:\foo and the directory is missing, then you can just create the directory automatically. It would be a bad idea to stop the application with a cryptic error message.
How do we get there? I guess that there is no secret sauce, other than thinking really carefully about how every supported scenario is supposed to work, about every line of code you write and how it can fail, and about every possible impact on customer experience. And we are not done yet! You also have to think about the support engineers representing your own company, or even support teams of partners of your companies (remember that OEMs have support departments too, and they might need to support your code 24/7). And if you develop code that is used by thousands of applications (for example if you develop infrastructure code, or some components that are part of the OS) then you will have thousands of OEMs, vendors or independent software developers that are supporting code that interact with your stuff. That's why supportability is hard to get right.
Below, I listed a few software development principles that I've learnt myself, from my own personal point of view, while supporting shipped software. And I guess that these principles apply to every serious company that ships software in the field - whether it's Microsoft, Red Hat or SAP.
Supportability Principle # 1 - There are a lot of ways software can fail. Understand all of them (or at least, as much as you can), and deal with them appropriately.
Personally, I think that any serious shipped software (that frequently calls APIs provided by the operating system) should contain not more 10-20% real code. The rest of 80% is code that has to deal with potential errors from any "hook up" points that can cause errors - Win32 APIs that fail for a reason that you didn't anticipate, bad user input, resource allocation failures like out-of-memory or insufficient disk space, or even potential bugs in your code.
I am always suspicious when I see a different ratio. For example when I see some code intended for shipping that enumerates some files, where 90% is the actual algorithm, and only 10% deals with errors during Win32 API calls - there is something really fishy going on...
Besides, you need to add solid logging/tracing/instrumentation to any software that runs in the field. Your running code should optionally generate logs for offline analysis in a very non-instrusive manner. So, all this instrumentation code adds up to the error-treatment code.
This point is so important that I feel I can never stress it enough. Your software will run into unexpected situation really really soon after your shipped it. Just don't assume that bad things will not happen. In other words, be prepared to tell the customer, programatically, the bad news. Even small, innocent failures can cause ripple effects that ultimately affect the user experience.
There are a few points to keep in mind here. Failures come in different flavors, and they affect the user experience in different ways. You can have for example retryable failures, or non-retryable but recoverable failures, or simply non-recoverable failures.
If your code notices a problem, and the failure is expected to be temporary (for example you wait for a service to start) then it makes sense to retry the initial operation a few times, with no loss in user experience. But never forget to exercise the appropriate logging about what happened (i.e. add an event to the event log that you had to retry 5 times to open some file). Be very aware that an longer "retry cycle" can be extremely annoying and the application can be perceived as "hanging" from the user point of view.
Second, if the error is recoverable but with minimal loss in user experience, then it might be a good idea to enter in a special "safe mode" where the application offers restricted functionality, but still allowing the user at least to save his changes.
Finally, if there is absolutely no way to alleviate the problem in some automatic way, then make everything possible to inform the user where he stands right now:
Thinking about user experience in the context of failures is very hard. For example, thinking about problems of an API like CreateFile it's not just enumerating all the possible ways in which this API can fail. CreateFile can hang too - for example when you are trying to access a file over the network, and the remote SMB computer is non-responsive. And, what you will need to do in that case? Even worse, what if your user wants to close your application while your main UI thread is stuck on this CreateFile call? What if he wants to shutdown the machine? You will block the whole shutdown, causing him to manually power-off the machine. Aaarrrggghhh....
The idea is simple: if something might fail due to unexpected conditions, then make the failure scenarios as predictable as possible. Additionally, when something goes wrong, then the failure should be surfaced in the right way, again, as predictable as possible.
This simple philosopy should permeate the whole design of your software components, at all levels. For example, if you see a corrupted data structure in your process, stop right away. If you get an unexpected error code, stop right away. It would be much worse if your program continues its execution with no signs of a previous anomaly. Why? Think about it: if you "hide" a bug, with absolutely no trace left from this transient encounter, then it will be extremely hard for a support engineer to find out the root cause of this problem. Especially if this is a crash that happens, say, once per month...
At first, this seems to conflict with the "retryable error" approach described above, but that's not the case. The key concept here is predictability. Be very careful to not retry things that you do not for sure when they will ultimately succeed.
Supportability Principle #4 - Don't hide failures under the rug.
A classical problem that can happen to software in the field is represented by Access Violations. Let's assume that your code has a bug that tries to access invalid memory. While an AV is very frustrating for the user, the fact that it actually happened is it is actually the right thing. No, I'm serious! Your software is already shipped in the field. What would you prefer? (a) A piece of code that generates unnoticed, hidden buffer overruns all the time? Or (b) some code that immediately terminates its execution when something goes wrong?
Your code shouldn't attempt to hide AVs it by blindly eating all SEH exceptions. An AV will quickly releveal the exact bug in your software. In fact hitting an AV is the optimistic case where the code touched inaccessible memory and in fact did not get a chance to corrupt valid customer data. It is much worse where the invalid memory access will not cause an AV and it will actually corrupted some user data. Not only that, but throwing SEH exceptions can leave locks acquired which can cause random hangs in your application.
And, not only that, but any AV is technically a buffer overrun. Who would prefer having some insecure software in the field that is being constantly attacked due to its hidden buffer overruns that nobody can identify?
I remember that the huge blackout that affected the eastern US a while back was due to an unfortunate series of events, the most interesting one being a software bug in a management system program:
In the end, what's the point of having all sorts of bells and whistles that are supposed to draw attention about a failure, if these remain silent due to a minor bug?
The user experience around failures should be defined in great detail. But there are several aspects there. For example, accessibility. If your software is designed to monitor a storage array, and the LUNs are graphically shown in green when everything is cool, vs. red when something is wrong. Is this OK? Absolutely not: what about people with color blindness?
Note also that all these failure paths must not only be well designed but also properly tested. And here we reach the next principle:
If you expect some failures to occur, then you should make sure that you have solid code that deals with them, and this code is properly tested. Not only that, but the test code should exercise these failures and verify that the right signals are sent to the user.
It is very tempting to skip testing of any failure paths and just verify that the software does its job on the main path. Big mistake.
This might be more challenging that it seems. Writing the correct amount of data in the event log requires a fine balance between offering too little (which can be defficient if you want to figure out what happened) and too much (overloading the event log with too many things).
There is another tradeoff: the event log entries must be useful both for the user (which expects to find there clear information) and the support engineer that wants to investigate the specific failure. The event log entry must not be confusing for user (Principle #2) but you should also provide enough developer-oriented information in a less-visible place (the event log private data, for example). If the customer contacts a support engineer, he will know where to start.
Supportability principle # 8 - Develop solid tools for data gathering at the customer analysis, and for offline data analysis. These tools must be available to your support engineers at least, along with the proper documentation.
You have to develop a easy-to-use script or binary that would gather every interesting data that can help you to make progress with the support case. This tool should get all the relevant info, like the OS version, the list of installed applications, specific configuration details of the target machine (is it running in a cluster? etc), the full or partial event log info, interesting log files, versions of various binaries, etc. Along with the tool, you must include simple, precise instructions to gather the data, which should be eventually cut & pasted in the email sent to customer. Optionally, you can add eventually several options to enable finer-grained selection of what data needs to be gathered.
This tool doesn't need to be something very complicated, but it needs to be extremely reliable. Certain support cases can have an extremely critical status, with very high visibility both at your company and at the customer site. In these cases it is important to get these results back as quick as possible, and in the most reliable way. Nothing is more annoying in a critical support incident when the data gathering tool has a minor bug that randomizes the whole support team in a completely off-track direction.
One more thing, which is pretty important: this data gathering tool should not affect the machine state in any way (except, maybe, creating some files in a directory). Some customers are very sensitive on running unsupported things on their production servers. Finally, think hard about privacy concerns around this data gathering tool. Avoid getting customer-sensitive data in your diagnostics (like the contents of the Exchange database). In parallel, educate the user properly about what this tool will do and what type of data is sent back.
Supportability principle # 9 - Don't throw away what you learned from a support case. Maintain a document (or a Wiki web page) with all the interesting support cases in the past.
This doesn't need to be something fancy. A simple DOC file on a share is enough. Or an internal blog. The key here is to make a habit of updating this document after each interesting support case - it is very easy to get stuck in a endless treadmill of support cases, and forget to communicate the gained experience to others. This source of information is goodness both for quick reference and for new support engineers that want to quickly ramp up to speed.
Supportability Principle #10 - All these "principles" are completely useless, as long as the feedback loop is not closed.
There is no substitute for having excellent communication with your customers and your support engineers. In the end, resolving each support case is a team effort, and good communication is critical for success.
3rd August 2006