Web hosting, IT operations, and service level management
Yesterday (in my post about the weather) I introduced the subjects of time series and probability. Today I want to show how they apply to one aspect of performance management -- working with service availability data, which you are probably collecting from several different measurement tools.
Suppose you need to improve the availability of a critical e-business application ...
Let's assume you have collected historical availability statistics (or estimates you are comfortable using) for all components of your Web environment. You'll need these for the devices and services you control yourself, and for those you don't, like your ISP and its connection to the Internet. Hopefully, many of those numbers will be close to 100%. But if you are having service quality problems, at least some of those availability scores must not be good enough.
Barring catastrophic outages that bring down an entire data center, the components that support your Web applications generally operate independently of one another. That is, Web server crashes are unrelated to database server problems, which are not affected by a back-end application service being down. So if you know the availability of each one separately, it's intuitively clear that you should be able to derive an estimate for the availability of the whole interconnected system, for a given application. In other words, you can estimate of the level of application availability a customer experiences.
Here's how you can use the Web Site Availability Model to do that:
- Based on my conceptual diagram, create a spreadsheet model listing all your services, where each row of your sheet corresponds to a service that's used to deliver your Web site or Web-based applications.
- Check that the target application does not depend on any unlisted services. This would be most likely to happen in the bottom row, called Application Services, because what goes in that row will vary from one application to another.
- For each separate page of the target application, set aside one column of an availability model.
- In each column, highlight only the cells that correspond to services needed to serve that page. These cells are the only ones you will use in this instance of the model. Not every service listed is required for the delivery of every page. Simple applications may require only a small subset of the services you manage.
- In each row of your completed model, record the percentage of down time (100% minus availability) of each service in the first cell where that service is used (highlighted). Do not enter the same down time in other columns after the first highlighted cell, or you will be counting the same problem multiple times. Once you have enterered a down time estimate in every row, the spreadsheet is already a useful document that can help refine your focus and tackle the biggest problems.
- Summing the percentages in each column will now give you the probability that your application will fail on any particular page. A customer may be equally upset if your home page is down, the product catalog is down, or the checkout process times out. But to improve overall application availability, you need to know which pages face the most problems.
- Summing the column totals for whole model will tell you the overall chance of an application failure.
How's the weather in your data center?
Although it may not be immediately obvious, the above method assumes that your devices and services behave a lot like the weather in California (using yesterday's example). Just substitute up and down for sunshine and rain! Service outages do not happen for a single instant, then go away. Normally a service is 100% available, but when it goes down (becoming 0% available), it is stays down for some time.
Therefore, the proposed method assumes that if a service is available the first time it's needed by any page of an application, it will remain available for all subsequent pages. Making this assumption greatly simplifies the challenge of aggregating device and service availability data from disparate sources.
If on the other hand you have reason to believe that an underlying service behaves independently from page to page of a Web transaction, the weather analogy breaks down. In that case, you have a bit more work to do to complete the cells of your availability model. First you must count the number of pages of the transaction that use the service, then you must compute the probability that none of those pages will fail when using the service.
This situation is what statisticians call a sequence of independent trials, like successive rolls of a dice. For n pages, the probability of no failures is service availability raised to the power n. For example, if a service is 90% available overall, the probability that two pages will succeed in using that service is (.9)(.9) = .81, three pages (.9)(.9)(.9) = .729, and so on. Subtracting that result from 100% gives you the percentage of down time for that service, during your application. Now proceed as in steps 5-7 above -- enter this result in the first highlighted cell (the first page requiring the service), sum the component unavailability data by page, and sum the columns to derive the overall probability of an application outage.
Note that all the possibilities for a service failure have once again been recorded in the first column (page) that uses the service, which does not matter if you are focused only on overall application availability. If you really want to estimate the probability of a particular page failing, you will have to spread the probability of failure across all the pages that use the service. This requires computing the conditional probabilities of each page failing, given that its predecessors did not. That, as they say in the textbooks, is left as an exercise for the reader.
Details, details, ...
Astute readers may have noted that my suggested method glosses over a couple of important details:
- If you use a group of parallel servers (for example, with load balancing), you should treat the group as a single logical service in this model. The customer does not care whether one of those servers is up or down, if they can still obtain service from another server in the group. Of course, that is one reason why you are using parallel servers -- the goal is to achieve 100% availability for this service, regardless of whether an individual server is up or down.
- In practice, customers may take different paths through your applications, where those paths (usage scenarios or use cases) require different sets of devices and services. If service availability differs significantly based on application usage, you should treat each scenario a separate application, or add weights to your model. Then you can combine the different use cases and come up with an overall availability estimate for the application. How you deal with this kind of detail will depend on your purpose in using the model.
Finally, my aim has not been to provide a rigorous exposition of data center forecasting as an application of probability theory. And I am aware that a probability is a quantity between 0 and 1, whereas percentages run from 0 to 100%. But for descriptive clarity, it seems easier and more natural to use the terminology interchangeably. But if you are actually doing the calculations, be sure to pick one or the other and not to mix them!
[This post was first published on Blogger on November 2, 2005.]