Designing a service for high availability requires the elimination of single points of failure and effective fault detection and failover. I’ve seen a few different high availability solutions used in SAP recently, and this blog describes some of the lessons learned or problems that have faced these solutions.
High availability is about minimising both planned and unplanned downtime. Planned downtime could be for server patching, whilst unplanned downtime is caused by a fault in some component that takes the service offline. The high availability solution has to cost less than the cost of the downtime it avoids, or you are wasting money.
To develop a business case for high availability, first understand the costs of the planned downtime that you require to maintain the system, and the cost per hour of unplanned downtime occurring. If you don’t know these costs, stop and work them out with the business users.
Remember the ongoing costs, especially in support as in my experience the extra complexity of high availability solutions such as clusters can be significant. I have seen a few systems that actually have lower availability as a result of the complexity of the clusters.
With the two costs to compare, the decision is then purely financial, not technical. The mistake people make is trying to build the most bells and whistles out of whatever budget they have. It’s important for the budget holder to understand what the options are and review the costs and benefits of each one.
I’d go as far as to say that the majority of clusters I’ve seen were not necessary; they were chosen for an assumed need and cost more than the downtime they prevented.
High availability can be implemented at different layers in your infrastructure, from the basic hardware all the way up to application layer clustering.
The first thing to consider is how much you can achieve purely at the hardware layer. A good quality server with redundant components attached to a redundant storage and network infrastructure might offer all you need, and will be simpler and cheaper than any cluster.
Clustering should only be considered if you need more high availability than your hardware vendor can provide alone. Look at the fallacies of distributed computing to get an idea of some of the problems you can run into by clustering when it isn’t required. The extra complexity is not to be underestimated.
Cluster Option 1: Virtualisation (Hypervisor) Layer
Hypervisor high availability gives you the potential to have a single high availability solution for your entire organisation. This means a single team can become skilled in managing that single solution. If you have different solutions for each application, it gets so complex that the costs for support spiral and the actual result is less availability rather than more!
The big weakness in hypervisor high availability is that you need downtime for maintaining the operating system in the highly available virtual machine. This could mean a periodic planned downtime window, and operating system faults will lead to unplanned downtime.
Cluster Option 2: Database and Application Layer
There are a multitude of ways to cluster your application and your database. The most common example is Microsoft Failover Clustering into which you can install both SAP and a database such as Oracle or Microsoft SQL Server.
Application-layer clustering can get very complex, and can be unique to the application. You need to consider the impact it could have on projects like applying a SAP kernel as it could make planned downtime longer, more so than it ever saves in unplanned downtime.
In my experience, Microsoft Failover Clustering can add more problems than it resolves if you can’t invest enough in it to operate it properly, so you may want to consider hypervisor-layer high availability if you need a single solution for all your applications.
Cluster Option 3: Database Clustering
You need to take care to make the SAP application servers highly available as well as the database; it is surprisingly common to see a single point of failure attached to an expensive database cluster. Typically SAP will be highly available through application-layer or hypervisor-layer high availability.
The single most common and serious mistake in high availability solutions I’ve seen is having high availability in production only. It is done generally to save cost, but it will cost more in the long run.
If the quality assurance environment doesn’t match production exactly, you will encounter all of these problems:
- You need to patch the database, SAP and its kernel, operating system, firmware, hypervisor and drivers without any testing on how it will affect the cluster.
- You need to run major projects on your SAP system, such as upgrades, without testing the process on the cluster first. Even migrating SAP out of the cluster becomes an untested process.
- You cannot plan production downtime on a realistic environment, increasing the risk of overrunning maintenance windows.
- There is no training environment for the cluster technology, making it difficult for new operators to be confident in its operation.
I have seen customers running into all of the problems above as a result of having no cluster in quality assurance environments and the impacts can be, and were, serious.
If you don’t do a failover test during the project when you can get downtime, how do you know for sure it’s actually going to work?
I heard a story about a non-technical managing director who asked for a datacentre tour to see the investment he had made in his new highly available infrastructure with his own eyes. As the technician proudly pointed out which parts of the rack could fail without impact, the director started pulling the cables without any warning.
Production stayed up on that occasion, but I am not sure how many highly available infrastructures could withstand the rogue director test.
I’ve been asked to look at a few clustered environments that had no procedure for failover with operators who had no confidence in using it.
One of the hidden costs of high availability is the ongoing support and knowledge required to look after it, and if you go into it without planning for that your team will be left unable to work on it effectively, which will lead to exactly what you were trying to avoid: downtime.
Make sure the quality assurance cluster matches production, make sure you do testing and training before go-live, and make sure there are solid idiot-proof processes that anybody could follow at 3am on the weekend when you are likely to need them.
There are single points of failure (SPOFs) everywhere in computer networks and systems. It is one of the fallacies to assume that the network is reliable, it may well be at the cluster itself but are all of your key users in a single wing with a single access layer switch?
For example, if your most important business process is receiving deliveries and your entire business case for high availability is based on that, you need to consider if any of the equipment needed to operate the warehouse can fail. Is the terminal in the warehouse used for connecting to SAP fully redundant, including the power and network all the way back to the cluster?
The most obvious example of a missed SPOF occurs in the cluster itself, for example building a distributed cluster on top of a SAN that is wholly located within one datacentre. Even more obviously, clustered SAP application servers connect to non-clustered databases and vice versa all the time. It’s worth having a proper review to make sure nothing is needed to run the system that isn’t highly available.
I previously worked for a company who had a fantastic high availability solution at the storage and hypervisor layers across two datacentres, ahead of its time in 2008. Faced with the urgent requirement to shut down power to one datacentre, we moved all the servers to the other in just a few minutes leaving everyone mystified about how we did it without any heavy lifting or downtime.
On another occasion, the same fantastic cluster failed dramatically because a problem with the spanning tree in the network caused a momentary network-wide outage. On trying to restart the cluster from cold for the first time, it was discovered that all the domain controllers that were required for access to the virtual machine’s disks had been migrated into virtual machines on the cluster itself, making half a day of downtime. It showed that a missing SPOF can be very subtle; this one existed only in the conceptual network topology created by an algorithm in the network switches.
To get your high availability right, make sure you have:
- A solid business case: it should be a purely business decision to decide the level of availability you can afford and should build to.
- A good design: test it to make sure it gives you the high availability you wanted, search it for missing SPOFs and get external help from vendors if you need it.
- Good procedures: test, document, and review. Good procedures are fundamental to being able to keep the cluster running.
- Trained staff: From the initial business case right through to ongoing operations, the staff need to be completely confident in operating the high availability technology or it won’t achieve its goals.
For more information about the topics covered in our blog, contact Absoft today via: firstname.lastname@example.org or T: +44 1224 707088 to speak with our SAP consultants.