How to Prioritize Performance with Failover Testing

Written by Juan Pablo González | Jun 9, 2022 7:00:00 AM

When it comes to software QA and testing, the focus can often fall on ensuring that software is bug-free and won’t pose significant security risks for developers or customers down the line. An often less discussed aspect of QA is performance testing. Performance testing is used to gauge an application’s speed, response time, scalability, resource usage, and stability under load. One of the most important areas of performance testing is Failover testing.

To fully understand why failover testing is such a critical aspect of the testing process, you must first understand what failovers are and how they can protect your bottom line.

What is failover?

Failover refers to a system’s ability to handle sudden failures by switching operations to a reliable backup. Failovers are a key feature of robust hardware and software systems. They rely on redundancies or secondary computing systems (or servers) to ensure service availability even in the face of critical failures. These redundancies are always online and updated regularly to maintain the most recent instances of applications, files, or, systems in use.

However, failover should not be confused with failback. Where failover describes the shift of functionalities to the redundancy following a failure, failback is the return of service to the main server or system in place.

If the main system or server experiences a critical failure, a well-designed failover system will switch operations to the redundancy seamlessly. Assuming the redundancy is up to date, this allows applications and data to continue being used even if the main system fails with minimal changes in end-user experience.

Failovers exist primarily to reduce the risk of data loss and potential negative impact on software users. They’re popular mechanisms built into most modern software products from SaaS applications to Operating Systems.

What is failover testing?

Failover testing refers to the testing technique that ensures a system has the adequate failover systems in place.

What is failover testing in the cloud?

Failover testing in cloud environments follows the same principles as standard failover testing. Tests must be carried out to ensure the failover systems are working correctly. Such tests are extremely important for delivering value to customers and ensuring satisfaction with the product or service.

Imagine you have an application for logging customer orders from an e-commerce service provider. If the main server were to experience an unplanned failure, hundreds or potentially thousands of users could be left unable to access their applications if the failover system doesn’t perform as required. More serious consequences could include the loss of purchase data or orders.

In a case like this, having a robust failover in place ensures that service disruptions and data loss is kept to a minimum.

Why is failover testing important?

Failover testing helps software developers to determine how well their application or system can handle critical failures by allocating extra resources and load balancing as necessary. When designing software for widescale use (particularly for Cloud-based applications), special attention needs to be paid to failover reliability and functionality. Failure of both the primary system and the failover could result in the loss of functionality and data before the failure.

Failover testing helps engineers ensure that the system in place can allocate computing resources as needed during periods of heavy load. Additionally, it tests the system’s ability to store backups of all the data and operations being used to redundancies and seamlessly transition operations to them in the event of an abrupt system failure.

Understanding the Failover Testing Process

When designing failover tests, bear in mind that failover testing starts with identifying the conditions that would lead to a system failure. Failover testing, therefore, builds from other performance testing methods, like fault tolerance testing.

Before beginning failover testing, it’s important to address a few key considerations, like:

What is the estimated cost of failure in terms of time and data loss?

How much does the maintenance on systems that are likely to fail cost?
How long will it take to repair failures and restore complete system operation?
Which failures are most likely to occur and what conditions are likely to trigger them?

Having factored in these considerations, testers can then design test plans around them. To conduct a failover test, testing professionals will usually do the following:

Identify likely sources of failure conditions and prepare solutions to minimize downtime.
Create a testing environment with the conditions or recognized combinations of conditions to cause a system failure.
Evaluate whether the failure condition was detected accurately.
Verify whether the required failover mechanisms activate when failure conditions are correctly detected.
Verify whether or not system functionality and data are consistent with the pre-failover state.
Prepare a comprehensive report on failover testing.
Planning and further action if necessary to address issues identified in testing.

Key Performance Indicators in Failover Testing: RTO and RPO

When conducting tests on how efficient your application system and servers are at handling failovers, two KPIs are particularly useful. These are the RTO and RPO.

Recovery Time Objective (RTO) represents the maximum amount of downtime that a system or service provider can accept before facing severe consequence. Intolerable consequences in terms of RTO can include:

customer frustration
dissatisfaction
customer churn

When conducting failover testing, measuring the time between failure and the initiation of failover helps with benchmarking expected recovery times. Lower downtime means higher availability for customers and ultimately better customer satisfaction.

While RTO is certainly important, it’s only half of the equation. RTO relates to the long-term impact on an application or service’s profitability (by influencing customer attitudes and behavior); however, RPO can have a more direct impact on customer and business data.

Recovery Point Objective (RPO) relates to the duration of time between a system's most recent backup and the point of failure. Like RTO, RPO represents a theoretical maximum; however, it’s more concerned with the maximum amount of data that service providers can afford to lose because of a failure.

When conducting failover testing, special attention should be paid to the duration between the last data backup to the initiation of failover. The time to failover initialization should be as short as possible to reduce the amount of data lost.

It’s important to note that with any failure there will always be data loss. Identifying an acceptable degree of loss and managing the actual data loss depends on how quickly the systems in place can facilitate recovery of services. This can be either by initiating failover or by restoring functionality to servers after failure.

Performance and failover testing as part of QA

Well-rounded QA testing, particularly for SaaS or other Cloud-based applications, also needs to have a strong focus on performance testing. Better performance testing (including failover testing) leads to better software and system reliability in production. For customers and end-users this means better service and user experience.

QA is concerned with ensuring that software products and services meet business-defined levels of performance and reliability to maximize profitability and minimize risk. To get the best out of your software products, having a comprehensive QA process is essential. See our tips for better QA success here.

View full post