Nixys > Blog > Chaos-engineering

Chaos-engineering

Chaos Engineering is an experimental approach to test the stability and fault tolerance of complex distributed systems through controlled failures.

Chaos Testing as part of Chaos Engineering

Chaos Testing is a method of proactively testing systems in production as part of Chaos Engineering. It helps to:

determine whether a system can remain operational if one or more of its components fail;
see if the components are capable of returning to normal operation on their own;
find out how delays or failures affect system performance and user experience;
estimate how long it will take to restore the system to a normal state.

The basic idea is to deliberately introduce bugs and break the system, and then observe and record its response afterward. This could be resource depletion, server outages, reduced network bandwidth, or even accidental data deletion. Does that sound drastic? Yes! But it’s precisely under such conditions that you can identify hidden weaknesses, assess resilience, and prepare your team for real-world incidents.

Course of the experiment

Testing consists of several key steps:

Determining the “normal” state of the system. The team measures and records the main parameters reflecting normal system operation (response time, request processing speed, error rate, etc.). This is necessary for further analysis of deviations during the test.
Creation of the stability hypothesis. A hypothesis describing the behavior of the system in case of failure is formed. For example: “When one of the servers goes down, the load should be redistributed to other nodes and the system should remain available to users”.
Failure (chaos) introduction. In this phase, testers deliberately create failures that may affect one or more components of the system. This may include stopping processes, emulating delays, simulating a network failure or database crash.
Analyzing system response. The team captures how well the system response matches the hypothesis, estimates recovery times, checks whether availability is maintained for users, and what elements of the infrastructure need improvement.

Some of the most popular tools for Chaos Engineering include Netflix’s Simian Army, Chaos Dingo, PowerfulSeal, and the “Chaos HTTP proxy.”

The harder it is to break the steady state, the more confident the testers are in the reliability of the system. If a vulnerability is discovered during testing, the team will prepare a paper with recommendations for improving the system after the experiment.

By the way. Load testing will help you to check whether a site or an application can withstand a large flow of users.

Fault tolerance is one of the properties of Cloud Native. We recently wrote an article where we broke down all five properties. Click here to learn more.