AWS Is Now Offering to Inject Failures in Your Applications as Part of Its Service

Cloud giant AWS recently announced another addition to its vast range of offerings in the AWS environment – its own chaos-engineering-as-a-service solution. Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Such service in the AWS environment is available through AWS Fault Injection Simulator (FIS), set to be released in 2021. According to AWS, the Fault Injection Simulator simplifies the process of setting up and running controlled chaos engineering experiments across a range of AWS services so teams can build confidence in their application behaviour.

At the time of the announcement during re:Invent 2020, FIS was able to run experiments against AWS services such as EC2, Elastic Kubernetes Service (EKS), ECS, and RDS.

A user can start by choosing among the pre-built templates in the service, which are available to use as a starting point for common chaos scenarios. These may include actions such as stopping an instance, throttling an API, and failing over a database.

Many action types do not require any agents to be installed in your resources. However, instance-level faults, including increased CPU utilisation and memory utilisation, require the SSM agent.

These actions aim to simulate disruptive events to your AWS environment, and as such, it is important to restrict people who can run your FIS. With that, the FIS is also integrated with AWS Identity and Access Management (IAM), allowing you to control which users and resources have permission to access and run Fault Injection Simulator experiments, and which resources and services can be affected.

The injected actions will create conditions that almost replicate what happens in an actual disruption – actually consuming your CPU, memory resources, and throttling API requests at the control plane level, so the experience is the same as any other throttling.

However, unlike in the real world, FIS provides built-in safety mechanisms in the form of stop conditions, which stop experiments before they run out of control. After that, you will be able to collect information on how your environment reacted, allowing you to prepare for the actual disruption.

This process is much like a vaccine, where part of a virus or bacteria is introduced to your body so you can train your immune system to recognise and combat them in an actual infection.

FIS supports Amazon CloudWatch and third-party monitoring tools via Amazon EventBridge, enabling you to use your existing metrics to monitor Fault Injection Simulator experiments. As an experiment is running, you can observe what actions have been executed. After an experiment has completed, you can see details on what actions were run, if stop conditions were triggered, how metrics compared to your expected steady state and more.

This approach makes it easy for teams to run and observe their experiments from end-to-end, making it easier to find their monitoring blind spots, performance bottlenecks, or other “unknown” weaknesses missed by traditional software tests.

As such, many companies are utilising chaos engineering on their environments these days, achieving resilience against infrastructure failures, network failures and application failures. Chaos engineering was introduced in 2011 by the media giant Netflix while overseeing their migration to the cloud.

Aside from AWS, there are already existing providers of the chaos-engineering-as-a-service platform, such as the Chaos Monkey and Gremlin.

You might also like
Most comment
share us your thought

0 Comment Log in or register to post comments