Resilience

Chaos Testing for Microservices: Network Resilience and Latency Spikes

Name: Assrt
Availability: InStock
Author: Assrt

Your microservices work perfectly on a fast, reliable network. Production has neither. Chaos testing reveals what happens when the network does not cooperate.

“Generates standard Playwright files you can inspect, modify, and run in any CI pipeline.”

Assrt SDK

1. Why functional tests miss resilience issues

Functional tests verify that your application works correctly when everything goes right. They confirm that a user can place an order, that the payment service processes charges, and that notification emails are sent. These tests are necessary but insufficient for microservices, because microservices fail in ways that functional tests never exercise.

In a microservices architecture, services communicate over the network. Networks introduce latency, packet loss, connection resets, DNS failures, and partial availability. A service that works perfectly when all dependencies respond in 50 milliseconds might behave unpredictably when a dependency responds in 5 seconds, returns errors for 10% of requests, or stops responding entirely.

The most dangerous resilience issues are the ones where the system does not crash but degrades silently. A slow dependency might cause thread pool exhaustion, which cascades to other services. A partially available database might cause some requests to succeed and others to fail, creating inconsistent state. A retry storm might amplify a small failure into a complete outage.

Chaos testing deliberately introduces these failure conditions in a controlled environment to verify that your system handles them gracefully. The goal is not to break things randomly but to verify specific resilience properties: circuit breakers trip correctly, timeouts are configured properly, fallback behavior works as designed, and graceful degradation provides an acceptable user experience.

2. Network fault injection fundamentals

Network fault injection introduces controlled network failures between services. The three most common fault types are: connection failures (the TCP connection cannot be established), response delays (the connection succeeds but the response takes abnormally long), and corruption (the response is malformed or truncated).

Tools like Toxiproxy, Chaos Monkey, and Litmus Chaos provide programmable network fault injection. Toxiproxy sits between your service and its dependencies as a proxy, and you can add "toxics" (latency, bandwidth limits, connection resets) via its API. This lets you inject faults from your test code without modifying the services themselves.

For Kubernetes-based architectures, Chaos Mesh and Litmus provide network fault injection at the pod level. They can partition network traffic between pods, inject DNS failures, and simulate network bandwidth limitations. These tools operate at the infrastructure level, which means they can inject faults that application-level proxies cannot (such as DNS resolution failures or TCP RST packets).

At the browser level, Playwright's network interception can simulate some of these conditions from the user's perspective. You can delay API responses, return error status codes, or drop connections to see how the frontend handles backend failures. This is not a replacement for infrastructure-level chaos testing, but it verifies that the user experience degrades gracefully when backend services fail.

Try Assrt for free

Enter your email to access the dashboard. No credit card required.

3. Latency spike testing and timeout verification

Latency spikes are the most common and least tested network failure mode. A service that normally responds in 50 milliseconds might occasionally take 10 seconds due to garbage collection pauses, database lock contention, or cloud provider throttling. If your calling service has a 30-second timeout (or no timeout at all), it will hold connections open, consuming thread pool resources until the entire service becomes unresponsive.

Latency testing verifies that your services handle slow dependencies correctly. The key question is: when a dependency takes 10 seconds instead of 50 milliseconds, does your service time out and return an error within a reasonable period, or does it hang indefinitely? Many production outages trace back to missing or misconfigured timeouts.

To test latency handling, inject progressively increasing delays between services. Start at 1 second, then 5 seconds, then 10, then 30. For each delay level, verify three things: the calling service times out within its configured timeout, the timeout produces a meaningful error (not a generic 500), and the circuit breaker trips after a configured number of timeouts.

Pay special attention to cascade effects. When Service A times out waiting for Service B, Service A's own response time increases. If Service C calls Service A, it also experiences increased latency. Without proper timeouts at each layer, a single slow service can make the entire system slow. Test the full chain, not just individual service pairs.

4. Simulating partial outages and degraded dependencies

Partial outages are harder to detect and handle than complete outages. When a service is completely down, health checks fail, alerts fire, and the team responds. When a service works for 80% of requests and fails for 20%, it might not trigger alerting thresholds, but 20% of users are experiencing errors.

Simulate partial outages by configuring your fault injection tool to fail a percentage of requests rather than all requests. Toxiproxy supports this with its "slicer" toxic, which slices data on connections with a configurable probability. You can also use a custom proxy that returns errors for a percentage of requests while forwarding the rest normally.

Test your system's behavior at different failure percentages: 1%, 5%, 10%, 25%, and 50%. At 1% failure rate, your system should retry transparent to the user. At 10%, circuit breakers should start engaging. At 50%, fallback behavior should activate. If your system has the same behavior at 1% and 50% failure rate, your resilience mechanisms are not working.

Degraded dependencies are a variant of partial outages. The service responds but with reduced functionality. For example, a recommendation engine might return generic results instead of personalized ones during high load. Test that your frontend handles degraded responses gracefully: showing fallback content, hiding broken features, or displaying appropriate messaging to users. Tools like Assrt can generate E2E tests that verify user-facing behavior during degraded states, ensuring that error handling actually produces the intended user experience.

5. Running chaos tests in CI without breaking everything

Chaos testing in production (as Netflix famously does with Chaos Monkey) requires mature observability, automated recovery, and organizational buy-in. Most teams should start with chaos testing in CI or staging environments, where failures are contained and do not affect real users.

Structure chaos tests as a separate CI stage that runs after functional tests pass. This ensures you are testing resilience of a functionally correct system, not debugging functional bugs in a chaotic environment. The chaos stage deploys your services, runs a baseline functional test to confirm everything works, then injects faults and runs resilience-specific assertions.

Keep chaos tests focused and deterministic. Each test should inject a specific fault (e.g., "add 5 seconds of latency to the payment service") and verify a specific resilience property (e.g., "the checkout page shows a retry message within 3 seconds"). Avoid random fault injection in CI; it makes failures hard to reproduce and debug. Save random chaos for staging game days where a human is monitoring.

Start small. Pick your most critical service dependency (usually the database or a core API) and write three chaos tests: what happens when it is down, what happens when it is slow, and what happens when it returns errors. Once these tests are stable and running in CI, expand to additional dependencies. A suite of 10 to 20 focused chaos tests covering your top dependencies provides significant confidence that your system handles failure gracefully.

Ready to automate your testing?

Assrt discovers test scenarios, writes Playwright tests from plain English, and self-heals when your UI changes.