Circuit Breaker Pattern: Fortify Resilience in Microservices Against Cascading Failures

In today’s world, distributed systems and microservices architecture have become the norm for building scalable and highly available applications. However, as the complexity of these systems increases, so does the likelihood of failures and network latency issues. To address these challenges, developers have turned to various patterns and techniques to improve the resilience of their applications. One such pattern is the Circuit Breaker pattern. In this extensive blog, we will explore the Circuit Breaker pattern in depth, discussing its benefits, implementation strategies, and real-world use cases.

Intent

The intent of the Circuit Breaker pattern is to enhance the resilience and reliability of distributed systems by providing mechanisms to detect and handle failures. It aims to prevent cascading failures, maintain a satisfactory user experience, and enable the system to gracefully degrade under adverse conditions. The Circuit Breaker pattern acts as a protective layer between services, monitoring their health and selectively breaking the circuit to isolate faulty or unresponsive components. By implementing this pattern, developers can improve fault tolerance, minimize service disruptions, and ensure the overall stability of the system.

Problem

In a distributed environment, calls to remote resources and services can encounter transient faults, such as slow network connections, timeouts, or temporary unavailability of resources. These faults often resolve themselves after a short duration. To handle such scenarios effectively, a robust cloud application should employ a strategy like the Retry pattern.

However, there are situations where faults arise from unanticipated events, which may take longer to rectify. These faults can range from partial loss of connectivity to complete service failure. In such cases, continually retrying an operation with low chances of success becomes futile. Instead, the application should promptly acknowledge the failure and handle it accordingly.

Moreover, in high-traffic scenarios, a failure in one part of the system can trigger cascading failures. For instance, an operation invoking a service can be configured with a timeout and respond with a failure message if the service doesn’t respond within the specified period. However, this approach may lead to multiple concurrent requests being blocked until the timeout expires. Consequently, critical system resources like memory, threads, and database connections can become depleted, causing failures in unrelated parts of the system that rely on the same resources. In such situations, it’s preferable for the operation to fail immediately and attempt to invoke the service only if success is probable. While setting a shorter timeout can mitigate the problem, it shouldn’t be so brief that the operation fails most of the time, even if the service request would eventually succeed.

Solution

The Circuit Breaker pattern, as introduced by Michael Nygard in his book, Release It!, serves as a valuable mechanism to prevent an application from repeatedly trying an operation that is likely to fail. Instead of waiting for the fault to be resolved or wasting CPU cycles while it determines that the fault is long lasting, this pattern allows the application to proceed. Additionally, the Circuit Breaker pattern facilitates an application to detect whether the fault has been resolved. If the problem appears to have been fixed, the application can try to invoke the operation.

A service client should invoke a remote service via a proxy that functions in a similar fashion to an electrical circuit breaker. With the circuit breaker pattern, you can define a threshold value for the number of failures between two microservices. When the number of consecutive failures crosses a threshold, the circuit breaker trips, and for the duration of a timeout period all attempts to invoke the remote service will fail immediately. After the timeout expires the circuit breaker allows a limited number of test requests to pass through to check whether the microservice is working. If those requests succeed, the proxy will allow microservices to continue normal operations. If not, the proxy will again start the timeout.

Implementation

States in Circuit Breaker Pattern

The Circuit Breaker pattern consists of three primary states: Closed, Open, and Half-Open. These states play a crucial role in managing the behavior of the Circuit Breaker and determining how it handles requests to a potentially faulty service.

  1. Closed State:
    • In the Closed state, the Circuit Breaker operates normally, allowing requests to pass through to the service provider.
    • The Circuit Breaker monitors the responses from the service provider to evaluate their success or failure.
    • If the number of continuous failures does not go over a set limit, the Circuit Breaker stays in the Closed state.
    • If the failure count exceeds the threshold, the Circuit Breaker transitions to the Open state.
  2. Open State:
    • When the Circuit Breaker is in the Open state, it prevents requests from reaching the service provider.
    • Instead of allowing requests to proceed, the Circuit Breaker immediately responds with an exception or an error message.
    • The Open state serves as a protection mechanism to avoid overwhelming a faulty service or wasting resources on likely failed requests.
    • While in the Open state, the Circuit Breaker periodically checks if a timeout period has elapsed.
  3. Half-Open State:
    • After the timeout period in the Open state expires, the Circuit Breaker transitions to the Half-Open state.
    • In the Half-Open state, the Circuit Breaker allows a limited number of test requests to pass through to the service provider.
    • These test requests help determine whether the service provider has recovered or if the underlying fault still persists.
    • If all the test requests succeed, the Circuit Breaker transitions back to the Closed state, assuming the service provider has returned to a healthy state.
    • However, if any of the test requests fail, the Circuit Breaker reopens and restarts the timeout period, returning to the Open state to protect against continued failures.
+---------+       Failure Count > Threshold       +--------+
| Closed  |-------------------------------------->|  Open  | < -----------------
+---------+                                       +--------+                   |
    ^                                                  |                       |
    |                                                  |Timeout                |
    |                                                  |                       |
    |                                                  |                       |
    |     Success Count > Threshold                    v                       |
    |                                             +-----------+                |
    --------------------------------------------- | Half-Open |                |
                                                  +-----------+                |
                                                       |                       |
                                                       |                       |
                                                       | Test Request Failure  |
                                                       ------------------------
Markdown

Java Implementation

// state enum
public enum State {
  OPEN,
  CLOSED,
  HALF_OPEN
}

public class CircuitBreaker {
    private State state;
    private int failureThreshold;
    private int successThreshold;
    private int failureCount;
    private int successCount;
    private long timeout;
    private long lastFailureTimestamp;

    public CircuitBreaker(int failureThreshold, int successThreshold, long timeout) {
        this.failureThreshold = failureThreshold;
        this.successThreshold = successThreshold;
        this.timeout = timeout;
        this.state = State.CLOSED;
    }

    public void execute(Runnable action) {
        if (state == State.OPEN && System.currentTimeMillis() - lastFailureTimestamp > timeout) {
            state = State.HALF_OPEN;
        }

        if (state == State.CLOSED || state == State.HALF_OPEN) {
            try {
                action.run();
                handleSuccess();
            } catch (Exception ex) {
                handleFailure();
                throw ex;
            }
        } else {
            throw new CircuitBreakerOpenException("Circuit Breaker is open");
        }
    }

    private synchronized void handleSuccess() {
        successCount++;
        failureCount = 0;

        if (state == State.HALF_OPEN && successCount >= successThreshold) {
            state = State.CLOSED;
        }
    }

    private synchronized void handleFailure() {
        failureCount++;
        successCount = 0;
        lastFailureTimestamp = System.currentTimeMillis();

        if (state == State.CLOSED && failureCount >= failureThreshold) {
            state = State.OPEN;
        }
    }
}

public class CircuitBreakerOpenException extends RuntimeException {
    public CircuitBreakerOpenException(String message) {
        super(message);
    }
}
Java

Using Netfix Hystrix circuit breaker

Spring Cloud Netflix Hystrix is a powerful open-source library provided by Spring Cloud for implementing the Circuit Breaker pattern in distributed systems. It is built on top of the Netflix Hystrix library and provides seamless integration with Spring Boot applications.

With Spring Cloud Netflix Hystrix, you can easily annotate methods in your service classes to define them as protected by the Circuit Breaker pattern. Hystrix intercepts these method calls and wraps them in a circuit breaker that monitors the execution. If the execution fails or exceeds certain thresholds, Hystrix opens the circuit and redirects subsequent calls to fallback methods or fallback logic.

Hystrix provides various features such as fallbacks, timeouts, request caching, and metrics gathering. It allows you to define custom fallback logic to handle failures and define timeouts for service calls to prevent long-running requests.

To create a Spring Boot project with Netflix Hystrix for implementing the Circuit Breaker pattern, follow these steps:

Step 1: Set up a new Spring Boot Project

  • Create a new Spring Boot project using your preferred IDE or Spring Initializr (https://start.spring.io).
  • Add the necessary dependencies for Spring Boot and Hystrix to your project’s build configuration file (e.g., pom.xml for Maven or build.gradle for Gradle).

Step 2: Create a Service with a Circuit Breaker

  • Create a new service class that will be protected by the Circuit Breaker pattern.
  • Annotate the service class with @Service to enable component scanning.
  • Add @HystrixCommand annotation to the methods that need Circuit Breaker protection.
  • Provide fallback methods that will be executed when the Circuit Breaker opens.
import org.springframework.stereotype.Service;
import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;

@Service
public class MyService {
    @HystrixCommand(fallbackMethod = "fallbackMethod", commandProperties = {
      @HystrixProperty(name = "execution.isolation.strategy", value = "THREAD"),
      @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value="1000"),
      @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "5"),
      @HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", value = "5000"),     
      @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50")
})
    public String performOperation() {
        // Code for the operation to be performed
    }
    
    public String fallbackMethod() {
        // Fallback logic to be executed when Circuit Breaker is open
    }
}
Java

Step 3: Configure Hystrix

  • Create a configuration class to customize Hystrix settings.
  • Annotate the configuration class with @Configuration and @EnableHystrix.
import org.springframework.context.annotation.Configuration;
import org.springframework.cloud.netflix.hystrix.EnableHystrix;

@Configuration
@EnableHystrix
public class HystrixConfig {
    // Hystrix configuration if needed
}
Java

Step 4: Use the Service in your Controller

  • Inject the service into your controller class using @Autowired.
  • Use the service methods in your controller endpoints.
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class MyController {
    
    @Autowired
    private MyService myService;
    
    @GetMapping("/perform")
    public String performOperation() {
        return myService.performOperation();
    }
}
Java

Step 5: Run the Application

  • Run your Spring Boot application and test the endpoint for the protected service.
  • Hystrix will handle failures and open the Circuit Breaker when necessary, executing the fallback logic.

Note: Make sure to annotate application with @EnableCircuitBreaker and to have the required Hystrix dependencies in your build configuration file, such as spring-cloud-starter-netflix-hystrix for Maven or compile('org.springframework.cloud:spring-cloud-starter-netflix-hystrix') for Gradle.

Applicability

The Circuit Breaker pattern is applicable in various scenarios where communication between services or components in a distributed system is involved. It is particularly useful in the following situations:

  1. Fault tolerance is needed to prevent cascading failures.
  2. Multiple service dependencies are involved, and failures in one should not affect others.
  3. Interacting with unreliable or external services that require handling of timeouts and fallbacks.
  4. Graceful degradation is required to maintain a good user experience during service disruptions.
  5. Performance monitoring and metrics visibility are necessary for identifying and resolving issues.

Applications

  1. Microservices Architecture: In microservices, it manages service failures and prevents cascading issues.
  2. Distributed Systems: It ensures system stability by handling communication failures and service unavailability.
  3. Cloud Computing: Cloud-based applications heavily rely on services provided by cloud providers. The pattern copes with disruptions and failures in cloud services.
  4. Financial Systems: Financial applications often involve interactions with multiple external services, such as banks, payment gateways, and financial data providers. It handles failures and timeouts in financial services, ensuring uninterrupted transactions.
  5. E-commerce Systems: E-commerce platforms rely on various external services for payment processing, inventory management, and shipping. It manages failures in external services to ensure smooth online shopping experiences.
  6. Web Applications: It provides fallback options and maintains a responsive user experience during service disruptions.
  7. IoT (Internet of Things): In IoT systems, where devices communicate with each other and cloud services, the Circuit Breaker pattern ensures fault tolerance, handles intermittent connectivity issues, and maintains the responsiveness of the IoT ecosystem.

Benefits

The Circuit Breaker pattern offers several benefits when implemented in a distributed system:

  1. Fault Isolation: By opening the circuit and redirecting requests to fallback mechanisms, it contains the impact of failures and maintains system stability.
  2. Resilience: It allows applications to gracefully degrade their functionality instead of completely breaking down, ensuring continuous operation even in the presence of service disruptions.
  3. Fail-Fast Behavior: When a service is unavailable or experiencing issues, the Circuit Breaker quickly detects the failures and avoids wasting resources by immediately returning fallback responses or exceptions, rather than waiting for slow or unresponsive services.
  4. Performance Optimization: By implementing timeouts and retries, the Circuit Breaker pattern optimizes system performance. It helps prevent long response times or hanging requests by imposing maximum time limits for service calls. Retries allow for subsequent attempts, improving the chances of successful execution without causing excessive delays.
  5. Graceful Degradation: The pattern allows applications to gracefully degrade their functionality when services are unavailable or unreliable. By providing fallback mechanisms or alternative behavior, it ensures that users experience a reasonable level of service even in degraded conditions, maintaining a positive user experience.
  6. Service Monitoring: The Circuit Breaker pattern often includes monitoring and metrics gathering capabilities. It allows for tracking and reporting important metrics such as failure rates, response times, and state transitions of the Circuit Breaker. This monitoring helps in identifying and resolving issues, enabling proactive management of service dependencies.
  7. Scalability and Load Management: The Circuit Breaker pattern aids in managing system scalability and load distribution. By preventing requests from reaching overloaded or failing services, it protects critical system resources and helps maintain overall system stability during peak loads or sudden spikes in traffic.

Challenges

While the Circuit Breaker pattern offers numerous advantages, it is important to consider potential downsides when implementing it:

  1. Increased Complexity to the system architecture. The need to manage state transitions, fallback mechanisms, and configuration settings adds complexity to the codebase and may require additional effort for maintenance and troubleshooting.
  2. Overhead and Performance Impact due to the additional checks, state management, and potential fallback executions. This can impact the overall performance of the system, especially in high-throughput scenarios. Careful consideration and testing are necessary to balance the benefits of fault tolerance with the potential performance impact.
  3. Fallback Limitations: Designing effective fallback mechanisms can be challenging. Fallbacks should provide alternative behavior or responses that are meaningful to the user and maintain system functionality.
  4. Fallback State Maintenance: Maintaining the state of the fallback mechanisms can be challenging. When the Circuit Breaker transitions between states (e.g., from Open to Half-Open), it may be necessary to reset or reevaluate the state of the fallback mechanisms. This state management can introduce additional complexity and potential issues.
  5. Dependency on Configuration: The effectiveness of the Circuit Breaker pattern heavily relies on appropriate configuration settings. Determining optimal thresholds, timeouts, and retry strategies can be challenging and may require careful tuning and testing to ensure the pattern behaves as intended. Incorrect or inadequate configuration can lead to false positives or negatives, impacting the overall system behavior.
  6. Limited Visibility: The pattern adds an extra layer of abstraction, making it difficult to pinpoint the root cause of failures or performance issues. Comprehensive monitoring and logging mechanisms are necessary to gain visibility into the behavior of the Circuit Breaker and underlying services.