The blueprint for highly available systems: Active-Active Architecture in Banking on Google Cloud

Mario Schaefer
January 15, 2025
DevOps

Uninterrupted service is essential in the banking sector, where strict compliance requirements, security restrictions and limited access to certain Google Cloud services add a level of complexity not found in less regulated environments. While an Active-Active architecture provides a reliable solution for high availability and resilience, implementing it in Google Cloud for banks comes with its own unique challenges. An effective Active-Active architecture can help overcome these challenges.

In this blog, we'll explore five 'active-active' designs we implemented for a banking client, the obstacles we faced and the lessons we learned. If you are facing similar challenges with high availability, these insights will help you shape your journey more effectively.

In today's digital landscape, the need for an active-active architecture is critical for banks to avoid downtime and strengthen customer loyalty.

Implementing an active-active architecture enables companies to synchronize their critical systems across multiple locations, thereby increasing resilience.

Why an active-active architecture in banking?

In the banking sector, reliability, security and compliance are non-negotiable - and failure to address these challenges can expose organizations to significant financial and reputational risk. An Active-Active architecture is not just an option, it is essential to ensure smooth operations and business continuity in an increasingly demanding landscape. The benefits of an Active-Active architecture are far-reaching and can be critical to an organization's success.

Here are the reasons why they are important:

Constant availabilityDowntime is not only unpleasant for customers, but also leads to financial losses and undermines trust. Active-Active setups eliminate single points of failure by distributing workloads across multiple regions to ensure uninterrupted service.

ReliabilityOutages, hardware failures and disasters are inevitable, but by operating in multiple regions simultaneously, Active-Active architectures keep systems operational and mitigate the impact of such events.
Compliance with legal regulationsStrict data retention and disaster recovery regulations must be met, and Active Active setups provide the cross-region replication needed to meet these requirements.
Low latencyCustomers expect immediate responsiveness. By processing workloads closer to the user, Active-Active setups reduce latency and provide a superior user experience.
Customer confidenceReliability is the basis for customer trust. Failure to meet expectations can irreparably damage relationships and the brand's reputation.

Implementing an active-active architecture is a strategic step that enables banks to optimize their operations and respond quickly to unexpected challenges.

While Google Cloud offers powerful tools for Active-Active architectures, the unique constraints of banking - such as restricted access to certain services - require customized solutions. A well-planned Active-Active architecture can help minimize risk and increase operational efficiency.

The risks of a non-implemented active-active architecture are significant and can jeopardize a company's operational resilience.

Why a lack of active-active architecture could harm your company

Active-Active Architecture - Regional deployments

In a non-active-aactive configuration, such as a regional deployment, a single region is responsible for the entire workload. While this approach is easier and cheaper to implement, it has significant drawbacks. Availability is inherently limited, as any failure - whether due to hardware failure, natural disaster or network issues - can lead to extended downtime. Without redundancy across regions, it is virtually impossible to maintain a Availability target of 99.99 % leaving critical systems unprotected and increasing recovery times. This has a direct impact on customer confidence, increases the risk of financial loss and leaves companies vulnerable in a highly competitive market.

5 Active-Active Architecture Designs

The Active-Active architecture for the banking sector is designed to ensure high availability, resilience and regulatory compliance. It uses a combination of multi-regional deployments, synchronous and asynchronous replication strategies and split-horizon DNS to efficiently manage both internal and external traffic. External clients are routed to public endpoints for seamless customer access, while internal clients are routed to private endpoints to optimize backend communications. This approach balances performance, consistency and failover capabilities and overcomes the unique challenges of a highly regulated and security-sensitive environment.

The architecture for external data traffic uses the Google Kubernetes Engine (GKE) Multi-Cluster Gateway Service of the Google Cloud Platform to ensure seamless, reliable and secure access to customer-facing applications. This design provides a robust and scalable solution for managing external traffic across multiple active regions.

The internal traffic architecture in an Active-Active configuration requires robust designs to ensure reliable service-to-service communication across multiple regions. Below are five approaches that utilize Google Kubernetes Engine (GKE) gateways and various load balancing methods to manage internal traffic.

1. GKE Internal Global Multi-Cluster Gateway

Overview:A centralized internal gateway provided by GKE to manage service traffic across multiple clusters worldwide.

Active-Active Architecture: GKE Internal Global Multi-Cluster Gateway

Advantages: Simplifies traffic management with a single control plane for routing across clusters.
Disadvantages: At the time of writing, the GKE Internal Global Multi-Cluster Gateway had not yet been fully implemented in Google Cloud. According to the Google team, the release is planned for 2025.
Use case: Suitable for organizations that prioritize simplicity and unified management in a global Active-Active configuration.

2. DNS-based global load balancing with regional GKE gateways

Overview:DNS resolves internal data traffic in regional GKE gateways and distributes requests to the nearest cluster.

Active-Active Architecture: DNS-Based Global Load Balancing with Regional GKE Gateways

Investing in a powerful active-active architecture is of great importance for banks in order to remain competitive.

Advantages: Offers flexibility in routing logic and can be combined well with split-horizon DNS for internal traffic resolution.
Obstacles:DNS changes depend on the TTL, which can lead to failover delays in the event of disruptions. The extended failover time undermines the concept of an active-active configuration as traffic cannot be rerouted between regions fast enough to ensure seamless operation.
Use case:Ideal for workloads that require light traffic distribution with regional autonomy.

3. Cross-regional internal L7 load balancing with regional GKE gateways

Active-Active Architecture: Cross-Regional Internal L7 Load Balancer with Regional GKE Gateways

Overview:A Layer 7 load balancer (application layer) manages data traffic across regions and forwards requests to regional GKE gateways.

Advantages:Provides advanced routing functions based on HTTP/HTTPS headers, paths or other metadata.
Obstacles:Increased complexity and potential latency when routing traffic between regions. In addition, the internal TLS certificate management in the customer environment was not compatible with L7 cross-region load balancers.
Use case:Best suited for microservice architectures with different routing requirements or complex logic at application level.

4. Cross-regional internal L4 load balancing with regional GKE gateways

Overview:Layer 4 load balancing (transport layer) processes the cross-regional data traffic and distributes it to regional GKE gateways.

Active-Active Architecture: Cross-Regional Internal L4 Load Balancer with Regional GKE Gateways

Advantages: Simplifies routing at network level with lower latency compared to L7 load balancing.
Obstacles: Limited application-oriented routing capabilities. The customer uses the RFC 6598 network segment (100.64.0.0/10) for their GCP cloud, but at the time of writing GCP only supports RFC 1918 network segments (192.168.0.0/16, 10.0.0.0/8, 172.16.0.0/12).
Use case:Effective for performance-critical applications where simplicity and speed are paramount.

5. Regional internal L4 load balancing with regional GKE gateways

Overview:Each region uses its own internal L4 load balancing to manage traffic forwarding between regions in an Active-Active configuration.

Active-Active Architecture: Regional Internal L4 Load Balancer with Regional GKE Gateways

Advantages:Low latency for regional communication while supporting cross-regional data traffic as part of the Active-Active architecture.
Disadvantages:Requires additional mechanisms for cross-regional failover and coordination
Obstacle:None - this design has been successfully implemented and adapted to the constraints of the environment.
Use case:Ideal for scenarios that require seamless intra- and inter-regional communication within a compliant and regulated environment.

Important considerations for setting up an Active-Active architecture

Implementing an Active-Active architecture in a regulated environment like banking requires careful planning and consideration of several critical factors. Here are the key considerations that influenced the design and implementation of the architecture:

1. Compliance and security

Regulatory requirements:Ensure that the architecture complies with industry regulations, including data residency, encryption standards and disaster recovery mandates.
Network security:Use firewalls, secure endpoints and encryption (TLS) to protect data in transit and at rest.
Access control:Implement role-based access control (RBAC) and enforce strict identity and access management (IAM) policies.

2. Latency and performance

Proximity-Based Traffic Routing:Minimize latency by routing traffic to the closest region using DNS or load balancers.
Cross-regional communication:Optimize data synchronization between regions to balance latency and consistency.
Service response times:Monitor and optimize services continuously to meet performance SLAs.

3. Data consistency and synchronization

Synchronous vs. asynchronous replication:Select the appropriate replication strategy based on the criticality of the data and the latency tolerance.

Conflict resolution:Design mechanisms for conflict handling in asynchronous replication scenarios.
Data partitioning:Consider geographic segmentation of data to reduce synchronization effort.

4. Scalability and reliability

Equalization of burdens:Use a combination of global and regional load balancing modules to distribute traffic efficiently.
Automatic scaling:Activate automatic scaling for GKE clusters and services to dynamically handle traffic peaks.
Failover mechanisms:Implement robust failover processes to maintain service availability in the event of regional outages.

5. Infrastructure restrictions

Service availability:Consider GCP services that may not be available or restricted in the banking environment.
Network configuration:Align the network segmentation with GCP-supported IP ranges (e.g. RFC-1918) and ensure compatibility with the architecture.
Resource quotas:Monitor and manage cloud resource quotas to avoid capacity-related disruptions.

6. Operational complexity

Supply pipelines:Automate deployment pipelines to ensure consistent management of infrastructure across multiple regions.
Monitoring and observability:Use centralized monitoring tools to gain insight into traffic patterns, service health and anomalies.
Configuration management:Ensure consistent configuration across all regions using tools such as Terraform or Config Connector.

7. Cost optimization

Traffic Distribution:Distribute data traffic intelligently to minimize the cost of cross-regional data transfer.
Right-Sizing Resources:Regularly review resource allocation to optimize cloud spend.
Billing Transparency:Use the GCP billing tools to monitor and forecast the costs associated with the active-active setup.

By taking these considerations into account, the Active-Active architecture can meet the high standards of availability, performance and compliance required in the banking sector.

Final recommendations

Implementing an Active-Active architecture in a regulated and constrained environment such as the banking sector requires strategic decisions and a clear understanding of the constraints. Based on our experience, the final recommendations are as follows:

Evaluation of the need for an active-active configuration: Not all applications require a 99.99 percent availability targetand an active-active configuration is not always necessary. Many applications can function effectively with slightly lower availability targets, and a single-region configuration with robust failover and disaster recovery mechanisms can provide sufficient resilience.
- For example, a local company canteen web applicationThe same is true for the office space that employees use during business hours to view menus or order lunch, which may not be available 99.99 % of the time. In such cases, deployment in a single region with uptime aligned with business hours is both practical and cost-effective. This approach allows critical resources to be focused on systems that truly require high availability.
- When deciding on an architecture, consider the criticality of the application, user expectations and cost implications. For non-critical workloads, the added complexity and expense of an active-active configuration may not be justified, so resources can be focused on other priorities. Always align the architecture with the specific availability requirements and business objectives of the application.
Understand the environment early on:Perform a detailed assessment of compliance requirements, network constraints and service availability to identify potential roadblocks before designing the architecture.
Prioritize simplicity where possible:While multi- and cross-regional setups are critical, an excess of technology can lead to unnecessary complexity. Opt for simpler solutions if they meet performance and compliance requirements. This approach is in line with the concept of Satisfactionwhich in Models of Man: Social and Rational introduced by Herbert A. Simon, which proposes to find a solution that meets the criteria for adequacy rather than striving for an optimal but overly complex solution.
Benefit from regional independence:Minimize strategic cross-regional dependencies where latency, performance or compliance are paramount, while ensuring sufficient cross-regional coordination to ensure failover, consistency and the benefits of an active-active configuration.
Plan for scalability and reliability:Plan for traffic spikes and unexpected outages by using automatic scaling and robust failover mechanisms. Ensure that monitoring and observability are at the heart of the architecture.
Optimize costs carefully: Balance performance requirements and cost efficiency by right-sizing resources and minimizing cross-region data transfers. Regularly review and optimize cloud spend.
Use a hybrid approach if required:Combine multiple designs to meet different requirements, e.g. regional load balancing for local workloads and cross-regional solutions for critical global services.

The implementation of an active-active architecture enables banks to react effectively to changes in the market and minimize risks at the same time.

Conclusion

Developing an Active-Active architecture in Google Cloud for the banking sector is both challenging and rewarding. While the constrained environment presented significant limitations, it also resulted in innovative solutions tailored to the client's individual needs. The successful implementation of the fifth design using regional internal L4 load balancers demonstrates the value of a thoughtful and adaptable approach.

By taking into account important aspects such as compliance, performance and scalability, this architecture provides the resilience and high availability required in the banking sector. The insights and experiences shared here can serve as a guide for other organizations facing similar challenges and enable them to build robust and future-proof systems.

To remain competitive and relevant, banks must accelerate their transformation and continuously evolve their architectures to integrate new technologies and meet ever-evolving regulatory requirements. The time to act is now.

This might also interest you

Atlassian Teamwork Collection: How Jira, Confluence, Rovo and Loom enable seamless collaboration for global remote teams

Endless chat histories, confusing documents and feedback spread across several continents and time zones? Real...

Efficiently and securely delete multiple users from the Atlassian Cloud in 3 steps

In fast-growing companies, outdated, duplicate or inactive user accounts quickly accumulate in the Atlassian Cloud...

Business Transformation

Container 8 - Engineering Platform

Atlassian Consulting

Cloud Migration & Consulting

Service management

Training courses

Industry solutions

Intranet, knowledge base and collaboration platform with Confluence

Container8 - Developer Experience Platform

Self-service portal for business and IT teams

Advanced Issue Sync For Jira

Customer Support Portal

XAAM - XALT Advanced Access Management

Project portfolio management with Jira

HR applicant management with Jira

Automated software testing with Atlassian

Business Dashboards

Success Stories

Case study

Cloud Whitepaper

ITSM Whitepaper

DevOps Whitepaper

Knowledge

Training courses

About us

Our culture

Careers & Jobs

Get to know our teams

Our Atlassian Marketplace Apps & Plugins

Advanced Image Gallery for Confluence

Repository Size List for Bitbucket

Status macro for Confluence

Atlassian Teamwork Collection: How Jira, Confluence, Rovo and Loom enable seamless collaboration for global remote teams

Transform your business. DevOps and Atlassian Consulting

Business Transformation

Atlassian Consulting

Products and solutions

Resources

Company

Business Transformation

Container 8 - Engineering Platform

Atlassian Consulting

Cloud Migration & Consulting

Service management

Training courses

Industry solutions

Success Stories

Case study

Cloud Whitepaper

ITSM Whitepaper

DevOps Whitepaper

Knowledge

Training courses

Get to know our teams

The blueprint for highly available systems: Active-Active Architecture in Banking on Google Cloud

Share

Table of contents

Why an active-active architecture in banking?

Why a lack of active-active architecture could harm your company

5 Active-Active Architecture Designs

1. GKE Internal Global Multi-Cluster Gateway

2. DNS-based global load balancing with regional GKE gateways

3. Cross-regional internal L7 load balancing with regional GKE gateways

4. Cross-regional internal L4 load balancing with regional GKE gateways

5. Regional internal L4 load balancing with regional GKE gateways

Important considerations for setting up an Active-Active architecture

1. Compliance and security

2. Latency and performance

3. Data consistency and synchronization

4. Scalability and reliability

5. Infrastructure restrictions

6. Operational complexity

7. Cost optimization

Final recommendations

Conclusion

Further reading

This might also interest you

Transform your business. DevOps and Atlassian Consulting

Business Transformation

Atlassian Consulting

Products and solutions

Resources

Company