Physical Cluster Replication: Your Disaster Recovery Plan

When you are an enterprise application owner, whether architect or operator, regional survivability is at the top of your list of must-have database features.

Meanwhile, CockroachDB is purpose-built to handle regional failures with zero RPO and near-zero RTO by replicating data across nodes, availability zones, regions, or clouds while consistently serving traffic from each location. Sounds like the perfect match, right?

Maybe not quite perfect. If, like many large orgs, your application was built around a legacy failover strategy, your environment may only have two regions instead of three. Generally available in v24.1, physical cluster replication allows you to benefit from the full resilience of CockroachDB, even with limited regional or data center distribution.

What is physical cluster replication and how does it work?

Also known as PCR, physical cluster replication is an asynchronous, byte-level, and consistent replication tool to keep two clusters in CockroachDB up to date.

Physical cluster replication is a flexible disaster recovery strategy that works by creating an exact replica of a primary cluster, called the standby cluster. The primary cluster is constantly replicating to the standby cluster, and users can easily monitor the data replication via DB Console or other metrics endpoints like Grafana. If your primary cluster fails, you can cut over to your secondary cluster. Once you cut over from the primary cluster to the standby cluster in a disaster recovery scenario, replication will stop, and the standby cluster will be in a transactionally consistent state.

Physical cluster replication runs effectively on various hardware configurations, and is extensible to multiple clouds and standby clusters.

When would you use physical cluster replication?

There are several scenarios that benefit from physical cluster replication.

First, let’s say your application is built around having two data centers, with no access to the cloud or to a third data center. Physical cluster replication allows users to unlock region survivability. (In an ideal world, however, you would have access to a third data center because then you can survive different failure models with CockroachDB’s native Raft replication).

Here’s a second scenario: even customers already running CockroachDB in three or more regions may still want to use physical cluster replication for a defense-in-depth approach — avoiding the potential for human errors that can accidentally take down an entire cluster (believe me, it happens). After recovery, PCR also helps avoid conflicts in data since the replication results in a transactionally consistent state.

Additionally, physical cluster replication provides an avenue to reduce load on the primary cluster. For example, if your primary cluster is managing thousands of user transactions per second, but you also want to analyze your data, you can use the standby cluster to manage a subset of the total workload.

Physical cluster replication vs traditional backup and restore

In a traditional backup and restore model, you are backing up everything and then restoring into a clean environment. This takes awhile — hours, or even days, depending on how large your dataset is. If using a backup, you would also still be missing any data that was written since the last backup. Neither of these conditions are ideal for critical applications.

With physical cluster replication, on the other hand, all that data has already been steadily replicating to the standby cluster. At the highest level, the physical cluster replication process involves creating two clusters, starting replication, handling failover and cutover, and potentially backfilling missing data.

As a result, physical cluster replication offers significantly lower RPO and RTO compared to traditional backup and restore methods.

Why is physical cluster replication important?

It comes down to taking a comprehensive disaster recovery approach.

Physical cluster replication is important because, until the advent of distributed SQL for cloud native applications, the common architecture pattern for resilience was to run on two cloud regions (or two physical data centers, or one cloud region and one physical data center). If one of those regions becomes unavailable, parts of your application, or even the entire thing, will be unavailable to users until backup and restore is complete. This traditional way to architect for resilience is a disaster recovery approach.

Now that physical cluster replication with fast cutback is available in CockroachDB 24.1, users can switch back quickly from the standby cluster to the primary cluster once it is available. Previously cutting back to the primary could be quite time-intensive as the primary cluster would have to be completely wiped, potentially terabytes of data, and then backfilled. Architects and operators who design and manage applications in enterprise environments with limited regional distribution can continue to ensure high resilience with CockroachDB. After all, CRDB is built to survive regional outages in an automated and self-healing manner.

Watch physical cluster replication in action

You can witness the power of physical cluster replication live and in action for yourself in this video. Principal Engineer and CockroachDB Technical Evangelist Rob Reid puts physical cluster replication through its paces in this demo. Watch to see CockroachDB clusters self-heal while creating two clusters, starting replication, handling failover and cutover, and backfilling data if any has gone missing.