Ideal Isn't Real, But Improvement Is: CockroachDB's Resilience Enhanced in 25.2

Real-world database resilience is not about surviving under ideal conditions – it's about thriving when adversity strikes.

In our earlier exploration based on Performance Under Adversity test on 25.1, we illustrated CockroachDB’s robust performance under diverse stress scenarios, from internal operational tasks to outright regional failures. Now, with the latest results from 25.2, we've taken CockroachDB's resilience a significant step further.

What’s new in 25.2?

The improvements in the latest benchmark are substantial. Across the board, SQL latencies improved by roughly 10x and workload latencies by about 2x compared to 25.1. Additionally, CockroachDB 25.2 delivers smoother, more consistent throughput and performance as opposed to the spikes in transactions per minute (tpmC) and queries per second (QPS) observed in 25.1:

_{SQL QPS and SQL Latency through all the phases for 15-node cluster in 25.2 vs 25.1.}

These improvements are the result of over 100 changes, both large and small, including new features like generic query plans and "buffered writes" introduced in 25.2.

Buffered writes is a significant advancement in the transaction flow of CockroachDB. This feature bundles up database writes more efficiently to increase throughput and decrease latency, while also reducing hardware requirements. It temporarily stores transaction writes on the client side (gateway) until the transaction commits. By deferring writes until commit time, it also eliminates redundant writes, resulting in improved throughput – especially in write-heavy workloads.

Enhanced baseline performance

Baseline testing demonstrated how CockroachDB continued delivering exceptional performance under normal "sunny day" conditions:

Multi-region (15-node): Achieved 88.1K tpmC with SQL QPS around 24.1K and SQL latency dramatically reduced from ~20ms in 25.1 to ~2.24ms in 25.2.
Single-region (9-node): Posted 63.1K tpmC, with SQL QPS at ~17.2K, maintaining CPU utilization around 50%. SQL latency improved from ~3ms in 25.1 to ~1.32ms in 25.2, ensuring consistent and smooth performance.

Quicker and more stable internal operations

_{More stable SQL QPS and SQL Latency through internal stressors for 15-node cluster in 25.2 vs 25.1.}

Routine maintenance and internal operations, like rolling upgrades, backups, and index creation, were notably faster and less impactful:

Rolling upgrades completed faster (144 minutes vs. 165 minutes in 25.1), with minimal latency spikes.
Backup duration reduced significantly (15 minutes vs. 20 minutes), and latency spikes were both shorter and lower.
Index creation showed minimal to negligible latency increases, maintaining stable throughput.

External stressors met with greater ease

Disk stalls, network failures, and node restarts are real-world inevitabilities. CockroachDB in 25.2 handled these conditions even better as can be seen in the charts below where we zoom in on the chart area from Disk Stalls through Node outages.

_{Consistent SQL QPS and SQL Latency through disk stalls, network partitions and node restarts for 15-node cluster in 25.2 vs 25.1.}

Disk Stalls: Latency impacts were minimal, and throughput remained entirely unaffected, showcasing CockroachDB's enhanced storage handling capabilities.
Network Partitions: Even during full partitions, throughput remained unaffected, with shorter and less severe latency spikes.
Node Restarts: Recovery times improved, taking only ~30 seconds per node restart, with negligible latency spikes.

Similar to internal stressors, CockroachDB's throughput remains smooth and consistent in 25.2 compared to the small wavy variations in 25.1.

Specifically in the case of Network Partitions, CockroachDB 25.2 survives network partitions using leader-leases. In this architecture, every Raft leader also holds the range’s lease, ensuring a single, unequivocal authority for both reads and writes. By unifying leadership and leaseholding in a single role, it eliminates single points of failure and split-brain risks. So even when node communication is severed, the surviving leader continues serving traffic seamlessly, preserving consistency and availability; and minimizing impact on throughput.

Resilience against major outages

_{SQL QPS and SQL Latency through Zone outages and Regional Outages for 15-node cluster in 25.2 vs 25.1.}

Most importantly, CockroachDB's ability to withstand significant zone and regional outages improved substantially:

Zone Outages: SQL and workload throughput remained unaffected, with latency spikes both shorter in duration and significantly lower.
Regional Outages: Maintained full stability and availability, again without impact on throughput and minimal latency disruption.

Resilience: Always essential, now enhanced

The Performance under Adversity benchmark on 25.2 further demonstrates CockroachDB's resilience in maintaining stable, reliable performance even as real-world adversity intensifies. Performance under ideal conditions remains vital, but unwavering stability during the unexpected is what qualifies a database as fit for modern enterprise demands.

Explore the interactive dashboard for the complete phase-by-phase metrics, and test CockroachDB’s enhanced resilience yourself by spinning up a free cluster today.

Try CockroachDB Today

Spin up your first CockroachDB Cloud cluster in minutes. Start with $400 in free credits. Or get a free 30-day trial of CockroachDB Enterprise on self-hosted environments.

Dipti Joshi is a Staff Product Manager at Cockroach Labs.