While not exactly envious of our current crop of interns (because, you know, the whole work from home thing), I’ll admit I find myself reminiscing back to when I was one myself. I’m still surprised the engineering team let me anywhere near the stuff they did.
When I first interned four years ago, we had declared a just code yellow to focus our energy towards stabilizing CRDB. Having joined the newly-formed distributed query execution1 team, but now with its focus directed elsewhere, what this meant for me was free rein to flesh out distributed hash and merge joins2, few aggregation primitives (think SUM
, COUNT
, DISTINCT
, etc.), and some sorting algorithms.
That was more than enough to rope me back in for a second internship. This time I brought my dear friend Bilal along, who also went on to intern twice. I even managed to sneak my brother (a strictly worse engineer) in as a two-time intern.
All of which is to say that I think internships here can be pretty great. CRDB is a cool system to work on, and we’re still at the point where we’re happy to let junior engineers take on work that elsewhere would only be accessible to someone more senior. This was true for me then, and I’d say the same applied for our most recent cohort.
We have hosted several interns over the years (and hired plenty of them into full-time roles), all working on projects deserving of full-length blog posts. Today, however, we’ll highlight two projects from our most recent batch and give a briefer treatment for the remaining.
Read-based compaction heuristics
Aaditya Sondhi interned on our Storage team to work on Pebble, a storage engine based on log-structured merge trees3 (abbrev. LSMs). Aaditya worked on introducing read-based compactions to Pebble, but before diving into what that means, we’ll first need to understand what read-amplification and compactions are.
Compactions and read-amplification in LSMs
In LSMs, keys and values are stored as sorted strings in immutable blobs called SSTs (sorted string tables). SSTs are stacked across multiple levels (L1, L2, …), don’t overlap within a level, and when searching for a key that overlaps with multiple SSTs (necessarily across multiple levels), the one found at the higher level is considered authoritative. This brings us to read-amplification: the amount of physical work done (bytes read, number of disk seeks, blocks decompressed, etc.) per logical operation. When reading a key k
from a two-level LSM, we may have to trawl through both if it isn’t found in the first.
That in turn brings us to compactions4. As data flows into higher level SSTs, LSMs maintain a “healthy” structure by compacting them into (fewer but larger) lower level SSTs. At one level (sorry) this lets LSMs reclaim storage (range deletion tombstones and newer revisions mask out older values), but also helps bound the read IOPS required to sustain a fixed workload. Like all things, this is counter-balanced56 with the need to maintain sane {write,space}-amplification, which the rate of compactions directly play into.
Figure 1. An SST compaction; the L1 SST overlaps with two L2 SSTs and is compacted into it.
(Aside: there’s something to be said about how storage engines are characterized in terms of resource utilization7 as opposed to unqualified “throughput” or “latency”. System-wide measures like $/tpmC are another example of the same. These feel comparatively easier to reason about, more useful for capacity planning, and easily verifiable.)
Optimizing compactions for read-amplification
Compacting LSMs based on reads isn’t a novel idea. It was originally implemented in google/leveldb, and later dropped in facebook/rocksdb. As for the Go re-implementation of it (golang/leveldb, incidentally where we had forked Pebble from), it hasn’t ported over the heuristic yet. Part of the motivation for using a purpose-built storage engine was to let us pull on threads exactly like this.
We hypothesized that by scheduling compactions for oft-read key ranges, we could lower read amplification for subsequent reads, thus lowering resource utilization and improving read performance. In implementing it, we borrowed from the ideas present in google/leveldb. For every positioning operation that returned a user key (think Next
, Prev
, Seek
, etc.), we sampled the key range (mediated by tunable knobs). The sampling process checked for overlapping SSTs across the various levels in the LSM. If an oft-read SST was found to overlap with ones from lower levels, it was scored higher to prioritize its compaction.
$ benchstat baseline-1024.txt read-compac.txt
name old ops/sec new ops/sec delta
ycsb/C/values=1024 605k ± 8% 1415k ± 5% +133.93% (p=0.008 n=5+5)
name old r-amp new r-amp delta
ycsb/C/values=1024 4.28 ± 1% 1.24 ± 0% -71.00% (p=0.016 n=5+4)
name old w-amp new w-amp delta
ycsb/C/values=1024 0.00 0.00 ~ (all equal)
$ benchstat baseline-64.txt read-compac.txt
name old ops/sec new ops/sec delta
ycsb/B/values=64 981k ±11% 1178k ± 2% +20.14% (p=0.016 n=5+4)
name old r-amp new r-amp delta
ycsb/B/values=64 4.18 ± 0% 3.53 ± 1% -15.61% (p=0.008 n=5+5)
name old w-amp new w-amp delta
ycsb/B/values=64 4.29 ± 1% 14.86 ± 3% +246.80% (p=0.008 n=5+5)
Figure 2. Benchmarks showing the effect of read-based compactions on throughput, read-amplification and write-amplification.
As expected, we found that read-based compactions led to significant improvement in read heavy workloads. Our benchmarks running YCSB-C (100% reads) using 1KB writes saw read amplification reduced by ~71% and throughput increased by ~133%. With YCSB-B (95% reads) using small value reads/writes (64 bytes), we reduced read-amplification by ~15% which led to a throughput increase of ~20%. These benchmarks targeted Pebble directly, and there’s still a bit of legwork to be done around parameter tuning (we’re necessarily trading off some write-amplification in this process), but the results are encouraging.
Query denylists (and our RFC process)
Angela Wen interned on our SQL Experience team, which owns the frontier where SQL clients meet the database. During her internship Angela worked on introducing a mechanism to gate certain classes of queries from being run against the database. This was motivated by our Cloud SREs running large CRDB installations, and wanting the ability to deny queries(-of-death8) when emergent situations call for it (think “circuit breakers”9).
Angela’s experience captures the kind of broad leeway accorded to interns that I’m arguing we do a bit better than elsewhere. A general purpose query denylist is a very open-ended problem, with many personas you could design it for, and one took deliberate effort to build consensus on. The process we use to structure these conversations are RFCs, and we ended up authoring one here as well.
The RFC and the ensuing discussions clarified who the intended users were, the “must haves”/“nice-to-haves”, catalogued the various classes of deniable queries, and most importantly, outlined the actual mechanics of the denial itself. For all my gripes with RFCs, I find the process of actually writing one edifying. It can foster real agency over a component’s design and works decently well as a pedagogical tool (also I imagine it’s cool to have public design documents to share with friends similarly into query denylists).
We ended up eschewing our original proposal to implement file-mounted regex-based denylists (the contentions here being around usability, deployment, etc.) in favor of cluster settings of the form:
SET CLUSTER SETTING feature.changefeed.enabled = FALSE;
SET CLUSTER SETTING feature.schema_change.enabled = TRUE;
Configuration changes were made to disseminate cluster-wide by means of gossip10. Individual nodes listen in on these updates and use the deltas to keep an in-memory block-cache (sorry) up-to-date. This is later checked against during query execution to determine whether or not it’s an allowable operation.
As mentioned earlier, we scrapped lots of alternate designs during this process, and were better off for it. We re-sized our scope to focus instead on certain classes of queries as opposed to more granularly matching specific ones. This came after observing that a vast majority of problematic queries during prior incidents were well understood, and could be structurally grouped/gated wholesale. That said, we modularized our work to make it simple to introduce new categories as needed.
Observability, design tokens, data-loss repair, and more!
We hosted a few other interns this semester, and there’s much to be said about their individual contributions. We typically structure our programs to have folks work on one or two “major” projects, building up to them with “starter” ones. Here we’ll briefly touch what these were.
Query runtime statistics
Figure 3. The query execution plan for a full table scan followed by an AVG
.
Cathy Wang interned on our SQL Execution team and worked on improving observability for running queries. We have some existing infrastructure in place to surface various execution statistics. Cathy built upon this to include details about network latencies (useful for debugging queries run within geo-distributed clusters), structured our traces to break down how much time is spent across various layers in the system, and tacked on memory utilization to our traces to surface exactly how much memory is in-use during any point mid-execution. This last bit is worth elaborating on: Go’s garbage collector doesn’t give us fine-grained control over allocations, and to that end a result we’ve had to design our own memory accounting/monitoring infrastructure to closely track usage during a query’s lifecycle. By exposing these internal statistics, we expect developers to better understand the memory footprint of individual queries and to tune them accordingly.
Design tokens
Pooja Maniar interned on the Cloud side of things, specifically on our Console team. One of the projects she worked on was consolidating and standardizing our “design tokens”. Think of these as abstractions over visual properties, variables to replace hardcoded color palettes, fonts, box shadows on pressed buttons, etc. The motivation here was to limit the number of design decisions developers had to make, whether it be choosing between specific hexcodes, UI components, etc. We wanted to create and hoist guidelines into a centralized, shared repo and then integrate it into our several console pages (accessible both through the database itself and our cloud offering). We were also partway through a brand-refresh at the time, and Pooja’s grand unification helped ensure brand consistency throughout.
Quorum recovery
Sam Huang interned on the KV team (they let me mentor this fellow), and one of the projects we worked on was to introduce a quorum recovery mechanism within CRDB. Because CRDB is built atop raft-replicated key-ranges, when a cluster permanently loses quorum for a given set of keys (think persistent node/disk failures), it’s unable to recover from it. This necessarily entails data-loss, but we still want the ability to paper over such keys and provide tooling for manual repair. Sam worked on introducing an out-of-band mechanism to “reset” the quorum for a given key-range, and somewhat cleanly, we were able to leverage existing Raft machinery to do so. This came from the observation that if we were to construct a synthetic snapshot (seeded using data from extant replicas, if any), and configured it to specify a new set of participants, we would essentially trick the underlying replication sub-system into recovering quorum for this key-range. Our synthetic snapshot incremented the relevant counters to “come after” the existing data, which also in-turn purged older replicas from the system.
Metamorphic schema changes
Jayant Shrivastava interned on our SQL Schemas team, and spent his time here ruggedizing our schemas infrastructure. CRDB makes use of several advanced testing strategies to ensure correctness and stability, including use of fuzzers, metamorphic and chaos testing, Jepsen11, and much more. Having observed some latent fragility in this area recently, Jayant fleshed out an equivalent test harness but focusing instead on schema changes. We constructed a workload generator to execute randomly generated DDL statements, executing within the confines of individual transactions. These statements generate and drop tables on the fly, do the same for columns with randomized types, and are executed concurrently with statements issued against those very tables/columns. We leveraged metamorphic methods here by asserting against the invariants of the system rather than specific outputs (things like “transactions that have read from a certain column should expect to always find it in subsequent reads”). Put together we were able to cover a large space of possible interleavings and uncovered several critical bugs in the process.
Import compatibility
Monica Xu took a brief hiatus from her aspiring music career to intern on our Bulk IO team. Her team’s broadly responsible for getting data in and out of CRDB as fast as possible (specifically import/export and backup/restore). Monica made several contributions in this area, including enabling progress tracking for dump files, supporting dry run imports, and improving pg_dump
12 compatibility. There were kinks to be work out with the latter seeing as how CRDB only supports a subset of Postgres syntax, which can be problematic when processing pg_dump
files as is. The particular set of questions Monica helped address was what “reasonable behavior” is when chewing through potentially destructive import directives. Think DROP TABLE [IF EXISTS]
, or CREATE VIEW
, which is particularly tricky given it stores the results of the query it was constructed using, results subject to change during the import process. Monica engaged with our product teams when forming these judgements (we now simply defer to the user with instructive messaging), and helped significantly ease the onboarding experience for developers migrating off of their existing installations.
Parting thoughts
If you’re still here and interested, check out our careers page. And don’t let the database-speak throw you off, most of us didn’t know any of it coming in.
Radu Berinde, Andrei Matei. 2016. Distributing SQL Queries in CockroachDB. ↩︎
Raphael Poss. 2017. On the Way to Better SQL Joins in CockroachDB ↩︎
Arjun Narayan, 2018. A Brief History of Log Structured Merge Trees. ↩︎
Siying Dong, [n.d.]. Leveled Compactions in RocksDB. ↩︎
Mark Callaghan, 2018. Read, Write & Space Amplification – Pick Two. ↩︎
Mark Callaghan, 2018. Name that Compaction Algorithm. ↩︎
Nelson Elhage, 2020. Performance as Hardware Utilization. ↩︎
Mike Ulrich, 2017. Site Reliability Engineering, Addressing Cascading Failures. ↩︎
Martin Fowler, 2014. Circuit Breakers. ↩︎
Abhinandan Das, Indranil Gupta, et. al. 2002. SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol ↩︎
Kyle Kingsbury, 2016. Jepsen Testing CockroachDB. ↩︎