What’s the best way to do a database migration?
It’s a challenging question. No single approach is going to be the best fit for every use case. However, after speaking with LaunchDarkly Senior Distributed Systems Engineer Justin Caballero in a recent webinar, we think LaunchDarkly’s migration from Postgres and MongoDB to CockroachDB presents some great database migration best practices.
Let’s take a closer look at LaunchDarkly and their approach to database migration.
What is LaunchDarkly?
LaunchDarkly is a feature management platform that makes it easier for dev and ops teams to control the entire feature lifecycle using feature flags.
At the most basic level, feature flags are bits of code that enable developers to turn product features on and off with small changes to configuration files. This makes it possible to easily manage features without the need for constant code redeployments, and enables experimentation through “dark launching” (that’s where the name LaunchDarkly comes from) of features to a subset of users.
LaunchDarkly’s platform goes far beyond simply enabling feature on/off flags. It provides users with an end-to-end feature management solution, including everything from an intuitive UI to advanced analytics to make feature experiments easy to control and understand.
Tip 1: Evaluate and test your options thoroughly
LaunchDarkly’s internal database needs were originally being served primarily by MongoDB, with some other workloads in Postgres. But with heavy transactional use cases, a growing client base, and the need to scale smoothly while guaranteeing consistency, they became concerned that MongoDB might not be up to the challenge, and started looking for alternatives.
Senior Distributed Systems Engineer Justin Caballero had been aware of the distributed SQL approach for a long time, having read Google’s original Spanner paper. But Spanner is only a solution for companies willing to be tied exclusively to GCP. So Caballero was intrigued when a coworker mentioned CockroachDB, a database that combines the smooth-scaling cloud-native distributed SQL of Spanner with advanced multi-region features and support for multi-cloud and hybrid cloud deployments.
“We wanted to be able to tell a zero RTO/RPO story,” Caballero says. “When we evaluated Cockroach against other things we were looking at for multi-region like AWS Aurora or similar products, we just felt that this [CockroachDB] was the best.”
Specifically, Caballero says, LaunchDarkly was looking for a database system that enabled them to tell customers “We could lose the East Coast and you wouldn’t notice.”
They also wanted a system that was designed first for cloud-native multi-region architecture. “It’s just very hard to manage” systems that were initially designed for single-region with multi-region support tacked-on later, Caballero says, so LaunchDarkly wanted something that had been built from the ground up with multi-region and multi-cloud in mind.
After testing it against SQL-based competitors and NoSQL databases MongoDB and Cassandra, CockroachDB emerged as the winner.
This is the first lesson to be learned from LaunchDarkly: take the time to evaluate what you need, and test your options thoroughly! Even the smoothest database migrations take time and money when you’re operating at scale – it’s not a process you want to go through again in a year because you realize you made the wrong decision.
Tip 2: Phased migration is a database migration best practice
As a company that enables developers to take a controlled, gradual approach to rolling out features, it’s no surprise that LaunchDarkly is taking a similarly gradual approach to their database migration.
And that’s critical, because even when you know you’ve picked the right database technology for your needs, the process of actually transferring your workloads onto it usually comes with at least a few surprises.
To kick off their migration process, Caballero says “we chose what we thought would be the easiest and least exposed [workload] if something went wrong.” They had some data in Postgres, so they started with that since Postgres and CockroachDB are both SQL-based and relational. The theory was that that would be easier than starting with moving their MongoDB workloads.
“It turned out that the workload that we chose to start with ended up being one of the more complicated ones,” Caballero says, “just because the SQL queries we had were using some interesting Postgres features, and they were very complicated queries that were doing a lot of heavy lifting.”
That difficulty was a bit of a surprise. But because LaunchDarkly had chosen to start small, it wasn’t a problem. “I think it was actually a good move,” Caballero says, “because it helped us really dig into Cockroach with a feature that’s not as critical as the core flagging information.”
LaunchDarkly’s philosophy for their database migration is ultimately the same philosophy they advocate for software development: “Small, reversible steps.”
Tip 3: Allow time for learning
Another advantage of the phased migration approach is that it gave LaunchDarkly time to learn how to work with a new system. Although CockroachDB uses SQL syntax that will be very familiar to any developer, switching from a legacy system to a distributed, multi-region database does require some adjustment.
This is because distributed systems store data and handle queries differently. “We’ve been learning about how to make our queries performant and what types of things to avoid when you’re dealing with a more distributed database than something like Postgres, where you’re dealing with a single master node […] it’s just a different set of concerns.”
One example was switching to using UUIDs as primary keys – a key best practice for distributed systems that facilitates the distribution of keys across the cluster. This was something that LaunchDarkly “got right, right out of the gate.”
Another example was an issue LaunchDarkly encountered with cascading deletes. On their original database, they had a workload with a single API call that could result in “tens of thousands of rows being deleted.” But on CockroachDB’s distributed system, breaking up this single command into smaller chunks ended up making it more performant because it facilitated faster information transfer between nodes.
In short: shifting to a new database system can require some shifts in thinking, particularly if you’re moving from a legacy, single-node system to a modern, multi-cloud, multi-region distributed system. It’s essential that you give your team the time and space to experiment and make those shifts – and that, in turn, is only possible if you’re taking a phased migration approach and starting with less critical workloads that allow you a bit of a margin for error.
Tip 4: Use feature flags!
It should come as no surprise that LaunchDarkly’s rollout of CockroachDB has also made use of their own feature flagging technology to enable them to move gradually and perform experiments.
“For each of our workloads, we have a repository interface, which essentially is ‘these are the functions we’re using to access the data.’ We provide two implementations of that, one with Cockroach and one with our previous store, and then we use feature flags to control which one is run, and which ones are connected to the end user,” Caballero says.
This enables them to roll out their new CockroachDB database gradually, starting with a small percentage of users. And because both databases are running in parallel, they monitor both for consistency. If they do discover an issue with the new database, feature flags enable them to quickly switch it off without the need to deploy any new code. Then they can fix the issue and restart the experiment at their open pace, with no negative impact on their user experience.
What’s next for LaunchDarkly?
At this point, LaunchDarkly has moved past their initial Postgres migration, and is well underway in the process of transitioning all of their MongoDB workloads to CockroachDB. To hear more about their migration process, check out the full webinar recording:
Full webinar transcript
Check out the page for this webinar, or check out our other tech talks.
Jim Walker (00:00): All right. Good afternoon. Good morning. Good evening. It’s lunchtime here, so this is Lunch with LaunchDarkly, building scale-out global applications is the name of our Cockroach Hour today. But first and foremost, I wanted to thank everybody for joining us, taking your time out of your busy day.
Jim Walker (00:16): I hope this is going to be a great session today. I personally love talking to customers, love talking to people that are in the trenches, making stuff happen. People that are distributed systems experts and interviewing them and talking to them, I learn tons in just basically, the build-up to these things, but hopefully, this is going to be really valuable to everybody here, talking about LaunchDarkly and how they’ve used Cockroach to actually scale out their application and what’s going on there. But before we get started, a bit of housekeeping. Of course, we get asked every time, “Will the recording be available after the event?”
Jim Walker (00:51): Absolutely. We’ll send a follow-up to everybody who’s registered. And then we post everything to our YouTube channel typically the day of. I don’t want to give an SLA, but it should be on our YouTube channel later today. Please do ask questions in the QA panel. Again, this is an interview with a practitioner. So, let’s all learn from each other and we’ll be monitoring the QA along the way, and then also engage in chat if you like that better. Sometimes, we get these crazy chat conversations where it’s back and forth and lots of things going on, so we encourage any and all of the above. Myself and Chris will be joining us in a second, we’ll be monitoring that and listening to the conversation and making sure that we get questions in.
Jim Walker (01:35): So if you have any questions, please along the way, do ask them in that form. That’d be great. It’d be really awesome. So, the session today, we often get asked, “What level is this?” This is an intermediate session. We are going to talk about some fairly complex things. This isn’t going to be just business value or these high-level things. It should be actually hopefully pretty useful from a technical point of view. We have a bunch of topics, like I said, we love questions. But we’re going to talk about LaunchDarkly and some of the things that they’re using CockroachDB for, but more importantly, the business as well.
Jim Walker (02:08): So with that, I want to invite my partners in crime here today. So Justin and Chris, do you want to bring on camera there? Awesome. All right. So thank you both for joining. I will introduce Justin first. Justin Caballero is a Senior Distributed Systems Engineer at LaunchDarkly. Justin, I have a whole bunch of questions about your title, first of all. So just hello, first. I’ll ask you to explain yourself, but hi, how are you?
Justin Caballero (02:40): Hello, good afternoon.
Jim Walker (02:41): All right, good. All right. Volume’s working, everybody’s good. And then my friend, Chris. Chris Casano, who is one of our leads and sales engineers who spends a whole lot of time with customers as well. You want to say hi, Chris?
Chris Casano (02:53): Hey, folks. Hey, Jim. Hey, Justin. Good to be on the panel with you here. And yeah, happy to help with any technical questions that folks might have today. Like Jim mentioned, there’s the Q&A in the chat. Feel free to post your questions there, and I’ll try to answer them as quick as I can.
Jim Walker (03:06): Awesome. Thank you. All right. Well, so Justin, hank you for doing this, first of all. And more importantly, thank you for being a customer of Cockroach Labs. We are just overjoyed that a company such as yours, I think a fast-moving, emerging startup who has… Gosh, tech that I really dig joining us. So Justin, just to start, what is your role at LaunchDarkly?
Justin Caballero (03:35): Yeah, so I’m on one of our backend services squads, if you will. And my squad in particular is known as the Flag Delivery Squad. So we’re in charge of basically the services and infrastructure that provide feature flagging info to our customers. SDK is the crux of what we do.
Justin Caballero (03:58): So this is the part of the infrastructure that’s getting feature flags down to end user devices or servers, and data centers across the world. And we’re not technically like a platform squad that’s providing infrastructure to the rest of the company, but the work on Cockroach landed with us. So we’re a dual purpose squad, in that extent.
Jim Walker (04:22): Well, that’s cool. And you’re a distributed systems engineer, I’m sorry. But I’m a big fan of all things distributed these days. I think it’s a big deal. How long have you been doing distributed systems, Justin?
Justin Caballero (04:35): So, yeah, I worked a lot on web services, I would say, for I don’t know, 20 years. So going back to SOAP stuff, which I’m glad to have left behind. So, yeah. Applications talking to each other is basically what I’ve been doing.
Jim Walker (04:56): That’s what it is, right? Well, I’ll go a little further back. I was an early days customer of BEA Tuxedo, which became WebLogic. And that was that whole beginning of that crazy. And man, when I first saw it, I was like, “Whoa, this is cool.” We were trying to build basically, Tuxedo way back then. It was nuts. Yeah. But it was awesome. We loved it.
Justin Caballero (05:17): In this space, it feels like there’s nothing new under the sun, in a way. It’s like, you’ll see something new that comes out, it’s like, “Oh, actually this is this thing that was from 15 years ago.”
Jim Walker (05:29): Well you mean like namespace is, and Kubernetes? And like-
Justin Caballero (05:37): Certainly, because Cockroach is a good example of something that does feel pretty new. But there tends to be trends that tend to cycle, I’ve noticed. Now things like RPC are very popular and how many IDLs have been created in the past 10 years for planning schemas and messaging? That goes back.
Jim Walker (06:04): Way back. Well, the funny thing is, I’m in marketing, so I don’t get to play with tech anymore. I usually tell the guys in the marketing team, “Somebody’s already said this. So, go find it because like…” And typically, huge companies have done half the things we’ve done. So I think it’s all around the board. But Justin, what is feature flagging?
Jim Walker (06:26): If anybody here doesn’t know what feature flagging is, check it out. Check out LaunchDarkly, because to me, when I first saw it, I was like, “Oh yeah, that’s the future, for sure.”
Justin Caballero (06:36): Yeah. Yeah. I had the same impression. Basically the mission of the company is to empower teams to deliver and control their software, and so what that means to us is a fundamental concept, of first being able to separate the act of deploying software from releasing features.
Justin Caballero (06:56): So you can put a software into production, but you control when certain features are available to which types of users. That’s very powerful, because it decouples those two things and gives you a lot of flexibility and control to make immediate runtime changes to how your software works. And this could be on someone’s device, or in someone’s car. A lot of use cases, we’re not just talking about server farms. There’s software everywhere. And so LaunchDarkly can enable all those use cases.
Justin Caballero (07:29): And then the other part is the control and the ability to see from there, from that point of we can modify how your software works at runtime, we can also track… do analytics based on that, so we can see which users are getting which features, and you can hook that into experimentation. We can provide you all the data associated with that if you want to do your own analytics. We can tie it into APM, so you can have automated… Like if I enable this feature and performance tanks, I can automatically have it switched off. There’s just a lot of different use cases that come out of that. So it’s really about giving developers control. And once you start using feature flags, you can’t live without them. It’s like-
Jim Walker (08:12): It’s one of those things, once you see it, you can’t unsee it. It’s like once you’ve been there, you just can’t come back.
Justin Caballero (08:18): Yeah, yeah. It’s like, “How did I get by without this? It seems like a necessary part of your toolkit.”
Jim Walker (08:24): And this is one of those things that I think like you just said, the database was new. This is new to me. I mean, I used to make files and builds in the middle of the night and oh my God, what a nightmare. This is just like heaven. Just here, take this feature and put it out there for me, please. Awesome. Yeah, so that’s awesome. So, that’s great. And like I said, if anybody, you haven’t seen LaunchDarkly, you’re a developer, go check it out. It’s pretty badass.
Jim Walker (08:49): So, let’s talk a little bit about LaunchDarkly and Cockroach and what you all are doing there. But let’s just start with how did you first get exposed to Cockroach? And what did you see as interesting in this database, Justin?
Justin Caballero (09:04): Yeah, I suppose it’s a cheat to say this, but I like to think of my first exposure as maybe reading the Spanner Paper from a decade ago.
Jim Walker (09:11): Fair. Fair. That’s totally fair. Me too, by the way.
Justin Caballero (09:15): Yeah. At the time, everyone was super excited about it, but it seemed like something, “Oh, this is great for Google. I can’t do this.” Now it’s available as part of GCP, but still, you’re not running on GCP. So I actually didn’t hear about Cockroach directly until it was through a colleague maybe a year, year and a half ago. We were looking for databases that could improve our multi-region story.
Justin Caballero (09:41): And I just started watching videos, your guys' videos about it online, and went to a meetup and it was like, “Oh, wow, this is cool. It’s in this lineage of Spanner and it’s bringing that to everybody.” It’s like, “This is great.” We wanted to be able to tell a zero RTO/RPO story. And so when we evaluated Cockroach against other things we were looking at, very multi-region, like say AWS Aurora or similar products, is like… I don’t know. We just felt that this was the best for that.
Jim Walker (10:19): Well, I’m happy that you felt we were the best. I’m a little kind of a Homer, so I think we’re best. But that’s another story. But by the way, I too was first exposed to this with the Spanner whitepaper, and I was at Hortonworks at the time. And you could imagine, we were like, “Oh my God, this is distributed transactions. This is good.” We were working on Hive and we had Hive… What was it, Chris? LLAP? What was that thing called?
Chris Casano (10:40): Yeah, LLAP, yeah.
Jim Walker (10:41): Yeah, and [crosstalk 00:10:41] we were just starting to build that out. And I was just like, “Whoa, this is cool.” But it is, and I think once you get into these concepts, I think it’s really interesting. And that’s why I love your title honestly, Justin, is the distributed systems, once you get into the thinking of how these things work, it’s one thing to understand how it functions, but it’s a wholly other thing in reality, when you’re a practitioner using these sort of things.
Jim Walker (11:05): But I love that the resilience story and that comes back to our name, Cockroach. Can’t kill it. So the RTO/RPO thing. And so, how is LaunchDarkly using Cockroach today?
Justin Caballero (11:17): So right now, we’re in the middle of migrating all of our existing… what we call our core data. So things that aren’t in very specialized storage systems, where we’re migrating to Cockroach. So currently, we have a single region deployment, but we’re looking to expand that in the near term. And so, yeah. We’ve got basically all of our account data for customers, users, the organizations, all the feature flagging information, all that stuff is going to go into Cockroach, as well as things around integration. We have numerous integrations with other products.
Justin Caballero (11:59): So there’s information about that or there’s some authorization-type things. So, yeah. We’re in the middle of this migration. I’d say we’re maybe half done, something like that. So hopefully, our target is to be completely off of our old systems pretty soon.
Jim Walker (12:23): Yeah. I mean, phased migration is too risky to do everything at once, of course. I mean, you are the feature flagging company. So of course there’s the mitigate the risk thing. What are you guys migrating from? And do you know why? Is it the resilience thing mostly, or?
Justin Caballero (12:38): It was really the resilience thing. Yeah, the wanting to go multi-region and have a system that was designed for that first, not as an add-on. So I’ve seen some solutions where it’s like, “Yes, this works great in a single region setting.” But when we go to expand it, it’s like, “Oh, we’ve added this feature.” It’s just very hard to manage that. And you’ve got to deal with a lot more stuff.
Justin Caballero (13:05): So yeah, the design first for cloud native multi-region architecture was what we wanted. And then the high resilience, and being able to tell a customer the story of like, we could lose the East Coast and you wouldn’t notice. That’s compelling to some of our customers, so.
Jim Walker (13:25): Well, yeah. I mean, developers lose their code and they freak out. We just saw this yesterday, everybody. The whole Fastly thing, people were freaking out. Like, “Oh my God.” Yeah. So they’re a little finicky, I think, this audience. So it was a fun day yesterday to watch that.
Jim Walker (13:45): Yeah. That’s cool. So you’re doing a slower migration. Is it fair to say you’re taking I guess less risky workloads to start with, Justin? And then leading up the path?
Justin Caballero (13:58): Yeah, yeah. We chose what we thought would be the easiest and the least exposed if something went wrong, work to begin with. So we migrated some data from Postgres, which we thought, “Oh, this would be a good fit because at least it’s SQL.” The rest of our stuff is a document story. And that was interesting as well. We can talk about that, but we did begin with that. It turned out that the workload that we chose to start with ended up being one of the more complicated ones, just because the SQL queries we had were using some interesting Postgres features. And they were very complicated queries that were doing a lot of heavy lifting, that you might think sometimes, you would just do an application code. So they were pretty beefy queries.
Justin Caballero (14:45): And so I think it was actually a good move, because it helped us really dig into Cockroach, and also be with a feature that’s not as critical as the core flagging information. So yeah, we did learn quite a bit about that, and now that’s behind us. So we’ve been learning about how to make our queries perform, and what types of things to avoid when you’re using a more distributed database than something like Postgres, where you’re just dealing with a single master node. And everything is local to that node, is super fast. And it’s just a different set of concerns. So, yeah.
Justin Caballero (15:33): We haven’t moved onto our core flag stuff, but we’re doing things around the edges, simple things, like what we call our dashboards, which is basically a saved view for users of the UI, maybe filter. It’s pretty simple stuff. And these have been fine because they’ve been easier, and we get single-digit latencies, because the queries are so simple. So we’ve been very happy to see that we can get good performance out of Cockroach, even though [crosstalk 00:16:06].
Jim Walker (16:06): And that’s all single region though, right? At this point, Justin?
Justin Caballero (16:08): Yeah, it’s single region. So you expect when we go multi, as we’re looking at our multi-region architecture, we’re going to start to optimize for making the rights as quick as we can. So we’re going to have another region as a right buddy, that’s as close as we can to our main region today.
Justin Caballero (16:26): And then as we expand out our infrastructure, we’ll probably try to maintain that. How can we minimize the right latencies, and then be able to provide that multi-region [crosstalk 00:16:41]?
Jim Walker (16:41): The move to multi-region is all around the resilience side, right? I mean, that’s the core driver, right? I mean, [crosstalk 00:16:47] the latency thing will help eventually, right? I mean, I think. But I think it’s more of a resilience thing, right?
Justin Caballero (16:54): It is primarily, yeah. There may be a point where we start to look at the GO partitioning thing more, but it just doesn’t fit that naturally right now with the way our systems work.
Jim Walker (17:05): Sure.
Justin Caballero (17:06): And so I am interested to see at some point if we can use the… I think it’s called the duplicated indexes topology for some of our very low frequency data or static data, just so we can trade-off some of that latency for… make the reads faster and the rates a bit slower. So when I saw that topology, I was really excited about it. It seemed very cool.
Jim Walker (17:30): Well, it just came from direct requests from customers like yourself. It’s honestly, the funny thing is, is when you work in a company like this and people start using your database at scale, we’ve learned a lot from our customers. And that’s one of those patterns that just kept coming up, and we had to figure out like, “Okay, look, in the Spanner paper, they don’t talk about that stuff.”
Jim Walker (17:49): Because that wasn’t a problem for them. It was Google, that could really control. But in reality, wow. There’s so many different weird cases that come up. I think it’s interesting to go through those. I can’t wait to see the non-voting replicas. Yeah, the duplicated indexes. Some really cool stuff, but it’s a little bit more advanced. You didn’t understand that stuff when you first started. Right, Justin? I mean, it was a little bit of learning curve, right?
Justin Caballero (18:17): There certainly is. I mean, we did an eval and played with some of these things, but the things that we learned when we were actually implementing were different than what we discovered in the evaluation. So yeah, it’s a very deep product, Cockroach. I’m so happy that the docs are so thorough. It’s really given us a big help, helping us even know what to ask if we need to go to support.
Jim Walker (18:43): Well, amen to our documentation team. And anybody’s ever seen me on one of these things before, at some point on every single public speaking thing I do, I say something nice about our docs team, because I think they do a phenomenal job. Jesse’s a poet, actually. The guy who leads all our documentation is actually a poet. So I think it actually comes off, but that whole team does a phenomenal job.
Jim Walker (19:08): I want to come back to something, though. You talked about this migration from Postgres to Cockroach, and you had to refactor queries. Were you refactoring queries because of the distributed nature of Cockroach, or the difference in syntax? Because there are things that we don’t do in Cockroach that are native to Postgres. And it’s because it’s distributed too, I think it’s a little bit of both. But can you talk to me about that refactoring a little bit? It’s actually very interesting.
Justin Caballero (19:35): Yeah. There were both. I mean, one of the things that we did was when we created the Cockroach schema, we made sure to use the UIDs for our primary keys.
Jim Walker (19:44): Yeah, that’s a key best practice, yeah.
Justin Caballero (19:46): That’s a pretty obvious one we got right out of the gate. So we didn’t have a sequence, a monotonic sequence for our primary keys. But the other thing, because there was some dialect differences that were just… There’s a similar function in Postgres as there is in Cockroach, but it’s just got a different name. So we need to figure those out, and that’s not too bad.
Justin Caballero (20:11): What were some of the other things? One of the problems we had was around cascading deletes. So, we have this workload where we may have a single call to our API that results in tens of thousands of rows being deleted in the database. In one of the older versions of Cockroach, I think there’s been a fix since, but we would get command two large errors, because it was just trying to build a thing too big with these cascading deletes.
Justin Caballero (20:43): So, we did do some work using a feature flag to be able to control, do we want to cascade this delete, or do we want to do it in a batch fashion? And so we can decide on the fly which approach we want to do. And so we took your advice. It’s in the documentation end from support to try to break up those commands to make them smaller, so they’ll run a little faster and fit in a size that can be messaged between the nodes. That’s [crosstalk 00:21:13].
Jim Walker (21:13): That’s the funny thing about distributed systems. You have to actually think about those sort of things. I think it’s really interesting. Let me come back to that, but Chris, I know you deal a lot with customers. Can you explain the UID problem and why you need to use that? I know, Chris, you’re out there talking to a lot of people about this stuff, right? Why do you need that on each table?
Chris Casano (21:35): Yeah, so it’s a great question. So one thing Cockroach likes is distribution. We all want you to get into the trap of having a hot range where all your traffic is going to one shard or one range in the database. So with things like UIDs, they allow you to actually get more distribution of your keys across the cluster. So that way, all the ranges and all the nodes are performing parallel tasks for you, as opposed to just, “Hey, everything’s getting out into a hotspot.”
Chris Casano (22:03): So the more you can make use of them, typically the better off you are. And just to add to the conversation here too, so I’ve been working with Justin for some time, but the approach that I think LaunchDarkly took here is just a textbook approach. They just didn’t jump right into the deep end of the pool and say, “We’re going to do multi-regional, we’re going to do all these crazy things. Let’s go”
Chris Casano (22:28): They’ve walked into the shallow end, and been really patient and really methodical about how to go about using Cockroach in the right ways. And there’s some learning things across the way that Justin and team have found, but we’ve given our best as far as how to educate and help and support them on their journey. So I just want to say, I really appreciate… I’ve been working with LaunchDarkly for some time, just how their approach and methodology they’ve used here.
Jim Walker (22:57): Yeah. We’re seeing that time and time again. And that’s why I wanted to ask about that, Justin. Where did you start? Because I think that’s one of those things, you could understand all these concepts, but if you just jump into it feet-first or head-first, it too much, man. You’re going to run into these weird things, right?
Justin Caballero (23:12): Yeah, yeah. For sure. You want to take something small. The philosophy is basically, small, reversible steps that are like [crosstalk 00:23:19] individually. You always want a way out and you don’t want to impact your current SLOs, certainly. So yeah, finding what you think will be the easiest thing to start with, least impactful if you get it wrong.
Justin Caballero (23:37): The naive approach to doing this kind of thing, which might be like, you declare an outage window and you do an import from one database to the other, then you come back up. But that’s not something that’s going to work for a lot of companies nowadays. You can’t take downtime. And so, we followed an approach that it’s a pretty common approach, I think. You can find videos that talks about it, but basically, running both databases together and sending identical queries to each. And so that was also an important risk mitigation approach for us, because we can’t just all of a sudden start using Cockroach.
Justin Caballero (24:20): We want to see that it works in production without actually having it connected to users. And so having it run next to our original data store and being able to observe it, monitor it, check for performance, for correctness, we compare the results of the two databases and log metrics if we find differences. So we are pretty conservative.
Jim Walker (24:42): Go on, sorry. Yeah. And then Justin, correct me if I’m wrong, but you’re using LaunchDarkly to do that, right?
Justin Caballero (24:51): Yeah, exactly. For sure. So, for each of our workloads, we have what you can think of as a repository interface, is a common name for it, which essentially, these are the functions that we’re using to access the data and provide two implementations of that, one Cockroach and one with our previous store.
Justin Caballero (25:10): And then we use feature flags to control which one is connected, which ones run and which ones are connected to the end user, which is the source of truth. And so, we can gradually roll out Cockroach on a percentage basis. We can say, “We’re going to do this for 10% of customers. We’re going to watch it, see what it looks like. We’ll turn it back if it’s not looking good, and go do some fixes and then start again.” And so feature flags enable all of that. We can do that in an instant, without having to do deploys. So yeah, that’s certainly been a beginning.
Jim Walker (25:46): That’s awesome. Every once in a while, somebody will say a sentence and it involves the word Cockroach and the 14-year-old in me just laughs. Like, I’m sorry, guys, but I’ve been here… I don’t know, almost three years? And it’s like, “Yeah, we deploy Cockroach out to our customers,” it just sounds funny to me sometimes. I don’t know. And nine out of 10 times, every time we get a hacker news post, somebody’s like, “Yeah, but that name, I don’t know.”
Justin Caballero (26:10): I love the name. When I heard it, it was like, “Oh, yeah. This is the totally perfect name for this product.”
Jim Walker (26:18): Love it or hate it, you don’t forget it. That’s for sure. So how many people are touching Cockroach? I mean, outside of developers, I mean, because developers are working with these things, but from an operations point of view and propping it up, dealing with it, how many people are managing the system now, Justin?
Justin Caballero (26:37): Right now, it’s just two of us. And so, we’ve started having other teams using it, so they’re migrating data sets or building new data sets on top of it, but actually, the infrastructure part, there’s just two of us. So yeah, we took a pretty simple approach. Similar to what I think I heard DoorDash describe in the previous livestream, where they’re just using more normal compute nodes. It’s something that’s good for two people to manage.
Justin Caballero (27:15): So, yeah. Yeah. I think that’ll probably shift over time. It’ll probably graduate from the team that I’m on now to a more platform-oriented team to take care of it.
Jim Walker (27:28): Cool.
Chris Casano (27:28): There was a good question in the chat that I thought we’d just bring up real quick. So I know we hit on a little bit of this before, but Justin, as you did that migration from Postgres to Cockroach, were there any other anti-patterns that you found along the way? I know we talked about UIDs before, but anything else that popped up with that effort?
Justin Caballero (27:49): So we had to look at the query plans for Cockroach. And this is a case-by-case thing. And we did have to rewrite some of the queries to make them… just to make sure that they were performing. I don’t know that there’s a general takeaway from that other than to look at the query plans and see if there’s a problem.
Justin Caballero (28:14): The other thing I would just say is try to keep your SQL simple. That goes a long way. You can get the transaction boundaries that you want or obviously, you don’t want to do a lot of round trips. So sometimes, you need to make a more complicated statement, but if you can keep your SQL small, it’s very easy to make it fast. Other than that, for our other less complicated workloads, it converted pretty easily. So one of the things, we were using some extensions for case and sensitivity, which are available with Postgres. And so we had to do a little extra syntax on the Cockroach side using collate statements to make sure that the case sensitive searches work. So there was some weird things like that, but they were all pretty solvable.
Jim Walker (29:13): Cool. Chris, do you want to ask this question from Peter? I think it’s a really good one here, about the bulk inserts and sizes. I’ll do it. I’m like-
Chris Casano (29:25): [crosstalk 00:29:25] need me, Jim. You got it.
Jim Walker (29:26): There was a good question that came in and actually, this came up in the DoorDash actually interview as well, Justin, when he was talking about inserts. And when you’re doing bulk inserts, how do you figure out the size, the number of rows for each one of those things that’s going to have the best performance? I think what Sean had said, “Yeah, we had this insert of 10,000 records. We made it 1,000 records each with 10 statements, to distribute that.”
Jim Walker (29:52): Have you guys been through that? And how did you guys figure out the trade-offs for performance versus size of bulk?
Justin Caballero (29:59): Yeah. I don’t think we’ve done that on the insert side. We did do that on the delete side, where again, we used the feature flag with an integer value to dial the size of the batch. And so we just tried some numbers to see which one seemed like it worked best for that case. I can’t say that we’ve come up with a general… I don’t think we’ve seen it enough to have general advice for that. I don’t know. Chris might have a better answer.
Chris Casano (30:29): I mean, typically, what we’ll find is that a lot of small transactions work well. So, I mean, we just tested something with another customer where we needed to load I think about a hundred million records within 20 minutes. So we just found that just having really small batches, but a lot of them was the way to go.
Chris Casano (30:46): So it depends. Sometimes, your batches, they might have to hit multiple ranges. Maybe your batch just hits one range, and that’s going to be faster for you. So it’s a little bit of trial and error, but I would say just have in the back of your mind that a lot of little small transactions will typically help you scale up for the right performance.
Jim Walker (31:05): Keep your SQLs simple, KISS. Keep your SQL simple. I’m going to start using that. I like it. I’m going to have bumper stickers, but I think it’s great though. The question’s tough to answer because if it’s just simple inserts, yeah, great. If it’s inserts that are cascading across multiple tables, well, you’re going to get into… So it really depends on what you’re trying to do. And I think we see this time and time again as well, Justin, and I think that’s what Sean was saying. It took a little bit of experimentation and you get used to it. You get in this different mindset, right?
Justin Caballero (31:40): Yeah. For sure. Yeah. One thing I wanted to mention that I forgot about, about the migration that has been good for us is that we’re starting to migrate document stored data now, JSON data. And it’s been nice to see that Cockroach can handle that pretty well.
Justin Caballero (31:57): So we’re just able to drop JSON documents into the database and write queries against those. And they’ve been performing well. So I’d say to folks, if you’re coming from a non-SQL database and looking at Cockroach, it could work.
Jim Walker (32:15): So Justin, from a deployment point of view, are you guys running just on raw containers on VMs? Are you in pods, in Kubernetes?
Justin Caballero (32:23): Oh, yeah. So there’s no Kubernetes. So we’re basically just using EC2 instances with traditionally managed infrastructure as code, like Terraform-type stuff. And yeah, that’s it. We may look at Kubernetes again, but until we feel like we want to invest more people in being able to operate that, because it is a level of complexity up.
Jim Walker (32:48): Yes, it is. That’s for sure. And was that really the decision? I mean, is that the reason why you guys landed in that? I mean, did you even consider anything else at that point, or?
Justin Caballero (32:59): Yeah, I mean, there’s Cockroach cloud, which for us, we felt like we wanted the control of running the database. And then the Kubernetes, if we were using Kubernetes a lot previously, it just seemed like we didn’t want to do two new things. So, we have a way that we’re deploying all our other infrastructure, we’re just going to do that. I think the Kubernetes approach looks good, from what I’ve read about it. I like the rolling updates and those types of features, so I’m interested to play with it.
Jim Walker (33:39): Yeah, for us, I mean, it was a no-brainer. I mean, we’re just managing so many clusters. The trade-off, from a pure just cost of operations relative to just doing it on bare-metal or whatever we want to do, it just made a whole lot of sense for us. Yeah, Cockroach cloud runs on Kubernetes because yeah, we have an SRE team that it was built for that.
Jim Walker (34:03): Like Spanner was to Borg, I love Kelsey Hightower’s tweets, Spanner was to Borg is like Cockroach is to Kubernetes. Yeah. You guys get it, you know what I mean. So same kind of paradigm, so.
Justin Caballero (34:15): Yeah. The number of nodes we’re running is not anywhere near what… We can still manage with a traditional approach. Yeah.
Jim Walker (34:24): And so Chris, there was a question. Is there a Cockroach Kubernetes operator?
Chris Casano (34:29): And there is. I’ll just respond to Mike who sent this out. Yeah. So there is a Kubernetes operator. If you go to our docs, it’s right in there, support for GK and it’s certified in the Red Hat marketplace. So you can certainly make use of it. We have a whole engineering team around it, as far as giving it all the love that it needs.
Jim Walker (34:55): When I first landed here at Cockroach, I remember I had arguments internally about if we needed an operator or not, because we are so aligned with Kubernetes. There was big debate and it’s kind of like, it’s the day two stuff. It’s the rolling upgrades, some of the more manual stuff, but just purely deploying just us on top of stateful sets, it’s fairly straightforward to get Cockroach up on Kubernetes. But of course, yeah, we have an operator as well. So any other questions, Chris? Any other questions? Anybody else? I’m sorry, Justin, did you-
Justin Caballero (35:24): I was going to say, yeah, it’s very nice how easy it is to run Cockroach. I was just going to say, it’s like passing some arguments and get your execute line, and you’re off, so.
Jim Walker (35:36): Well, thanks, Chris. I mean, thanks, Justin. I’m looking at Chris and I said, “Justin.” What the hell am I doing here? Thanks, Justin. Yeah. I think the engineering team’s done a pretty good job of trying to make this as simple as possible. A lot of our focus here at Cockroach is how do we make even these more complex operations even more simple?
Jim Walker (35:57): We want to put this in the hands of every developer. And so, do I want developers thinking about indexes and keys? I think that’s where we’re headed. As we think about our roadmap, how do we just simplify it? I don’t want a developer to ever have to think about database.
Jim Walker (36:14): I just want a SQL API in the cloud, and I think that’s the ultimate vision. And then run these things as services, and feature flag things in and out. And we’re living in a different world than gosh, 25 years ago when I started in this whole thing. So it’s fun.
Justin Caballero (36:29): That’s true.
Jim Walker (36:30): So, cool. If there’s no more questions, again, Justin, thank you so much for doing this. It means a whole lot. I hope it was valuable for everybody. Chris, thanks for jumping on here and being part of the conversation as well. Any parting conversation or any parting words for anybody? Anybody? Chris, Justin? Put you on the spot.
Chris Casano (36:54): I mean, it’s crawl, walk, run. I think that was the lesson that I learned with LaunchDarkly, was their approach. They just took it one step at a time, as opposed to just jumping in and breaking everything, causing all kinds of havoc. Cockroach is very much a relational database and you want to treat it that way, but it’s also it’s a distributed system. So you have to think about both of those things together.
Chris Casano (37:23): Yeah, that was my largest lesson learned here. I don’t know if Justin, if you have anything to add?
Justin Caballero (37:29): Yeah, no, me too. I think that hits the nail on the head and thanks, you guys. Appreciate all your support helping us through this journey.
Jim Walker (37:35): Yeah. Well, you guys are a great customer and honestly, we’re all living in the same place together too, the same space. My takeaway is KISS. Keep Your SQL simple. Right? I’m going to carry bumper stickers inside the company now. So, all right. Well, thank you guys for doing this, very much. Thanks everybody for joining in as well. Again, recording will be up on our YouTube channel this afternoon, I’m pretty sure. I’m pretty sure.
Jim Walker (38:02): But I hope this was useful for everybody. There will be a survey when the event ends here. So gosh, please, please give us feedback in the survey. It’s really important for us to get these things right. We try to keep the right level of tech and business and try to make it entertaining along the way as well. So again, Chris, thank you, buddy. I’ll see you later. And Justin, thanks again for everything. And thanks for being such a great customer. On behalf of the whole company, thank you, buddy.