Building a payment system is a challenging proposition. The potential pitfalls can be subtle, and the price for getting something wrong is high.
Shipt is a company that has obviously gotten it right. You probably know Shipt as a grocery delivery service, but the ecommerce company has several other businesses and fulfills all of the orders placed on its parent company Target’s website.
Shipt’s payment service underpins almost all of the company’s business, so Cockroach Labs’s Jim Walker recently sat down with David Templin, a Senior Software Engineer on Shipt’s Member Payments Engineering team, to find out how they built it. We also had the pleasure of hosting Michael Ching, an engineer from Shipt, at our first annual customer conference. He gave a compelling presentation about the cloud provider outage that lead Shipt to CockroachDB:
Multi-region from day one
From the beginning, Templin says, the Shipt team knew that it wanted to build a multi-region payment system. In part, that was out of necessity:
“This [the payment service] is a really important service, it’s what we would call a tier zero service,” Templin says. It’s essential for operation at Shipt, so we wanted it to be highly available. The way you typically do high availability is to do some kind of geographical distribution.”
“We could easily just distribute it into multiple availability zones,” Templin added, “but several times just in my career at Shipt I’ve seen region failures in various cloud providers. So we wanted to distribute it across at least two regions.”
A payment as a state machine
To build the system itself, Templin says, the Shipt team started by trying to define an abstract model of a payment. That would help them define what would be required to ensure payment correctness, and that would define the requirements for the payment system.
Fairly quickly, the team settled on a basic model of a payment as a state machine. For example, payments might have states such as “authorized”, “captured”, “refunded”, etc. Any given payment either has one of those defined states or is in the process of transitioning between one state and another.
In this model, a payment can only have one state at a time – a payment cannot simultaneously be “captured” and also “refunded”, for example. Payments can’t have multiple concurrent states.
That introduced the issue of concurrency control – an area where, Templin says, small subtleties can turn into big problems if developers aren’t careful. (For a thorough deep dive into distributed payment system best practices check out this blog.)
“Bad things can happen.”
“In my experience with distributed systems,” Templin says, “if you don’t have some kind of concurrency control in place, bad things can happen.”
“You might think ‘Oh, well, that’s a theoretical concern. We’re not going to really have concurrent requests for the same payment in our system.’ But those are famous last words.”
In reality, Templin says, it’s easy for concurrency problems to creep in. One common example: a browser sends an HTTP request related to a payment, but the connection drops. A retry library retries, and that request is routed to a different server by the load balancer. “Now,” Templin explains, “you’ve got two concurrent processes trying to do something similar on the same payment. So concurrency just seeps in, you have to have concurrency control.”
Consider, for example, an active-passive setup with a traditional relational database technology such as Postgres. At first glance, it might appear that such a setup would have met a lot of Shipt’s requirements. The problem, Templin says, is “the lag when you transition to the passive system.”
“One of the things that’s subtle about relational database management systems like Postgres,” he said, is that while they do provide redundancy and durability, “what you don’t have is isolation.”
That poor, neglected letter
To guarantee correctness, transactions in a database need to have Atomicity, Consistency, Isolation, and Durability, or ACID. “We’re all familiar with the ACID acronym for databases,” Templin says. “‘I’ is that poor, neglected letter.”
Isolation may be under-emphasized in some corners of the development world, but Templin and the Shipt team knew that it would be critical for the heavy transactional workloads of the payment system they were building. And they knew that with an active-passive Postgres setup, for example, isolation could be violated during a failover, as the system transitions between the active and passive nodes.
“I was concerned about what could happen in that five minutes to an hour, or however long it takes just to failover,” Templin says. “If it happened at the wrong time it could affect hundreds of payments, and so then we get into this really bad situation where now I’ve got 100, 200, 300 payments and they’re messed up.”
“I’ve been in those situations before, and they’re really painful,” he says. “I have to go through and audit every one and correct them manually. I really wanted to build a system where I could avoid something like that.”
“That’s one of the reasons why we wanted a truly distributed database, not an RDBMS where you basically try to retrofit some kind of replication to it. We wanted something that’s fundamentally designed differently, where that kind of replication is part of the system from the beginning.”
CockroachDB and the distributed mindset
The search for a truly distributed database led Shipt to consider CockroachDB, Spanner, Yugabyte, and FaunaDB.
“We didn’t really want to use Spanner because of its somewhat proprietary nature. It only works on Google Cloud [and] it also relies on highly specialized hardware,” Templin says. And when Shipt compared CockroachDB to Yugabyte and Fauna, they felt that CockroachDB was the most mature product.
It also helped that Cockroach Labs offers a managed cloud solution. “It’s not really very hard to run CockroachDB yourself – there’s very good documentation on it – but we didn’t want to deal with that,” Templin says.
“We’re already dealing with a lot of new territory, so we wanted to be confident that [the database] will be done, it’ll be done quickly, and it’ll be done right. So we decided to use CockroachCloud, and it’s been very convenient to manage our clusters that way. We didn’t have to try to get a whole team of people here to try to maintain [the database].”
This, Templin says, is one of the things developers need to think about when developing distributed applications. There’s no way around the complexity, so the question becomes where do you put that complexity: in the database, or in your application? “One of the things we liked about cockroachDB is that most of that complexity goes in the database,” he says. “ We could have attempted to use other database systems, but then a substantial amount of the complexity goes into the application. It’s already a moderately complicated application, so we didn’t want to add more complexity to it.”
But, Templin says, “No matter what you do, you do have to shift your mindset. You can’t just naively take an application designed for a single region and scale it up to multi-region and everything works.”
“The biggest thing in our type of applications is data affinity — where does the data live?” CRDB is truly global and data is accessible everywhere, but it’s not equally performant everywhere, depending on where it’s located. So you need to think about “what are the patterns of access where I can access this data efficiently.”
So how did Shipt solve that problem? For the answer to that question, and more details on how the company built and scaled its payment system, check out the full webinar below, or read through the full case study!