The details in this post are based on The Netflix Tech Blog post titled “Towards a Reliable Device Management Platform”.
The Media & Entertainment (M&E) industry is extremely profitable – it has raked in billions of dollars each year for the last several years. The U.S. M&E industry is the largest in the world, valued at $660 billion (of the $2 trillion global market) despite seeing a 7.3% year-on-year decline in 2020 due to the pandemic.
While the pandemic accelerated existing trends (i.e. the streaming subscription model), it halted others (i.e. box office sales). Many M&E companies had to pivot their business model to stay competitive. For example, we saw several studios release first-run movies directly to streaming services, which allowed them to expand to an even larger audience.
The M&E industry is on the rebound from 2020, and major players are figuring out new ways to build relationships with customers that last years, not weeks. One thing that’s been clear is the importance of creating an agile business model that allows you to iterate on new ideas and quickly adapt to market changes driven by customer demands.
The two challenges of fast, flawless streaming UX
As you can imagine, this is an industry that is built on delivering an excellent customer experience. End-users expect a steady stream of high quality content, recommendations tailored to their preferences, and a fast, flawless UX.
The M&E companies that thrive develop environments where they can test new ideas, and respond to changing customer demands. The goal is to develop applications and services that delight current customers and attract new customers.
For their development teams, this creates two significant challenges:
How can they make their development process as agile as possible?
How can they build services for the industry’s massive scale?
Resolving these challenges requires looking carefully at the technologies that are powering their applications and services. And the most critical foundation of these technologies is the database.
Modern database solutions have to work with the cloud. They also have to be able to scale horizontally without compromising data correctness and availability. They can’t add any operational complexity because it will slow development cycles and frustrate teams. So, when it comes to the database, that’s the challenge: how can we build something that’s cloud-based and scalable but also simple to implement and operate?
How does Netflix work so well on so many devices?
Netflix is invested in “continually improving their products so they can deliver more joy and satisfaction to their members”. To make this possible, not only does Netflix foster a culture of experimentation, but they also invest in infrastructure that is built to support and scale successful experiments.
Let’s look at one example. One area where Netflix has succeeded in delighting users is device availability – you can log in and watch Netflix from almost any device with a screen. That’s thanks in part to Netflix’s Partner Infrastructure Team, which is in charge of building and maintaining the Device Management Platform. That platform is the foundation for Netflix Test Studio (NTS), a cloud-based automation framework that lets developers remote control “Netflix Ready Devices”. This platform is designed to handle device management at scale.
Netflix works with partners like Roku, Samsung, and LG to support hundreds of different device types such as streaming sticks and smart TVs. When new devices are introduced, Netflix makes sure that these devices are up to their standards before onboarding them to their application. Additionally, Netflix tests current devices every day through automation to ensure that new software releases continue to deliver the streaming quality and user experience customers expect.
It’s critical for the Netflix team to keep device information up-to-date in order for their device tests to work properly. Within the Device Management Platform, this is achieved by event-sourcing through the control plane to the cloud. The challenge is ingesting and processing these events in a scalable manner, which also means scaling with the number of devices.
Here’s an overview of their Device Management Platform setup:
[Image credit: The Netflix Tech Blog]
At the edge, they have hardware in the form of an embedded computer called Reference Automation Environment (RAE). The RAE functions as a router that devices under tests (DUTs) are connected to. Under the RAE there’s a Local Registry that is responsible for detecting, onboarding, and maintaining information about all the devices. MQTT, which is a messaging protocol for Internet of Things, forms the basis of the control plane.
On the cloud side, Netflix uses Kafka as a bridge between the two protocols to allow cloud-side services to communicate with the control plan. MQTT messages are converted to Kafka Records, which are connected to an Alpakka-Kafka-based processor. This configuration helps achieve fault tolerance on the consumer side of the control plane, which is key to enabling accurate and reliable device state aggregation within the platform.
There’s also a service called Cloud Registry that ingests and processes the device information and updates and pushes the materialized data into CockroachDB. Netflix says it chose CockroachDB as the backend data store because it is designed from the ground up to be horizontally scalable. Additionally, the team liked that CockroachDB offers SQL capabilities to help normalize their data model for device records.
This setup has allowed them to 1) achieve fault tolerance, which is key for reliable data and 2) scale to support an increasing amount of devices. They expect the Device Management Platform will continue to scale and accommodate more workloads over time as they continue to onboard more devices.
For full details on how building a reliable device management platform works, visit Netflix’s blog.
Build application architecture for the future
While device management is just one example, it demonstrates Netflix’s commitment to delivering an optimal customer experience regardless of the platform. With this model, not only can they adapt to changing demands, but they can scale their applications to their growing customer base.
The global M&E industry is forecasting a lot of projected growth, with an additional $597 billion of global revenue in 2028, as compared to that in 2023. To support this imminent growth, it’s wise for companies to take a page out of the Netflix playbook and build infrastructure that can scale now.
Since Netflix first started serving CockroachDB-as-a-Service, the company has implemented a number of use-cases, including:
Data/ML Workflow Orchestration: Within the Maestro orchestration system, all services are stateless and can be horizontally scaled out. CockroachDB is implemented to persist workflow definitions and instance state.
Data Mesh: A data movement and real-time processing platform, in which a CockroachDB source connector has been added.
Gaming: The Gaming Team uses a 4-region cluster to ensure high availability and resilience against region failover.
These are excellent examples showcasing how Netflix engineering teams push boundaries and lead the whole industry into the future. You can check out our RoachFest videos to hear Netflix software engineers, Shengwei Wang and Ram Srivatsa Kannan share how their team provides CockroachDB-as-a-Service to Netflix developers.
If you identify with these use cases and you’re keen to learn more about CockroachDB, start with our architecture overview.