Netflix Delivery

I went to a super interesting trio talk tonight about how the delivery teams at Netflix operate in order to be able to serve the many many millions of people who use the service on a daily basis. The volume is really staggering…on average is 16,000 years of content being streamed on a daily basis, and on a peak day it’s been as high as 29,000 years. Absolutely insane.

They do a number of things to make sure the site and user experience is delivered as seamlessly as possible for everyone everywhere. The first talk was about how they introduce purposeful failures throughout the system as a method of testing. For example they are running hundreds of thousands of EC2 instances daily…they’ve built systems to purposefully fail some instances, clusters, etc. for the purpose of testing fallbacks. Even beyond that, they build mirror systems into those failure injections so that they can compare results in the failing system to the normal system without affecting the user experience. Overall, the purpose is to find and flag anything that could cause issues before the issues ever present themselves to the user.

Some other quick highlights/key learnings:

Different teams run their own products: there’s no “developer team” and “DevOps team”…instead teams develop, deploy, and maintain their product from beginning to end, and they’re each responsible for making sure it works with other parts of the system
Spinnaker manages deployments automatically: there are very few manual events taking place throughout the system
Deploy regionally and test before next deployment: thought automated, a lot of thought goes into the deployments to ensure nothing breaks, and that there’s time to make fixes if it does
Plan time of deployments: likewise, they don’t do deployments out of office hours…and certainly not between 7-10pm weeknights when the service is in high use
Redundancy –> 3 regions & 3 avail zones in each: if something fails, the client can get whatever data from another server, usually before the user even notices a problem has occurred with their initial connection
Failsafes at every level: they build resiliency into the apps to completely control the user experience…the app can get ahead of any issues before they become apparent to the user, same as the backend servers can

That’s a huge simplification of the three talks but overall I learned a lot and found it very interesting!