Keep your cache in sync when streaming

A while ago Apple announced the iPhone 6 and the world was watching or.. so they tried. The livestream provided by Apple was working poorly due to technical difficulties. There has been a lot of speculation about what caused the stream to be so bad.

Some say its because of the javascript on the live page of Apple. It was performing a json call every few seconds and the response of the server had a very small lifetime, making caching almost impossible. Now this was indeed a bit strange and could have been done better, I don't believe it was related to the actual stream itself. The Apple TV simply connects to the streaming service right away, rather than loading the live page with javascript included, and was also offline.

Users refresh. A lot.

A well known fact is that users are going to refresh endlessly if things don't work. This ofcourse is pretty heavy on your load balancers but it should withstand. They are designed from the ground up to route traffic and thats all they do (ok, they also perform a little health check now and then). You can also scale up your load balancers, make use of latency based routing to divide computing power across the world and much more. Then again this architecture is running seperate from the livestream, which on its own was performing poorly.

Cache out of sync?

Then where did it go wrong? I think it has something to do with the caching servers. Every 4 seconds (or so) the client will ask for the next few seconds and the caching servers will provide these accordingly. But when the caching server you got appointed is running behind, they provide you with the wrong 4 seconds. This results in the stream rolling back constantly. All you can do now is kill the servers that are running behind and relaunch them when they are healthy again. If this happens a lot you lose a lot of servers and you are forced to prevent some users from watching at all.

The user refreshes and gets appointed a working server that provides him with some video. Every request for more video is routed through the load balancerand goes to a server. When this server was killed or has caching issues the request will fail. Its therefore crucial to not let too many nodes be killed or go unhealthy because your load balancers need to keep up.

Caching is key for a livestream as you want to protect your origin servers. Just as important is keeping the caching servers in sync.

To learn more about scaling from zero to millions of requests, check the video by AWS and the NASA. They used multiple 40GB caching clusters.

[Youtube] AWS re: Invent CPN 205: Zero to Millions of Requests