How does Netflix manage to stay online when other services that use Amazon's cloud platform go down? Seems that they employ an army of "virtual monkeys" that they let loose on the system to wreak chaos, create havoc and try to kill it by terminating processes.
The basic idea is that if they constantly try to bring the system down in a random fashion, they will be able to constantly ensure that it doesn't go down in a catastrophic fashion.
They run a suite of software that launches processes that traverse around their networks in a random fashion and cause problems. It's not a simulation, as each process is actually making things fail in their production environment. One process, aptly named the "Chaos Monkey", runs at a random time between 9 am and 5 pm on weekdays, scans the environment, and random picks a real production process and terminates it.
"The design premise there is that all of the architecture is resilient enough to retry and to begin re-serving the experience in a way that is completely transparent to the customer. You as the viewer should have no idea that the instance that was serving up your movie was just terminated."
It seems to be working well for them, maybe more services should take up the idea to alleviate outages.