There have been several useful discussion threads on the LinkedIn Site Reliability at Scale group (http://www.linkedin.com/groups?home=&gid=4200099) recently:
Is extreme high availability a bad thing?
- “…having a higher degree of PER ELEMENT failure, while allowing an architecture to get you to extremely high overall reliability, is one of the more transformative features of many cloud options…“
- Several comments about how reliability is less immediately appealing to businesses than new features, but of course reliability is the slow-burning coal, whereas new features are often just so much kindling (and therefore “burn out” very quickly).
What are you using for server and network monitoring?
- Zabbix and Nagios are the usual suspects, with Splunk in the mix too (obviously not a direct comparison), with Zenoss making waves. Little input from anyone running SCOM for Windows; presumably these folks are using Zabbix or Nagios(?!).
Literature on website scalability
Some useful print and online resources for building reliable websites, including:
- Seven Databases in Seven Weeks
- High Performance Web Sites[although this probably needs updating now to include WebSockets, Node.js, etc.]
- Gigaspaces XAP architecture overview
- Experience with some Principles for Building an Internet-Scale Reliable System(Akamai – PDF, 160 Kb)