Site Reliabililty at Scale – Discussion Roundup

Posted on February 5, 2012April 24, 2013 by Matthew Skelton (@matthewskelton)

There have been several useful discussion threads on the LinkedIn Site Reliability at Scale group (http://www.linkedin.com/groups?home=&gid=4200099) recently:

Is extreme high availability a bad thing?

“…having a higher degree of PER ELEMENT failure, while allowing an architecture to get you to extremely high overall reliability, is one of the more transformative features of many cloud options…“
Several comments about how reliability is less immediately appealing to businesses than new features, but of course reliability is the slow-burning coal, whereas new features are often just so much kindling (and therefore “burn out” very quickly).

What are you using for server and network monitoring?

Zabbix and Nagios are the usual suspects, with Splunk in the mix too (obviously not a direct comparison), with Zenoss making waves. Little input from anyone running SCOM for Windows; presumably these folks are using Zabbix or Nagios(?!).

Literature on website scalability

Some useful print and online resources for building reliable websites, including:

Seven Databases in Seven Weeks
- http://pragprog.com/book/rwdata/seven-databases-in-seven-weeks
High Performance Web Sites[although this probably needs updating now to include WebSockets, Node.js, etc.]
- http://shop.oreilly.com/product/9780596529307.do
Gigaspaces XAP architecture overview
- http://www.gigaspaces.com/wiki/display/XAP8/Product+Architecture
Experience with some Principles for Building an Internet-Scale Reliable System(Akamai – PDF, 160 Kb)
- http://www.akamai.com/dl/technical_publications/ ExperiencewithsomePrinciplesforBuildinganInternetScaleReliableSystem.pdf

Join the discussion... Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.