I’ve just returned from UK Scale Camp 2010 (@scalecampuk), organised by The Guardian (and the indefatigable Michael Brunton-Spall, ). Here are some notes:
I liked the “unconference” format (no formal programme; attendees vote for their favourite sessions in advance), and ended up in four of the many sessions:
- DevOps on Windows
- Log Analysis for Search Results
- DB Changes without Downtime
- Handling Errors at Scale
DevOps on Windows
The Puppet guys are not really interested in Windows at the moment. OpsCode Chef is slightly better on Windows than Puppet, although Puppet seemed to be better than Chef in general (religion alert!). PowerShell is a bit clunky; is there a need for a nicer language (DSL?) running on top of PowerShell? rPath and Anthill – are useful Windows-targeted tools.
Log Analysis for Search Results
Guardian uses Solr for search. A key thing is to get the categories correct, and to use a linear equation for scoring results etc. Make sure to include sensible use of editorial results to prime the result set; use human experts to help validate the search results.
When choosing a log file format, note that line-terminated is by far easiest format to work with. Hadoop and Dumbo are great for analysis, along with Pig.
DB Changes without Downtime
Russ Garrett gave a whirlwind tour of some cool features of PostgreSQL allowing DB schema changes on-the-fly:
In general, PostgreSQL is better than MySQl for non-blocking schema – PostGres schema changes are generally O(1). Make sure you separate DB schema changes from code and data updates. Know how long certain operations will take, using production-size data. Commercial automated migration DB systems “are dangerous” – too black box. Have a picture of the ideal schema, and work towards this (the big picture). Have a DB “owner” in charge of the schema/DB. The data model should have some separation from the application needs. After a schema change, just dump the whole production schema back into version control. Treat the database as more of an organism than just a black box of data.
Handling Errors at Scale
Andrew Betts of Assanka talked about handling errors and how to track the same kind of error over time (using a masked hash of the erroring line). The group also came up with several tools to use to help with monitoring and logging: Splunk, Loggly, Zabbix, Nagios, Scribe. The “pushing a Session ID back from the front-end” approach taken by Andrew was an interesting addition to the error tracing using GUIDs which I have used before.