monitoring

Tune logging levels in Production without recompiling code

Posted on December 5, 2012April 24, 2013 by Matthew Skelton (@matthewskelton)

IAP Software Development Practice Journal

This article first appeared in Software Development Practice, Issue 1, published by IAP (ISSN 2050-1455)

Abstract

When raising log events in code it can be difficult to choose a severity level (such as Error, Warning, etc.) which will be appropriate for Production; moreover, the severity of an event type may need to be changed after the application has been deployed based on experience of running the application. Different environments (Development (Dev), User Acceptance Testing (UAT), Non-Functional Testing (NFT), Production, etc.) may also require different severity levels for testing purposes. We do not want to recompile an application just to change log severity levels; therefore, the severity level of all events should be configurable for each application or component, and be decoupled from event-raising code, allowing us to tune the severity without recompiling the code.

A simple way to achieve this power and flexibility is to define a set of known event IDs by using a sparse enumeration (enum in C#, Java, and C++), combined with event-ID-to-severity mappings contained in application configuration, allowing the event to be logged with the appropriate configured severity, and for the severity to be changed easily after deployment.

Continue reading Tune logging levels in Production without recompiling code

Fault tolerance, anomaly detection, and anticipation patterns by Jon Allspaw at QConLondon 2012

Posted on March 21, 2012May 15, 2012 by Matthew Skelton (@matthewskelton)

Jon Allspaw (@allspaw) from Etsy talked about the role that Anomaly Detection, Fault Tolerance and Anticipation play in producing highly scalable software systems (Fault tolerance, anomaly detection, and anticipation patterns, slides [PDF, 5MB]).

As head of technical operations at Etsy, whose web traffic is pretty substantial, Jon focused on resilience in software systems: what it is, and how to achieve it.

Continue reading Fault tolerance, anomaly detection, and anticipation patterns by Jon Allspaw at QConLondon 2012

Site Reliabililty at Scale – Discussion Roundup

Posted on February 5, 2012April 24, 2013 by Matthew Skelton (@matthewskelton)

There have been several useful discussion threads on the LinkedIn Site Reliability at Scale group (http://www.linkedin.com/groups?home=&gid=4200099) recently:

Continue reading Site Reliabililty at Scale – Discussion Roundup

UK Scale Camp 2010 – Braindump

Posted on December 10, 2010April 24, 2013 by Matthew Skelton (@matthewskelton)

I’ve just returned from UK Scale Camp 2010 (@scalecampuk), organised by The Guardian (and the indefatigable Michael Brunton-Spall, ). Here are some notes:

Overview

I liked the “unconference” format (no formal programme; attendees vote for their favourite sessions in advance), and ended up in four of the many sessions:

DevOps on Windows
Log Analysis for Search Results
DB Changes without Downtime
Handling Errors at Scale

Continue reading UK Scale Camp 2010 – Braindump

You are invited to ScaleCamp 2010

Posted on November 19, 2010April 24, 2013 by Matthew Skelton (@matthewskelton)

Very pleased to receive this email today:

From: Michael Brunton-Spall
Sent: 19 November 2010 16:07
To: Matthew Skelton
Subject: You are invited to ScaleCamp 2010 – 10th December at the Guardian offices, London

Hey,

We are so pleased to be able to invite you to Scale Camp 2010 on the 10th December at the Guardian Offices here in London.

I’m looking forward to some great conversations and debate, particularly around DevOps and how that can contribute to scaling a software platform.

Abstract

Share this:

Share this:

Share this:

Overview

Share this:

Share this: