Notes on ‘team responsibilities in cloud-native operations’ (Pete Mounce)

Summary:  Pete Mounce (@petemounce) from Just Eat gave a compelling talk at the London Continuous Delivery meetup group on ‘team responsibilities in cloud-native operations’. I found the talk hugely engaging, with loads of detail applicable to many organisations. Here are my notes from the meetup.

I captured my notes as slides:

Update: the video of Pete’s talk is here on Vimeo:

Pete Mounce video frame

There were several specific points made by Pete that were interesting for me:

Continue reading

Roundup: Patterns for Performance and Operability

Patterns for Performance and OperabilityI recently posted a review of Patterns for Performance and Operability by Ford et al on the SoftwareOperability website. I think that this book is exceptionally useful in its treatment of both performance and operability, and anyone who cares about how well software works in Production should buy and read a copy (there are paper and eBook editions).

Two other reviews might be useful too: my colleague Anant East (Head of Architecture and Infrastructure, wrote up a detailed review of Patterns for Performance and Operability on the tech blog at, and I posted a short review on Amazon.

Who Owns My Operability?

Matthew Skelton (@matthewpskelton):

Includes recommended reading for software operability:

Originally posted on Software Operability:

Operability is not something which can be ‘bolted on’ or retrofitted to software after it goes live; we need to design and build our software with operability as a first-class concern. You don’t build a bridge, then try to add load-bearing capabilities at the end of the project — but most software projects try to do exactly that, typically with costly results.

Ultimately, the product owner should be responsible for ensuring that operational requirements are prioritized alongside end-user features. If you are responsible for the software product or service, there is only one answer to the question

Who Owns My Operability?

Who Owns My Operability?

Update: the site now shows selected recommended reading on each page load.

(With a nod to

View original

Festive Graphite Line Art for the Masses

You have an installation of Graphite, a desire to learn more Ruby, and some festive spirit – what emerges?

An xmas tree drawn using proper metrics via the Ruby graphite gem, of course!

Xmas tree plotted using Graphite

The magic happens with the Graphite::Logger class, because we can log metrics at specific points in time:

logger.log(when,{"Branches1" => 3})

I calculated the points to plot on paper, and found decent Graphite rendering settings by experimentation:

Planning the Graphite Xmas tree

The code is on Github here: – fork away!

I’d love to hear or see any suggestions for improvements to the script. Candles? Snowflakes? Reindeer?! Also, as my Ruby-fu is limited, if there are better ways of interacting with Graphite, I’d love the hear about them (I tried and failed to get activesupport to work on my machine, for example).

Tune logging levels in Production without recompiling code

IAP Software Development Practice JournalThis article first appeared in Software Development Practice, Issue 1, published by IAP (ISSN 2050-1455) 


When raising log events in code it can be difficult to choose a severity level (such as Error, Warning, etc.) which will be appropriate for Production; moreover, the severity of an event type may need to be changed after the application has been deployed based on experience of running the application. Different environments (Development (Dev), User Acceptance Testing (UAT), Non-Functional Testing (NFT), Production, etc.) may also require different severity levels for testing purposes. We do not want to recompile an application just to change log severity levels; therefore, the severity level of all events should be configurable for each application or component, and be decoupled from event-raising code, allowing us to tune the severity without recompiling the code.

A simple way to achieve this power and flexibility is to define a set of known event IDs by using a sparse enumeration (enum in C#, Java, and C++), combined with event-ID-to-severity mappings contained in application configuration, allowing the event to be logged with the appropriate configured severity, and for the severity to be changed easily after deployment.

Continue reading

Fault tolerance, anomaly detection, and anticipation patterns by Jon Allspaw at QConLondon 2012

Jon Allspaw (@allspaw) from Etsy talked about the role that Anomaly Detection, Fault Tolerance and Anticipation play in producing highly scalable software systems (Fault tolerance, anomaly detection, and anticipation patterns, slides [PDF, 5MB]).

As head of technical operations at Etsy, whose web traffic is pretty substantial, Jon focused on resilience in software systems: what it is, and how to achieve it.

QConLondon 2012 blog posts

See all QConLondon 2012 blog posts…

Continue reading

Site Reliabililty at Scale – Discussion Roundup

There have been several useful discussion threads on the LinkedIn Site Reliability at Scale group ( recently:

Continue reading