Matthew Skelton delivered a talk titled “Practical ways to increase Operability within Continuous Delivery” at Continuous Lifecycle London 2019. The co-author of the “Team Guide to Software Operability” book shared practical experience from working on operability for quite a number of years.

Skelton says that we really have to be investing in the operability of our systems, not just pushing out the latest features that users can see if we are thinking about our software services that are going to last an extended period of time. Operability is all about optimizing for long term customer experience and the viability of the service, instead of short term delivery features.

Featured

Book: Team Guide to Software Operability

By Matthew Skelton, Alex Moore, and Rob Thatcher

July 2019

ISBN 978-1-912058-00-6

Learn how a focus on software operability helps to increase system reliability, reduce problems in Production, and reduce total cost of ownership (TCO).

This guide provides a set of hands-on practices based on real-world, tried-and-tested experience across multiple organizations for teams to adopt (and adapt) in order to promote and enhance software operability.

According to Matthew Skelton, modern approach to logging is one of the main techniques for increasing operability. We need to focus on operator experience and carefully choose the things that need to be logged - logging is part of the design of operability of the system and we have to make sure that the experience of the people running the system is first class in order to enable them to do their jobs.

“Logging is no longer about spewing stuff out that might be useful. Logging is part of the design of operability of the system.”

— Matthew Skelton

Apart from modern logging, Matthew Skelton made a case for using Run Book dialogue sheets, HTTP health checks, Correlation IDs and User Personas.

You can hear the full talk and read the full transcript below.

Slides

Practical ways to increase operability within Continuous Delivery - Matthew Skelton - CLL19 from Matthew Skelton

Transcript of the talk:

What is Operability and why do we care

Matthew Skelton:

Good morning, folks. Hi. Welcome. My name is Matthew Skelton. I'm from Conflux. And I'd like to share with you today some practical experience of increasing operability within a Continuous Delivery context. So today's talk will look a bit like this. It's like a sandwich, there'll be an experience report and then some practical stuff in the middle and then more experienced report at the end. Unfortunately, I've not brought any biscuits with me today. So don't expect to get any cookies but the talk will look a bit like that.

Operability I mean, very simply, making software work well in production. And operability involves lots of things like these. Diagnosing problems, clearing down data, reporting on stuff, those going well or going badly, monitoring things, securing things, inspecting the state of things, making sure we can deal with failover, scaling things out, scaling things down, these kind of things. These are the kind of things that users don't quite rightly particularly care about, until the application fails. Until they can't use the application and then they care very, very much.

And I've come to see, I've been working on operability for quite a number of years. And I have come to see operability actually, in these terms. We're really optimizing for kind of long term customer experience and the viability of the service, instead of kind of short term delivery features. So when we're thinking about our software services that are going to last for a long time, or that we want to make sure are available to customers over an extended period. That's when we really have to be investing in the operability of our systems, not just pushing out the latest features that users can see. And effectively, it helps us to make this the spend or revenue associated with our software much more sustainable. It also makes things sustainable for the human beings. So if you got humans on call, or working on pager or they're on call for the systems overnight. Their lives are more sustainable if you like, because we've addressed these operational concerns much sooner. And for the organization that we're building this systems for, the outcomes are more predictable. And predictability more than pure speed is often what we actually need to be optimizing for many of these situations.

Continuous Delivery Context

Matthew Skelton:

In a Continuous Delivery, context is how I've been working for many years. What do we mean by that? Well, I take to my starting point, this book here by Dave Farley and Jez Humble, Continuous Delivery, published in 2010. And still utterly relevant today. Yes, they wrote the book prior to large scale containerization but all the techniques in here are still completely relevant to how we build software systems today. In fact, I challenge you to find a technique in the book that is not relevant in your context. For Continuous Delivery to work, we need good engineering practices. We need fast feedback from deployment pipelines. That's effectively what a deployment pipeline is for, guesses for deploying stuff, but it's to give us a kind of sensing mechanism, for how well our software is actually working. It's like an extra pair of eyes or a pair of insect antennas like this.

It helps us to sense what's going on with how we're building, how we're building the software/ We also need to realign our software architecture to enable us to do this rapidly. So things like microservices, that's not the only thing we can do but that kind of approach. And we also need to make sure that we have a kind of team ownership of different bits of software. We don't have multiple people fighting trying to update the same stuff. We've got very clear ownership boundaries within our organization, or within the people who are building the systems. So lots of people will know all this stuff, this is all fine. If it's new, then there's some useful things here for you to take away and consider. But these are the main points of my talk today. I actually wrote a book a few years ago called Continuous Delivery with Windows and .NET. Even if you're not in a Windows and . NET world though, this book is actually quite useful. It was picked up by the course leader at a university in London called UCL.

And it's actually one of the key texts for the master's course in Software Engineering at that university. Because the way in which we explain kind of continuously delivery concepts and things is very straightforward and very transparent. And it is not very related to .NET in some part. So if you've got colleagues who you want to kind of bring on board to this way of thinking, then you can download a copy of this book. It's actually on a O'Reilley Safari. If you've got access to that you've got access to the book. You can also buy a printed copy. If you go to the website, that's listed on the CD with windows.net. You could buy a printed copy and we'll send you one or more printed copies of the book. Because it's really useful, kind of getting your colleagues in different departments to understand things more quickly. Okay. So here's the first biscuit part of the biscuit sandwich. Little experience report.

Experience Report - GOV.UK departmental division

What I've been doing recently to put some of these techniques in context. So for most of 2018, I was engineering lead at a large department in the UK Government.

There's just over 700 people, depending on how you counted it in total. We had about 70 teams across several different locations. And because of some kind of event that's happening to the UK at the moment, which you may have heard of, we have some fairly time critical delivery of software. So we're working in that kind of space. It's kind of a complicated environment. We needed to increase the speed and safety of delivery, pure speed, no use in that kind of context where we're dealing with quite sensitive data, that affects people's lives. We have to make sure we're optimizing for both speed and safety at the same time. This is a multi year program of work, some of that work could be going for many, many years. Many, many teams involved. It's important for us to track and control infrastructure costs. The technology landscape is fairly standard stuff. They'd recently moved from traditional data centers, VMs and so on to more of a communities based, container based approach. So, with alongside complex, there was another company called Axiologik. There are loads of other organizations involved as well. We worked in partnership on our particular area.

And so some of the dynamics that were in play at the time. The number of people and the kind of size of the software system to be built, were about seven times what they were just a few years before. So some of the dynamics are going to be different because the rules don't apply at nearly an order of magnitude larger. We had to spread awareness of effective software practices across these many hundreds of people. And we're crossing some internal boundaries and some external boundaries. So internal boundaries across different departments, external boundaries to different suppliers outside providing data, different countries and different private organization supplying data into the systems. Lots of different viewpoints of what Continuous Delivery actually means. This is why I mentioned the book before because that's always my starting point. But lots of people think it's just about pushing things out many times a day, which might be one outcome of Continuous Delivery but it's not the aim and so on.

How many people have worked currently in organizations where there's a strong drive to have just one way of doing everything? Raise your hand, if you're in that unfortunate position now. Oh, not so many. Okay. So we have to contend with that kind of way of thinking in some quarters, to move towards a situation where we could actually explore multiple different ways of doing things. It's important to define the platform. We heard in the keynote this morning, from [Jobida 00:09:27] about Kubernetes being a kind of platform for building platforms. And one crucial thing that we found, as in many places is, it can... some people may have very different views about what the platform is, or whether there actually exists a platform in that particular context. And so actually being very clear about what the platform is that you're building on. And where that kind of boundary sits is incredibly important. So we did quite a bit of work around that. And various other things like improving logging quality, and making the delivery model work with different suppliers and so on.

We want to work through a few things that we did to help in this context. We moved from an operating model that was very siloed, lots of different kind of organizational groups, into something that was more optimized for a flow of change through the production. That's still ongoing. A really important thing that relates to operability then, is the operator experience. What is it like for people in the ops team, or the life services team or the system support team? Whatever you call it, for people who are on call or who are responsible for the live services, production services, what is their experience? Actually assess what their experience is. The great thing these days with agile techniques, like UX user experience, is that we know how to do that. We don't have to invent it, in the context of operability. We use the existing standard UX techniques, which we'll see in a little bit, to assess the experience of people who are actually running and supporting these system.

And it's amazing how much progress you can make by starting with the experience of the people who are currently having to run these systems or look after them. And asking them, how nice is it? Or how awful is it for you to work with the software? If you can get those people on board, which is kind of what we did. It can be transformational in terms of the effect it has on enabling the software to run well. There are lots of people on site. We also did two really important things I think, Tonya mentioned aspects of these in her keynote yesterday. We run two related, or slightly different groups every week. Something called the Guild, which was sharing of knowledge across multiple teams, quite technical. But it could also be people who gave talks or related to stuff they were doing on the program, but also stuff that'd be done doing outside if someone was playing around with a Raspberry Pi, for example, or they're built some kind of, I don't know, whatever.

They were tinkering around with something. They've built a racing car or something at home. They come and do a talk on that because we've learned something about kind of engineering techniques from people's own kind of projects. And that was coupled with what we call the Engineering Working Group, which is a group of interesting people who had a strong interest in things like operability testability, things like this, come together and steer the focus across all the teams. What do we need to improve across the board? Is logging working well? Exactly what's working well and what's not working well? When should we introduce a new metrics platform? For example Prometheus. When's the right time to do it? Who do we need to ask? Who do we need to get on board and do it? And so we ran each of these groups every two weeks, so they're kind of overlapped. So one week, it was Engineering Guild, then it was Engineering Working Group and back to Engineering Guild again and that kind of cycle of talking about stuff and sharing knowledge.

But also coming together and really arguing, discussing, what we need to focus on next was a really good driver. Was a good wave of driving effective attention to detail for things like operability. There were some weekly lunchtime tech talks as well. I spent some time helping to explain Continuous Delivery to people who don't write code, because it's quite important to make sure people are all in the same headspace. There's a set of slides online, which you're welcome to download and use if you want. These slides I'm presenting now will be online later. So you don't need to read the eight point font that's at bottom of that screen right now. But it's quite important to take non techies through things like Continuous Delivery, operability, how we do testing in this kind of context, because it's very, very different for them, for many of these people. And we also introduced and helped to promote some team first operability techniques, which I'm going to share with you now. Any questions about this first part. All make sense.

5 Operability Techniques

1 - modern event-based logging

Matthew Skelton:

So the first technique that we spent quite a bit of time embedding was a modern approach to logging. I actually did a talk on these five operability techniques. At Continuous Lifecycle last year, so there's there's a video and slides that expand on this middle section quite a bit more. So have a look. If you want more detail on this middle section, have a look on online on the continuous lifecycle site and you'll find more stuff. But I just wanted to call these out. Modern kind of event based logging. There are some tools and services we need to use. We need to have a log aggregation system that brings all logs in from multiple machines. We should probably be using structured logging these days. So it's probably logging with Jason, they're all tagged and all that stuff. That's all fine as mechanics. The intent at a human systems level, of what I call modern logging is important as well. And that is, the way in which we communicate with other teams about what we are going to log actually is an important part of design.

So remember these ops people or the life services team or the ITOps people, whatever you call them have to run the systems, we make sure there's collaboration between the application development teams and the people running the systems on what... so we asked these people, "What do you want to see in the logs? What alerts do you expect to see? What's interesting from your point of view? What are the interesting events that happen in the software that we actually care about? And then we actually need to raise as as a log event." Because if these poor people here in ITOps or support, if they're in a position where they just have pages and pages and pages, millions and millions and millions of events in the logs, that are effectively meaningless, it's impossible for them to do their jobs. Their operator experience at that point is terrible. So we focus on the operator experience and say, "What kind of things are important?" We'll only log those things. Or we'll log them in a way, which allows us to search for things will filter things effectively.

So in this particular context here, this is let's say, an e-commerce shopping application, something like this. So we've got a basket, we've got a basket item added basket item removed. And we're using, what's the word, enums. That went blank. We were using enums here, so in Java or .NET or the equivalents in things like Python. So to make sure that we've got unique, human readable identifiers for interesting events. And if it's not interesting to the people are going to run the system, then why are we logging it? Certainly why we're logging it at like an info or warning or error level, so that they can see it. Shouldn't be getting down. Those people who run the system shouldn't be seeing non interesting events. So logging is no longer about spewing stuff out that might be useful. Logging is part of the design of operability of the system. Because we have to make sure that the experience of the people running the system is first class, to enable them to do their jobs.

So collaboration on the nature and exact details of what we log becomes a really important kind of DevOps, interaction between these two teams. And actually, so in this specific case, where I was working due to some poor choices, from some of the teams who had deployed some software just before I started. The ITOps people actually believed at that point, that the elastic stack, LogStash and what it. That they believe that the ELK Stack was absolutely terrible and I never wanted to see it again. It was slightly strange, I thought, "Well, why, why is that?" And it was exactly because of that reason. We had millions and millions and millions of logs that it was impossible for them to actually do their job properly, to respond to problems with the software. So we went through, we kind of rethought all of this stuff around logging. And we found like a nice size service, fairly small service to use this new approach on.

And we've worked from the start with the ITOps people, live services people who said, "This will be your experience with this new way of logging using ELK. Don't worry. Helps to ELK both. Here's what your experience will be, work with us, just have some belief in us. Come along with us for a little while and we'll show you that it will be better." And when that's that particular service was finally deployed, they stood up by themselves at one of the Engineering Guild or Tech Talk sessions and said, "This is awesome. This is exactly what we want. Everyone else please go ahead and implement this." So they change from hating the logging solution to absolutely loving it. It's the same technology. The intent and the way in which the collaboration happened was different.

2 - Run Book dialogue sheets

Matthew Skelton:

So the second technique is this one here. So Run Book dialogue sheets. This is an A1 size sheet, which you put on a table. And you get the team around the room or possibly two teams, you might get a team of application developers or a team of academic infrastructure platform people.

And you lock them in the room until they filled in all the sheet. I'm just kind of joking, right? You don't actually lock the door. But what's on the sheet is a whole load of criteria for operability effectively. Things that you will definitely need or someone will need to deal with, if this system is going to work well. None of it is magic, right? There's all things that we're very familiar with. What's the service level? Who is the service owner? How's it going to fail over? What happens about patching? All really standard stuff that lots of people have infrastructure, ops people have known for a long time. The magic here is, that is on a team size sheet, for the people in the room, everyone with a sharpie or a marker pen, and they can fill in little bits of information. If they don't know the answer, if you don't know who the service owner is, put a question mark. Identify. Like Tonya said yesterday, "Write it down." Be very, very explicit about your decisions and about where you have gaps in the knowledge.

And so this is a vehicle for very good conversations about how the system is going to operate. If you've got gaps here, something might not work when you deploy to production. We actually have some of these because we're sponsoring conferences this year. We actually have on our sponsor table, some of these sheets. So if you're interested, come along and grab one of these sheets and take it with you. I can see there's one or two people in the room who have already taken them. That's great. And all of this stuff is Creative Commons, Open Source, the GitHub repos got yesterday 168 forks. So there's at least 168 people who found it kind of useful, and have taken it and evolved it internally for them. So send us a pull request, if we're missing something on one of the headings, one of the criteria for operational concerns, sends a pull request and we'll add it. This is fairly standard now.

3 - endpoint healthchecks

Matthew Skelton:

So each running, the third technique to share is make sure that each endpoint, each thing that's running has got an HTTP based health check endpoint.

This is kind of built in at a kind of Kubernetes level. But if you're not running that, just make sure that everything running has some sort of HTTP health check. Even if you've got something like a database, which doesn't have that natively, just put a little helper service in front of it because it means that we can build these kind of dashboards and report on the health of an environment very, very quickly. Literally, you can get a little dashboard app off GitHub, get clone, edit the config file pointed to an environment within literally four minutes. This is the quickest time I've done it. If you have HTTP based status, endpoints running for each thing, each kind of runnable thing that you've got running in your environment. There's a very, very powerful and lots of kind of monitoring applications support that implicitly or as a default. So it's a really useful thing to make sure you bake in. If you do from beginning very straightforward, even retrofitting it is not particularly hard. Each service or component or runnable thing decides on its own health. And reports that yes, I'm healthy. No, I'm not healthy.

4 - Correlation IDs and traces

Matthew Skelton:

The fourth practical technique is to use correlation IDs. So in the particular place I talked about last year, this was all this was already happening, this is already in place. Apart from the fact that in one or two cases, a different identifier was used as the correlation identifier. I'll explain what I mean by that in a minute. So if you are shipping a parcel, this is from a parcel shipping service, this is like a web page on a parcel shipping company. You're sending a parcel, and you want to track it. So you've got the tracking identifier and it's gone through various different states. It's been collected from the source. It's now been shipped through various delivery centers and it's arrived at the destination. And the thing that ties all that together is this tracking ID. So this is the equivalent of what we need to do in our software, where when a new request comes in, we assign it what we call a correlation ID or tracking ID and then at every hop through the system, we make sure we log that and pass it downstream to the next component.

And that allows us to reassemble the request and see all the machines or nodes, containers that request has gone through. So it's fairly standard these days. But if that's something you're not doing, go and investigate this because it's incredibly useful as a diagnostic technique. Another reason why it's important from an operability point of view is, it helps to facilitate interesting conversations between different teams. You have to agree on an identifier that represents, this is the correlation ID. So you have to choose that and make sure you're using the same identifier when you're logging and in different services, different parts of the system. If you use this technique well, you can use it to generate useful conversations between different teams. On how services written by different teams are going to present the information that you end up being able to see here. You're having to align that kind of diagnostic capability between different teams.

5 - Lightweight User Personas for Ops

Matthew Skelton:

And the final technique is this one. Is you taking some of the learning from user experience. So UX, and applying this internally to people who are not the primary users of the software, but our people like life services, or IT support or Ops, or whatever you call them, maybe SRE. And doing a lightweight user persona for these people. The really key aspects that we're trying to get by using a lightweight user persona are these three things here. So the motivations of those people, what's driving them? The goals, what are they trying to achieve? And their frustrations. What things really annoy them about how the software works now or how the software has worked in the past? And so we can get UX experts to help us in this. Characterize the needs of ITOps people, SREs, life services people to help drive the right kind of operational features in the applications that we're building in the product teams. So the experience, the operator experience of these people is being met early on.

And that will be if you can have it, if you can arrange things so that these people working in ITOps, sorry life services, if these people could be champions for your way of building software, that's an incredibly powerful driver to make things work well. So these are the five things, just summarize this little section. The filling in the middle of the cookies. So make sure using modern event based logging. Crucially, not just focusing on the tools, focus on the intent, the collaboration between different teams when using that event based logging. That's where the power lies. Try these Run Book dialogue sheet techniques or something similar. Get people around the table, get hold of a list of stuff that you know is going to have to be addressed. To make the systems work well, to give them good operability. Make sure you've got endpoint health checks, HTTP and endpoint health checks is really probably the best way. Make sure using correlation IDs, align different teams to make sure that you can trace a call through different parts of the system and a focus on the operator experience using UX techniques in this specific case, kind of lightweight user personas.

These techniques and more are in this book here, this little sales pitch. They're in this book here, which I'm a co author of this book. And we've got this on sale during Continuous Lifecycle with 30% discounts. So come to this stand, if you're interested in taking advantage of that. And you'll find a lot more details. We've got case studies in there. We've got other kind of tech techniques and things for operability in there.

Assessing Operability at Scale within Continuous Delivery

Matthew Skelton:

So how does this stuff work at scale? So 700 or more people, 70 teams across lots of different locations. How can we kind of assess this and help teams to improve? So we did some work around cross team engineering standards. Now some of you might go, "Whoa, I don't want to go anywhere near engineering standards. It's really scary. Or some architectures written them and they're three years out of date and they're all irrelevant, I'm not interested." So that's a really important point. It took a little while to kind of work out a dynamic that was going to not be out of... mean the standards were not going to be out of date right away.

What we ended up coming up with is something like this. Where we had a very small number of things that were mandatory. For example, the name of the correlation ID field, when logging that's mandatory. There's no value in having divergence on that, because we need to be able to trace across multiple services. You will use exactly this field to represent that thing. You will use, let's say, GitHub for source control. There's no value in someone picking something else. But a very, very small number of things in the very center of the onion. Things are unlikely to go out of date, and have a high value in being mandated. The next layer was as expected. We expect you to use your Java or whatever was there. Java, I can't remember, volume eight or whatever. Anyway, just this version of Java, we expect you to use this log shipping component, we expect you to use something else. But it's not mandated. So if you use something else and can then demonstrate a really effective outcome from using that, great. Come and share your experience at one of the lunchtime sessions.

And that gives us an interesting insight into, "Well, maybe actually that's worth exploring." Or actually someone might present it and other people say, "Actually, no, we've solved those problems before you're going to head down an awkward path. Go and have a look at this previous talk that we've done. And some slides there. And some details, really recommend you don't do that." So the expected group is kind of a happy path, tried and tested. And is quite a large number of things in there, that helps teams get on board very quickly. And they'll get lots of support. But if they're very confident that they can do something differently, that's fine. A lot of recommended stuff here, which are kind of patterns that seem to work well. We recommend you use this. But actually, it's equally valid probably to use something else over here. So this sets up an interesting dynamic, where the number of mandatory things... to get into the mandatory bit something has to be super high value.

And it means that we're explaining the intent. Why is that thing mandatory? "Oh I can see the reason for that. That's fine. I'm happy to go with that as a mandatory thing." And instead of an interesting dynamic within the, as I said 70 of the engineering teams, lots of different suppliers, that helps to get more buy in, for aligning teams across multiple different places, multiple different suppliers, multiple different streams of work. So if you're operating at a similar kind of scale, it's worth thinking about something like this. Again, it's the intent behind it. It's the kind of conversations that this thing is driving. So combined with the weekly Engineering Guild and Working Group, where we could discuss what goes in mandatory, what goes in expected recommended. They set up a great dynamic for kind of learning and learning from different people, learning from different teams. Emphasizing this kind of team engagement basically. And then we did a kind of multi dimensional engineering assessment. It wasn't quite as psychedelic as this.

The dimensions we use for the assessments across all these teams was team health, deployment, Continuous Delivery, flow, operability and testing. We've pulled together the framework for doing this, if you go to softwaredeliveryassessment.com, it will redirect you to a GitHub repository where all of this stuff is available to see. And the criteria for these six dimensions are taken from existing sources. So the Team Health Check bit was taken from the Spotify health check which is already I think, Open Source or Creative Commons. The book by Mirco Hering, DevOps for the Modern Enterprise. The Continuous Delivery book I mentioned before, the book Accelerate by Nicole Forsgen and colleagues was published last year. And then the Software Operability book that you've already seen. And its companion in the series called Teen Guide to Software Operability. So the criteria we've got for the assessment is all taken from books that are already published, already kind of well known, and sort of proven if you like. So we weren't inventing anything from scratch. There's a few more resources there, which might be interesting.

So operability questions, testability questions, and then CD checklist are info other ways of kind of sharing some of these assessment criteria to people who can't maybe wouldn't understand how to read a GitHub repository, for example. So has anyone, raise your hand if... has anyone used the Spotify team health check? One, two. Do you find it useful? Is good. It's very useful. Every team I've worked with has found that this the Spotify health check model very, very useful. So broadly how it works is... so what we've done effectively is taken the Spotify health care model, which is a single dimension of Team Health. And we've added five more dimensions to have six dimensions in total. But we've taken the kind of the way in which Spotify recommend running it, we've used it as a model based label. Partly because it works, but also partly lots of scrum masters or other people have already used it. So it's a team self-assessment. The team self assess based on the criteria, comes up with a score for itself. We'll see the results of that in a second.

The sessions take about two hours to go across the six different dimensions. So it feels a bit like a kind of extended retrospective. Or maybe it's not extended for you depending on your retros list. And so you typically have someone facilitating it acting kind of like someone who would facilitate a retrospective and keep things moving along. Inside each assessment session, there was another facilitator in training. So that allowed us effectively to make the assessment kind of viral because the person who was being trained in the assessment, would then go on and do two more assessments and also train two more people. So then that was the way we kind of scaled it across multiple teams. We recorded the results on big sheets. We print up big kind of A1 sheets, put it on the wall. So here's one of the dimensions for one team. So this will operability. There are some additional details for each of these headings here. The additional details you'll find on the website, I mentioned. The criteria, the sheet here just listed the kind of headings.

But the team is able to rate itself on how well it collaborates, on operability for example. They're able to rate itself in this case on how much time and effort they spend, thinking about and implementing operability within their team. In this particular case, they gave themselves kind of middling three out of five for that one, that's fine. What else have we got? How well they deal with failure modes in their software or their service? Again this particular team thought, "Well we're doing all right we're about a three." You see that there's a lot of interesting kind of stuff testability, certificates, KPIs, logging, all sorts of stuff like this. There's more detail that you can see here. But the team were able to self assess itself, on these six dimensions. And then we brought the results together. And this allowed us across multiple teams to spot where it, were there any consistent problems? If a team rated itself really poorly on one of these areas, what could we, as the kind of engineering, core engineering group do about it?

Is there something that is missing? Some documentation is missing. Is it too hard to use this particular infrastructure service and so on? So we use these as signals to help work out where to invest kind of time in the platform. Also where to invest time for maybe assisting other teams just to raise their game a bit. And this is typically the response we got from every session that we ran. Can you see that at the back? So there's two stickies, once says value and one says execution. So basically how much value did you get out of the session? How well was executed and pretty much all of the sessions came out with all smiley faces. The team's really valued this opportunity to reflect, self assess on these kind of different criteria. So as I mentioned, it's Creative Commons, Share Alike, it's Open Source. If you're interested in having a look, just go to softwaredeliveryassessment.com and you'll find all the details. And obviously, as before send us a pull request if you think something should be changed or there's something missing.

Experience Report: results so far

Matthew Skelton:

So what are the results so far? So I'm no longer on that particular piece of work. For various reasons I'd to... probably because I'm writing a book, which I'll mention in a minute. I had to finish at the end of last year. But so far, we went from two successful release candidate bills per week, through increase the speed of that through to seven or eight of our successful bills per day.

So it's pretty good. More than order of magnitude change. And I'm starting to see multiple independent routes to production to live. In terms of operability, the ops teams, the code life service teams really love the new operator experience that we've been able to show them, particularly on the logging side, with some changes to how the dashboards work as well. And there was no major operational problems. We managed to get people to see that things like logging, and correlation IDs and things were actually an opportunity to improve collaboration between teams. And I think that was a real revelation for lots of people.

And it's good to hear from people I know we're still working now there, the weekly tech talks and Guilds and so continue to kind of drive this awareness of good practice. Loads of other people involved, right? It wasn't just me, it wasn't just my colleagues in Axiologik. Lots of other people as well. All right.

Key Takeaways

Matthew Skelton:

Some key takeaways then. Address operability early on. So what some people call shift left, but I've always find that a bit confusing. Address these operational aspects early on. And these add checks into the deployment pipeline. Techniques that are very team focused, team-first like the Run Book dialogue sheet that you saw. You'll find that runbooktemplate.info. But similar techniques that kind of get the team thinking about, how effective their software is going to be when it runs in production. Absolute foundation to have good logging in place. Use this well defined events base. You can lean on your enums if you've got them in your language.

Being able to read an event ID as a human being and immediately understand it. So as someone writing the software, yes, you might understand what 3475 is, as an event identifier. Does the person in ops or life service immediate now know that? If they can see it says, "Item added to basket." So much richer, more meaningful, cutting straight to the nature of the problem. And we need to kind of make space for learning and sharing. You've got to invest in this time, like Tonya was saying yesterday, you got to invest in opportunities for people learning in different contexts and sharing knowledge and practice. Something that I've been thinking about a lot recently is, the concept of a thinnest, viable platform. What is the smallest thinnest platform that could be provided, to enable application delivery teams to deliver rapidly and safely in your context? If the only platform you need is a web page that says, "We use these five Amazon or AWS services and we use this authentication mechanism."

If that's all you need in your organization, that is your platform. It's a wiki page that sits on top of AWS. You don't need to build anything more. That's awesome. If you saved a lot of time. Now in your context is probably a bit more than that. But defining very, very clearly what that platform is, and making sure that people have to use it, have a good experience when they use it. For me, that's a major takeaway from recent work that I've been doing. So summarize this little closing section. Address operability early on. Make sure that you've got exceptional logging in place. Make space for learning and sharing. Make sure you've defined what your platform is, from the point of view of people have to consume it. And involve teams in these improvements. Like these assessments that we've seen, get them bought into the improvements you're trying to make. There's a link to the operability book if you're interested, again, we got a 30% discount today. And there's a few more links as well in here so you can find all this stuff when you download the slides later.

In the slightly separate note, this is a book I'd be working on, which is why I'm not working anymore.

If you're interested in the relationship between teams and technologies, platforms, how different teams inter-relate. This book is coming out in September. (“Team Topologies” book has been released in 2019) It's published by IT Revolution Press who published Accelerate DevOps Handbook, Phoenix Project and so on. We've got some really amazing case studies from great organizations around the world. So if you're interested, sign up for the newsletter. You can actually buy it, pre-order it on Amazon I think now. So if you're interested in that kind of stuff, head to teamtopologies.com or come and find me at the Conflux down later on. And that's all we got. So thank you very much.

Practical ways to increase Operability within Continuous Delivery - Matthew Skelton at Continuous Lifecycle London

Slides

Transcript of the talk:

What is Operability and why do we care

Continuous Delivery Context

Experience Report - GOV.UK departmental division

5 Operability Techniques

1 - modern event-based logging

2 - Run Book dialogue sheets

3 - endpoint healthchecks

4 - Correlation IDs and traces

5 - Lightweight User Personas for Ops

Assessing Operability at Scale within Continuous Delivery

Experience Report: results so far

Key Takeaways

Further Reading