Time flies. It's been almost a year since I've published my What's wrong with DevOps? article. The attention it received came as quite a surprise to me. It seems that there is a number of like-minded individuals sharing my point of view. But what came as even more of a shock was that these individuals also seem to like my writing style. Awesome. No pressure then.
But truly, I've been wanting to write a follow-up for some time now. Finally got to do that. For better or for worse; I leave figuring this out to you.
One of the topics closely related to the "DevOps" buzzword (or even brought to daylight by its viral spread) is Agile Operations.
We all know that in nowadays business "agility is the key". There are so many wonderful acronyms, contractions and neologisms like MVP, NoOps, CI, CD, pipeline, XP, serverless, SOA, TDD, API, TL;DR and so on that one could write up a tech article consisting just of them! But what that means to "hard IT" folk like us, "the ops guys"?
With the rise of Agile software development methodologies, developers and companies started releasing apps much faster, than they used to. Iterative approach towards creating code had to -- sooner or later -- revolutionise delivering this code to systems it would run on. Continuous integration and continuous delivery are two concepts that probably most of operations people are familiar with. While automating the "delivery pipe" completely may be frightening at first, it's functional not only for Devs, but also for Ops. Being able to deploy faster means being able to fix problems faster. For example by rolling back to a known-working application version.
Both CI and CD are, in my personal opinion, on the "bright" side of Agile Operations. I've done some automation work with Jenkins and Drone myself and loved the results I've achieved. What's even more important than agility, is the communication and collaboration improvement that CI and CD bring to the table: from the shared visibility of the test and deployment success rates to the clarity around delivery process to a single control console of the process itself.
But there's more to Agile Operations than just Agile Delivery.
The "Agile" in "Agile Operations" implies we should use some sort of agile methodology. Scrum is probably the most well known one and it's used by countless Dev team across the globe. Probably with some success, since we still read about it. But in the Ops world we don't have only structured project work. Actually, most of the work we do (and it really doesn't matter, if we're called "system administrators", "DevOps engineers" or "site reliability engineers) are the day-to-day operational tasks, ongoing incidents, user-reported system problems. And trying to fit those into two-week sprints is not only pointless, but also dangerous. Just imagine what would've happened if a critical security patch had to wait for two weeks to be applied. Or if a CEO had to wait even a minute for help with mailbox. And I won't even mention stuff like I/O contention or disk space increases. Scrums, with their sprints, user stories, point estimates, velocity charts and stand-ups are a great way or organising work around a project (or a set of those). But what happens every day in Ops teams that use Scrum is that we constantly modify our sprint's task list to accommodate for the unstoppable influx of user-submitted tickets. Incidents. Help requests. Questions. Sizing. Password change requests. Wifi access problems. The list goes on and on. If you're an Ops person, you know how it's like.
Forcing Scrum on operations team just because it's a company policy for development teams isn't the best idea. Or it is, if what you want to achieve is decreasing issue visibility and increasing time to respond and time to resolve. That's why recently I've introduced a hybrid Scrum + Kanban workflow in the team I work in. Kanban differs from Scrum mainly when it comes to workload planning. While in Scrum the issues are picked from backlog and organised in fixed-time sprints, in Kanban the flow in continuous and the notion of sprint has no meaning. Another thing is that in Scrum we optimise for so-called story points (the abstract measure of tasks' complexity) as in Kanban we do that for time to resolve (and sometimes also time to respond). These two factors make Kanban much more suited for the type of work that operations teams usually do, or do in bulk. But sure, we do project work as well. Remember those wonderful acronyms? So implementing one of them can be a project for operations team. And for stuff like that it absolutely makes sense to have Scrum. But what we found out in my team just week after doing stand-ups with two boards was that it creates a lot of confusion; it's hard to know at first glance, which board is the issue on. Fortunately, I've manage to create one, universal board, though it required a compromise.
But still, this isn't everything that Agile Operations consists of in nowadays brave new IT Operations world.
Pets and cattle. Another brilliant example of our New-speak. Another awesome idea to make us more agile.
The pets and cattle metaphor postulates that in modern IT servers (well, systems really, or VMs) should be treated like cattle, not like pets. So pets you name, of pets you take care, you nurse them back to health when they're sick. Cattle? You number it, you govern it, you shoot it in the head when it stops being perfect.
This difference in thinking about our systems comes mainly from the physical/virtual distinction, but also from the wide-spread of automation and orchestration frameworks. It's a great thing that we can just run an Ansible playbook or apply a Puppet manifest and get a fully-configured system in just few minutes instead of typing magical incantations in the console for hours. I do love automation, I'm not saying I don't.
Carrots are good for your health, right? But try eating like a 100kg of carrots. The point is that everything needs to be properly balanced. Sometimes you just have to go and fix things. Especially if you're in operations. No matter how much your boss optimises you for agility, sometimes you just have to do that.
You know this viral post that spread over ReddIt few weeks back? I know, it was on the internet, so it sure as hell was true. But that's beside the point here. A guy told his story: he has automated his software testing job in 100% and was doing literally nothing for quite a few years. But in the end he was let go. And now the poor thing doesn't remember how to code or do anything useful. Poor, poor thing.
You see? Everything needs to be balanced. Even if we have a perfect automation framework and we can always rebuild faulty systems in minutes with very little manual intervention, we should keep our skills sharp. What if the automation framework breaks? What, if something goes wrong with the hypervisor's OS? What if hardware goes down?
We're the Operations Folk, we are here to resolve problems, we are here to gather knowledge and serve as subject matter experts in, well, operations! Of course, sometimes, maybe even most of the time, the quickest, easiest and most agile way of "resolving" problems is to go around them, but we have to be prepared for when things go so wrong that there's no other way than go deep into the dark command line realm.
We rely on abstractions a lot. We have abstraction layers over abstraction layers. For example, we have bare-metal, than hypervisor OS, than we have VMs, on top of those we have Docker, on Docker we run Docker Compose, than finally we have containers and in containers the applications. So many abstraction layers. Each of which can fail. And your automation framework? How many of those abstraction layers does it account for?
Individual servers don't matter. Individual VMs don't matter. It's the systems that run on top of them that matter. If the infrastructure underneath is failing, don't waste your time trying to figure out why. Throw it out and replace it -- preferably through automation. George Reese
What this quote says to me is basically: "Make your Operations Engineers less and less competent so that companies like mine can take more and more of your business".
Let me say that again: I do love automation and I think it's awesome that we have the ability to rebuild big chunks of our infrastructure. But our systems are complex and there's a lot of things that can go wrong. And things will go wrong. And you can't have support for every black scenario in your automation framework. That's just not possible. I bet most of us can't even think of all the ways our systems can fail. And systems fail. And when complex systems fail, they fail hard. This is why we have to be prepared, we have to keep our skills of dealing with failing and failed systems sharp. We always have to be able to rebuild our systems by hand. From the command line. Without any assist from yet another system that can fail in yet another way.
I think that the pets and cattle metaphor is very useful and that treating your servers as the latter can bring good results, increase productivity, agility and ease Ops in their daily, sometimes mundane tasks. But since we are the last line of defence against failures, we have to keep a good balance between how much our job is easy and how competent we are when it comes to our systems. The more automation the better, but we have to remember that there should be always our knowledge behind the framework. Our, not anyone else's.
You know the famous five monkeys experiment? What it basically teaches us is that it's very easy to fall into patterns that we don't fully understand or don't know the basis for. Taking pets and cattle metaphor too far can make us less competent in exactly the same manner. And, what's even worse, can make our organisations like that. If all the we do is tearing down old infra and replacing it with new one without deep understanding of why we do this, the people that will -- sooner or later -- replace us at our desks will not be prepared for when things go really bad.
What's wrong with Agile Operations then? Well, the truth is that Operations was never meant to be as Agile, as Development is. We are here to keep things stable. And to do that, we have to know our stuff. And to know our stuff, we have to, from time to time, go through a crisis and practice our skills of keeping stability and bringing it back.
If you like this article please consider sharing it with your friends.