DevOps

To be a #DevOps Leader…

…you have to do less work, and do more to improve work.

I know, it’s been said 1000 different ways already, but it sunk in this week as I struggled to put out the bazillionth little forest fire that had crept up after our last deployment.  It’s so easy to get back in the rhythm of doing whatever it takes to keep the system stable, even if it means sacrificing time with family, projects you want to make headway, general health and well-being… IT work is built on the backs of heroes.

But it isn’t sustainable.  I look back over the last month and wonder what the hell it is I accomplished.  My team is worn out; I’m worn out.  We managed to squeak in a major project, but we had to drop the ball on 100 other little things to get it done.  And, to rub salt in the wound, the new project added new work to the backlog, which just means that we may have won the battle, but it’s an expensive win.

My job as a leader is to make sure people know how much wins like that cost; not to discourage them from trying to make a win, but to make better strategic decisions so that we don’t sacrifice something important in order to get some other important thing done.

Monday’s coming, and I’m going to start over. Again.  And i”m going to keep starting over until things get better.

#DOES18 – Wednesday Takeaways

Last day of the DevOps Enterprise Summit 2018, and I had a few aha! moments that really clicked for me.

  1. Jon Hall’s presentation on swarming was a great practical explanation on how to manage swarms.  My team needs to be more available to Tier 1 and Tier 2 to swarm on issues and resolve them.  It’s not quite the same model as the model Jon presented, but it’s a step.
  2. As an organization, we don’t need a plan so much as we need a set of common principles: commitment to communicate (make work visible to teams, management, customers… etc.), and commitment to best practices in development (SOA, Agile, product not project, etc.).  That’s the first step; once we’ve identified those principles, then there needs to be an establishment of processes that reflect those principles.
  3. Uncertainty is good.  We don’t need to stay in a particular mode in so much as we need to find the problem and solve quickly.

Keynotes are up, if you want to check them out.  Great stuff in there.

#DOES18 – Tuesday Takeaways

Thoughts from yesterday’s DevOps Enterprise Summit in Las Vegas

  1. Managing workflow is different than managing configuration changes or code.  It’s related, but you can have separate systems for individual team’s processes and procedures (i.e., Azure DevOps for developers, ServiceNow for ops), but you want to make work visible to all affected teams.
  2. Making work visible was a theme I heard over and over again; so much of the “throw it over the wall” mentality stems from the fact that ops does magic. 
  3. Processes are necessary, but they must be as light as possible or they won’t meet the need (they get in the way of rapid response).  People often adapt by ditching the process anyway.

#DOES18 – Monday Takeaways

This is a quick post; I meant to get this pulled together last night, but hey, I’m in Vegas.  There were things to do, so I’m hammering it out before the morning sessions start.  As always, this is a fantastic conference; it really shifts the focus away from specific technological solutions to understanding why technology is important to delivering business value.  In no specific order, here are my key takeaways from yesterday:

  1. I need to do a better job of articulating my vision, and asking my boss (and his boss) to do the same.  What’s the message we want to send to our employees?  To our clients?  To our competition?
  2. Minimum Viable Compliance – we are so process-heavy; how do we convince governance to help us do the right thing, but do it in such a way that it reduces the strain on the system?  Governance as a Service.
  3. Return to our roots- build a service catalog, and use it to describe architecture.  Every single component of the value stream (ideally) should be captured, and used like Lego blocks to define service delivery.
  4. Reliability Engineering needs to be visible.  Keep measurement and reporting simple; start with a rubric, and a pass/fail mentality.  Dedicate time to toil reduction as a team, not as individuals.

Lots more to come, but wanted to get these notes out the door.

Look Mom, I’m famous….

Well, moderately well-known 😛

I had the pleasure of being a guest on Andy Leonard and Frank LaVigne‘s podcast, Data Driven.  I was originally going to talk about #SQLFamily, Azure DataFest, and community life in general, but the topic quickly changed in light of the Azure outage on the day we recorded it (9/4/2018).  So we rambled on, joked a lot, threw in some 80’s pop culture references, and generally had a good time. Give it a listen on your favorite podcasting app.

Hopefully this will inspire me to find time to write more.  We touched on a lot of ideas, but didn’t really dive very deeply into any of them.  I need to stop DOING, and WRITE IT DOWN (that’s a variant of the advice I give my team).

 

 

The Definition of Service

As I’ve blogged previously (The S in #SRE), I’m in the process of transitioning my team from database and system administration to a team focused on service reliability. As I’m continuing to evangelize DevOps and Service Reliability Engineering within my business unit, I’ve realized that I need to have a good strong definition of what exactly a service is. I figure if I’m going to work through this, might as well do it in my blog so I can find it later.

A Service:

  1. Is an abstraction of value. Services are containers for delivering value to a customer, either internal or external.
  2. Has a consistent definition. It’s comprised of people, products, and processes, and while the relationship between those elements can change, but the inputs and outputs of a service are relatively consistent.
  3. Requires a backup strategy. Disaster Recovery Plans and Business Impact Analysis are foundational tools for the management of quality associated with a service. While a DRP may contain multiple recovery strategies, a Service should be able to be recovered entirely.
  4. Should be continuously improvable. This means that a service is only fixed at a point-in-time; components should be versioned so that management processes (including recovery) are synchronized with the delivery of value at a given time.

More to come.

#DevOps – Lead by example, but set the right example.

Last weekend, I missed a data center migration.

It was a scheduling conflict; for Christmas last year, my wife had bought me tickets to High Water music festival (which was great, btw), and when they set the dates for the data center migration, I was worried. The tickets were expensive, and we had booked hotels, etc; I couldn’t change plans to work with the schedule, and there were too many teams involved in the migration for them to pick a different date. We’d done this migration once before (6 months ago), and I was confident in my team’s ability, but still… I was worried. You see, missing an after-hours deployment or a maintenance window of this size wasn’t usually considered to be an option before (by me). I’ve always been a firm believer in the management rule of: Don’t Ask Others to Do Something You Won’t Do.

So, every migration, every deployment, every maintenance window… I was there. Weekends, mornings, evenings… I was there. When our first major data center migration blew up a year ago, I was there for 26 hours. I THOUGHT I was sending the message that “I’m here for you… I’m leading the way… I’m being a team player.

That’s not the message I was sending.

What happened while I was away is that others stepped up and filled the void left in my absence. They didn’t do things exactly like I would have done, and they had to take on some additional responsibilities during the migration, so their timing wasn’t as efficient as if I had been there. But the work got done, and we survived without me. I could have looked at that and said “aha; I’m not really necessary; there’s some waste savings there!”. Instead, I realized that what I thought was a four-person job was really a three person job, and that meant that the fourth person could do what was more important than work; life.

You see, the message that I was sending by being at every activity outside of work was that I Expect Y’All to Give Up Your Free Time for Your Job, Just Like I Do. I didn’t mean it that way, but my employees picked up on it. I was there; they were there. Every time. And that’s no way to work.

What I realized this weekend is that Leading By Example also means Resting By Example. If the job really is a three person job, then four people don’t need to show up to do it (or else work will expand to make it a four person job; a variant of Parkinson’s law). And while I should still be willing to do the job, I need to be willing to do it when it’s my turn. I’m now scheduling rotations (I’m in one of those rotations as an engineer), and letting my team understand that it’s not just OK to not be at every maintenance window activity; it’s expected. A job is what you do to pay the bills and enjoy life. If I believe that for myself, then I need to set that example for my team as well.

“Presenting” at #SQLSATATL – #LeanCoffee #DevOps

My supplies for my workshop!

On Saturday, May 19, 2018 at SQL Saturday Atlanta, I won’t just be an organizer; I’m a presenter! My session, “All (Data) Things Considered: The Lean Coffee Workshop” is something I’m very excited to “present”. I use that term loosely, because the whole point of a lean coffee workshop is that it’s a structured, but agenda-less discussion. I participated in one of these at the DevOps Enterprise Summit in 2017, and it was a fun, and inspiring way to engage with other people who were facing very similar problems as I was.

The way it works is that there will be a brief introduction at the beginning of the session, but people are expected to form several small groups. A seed topic will be presented, but each small group will have a moderator (and thanks to my volunteers) who will make sure that their group stays on track. Every group will:

  1. Set up a personal Kanban board.
  2. Identify topics
  3. Vote & discuss.

That’s it. Easiest presentation I’ve ever done, but the goals are really deep. I want to encourage people to engage with each other; that’s one of the original goals of SQLSaturday, and I think traditional classroom settings don’t do enough of that (conversations are usually instructor -> audience, or audience -> instructor). This puts people around a table in a small, safe environment, and that leads to long term possibilities for relationships.

Second, I’m more interested in conversations about improving work, rather than just how to do work. I think coffee talks foster that because you’re not looking at a tool or a piece of code; you’re talking to a person, and hearing what they think. That sharing of perspective can spark new ideas, and new ways of looking at the forest, rather than individual trees.

Looking forward to seeing you there!

One Weird Trick to Build a #DevOps Culture

I know it’s clickbait, but I really did want to reinforce the simplicity of this post. I think building a DevOps culture can sometimes be daunting for most folks from traditional IT backgrounds because, well, people aren’t systems. You see, both developers and system engineers are comfortable with technology; it’s usually predictable, and it’s relatively easy to manipulate.

People (and processes) are messy. The social aspect of a sociotechnical environment is full of friction: different backgrounds, different perspectives, differences of opinion. It all creates a challenge for solving business problems quickly because it takes energy to build and maintain a relationship. However, once those relationships are established, the benefits multiply. When you are challenged by a peer you trust to defend an idea, it makes the idea better. The old adage of “iron sharpens iron” is true; smart people make smart people smarter.

But how do you build trust? The first step is easy.

Be thankful.

That’s it; acknowledge when people take a risk, and thank them for stepping of their comfort zone. Be thankful when they offer an opinion that’s different than yours, even if you don’t take the advice. Specifically acknowledge their contributions that make your job easier, and let them know why. It takes some time (and occasionally some effort), but it establishes a pattern of trust. This is particularly important when you’re in a leadership position, and they are not; the best ideas come from people who feel like they can contribute on a regular basis, especially to people in power.

Take the extra time to say “Thank You”, and establish a foundation of trust.

The S in #SRE

As I’ve blogged previously, my responsibilities at work have shifted to focus more on the application of Site Reliability Engineering principles to the delivery of our business services to our customers. Unofficially, we’re calling my team Service Reliability Engineering for a few reasons. I thought I’d take some time to explain what the differences are, and why I think the name matters. I realize I’m just one lonely guy in the wilderness, and I’m going up against Google, but I think one word in the title is wrong. Before I explain why, let me explain what I do like about the title.

Engineering defines consistency of methods.

I realize that engineering is an interesting terms these days, with lots of different definitions; you can even be sued if you call yourself an engineer inappropriately in the wrong jurisdiction. However, the term itself is widely used in technology careers to describe the systematic design and operation of complex systems. Most modern applications are actually comprised of several smaller applications, all in varying states of underlying complexity. Furthermore, the delivery of an application to an end user (particularly web applications) can span the entire spectrum from infrastructure to platform to software. Additionally, applications can vary in terms of scalability, configuration, and location. Engineering addresses complexity, not just complication through systematic processes; engineers experiment, learn, and integrate consistent practices into their daily processes.

Reliability refers to purpose.

When your job title identifies reliability as a name, it means that you have a specific goal in mind, and that goal is not limited to a technology. Reliability engineers work with networking equipment, operating systems, applications, middleware, and/or database systems. They may specialize in a area (e.g., database reliability engineering is now a thing), but a robust team is comprised of necessary skill sets required to meet service level objectives across the entire technology stack. Reliability as a goal must first be defined, and then measured, and SRE responsibilities are responsible for measuring and addressing reliability across the entire spectrum, from infrastructure to platform to software. However, reliability measurement must also account for not only technological issues, but also the processes and people responsible for developing and operating the system. There’s a reason that a just culture is an integral part of the SRE experience (and the DevOps movement at large); people are responsible for how well technology performs, both in terms of defining expectations and day-to-day delivery of service. It only makes sense to look beyond technology when examining reliability, and that leads to where I disagree with the standard SRE nomenclature.

“Site” implies a technical focus; “Service” implies a business function.

The word “Site” in the IT domain typically refers to either a physical location (data center site) or an application (web site); however, the heart of the definition is sociotechnical, not strictly technology. From an undated (seriously, Google?) interview with Ben Traynor, the founder of the SRE movement: “… we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment — not only the production environment, but also the development teams, the testing teams, the users, and so on.” While the previous paragraph of that interview specifically focuses on the type of work that’s being done by Google’s SRE team, these rules of engagement show that SRE’s should be concerned with the entire value stream of service delivery including not only operations, but development, testing, and ultimately the end user experience.  In, other words. SRE’s are concerned with the reliability of the whole service, not just the technical parts.