SRE

Look Mom, I’m famous….

Well, moderately well-known 😛

I had the pleasure of being a guest on Andy Leonard and Frank LaVigne‘s podcast, Data Driven.  I was originally going to talk about #SQLFamily, Azure DataFest, and community life in general, but the topic quickly changed in light of the Azure outage on the day we recorded it (9/4/2018).  So we rambled on, joked a lot, threw in some 80’s pop culture references, and generally had a good time. Give it a listen on your favorite podcasting app.

Hopefully this will inspire me to find time to write more.  We touched on a lot of ideas, but didn’t really dive very deeply into any of them.  I need to stop DOING, and WRITE IT DOWN (that’s a variant of the advice I give my team).

 

 

The Definition of Service

As I’ve blogged previously (The S in #SRE), I’m in the process of transitioning my team from database and system administration to a team focused on service reliability. As I’m continuing to evangelize DevOps and Service Reliability Engineering within my business unit, I’ve realized that I need to have a good strong definition of what exactly a service is. I figure if I’m going to work through this, might as well do it in my blog so I can find it later.

A Service:

  1. Is an abstraction of value. Services are containers for delivering value to a customer, either internal or external.
  2. Has a consistent definition. It’s comprised of people, products, and processes, and while the relationship between those elements can change, but the inputs and outputs of a service are relatively consistent.
  3. Requires a backup strategy. Disaster Recovery Plans and Business Impact Analysis are foundational tools for the management of quality associated with a service. While a DRP may contain multiple recovery strategies, a Service should be able to be recovered entirely.
  4. Should be continuously improvable. This means that a service is only fixed at a point-in-time; components should be versioned so that management processes (including recovery) are synchronized with the delivery of value at a given time.

More to come.

The S in #SRE

As I’ve blogged previously, my responsibilities at work have shifted to focus more on the application of Site Reliability Engineering principles to the delivery of our business services to our customers. Unofficially, we’re calling my team Service Reliability Engineering for a few reasons. I thought I’d take some time to explain what the differences are, and why I think the name matters. I realize I’m just one lonely guy in the wilderness, and I’m going up against Google, but I think one word in the title is wrong. Before I explain why, let me explain what I do like about the title.

Engineering defines consistency of methods.

I realize that engineering is an interesting terms these days, with lots of different definitions; you can even be sued if you call yourself an engineer inappropriately in the wrong jurisdiction. However, the term itself is widely used in technology careers to describe the systematic design and operation of complex systems. Most modern applications are actually comprised of several smaller applications, all in varying states of underlying complexity. Furthermore, the delivery of an application to an end user (particularly web applications) can span the entire spectrum from infrastructure to platform to software. Additionally, applications can vary in terms of scalability, configuration, and location. Engineering addresses complexity, not just complication through systematic processes; engineers experiment, learn, and integrate consistent practices into their daily processes.

Reliability refers to purpose.

When your job title identifies reliability as a name, it means that you have a specific goal in mind, and that goal is not limited to a technology. Reliability engineers work with networking equipment, operating systems, applications, middleware, and/or database systems. They may specialize in a area (e.g., database reliability engineering is now a thing), but a robust team is comprised of necessary skill sets required to meet service level objectives across the entire technology stack. Reliability as a goal must first be defined, and then measured, and SRE responsibilities are responsible for measuring and addressing reliability across the entire spectrum, from infrastructure to platform to software. However, reliability measurement must also account for not only technological issues, but also the processes and people responsible for developing and operating the system. There’s a reason that a just culture is an integral part of the SRE experience (and the DevOps movement at large); people are responsible for how well technology performs, both in terms of defining expectations and day-to-day delivery of service. It only makes sense to look beyond technology when examining reliability, and that leads to where I disagree with the standard SRE nomenclature.

“Site” implies a technical focus; “Service” implies a business function.

The word “Site” in the IT domain typically refers to either a physical location (data center site) or an application (web site); however, the heart of the definition is sociotechnical, not strictly technology. From an undated (seriously, Google?) interview with Ben Traynor, the founder of the SRE movement: “… we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment — not only the production environment, but also the development teams, the testing teams, the users, and so on.” While the previous paragraph of that interview specifically focuses on the type of work that’s being done by Google’s SRE team, these rules of engagement show that SRE’s should be concerned with the entire value stream of service delivery including not only operations, but development, testing, and ultimately the end user experience.  In, other words. SRE’s are concerned with the reliability of the whole service, not just the technical parts.

#DOES17 San Francisco – Things I Learned

Just spent the last few days at a technical conference that focused more on cultural change and workflow than bits and bytes, the DevOps Enterprise Summit. It was enlightening, and for the first time in a while, I’m leaving a professional conference energized and hoping to implement some of these ideas. The DevOps community reminds me a lot of the SQL Server community; passionate people who just want to help each other grow. There’s so much good content, and I think most of it will be available via YouTube later.

Armed with my handy dandy Rocketbook (I left my laptop in my hotel room each day purposefully), I scribbled notes fast and furiously. At the end of each day, I tried to capture three things that struck me as important, based on everything I’d heard. Here’s my list, broken apart by day:

Day 1 – Nov 13, 2017

  1. You are on the right path, do not fear. Change only happens when you take risks.
    I’m not sure why this struck as so important, but it was a feeling of general acceptance of ideas. I often struggle with being a leader because I think too much about the challenges ahead; that’s fear, pure and simple. The truth is that every challenge in technology has a solution; it may not be obvious, it may not be immediate, but it’s there. Fixate on the goals and successes, and don’t worry about the challenges.
  2. Value Stream Mapping is key to continuous improvement.
    This came out of the workshops, based on the Lean Coffee format; one of the discussion points that came up was the value of value stream mapping. I already liked the concept, but I tend to drift into thinking about the software components, whereas the discussion focused more on the people & process aspects of it. You can’t improve until you know where you are starting from.
  3. T-shapes are the best shapes.
    I had come across the term T-shaped professionals only a few days before flying out to San Francisco, and I was surprised to hear it mentioned in at least 4 sessions today. It aligns well with my vision for Service Reliability Engineering; people can (and should) be experts in a key area, but have some depth in additional necessary areas.

One of the amazing graphic facilitation artifacts done by Christopher Fuller of Griot’s Eye at the conference.

Day 2 – Nov 14, 2017

  1. Start with what you can do, but don’t be afraid to ask what other people are doing.
    This ties back into the first message from day 1 a bit, but it’s a little bit of a twist. Often when I see good ideas, I think “I can’t implement that”, and I may be right; I don’t have control of my infrastructure anymore, for example. However, just because I can’t do it, doesn’t mean that I shouldn’t tell people about something I’ve seen, and ask if it fits into their mental model of where we’re going.

  2. The SRE model works when you focus on people, not technology, and reliability (what you can measure and improve).

    Site Reliability Engineering talks were in short supply at this conference, but it permeated throughout. The fundamental truth of focusing on people, rather than technology still holds true with this model; technology breaks because of people. That’s not intended to be a statement of blame, but rather an understanding that technology does what it’s told to do; when things go awry, there’s an opportunity to understand what humane choices to make; is it a misunderstanding of purpose? Was there a missed signal of pending change? Was a change implemented riskily?

  3. Service Level Objectives (SLO’s) are crucial to monitoring.
    Operations folks are inundated with logs and other metrics; understanding what the expectations are for the business service is key to understanding what metrics to observe (and what to ignore). You cannot measure reliability effectively without some understanding of what up-time means.

Day 3 – Nov 15, 2017

  1. Foundation of journey is based on a three-legged stool: Service Level Objectives, Value Stream Mapping, and Technical Architecture Map.
    I know, the whole DevOps is a journey thing is a bit tired, but it’s an attempt to represent the fact that everybody’s experience with these principles yield different results, and paths. However, for my current work situation, we’re not going to get far without these three things. We need to understand what the Service is supposed to do, how the people interact with the technology, and how the components of the technology are supposed to work together.
  2. DevOps brings joy to technology; don’t quibble over details, but focus on bringing humanity into technical work.
    Gene Kim’s closing comments reminded me a lot of my #sqlfamily; everybody brings a different perspective on technology, and a different method of solving problems, but the people that impact me most are the folks that do so with joy. They have a passion about what they do, and their goal is to encourage others to move forward (rather than focusing on rigid solutions based on their own experiences).
  3. Patience – all change takes time, and the road is uncertain. Don’t expect overnight successes.

    Finally, my head is full of things to think about, and books to read. However, there is no magic pill for implementing change in my organization. It has to happen slow, and it has to be organized by every member of the team who wishes to contribute. It takes time to light a fire, but it is crucial to the success of any DevOps initiative.

Changes… #DevOps, #SRE, and #Management

As many of you know, I’ve been slowly changing focus from database administration to DevOps, management, and service operations. Over the last few years, I’ve been heavily involved with migrating our datacenter to a private cloud infrastructure, and focusing on methods to improve the overall reliability and scalability of my company’s service deliverables. I’ve been a bit of an evangelist for organizational changes, and it’s finally official.

On October 1, my job title and responsibilities will officially change from Manager of Database Administration to Senior Manager, System and Network Administration. I think that title’s a bit funny, because I won’t be managing system or networks; our infrastructure is managed by another team, and my team is responsible for applying principles of Site Reliability Engineering to operating our complex business services. Unofficially, we’re calling ourselves Service Reliability Engineering in order to remind ourselves (and others) of our foci:

  • Identifying issues that impact the reliability of customer facing services;
  • Tracking and documenting complex relationships between applications used in that delivery; and
  • Coordinating efforts with other teams responsible for developing and operating those services

The title change itself is a minor issue, but it’s an important distinction to me. It represents a significant shift in my career path. Although I’ve been acting as the IT Operations manager for the last two years, it’s always felt a little odd because I was still known as the DBA manager. With the move to the cloud, most of our administration (like backups, DR, and maintenance) are handled by another group, and we’re responsible for making sure that the applications (all 84 of them) perform appropriately. We’re not coders, but we need to be able to identify coding issues. We’re not network or sys admins, but we need to know how systems work, and be able to propose solutions to scalability issues. We’re not DBA’s, but we need to be able to express data requirements to the new DBA team. In short, we’re an integration point for a loosely coupled set of applications.

I plan to blog more on Service Reliability Engineering in the future, but for now, I’m excited about what’s happening.

#DevOps Two Books for Operations

Over the last couple years, there’s been a subtle shift in my responsibilities at my day job (and my interests in technology overall).  I’ve been doing much less database development and administration work, and more general system architecture work.  That’s harder to write up in blog posts than SQL code, so I’ve struggled with writing, but I want to get back into the habit.  So excuse the choppiness, and let me try to put some thoughts on digital paper.

I’m pushing very hard for my company to adopt DevOps principles.  There’s a lot of material out there about DevOps from the developer perspective, but there’s few resources for those of us on the operations side of the house.  In a pure sense, there’s no such thing as sides, but in a regulated industry like healthcare or financial services, old walls are tough to break down, so they’re useful as organizational frameworks for general responsibilities.  However, we are all developers, whether or not we sling code or manage infrastructure as code; the goal is to produce repeatable patterns and tools that allow growth and change.

Two great books that I’m reading right now are:

The Practice of Cloud System Administration by Limoncelli, Chalup, and Hogan.  Tons of practical advice for building large-scale distributed processing systems, and DevOps philosophy is woven throughout (and specifically highlighted in Chapter 8).  This is one of those books that you’ll feel like diving in on some sections, and skimming over others; it’s a through examination of system administration from development through implementation, so there’s lots of conceptual hooks to grab hold of (and conversely, things that you may not have experienced).

The second book that I’ve recently started reading is Site Reliability Engineering: How Google Runs Production Systems.  This book is a collection of essays which explore Google’s method of approaching reliability; like most things Google, Site Reliability Engineering is similar to DevOps, but specific to the ways that Google does thing.  It’s also light on documentation (insert joke about Google and beta products here).  However, it does offer several insights into day-to-day system administration at Google.  While the SRE model is not exactly like DevOps, there’s lots of overlap, and differences may be attributed more to practice than to concepts.

More to come.