Embrace Complexity; Tighten Your Feedback Loops

Embrace Complexity; Tighten Your Feedback Loops


This post contains a transcript of the talk I wrote for and gave at QCon New York 2023 for Vanessa Huerta Granda‘s track on resilience engineering.

The official talk title was “Embrace Complexity; Tighten Your Feedback Loops”. That’s the descriptive title for the talk that follows the conference’s guidelines about good descriptive titles. Instead I decided to follow my gut feeling and go with what I think really explains my perspective and the approach I bring with me to work and even my life in general:

I take what would probably be a sardonic approach to dealing with life and systems, and so “This is all going to hell anyway” is pervasive to my approach. Things are going to be challenging. There are going to always be pressures that keep pushing our systems to the edge of chaos. I don’t think this can be fixed or avoided. Any improvement will be used to bring it right to that edge. In complex systems, the richness and variability is often there for a reason. Trying to stamp it out in favour of stronger control is likely to create weird issues.

So the best I personally hope for is to have some limited influence in steering things the best I can to delay going to hell as long as possible, but that’s it. And my talk is going to focus on a lot of these approaches, but first, I want to explain why I feel things are that way.

In what is probably my favorite paper ever, titled Moving Off The Map, Ruthanne Huising ran ethnological studies by embedding herself into projects within many large corporations doing planned organizational changes. In supporting these efforts, they were doing “tracing” of their functions, which meant gathering a lot of data about what activities take place, what interactions and hand-offs exist, what information and tools are used and required? How long do tasks take? How do people and teams deal with errors? Generally asking the question “what do we do here?” and wondering with whom they do it.

To build these maps they generally reached out to experts within the organization who were supposed to know how things were working. Even then, they were really surprised.

One explained that “it was like the sun rose for the first time… I saw the bigger picture.” Participants had never seen the pieces (jobs, technologies, tools, and routines) connected in one place, and they realized that their prior view was narrow and fractured, despite being considered experts.

Others would state that “the problem is that it was not designed in the first place.” The system was not designed nor coordinated, but generally showed the result of various parts of the organization making their own decisions, solving local problems, and adapting in a decentralized manner.

The last quote comes from events when a manager at one of the organizations walked the CEO through the map, highlighting the lack of design and the disconnect between strategy and operations. The CEO sat down, put his head on the table, and said, “This is even more fucked up than I imagined.” He realized that the operation of his organization was out of his control, and that his grasp on it was imaginary.

One of the most surprising results reported in there was about tracking the people who participated in organizing and running the change projects, and seeing who got promoted, who left, and who moved around the org or industry they were in.

She found out there were two main types of outcome. The first group turned out to be filled with people who got promotions. They were mostly folks who worked in communications, training, who managed the costs and savings of the projects, or those who helped do process design. Follow-up interviews revealed that most of them attributed their promotions to having worked on a big project to put under their belt, and to frequently working with higher-ups, which both helped with getting promoted.

Another group however mostly contained people who moved to the periphery: away from core roles at the organization, sometimes becoming consultants, or leaving altogether. Those who fit this category happened to be the people who collected the data and created the map. They attributed their moves to either feeling like they finally understood the organization better, felt more empowered to change things, or became so alienated by the results they wanted to get out.

So the question of course became how come people who feel they understand how the organization truly works and who want to change it move away from the central roles and positions, and into the peripheral ones?

The fatal insight, according to Huising, is something sociologists knew for a good while: the culture and the order imposed to organizations, groups, and even societies is often emergent and negotiated. And while it’s obvious that these structures dictate a lot of actions, the actions themselves can preserve or change the structures around them.

The feelings of empowerment and alienation come in no small part because people realized that they could change a lot more than they could, albeit often from outside the core decision-making that enforces the structure (while understanding how that core works), or because the ways they thought they were impacting things was shown not to be effective and they felt disembedding.

Another thing you have possibly experienced and isn’t in the paper now is one of differentiating between the nominal and actual structure of the org, the emergent one that depends on power dynamics, who knows what or whom, who likes or dislikes each other, and so on.

If you’ve ever worked in a flat organization, like the one in the middle here, is that even though you have little management structure to speak of, power dynamics and decision-making authority still exists. People who have no power attached to their role are still going to be consulted or inserted in the decision-making flow of the organization, they’re still going to be influential and have the ability to make or break projects, but just with less obvious accountability.

The nominal structure is the one where each level of management and within the organizational ladder specifies how information flows, and how authority is applied. It’s what we see on the left in a more traditional org structure, and this way of organizing groups will simultaneously be useful to align efforts and to constrain them. It makes accountability more explicit and transparent, but structurally will prevent people from doing unspecified things, whether they would be harmful or useful.

The emergent structure is always there as well. It is implicit, always changing, and not necessarily constrained to your own organization either. Sometimes, people who know how to run, maintain, or operate components, or whom people listen to, are not even in your org anymore. They might have moved away (to a different team or even a competitor), retired, or never been in and they have just published a really influential piece of media and people look up to them.

But who knows what, works with whom, and who can move things around in specific contexts can be key to successful initiatives. Even if the organizational structure has often been put in place to constrain change, as a barrier to people working in mis-aligned ways, some folks central to the emergent structure, in key contexts, have earned enough trust to be allowed tacitly to bend and break the rules. They can choose not to enforce the rules, or the rules are not enforced as tightly for them with the hopes of positive outcomes—even if sometimes it can get you the opposite result.

I’m not here to argue in favor of one or the other structure, but mostly that in my experience, driving change or making initiatives succeeds the most when catering to both structures at once, or rather fails when only looking at one and being blocked by the other. They’re both real, both distinct, and pretending only either exists is bound to cause you grief.

As a continuation of this, the way people work every day is often different from the way people around them imagine their work is being done. The gap between how work is thought to be done and how it is actually done is a major but generally invisible factor in how systems work out.

Based on flawed mental models of the work, procedures and prescriptions are given about how to do work, and will vary in inaccuracy. People will imagine things like, for example, writing all the tests before writing or modifying any code and that code coverage could be ideal and then that it will all be reviewed in depth by an expert, and will enshrine this as a policy.

But the application of these policies is never perfect. Sometimes code doesn’t have an owner, or due to crunch time and based on how much the reviewer and author trust each other, the review won’t be as in-depth as expected.

When you see this mismatch causing people to ignore or bend rules, you can choose to apply authority and ask for a stricter rule-following. This pattern of enforcing the rules harder will likely drive these adaptations underground rather than stamping them out, because real constraints drive that behavior.

In turn, the work as disclosed will be less adequate, and the work as imagined progressively gets worse and worse.

This becomes a feedback loop of misunderstanding and at some point, like our devastated CEO, you’re not managing the real world anymore.

To demonstrate this, earlier this year I went to my local mastodon network—so you know this is super scientific—and ran a poll about time sheets. The question was “If you’re a software developer who ever worked for an employer who had you track your time hourly into specific projects/customer accounts and you were short on time budget, did you…”

Multiple answers were accepted. Fewer than 15% of people either stopped work, worked without tracking their time anymore (for free), or shifted their time into other projects with more buffer space.

Roughly a third of people reported billing anyway, some stating that it’s not their problem the time allocation wasn’t realistic or adequate.

But the vast majority of answers, nearly 60%, came from people saying “my time tracking was always fake and lies,” with some people stating they even wrote applications to generate realistic-looking time sheets.

What we can see here is an example of how work-as-imagined gets translated into policies (“people do their work in projects, and account for their time”), which at some point doesn’t get applied right anymore. If I were to suppose, it could be things like not being allowed to go over time, or just finding the practice useless. But the end result is that the time sheet data just isn’t trustworthy, and then it can get used again and again in further decision making.

The gap widens, and our CEO might also get to think “this is all fucked up.”

Part of the reason for this is that every day decisions are made by trying to deal with all sorts of pressures coming from the workplace, which includes the values communicated both as spoken and as acted out. People generally want to do a good job and they’ll try to balance these conflicting values and pressures as well as they can.

The outcome of that trade-off being a success or a failure isn’t known ahead of time, but these small decisions accumulate based on the feedback we get from each of these and can end up compounding and accumulating, either as improvements, or as erosion that makes organizations more brittle, or really anywhere in between. People adopt the organization’s constraints as their own, and this set of pressures is the kind of stuff that drives processes to the edge of chaos over and over again.

These accumulations of small decisions, these continuous negotiations, that’s one way your culture can define itself. Small common everyday acts and small amounts of social pressure you can apply locally has an impact, as minor as it might be, and compounds. You can easily foster your own local counterculture within a team if you want to. This can both be good (say in Skunkworks where you bypass a structure to do important work) or bad (normalizing behaviors that are counterproductive and can create conflict).

So while a lot of the work you can do to improve reliability or resilience as a whole can be driven locally, my experience is that you nevertheless get the best results by also aligning with or re-aligning some of the organizational pressures and values usually set from above.

The idea here is to start looking at the organization from both ends: how can we support the people dealing with the trade-offs in conflicting goals as they happen, how can we influence the higher-level values and pressures such that we can try to reduce how often these conflicts happen even though they will definitely keep happening, and how can we better carry context and feedback across both ends so that we constantly adjust as best as we can. A system perspective on interactions, rather than focusing on components is also something I’ve found useful. The rest of the talk is going to be spent on these ideas.

(as a note, the third drawing is Dimethylmercury, a highly volatile, reactive, flammable, and colorless liquid. It’s one of the strongest known neurotoxins, and less than 0.1 mL is enough to kill you through your skin, and gloves apparently do a bad job at protecting you)

So let’s start with negotiating trade-offs, with a bit more of an ops-y perspective, because that’s where I’m coming from.

This is a painful one sometimes, especially when you have highly professional people who take their jobs seriously.

Locally for you as a DevOps or SRE team, there is a need for the awareness of what the organization and customers actually care about. Some availability targets become useless metrics because they’re disconnected from what users want, and you’re just going to burn people out doing it.

I learned this lesson when talking to the SRE manager of one of these websites where people pick their favorite images, put them on boards, and get shown ads. He was telling me how their site was having a lot of reliability issues. It would keep going down, his team would do heroics to bring it back up, and it’d open all over again.

He felt his team was burning out. They were losing people, and their call rotation was so painful they were also having issues hiring back into it. He was seeing the death spiral happening and was wondering what to do.

He added that there were perverse incentives at play: every time the site went down, they stopped showing images, but not ads. That meant that during incidents, they still earned money, but no longer paid for bandwidth. The site was more profitable when it failed than when it worked, and seemingly, users didn’t mind much.

They were not getting help, nobody seemed to consider it a problem. Not really knowing what to say, I just asked off-hand: “are you trying to deliver more reliability than people are asking for? What if you just stopped and let it burn more and rested your people?” He thought about it seriously, and said “yeah, maybe.”

I never actually found out what happened after this, but it still stuck with me as a really good question to ask from time to time.

In some cases, the answer will be “yes, we want to be this reliable”. But you just won’t be given the right tools to do it.

At Honeycomb, we want on-call rotations to have 5-8 people on them because that’s what we think gives a good pace that maintains a balance between how rested and how out-of-practice people can be. Not too often nor not often enough.

But many services are owned by smaller teams of 3-4 people. If we wanted rotations to be made of people who know all their components in depth, where they could build expertise and operate what they wrote, we couldn’t reach a sustainable frequency.

Instead, to keep the pace right, we tend to put together rotations made of multiple teams, for which people won’t understand many of the components they operate. This in turn makes us prepare to deal with more unknown: fewer runbooks, more high-level switches and manual circuit breakers to gracefully degrade parts of the system to keep it running off-hours, and with different patterns of escalation.

We started leaning more heavily on this when a big public product launch required shipping a new feature, which was to be operated by a team that didn’t have full time to get it operationally ready. When our SRE team was discussing with them what still needed to be done, we asked for a few simple things: a way to switch the feature off for a single customer, and a way to turn it off entirely, that wouldn’t break the rest of the product. The rest we could add as we went.

We ended up using these switches a few times, one of which prevented a surprising write-amplification bug that could have killed the whole system, and instead let us wait a few hours for the code owners to get up and fix it at a leisurely pace. We’re going to accept a bit of well-scoped, partial unavailability—something that happens a lot in large distributed systems—in order to keep the system stable.

The person wearing the pager often does triage and that weird issues will eventually be handled by code owners, just not right now.

This approach means that rather than working impossible hours and making inhuman efforts foreseeing the unforeseeable, we keep moving rather fast, gather feedback, find issues, and turn around a bit more on a dime. In order to do this though, there’s a general understanding that production issues may turn parts of the roadmap upside down, that escalations outside of the call rotation can disrupt project work, and so on.

That’s one of the complex trade-offs we can make between staffing, training/onboarding, capacity planning, iterative development, testing approaches, operations, roadmap, and feature delivery. And you know, for some parts of our infra we make different decisions because the consequences and mechanisms differ.

To make these tricky decisions, you have to be able to bring up these constraints, these challenges, and have them be discussed openly without a repression that forces them underground.

One of my favorite examples is from a prior job, where one of my first mandates was to try and help with their reliability story. We went over 30 or so incident reports that had been written over the previous year, and a pattern that quickly came up was how many reports mentioned “lack of tests” (or lack of good tests) as causes, and had “adding tests” in action items.

By looking at the overall list, our initial diagnosis was that testing practices were challenging. We thought of improving the ergonomics around tests (making them faster) and to also provide training in better ways to test. But then we had another incident where the review reported tests as an issue, so I decided to jump in.

I reached out to the engineers in question and asked about what made them feel like they had enough tests. I said that we often write tests up until the point we feel they’re not adding much anymore, and that I was wondering what they were looking at, what made them feel like they had reached the points where they had enough tests. They just told me directly that they knew they didn’t have enough tests. In fact, they knew that the code was buggy. But they felt in general that it was safer to be on-time with a broken project than late with a working one. They were afraid that being late would put them in trouble and have someone yell at them for not doing a good job.

When I went up to upper management, they absolutely believed that engineers were empowered and should feel safe pressing a big red button that stopped feature work if they thought their code wasn’t ready. The engineers on that team felt that while this is what they were being told, in practice they’d still get in trouble.

There’s no amount of test training that would fix this sort of issue. The engineers knew they didn’t have enough tests and they were making that tradeoff willingly.

(note: this slide was cut from the presentation since I was short on time)

Speaking of which, sometimes it’s also fine to drop reliability because there are bigger systemic threats.

Sometimes you can eat downtime or degraded service because it’s going to keep your workload manageable and people from burning out. or maybe you take a hit because a big customer that makes you hit your targets as an org and can prevent layoffs will put some things over the limit and a component’s performance will suffer. You can’t be the department of “no” and that negotiation has to be done across departments.

Conversely however, you have to be able to call out when your teams are strained, when targets aren’t being met and customers are complaining about it. It means you might be right, and some deadlines or feature delivery could be deferred to make room for others.

How do you deal with capacity planning when making your biggest customer renew their contract prevents you from signing up another one that’s as big? Very carefully, by talking it out by all the involved people.

And sometimes that trade-off is very reasonable. And good engineering requires you to move it earlier in the lifecycle of software than just around incidents. It’s much simpler to change the shape of a product’s features than it is to deliver the perfect distributed systems sometimes. Making your features take the ideal shape to deal with the reality of physics is one of the things a good collaborative approach can facilitate.

So we can make tradeoff negotiation simpler by having these honest discussions, but in many cases this ability to discuss constraints to influence how work takes place brings us to this next step, where we don’t only influence the decisions people make, but surface these challenges to influence how the organization applies its pressures. This is moving from the local level to the alignment to the broader org structure.

Metrics are good to direct your attention and confirm hypotheses, but not as a target, and they’re unlikely to be good for insights. They’re compression, and it can be unreliable.

The thing you generally care about is your customer or user’s satisfaction, but there’s a limit to how many times you can ask “would you recommend us to a friend?” and still get a good signal. So you start picking a surrogate variable.

You assume that when the site is down and slow, people are mad, and you make being up and fast a proxy for satisfaction. But then that signal is a bit messy and not super actionable, because it can include user devices or bits of the network you don’t control, plus it’s hard to measure, so you’ll settle for response time at the edge of your infrastructure. This loses fidelity into the signal, but it’ll get worse as you suddenly find some teams have more data than others, and they use features differently, so you either need a ton of alarms or fewer messier ones, but you’re getting further and further away from whether people are actually satisfied.

This loss of context is a critical part of dealing with systems that are too complex to adequately be represented by a single aggregate. Whenever a signal is useful, an in-depth dive is usually worth it if you are looking to embrace complexity.

The metric is better used to attract your attention than as a target or as something that tells you what to know. Seek to explain and understand the metric first, not to change it.

As a related concept, if you act on a leading indicator, it stops leading, particularly when it’s influenced by trade-offs.

Metrics that become their own targets and are gamed of course lose meaningfulness; this is one of the most common issues with counting incidents and then debating whether an outage should or shouldn’t be declared in a way that might affect the tally rather than addressing it directly.

But other metrics are of interest as well. If you evaluate your total capacity by some bottleneck’s value, and that this bottleneck is a target of optimization work, you will lose the ability to easily know when or how to scale up because that bottleneck possibly hid something else. This is contributing to a non-negligible portion of our incidents at work I believe. We fix a thing that acted as an implicit blocker and off we go into the great unknown.

Our storage engine’s disk storage used to be our main bottleneck. We drove scaling out and rebalancing traffic based on how close we were to heavy usage across multiple partitions. This was a useful signal, but it also drove costs up, and eventually became the target of optimization.

An engineer successfully made our data offloading almost an order of magnitude faster, and eliminated our most glaring scaling issues at the time. Removing this limit however messed with our ability to know when to scale, which then revealed issues with file descriptors, memory, and snapshotting times.

The only good advice I have here is to re-evaluate your metrics often, and change them. I guess there’s also a lesson to be learned that improvements can also cause their own uncertainty and that these successes can themselves lead to destabilizations.

Because we no longer needed to scale out as aggressively and were free to discover new issues, and one of our best improvements to the system in recent memory is therefore also a contributor to a lot of operational challenges.

Things that people think are useful are possibly going to happen even if you forbid them. If you forbid people from logging onto production hosts, and they truly think they’ll need it for emergency situations, they’ll make sure there’s still a way for it to happen, albeit under a different name.

On the other hand, things that people think are useless are likely to be done in a minimal way with no enthusiasm, such as lying in your timesheets.

This means that writing a procedure means little unless people actually see its value and believe it’s worth following. Conversely, it means that if you can demonstrate the usefulness and make some approaches more usable, they’re likely to get adopted regardless of what is written down as a list of steps or procedures.

A related concept here is one here is that if you are tracking things like action items after an incident reviews and they go in the backlog to die, it may not be that your people are failing to follow through; it might also be that it’s impractical to do so, or it’s could also be that these action items were never feeling useful, and the process itself needs to be revisited rather than reinforced.

Seeing non-compliance is not necessarily a sign of bad workers. It may rather be a sign of a bad understanding of the workers’ challenges, and point to a need to adjust how work is prescribed.

Getting a small real buy-in into something voluntary may be better than getting fake buy-in into something you’re forcing people to do. Of course if you manage to write a good procedure that people believe are worth following, more power to you, this is going great.

The shortest feedback loop may be attained by giving people the tools to make the right decisions right there and then, and let them do it. Cut the middlemen, including yourself.

How do you make that work? We come back to goal alignments and top priorities being harmonized and well understood. If the pressures and goals are understood better, the decisions made also work better.

That does mean that you have to listen back about how these things have been going, and that not only do you need to trust your people, but they need to trust you back with critical and unpleasant information as well. The feedback flows both ways, and this hinges on psychological safety.

If you’ve ever talked to a contractor asked to help a big organization, the first thing they’ll tell you they do is go talk to the workers with boots on the ground, and ask them what they think needs changing. They’ll often have years of potential improvements backlogged, and that they’re ready to tell anyone about. Either because management wouldn’t listen to it, or because the workers lost trust that voicing that feedback would yield any result.

Then the contractor brings it up to management as a neutral party, and suddenly it gets listened to and acted upon.

If you’ve lost that trust, then contractors can play that specific role of workers at the periphery of the organization helping drive change, and they can play a very useful function.

But if you have that trust already, maintaining it is crucial because that’s how you get all the good information to help orient and influence things.

Trust also means that if you want people to be innovative, you have to allow them to make mistakes. You can’t get it right the first time all the time; if people can’t be allowed to get it wrong here and there, they won’t be allowed to improve and try new things either.

Finally, let’s look at shifting perspective away from a bare analysis and onto a more systemic point of view. People in specific teams often have a more detailed expert view than you could either have, but if you’re standing outside of it, your strength might be to understand how the parts interact in a way that isn’t visible to the inside.

The most basic point here is that you can’t expect to change the outcome of these small little decisions that accumulate all the time if you never address the pressures within the system that foster them.

I used to try and weed my lawn a whole hell of a lot and pull the weeds hours a week until someone explained to me that weeds grew easier in the type of soil I had (poor, dry, unmaintained soil) than grass, and pulling the weeds wasn’t the way to go, I needed to actually make the soil good for the grass to crowd out the weeds.

It’s similar when considering this whole idea of root cause analysis—of trying to find the one source of the problem and removing it. If your root cause is at the weed’s level, you’ll keep pulling on them forever and will rarely make decent progress. The weeds will keep growing no matter how many roots you remove.

If you foster good soil, if you create the right environment that encourages the type of behavior you want instead of the type of behaviour you dislike, you have hopes that the good stuff will crowd out the bad stuff. That’s a roundabout way of talking about culture change. And for these, deep dives based on richer narratives and thematic analysis prove more useful.

Also there’s a warning here about trying to change the decisions your people make with carrots and sticks—with incentives. They are not going to fundamentally change what pressures the employees negotiate. The pressures stay the same, all you’re doing is adding more of them, either in the form of rewards or punishments, which makes decision-making more complex and trickier.

Chances are people will keep making the same decisions as they were already, but then they’ll report it differently to either get their bonus or to avoid getting penalized for it. Surfacing, understanding, and clarifying goal conflicts can make things easier or shape work to give them more room. Adding carrots and sticks can make things harder.

But the tip here is probably: look into what are the behaviors you want to see happen, and give them room to grow.

My most successful initiative at Honeycomb is probably creating weekly discussion sessions about operational stuff and on-call. They range from “how do we operate new service X” into trickier discussions like “is it okay to be visibly angry in an incident”, “how do you deal with shit you don’t know or avoid burnout” or “are there times where code freezes are actually a useful thing?”.

Over time we looked into all sorts of weird interactions and the meeting became its own tool.

When we noticed incident reviews were difficult to schedule across departments and timezones, we decided that a good wide incident review is good operational talk and started making the optional time slot, which was already on every engineer’s calendar (and some other departments too), available for them. It became easier for people to run incident reviews, and over time their size grew from 7-8 people, scoped to 1 or 2 teams, to bigger events with 20 to 40 people in them.

We removed a huge but subtle blocker to good feedback loops existing within the organization.

These sorts of small changes are those you can drive locally with almost no risk of having them run afoul of organizational priorities, and when you see them work, use the org structure to expand them everywhere.

I find it useful to keep focusing on what an indicator triggers as a behavior (the interaction) rather than only what it reports directly. This slide here is 4 error budgets from our SLOs, which combine how successful requests are both in terms of speed and errors, compared to an objective we express in terms of the desired fault rate.

When we have to pick targets for our platform, people often ask whether we could pick some key SLOs and turn them as the objective. My answer is almost always “I don’t care if we meet the SLOs or not”. I mean I care, but not like that.

SLOs aren’t hard and fast rules. When the error budget is empty, the main thing that matters to me is that we have a conversation about it, and decide what it is we want to happen from there on. Are we going to hold off on deploys and experiments? Are we able to meet the objectives while on-call, with some schedule corrective work, some major re-architecting? Can we just talk to the customers? Were our targets too ambitious or are we going to eat dirt for a while?

Kneejerk automated reactions aren’t nearly as useful as sitting down and having a cross-departmental discussion about what it is we want to do, as an organization, about these signals of unmet expectations. If it fits within on-call duty, like what is probably the case with the error budget on the top left, then fine.

But in other cases, such as the top right budget here, which seems to show a gradual decline, owe have to choose whether to do corrective work (and how/when) to meet the SLO—because that wasn’t expected and is undesirable—or maybe to relax it—because that’s actually a natural consequence of new more expensive features and we need to tweak definitions. Or we could temporarily ignore it because corrective work is already on the way, but not a top priority right now.

The two budgets at the bottom come from SLOs that may never page anyone. But from time to time, we re-calibrate them by asking support whether there are any issues users complain about that we aren’t already aware of. So long as we’re ahead of the complaints, we figure the SLOs are properly defined. But from time to time, we find out that we slipped by getting comments on things our alerting never properly captured. Or maybe we needed to better manage the user’s expectations—that’s also an option.

For any of these choices, we also have to know how this is going to be communicated to users and customers, and having these discussions is the true value of SLOs to me. SLOs that flow outside of engineering teams provide a greater feedback loop about our practices, further upstream, than those that are used exclusively by the teams defining them, regardless of their use for alerting.

Finally, this is where SREs can be placed in a great way to shine. You can be away from the central roles, away from the decision-making, on the periphery. By being outside of silos and floating around the organization’s structure, you are allowed to take information from many levels, carry it around, and really tie the loop at the end of so many decisions made in the organization by noting and carrying their impact back once they’ve hit a production system.

It is an iterative exercise, our sociotechnical systems are alive, and carrying pertinent signals and amplifying them, you can influence how long it’s gonna take before it all goes to hell anyway.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *