Why Production Monitoring Needs a Massive Overhaul

Most of the problems in production monitoring are at least are doing us the courtesy of showing their faces while they try to destroy us from within. 

But what about the hidden problems? The problems bubbling under the surface, undermining your production monitoring efforts from within, without showing up in the stats, without inciting angry phone calls?

When we go beyond the surface level, we face issues that may not initially look like problems at all, because they are disguised by fundamental axioms of our production monitoring toolset.

Emergent issues are created when a toolset designed to solve a problem creates a new set of problems all of it own. None of these problems can be derived by looking at any single part of the process, which is why they are easily overlooked.

The real leap in any system comes when the emergent problems are tackled, broken down to their elements, and politely destroyed.

LineSeparator

Emergent Issue #1: Bias Bloat

The Anchoring Effect

Many tools require us to ask many question before and during our work with them. But the Anchoring Effect is a bias that ensures we are setting ourselves up to fail in this. When you’re the one asking the questions, you’ll always ask questions that produce the answers you seek (it’s a yin-yang). Ideally, a system would recognize and report problems without asking you to define them yourself.

Additionally, since many services are very expensive, a lot of companies choose to give only some of the information to the system, or monitor only parts of their infrastructure. This drives them further into their own Anchoring Bias.

Change Blindness

All types of monitoring tools fall victims to our innately human bug: Change Blindness. Change blindness occurs when a change in visual perception is so minute as to go undetected. When you’re faced with mountains of logs lines, a constant barrage of alerts, or pretty yet ultimately interchangeable dashboard graphs, you end up missing a lot of what’s going on.

Monitoring is setting itself up to be a reactive field

The words we use to describe the world shape our perception of it. When we use a word like “monitoring”, we’re condemning everything under its umbrella to the passive, observing role in the DevOps cycle. Our tools themselves are painfully reactive, further driving into a corner of reaction, rather than action.

In addition, production monitoring issues randomize your team’s schedule and pipeline, hurt morale, and keep them in a reactive state. This creates a vicious cycle – your team’s own reactiveness keeps them at a reactive state.

The illusion of having nailed it

People need to learn a new language with every new tool they pick up. This is very time consuming, and creates an additional layer of work on top of the “real” work that need doing. This feels like a necessary evil – it’s not. It’s just plain wasteful.

People also tend to confuse “monitoring” something with “resolving” it: they think because they have pretty graphs and a big screen to put up in the office for everyone to ogle at shiny dashboards, and a big red flashing button, that they’re safe.

Newsflash, they’re not safe. These are red herrings. What IT teams really need isn’t flashy tools, it’s quite the opposite.

Overreliance on numbers

Time-series dashboards and numbers-driven tools lead us to an to overreliance on cold data. Just as the same numbers can be used to excuse different behaviors at different times due to their context, so we must keep in mind that our systems have become too complex to be fully understood using numbers.

You can track and display a mouse as a series of metrics, but that wouldn’t be a mouse. And it would give you a very partial view of what’s actually going on. We need tools that open our eyes to what our systems have now become, and find better ways to speak – and listen – to them.

LineSeparator

Emergent Issue #2: Hysterical Homosapiens

Humans and their pesky variability

Everyone has their own ideas about what their ideal monitoring system actually does, and this creates weird Frankensteins – people come in, create amalgamated systems from a mesh of varied tools, and if they do a good job, they get promoted. Then other people take their job, and these people come in with their own set of ideas and ideals they’re itching to implement. Putting aside the time and effort every such cycle costs, the biggest risk is that soon your lovely boulevard is a hot steaming favella of a monitoring solution.

Rocinha favela in Rio de Janeiro
I’m thinking of putting up a patio.

Throw in the the variability in production monitoring teams’ technical knowhow, hands-on experience with different types of tools, and general aptitude. Now you see that you’ve made your monitoring team’s level of competence is likely to be as varied as the monitoring stack itself is.

Adverse psychological effects

As companies scale up (mainly due to recent years’ virtualization bonanza), and run more and more micro-services that need tracking, we are seeing an explosion in adverse psychological effects caused by this onslaught of monitoring needs.

Some of these include:
Alarm Fatigue – workers’ senses and cognitive abilities wear out by over stimulation.

Analysis Paralysis – being in a constant state of analysis causes workers to stop being able to come to any sensible decision, and finally any decision at all.

Information Pollution – workers are exposed to too much irrelevant data, which ends up obscuring the relevant data.

Information Overloadworkers are swarmed by an overabundance of unactionable information which causes, rendering them unable to properly process anything.

Ego Depletion – remaining in a prolonged state of constant decision-making, workers end up depleted and unable to come to decision with a clear enough mind.

Tools as toys

We’re geeks. We like to go deep, dig into issues and not emerge until we have all the answers to all our question. This is what makes us great at what we do.

Many of our monitoring tools tap into this part of our psyche and encourage us to go deep. There is a thrill of the chase and a sense of accomplishment in moving through our monitoring stack and breaking down every little thing that happens. The way many tools are constructed invites users to dig deeper, to look beyond the surface and embrace an “information is power” mentality. Unfortunately, this ends up harming the process as it helps people spread wide instead of zeroing in on relevant issues immediately.

Simply put, we need to come to terms with the fact that there will always too much information for us to reasonably analyse ourselves. In an age of big data, we need to let go and let the machines do the work – we need to make room for smart data. As DevOps and monitoring specialists, we need tools that respect our time and divert our skills to where they’re really needed.

LineSeparator

Emergent Issue #3: Crushing Conatus

Micro-service entanglement

As outlined in the first chapter of this series, Conatus is the name given to any system’s innate inclination to continue to exist and enhance itself. That is, the power driving it not to change. When your monitoring stack is already heavily invested in a particular mode of operation, including all its limitations, it’s hard to break away and consider that there may be another way to do things. And the shift towards a world of micro-services is making this entanglement more and more severe.

Too much integration

I know what you’re thinking: in a previous section we’ve covered the exact opposite problem, of too few integrations. So what’s up with listing extensive integrations as a problem?

Well, it’s complicated. Integrations are great, but they do have a dark side: even when the tools you’ve integrated with work impeccably well ( when do they ever?), you’ve tied yourself to the way THEY work; both wittingly and unwittingly, you’ve create dependencies which affect your product’s development cycle, limiting your ability to expand beyond the reach of the tools upon which you’ve come to depend. This makes for a strong case in favor of all-in-one suites. One coherent set of skills and of training, one tool in your stack to consider when you’re trying to evolve beyond your current state.

Monitoring is not about you – it’s about your customers

Monitoring has become too much about the needs of the company. It should be more about providing value to their customers – to focus less on what the IT team thinks the issues are. In other words, to focus on the outcome as it reflects through customers’ eyes – starting with what the customer cares about and seeing how to provide that most efficiently.
LineSeparator

What is clear is that all the tools in our toolset are just that – tools.

None of these are monitoring solutions.

Ultimately, the industry as whole strains under the weight of tools that were designed to manage the problem, rather than solve it.

We need to shift our perception and come to terms with the happy realization that technology has now enabled us to stop playing around with tools, and set up solutions in their place.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s