Spiraling Ops Debt & the SRE Coding Imperative

Written by Rob Hirschfeld, CEO and co-founder of RackN

Every Ops team I know is underwater and doesn’t have the time to catch their breath.

Why does the load increase and leave Ops behind?  It’s because IT is increasingly fragmented and siloed by both new tech and past behaviors.  Many teams simply step around their struggling compatriots and spin up yet more Ops work adding to the backlog. Dashing off yet another Ansible playbook to install on AWS is empowering but ultimately adds to the Ops sustaining backlog.

That terrifying observation two years ago led me to create this graphic showing how operations is getting swamped by new demand for infrastructure.

Ops Tsunami
Ops Tsunami

 

It’s not just the amount of infrastructure: we’ve got an unbounded software variation problem too.

It’s unbounded because we keep rapidly evolving new platforms and those platforms are built on rapidly evolving components.  For example, Kubernetes has a 3-month release cycle.  That’s really fast; however, it built on other components like Docker, SDN and operating systems that also have fast release cycles.  That means that even your single Kubernetes infrastructure has many moving parts that may not be consistent in your own organization.  For example, cloud deploys may use CoreOS while internal ones use a Corporate Approved Centos.

And the problem will get worse because infrastructure is cheap and developer productivity is improving.

 Since then, we’ve seen a container fueled explosion in developer productivity and AI driven-rise in new hardware-flavored instances. Both are power drivers of infrastructure consumption; however, we have not seen a matching leap in operations tooling (that’s a future post topic!).

That’s why the Google SRE teams require a 50% automation vs Ops ratio.  

If the ratio is >50 then the team slowly sinks under growing operational load.  If you are not actively decreasing the load via automation, then your teams get underwater and basic ops hygiene fails.

This is not optional – if you are behind now then it will just get worse!

The escape from the cycle is to get help.  Stop writing automation that you can buy or re-use.  Get help running it.  Don’t waste time solving problems that other people have solved.  That may mean some upfront learning and investment but if you aren’t getting out of your own way then you’ll be run over.

This post is part of Rob Hirschfeld’s SRE series grounded in the ideas inspired by the Google SRE book.

One thought on “Spiraling Ops Debt & the SRE Coding Imperative

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s