You don’t need to be a developer or devOps investor to know that software developers like to spend time creating, not maintaining apps and systems. While the cloud world has been an incredibly powerful enabler for software developers to build and automate workloads not previously possible in a faster and more exhaustive way, most software fanatics only see this side of the story. What people often forget is that every innovation, whether mega or step function, comes at a cost – it’s rarely free or without burden. As the enterprise continues its march toward a multi-cloud, hybrid and distributed world we can’t forget to enable the maintainers – the people focused on keeping it all from crumbling into a million pieces.
Google first defined site reliability engineering in 2003. While their internal stack and engineering org was and still is unique, they were simply ahead of the curve in recognizing the need for defining reliability-focused engineers tasked with keeping increasingly complex cloud stacks from crashing and identifying automation gaps. In the case of the need for SREs (site reliability engineers), Google realized that while traditional software developers do spend a portion of their time on what they consider the less exciting exercises of debugging, building internal tools, and reviewing code, site reliability was nothing that could be handled by your typical software developer.
In the nearly 20 years since SRE became a defined role, site reliability engineering has become widely adopted across enterprises and cloud-first scale ups, with demand for SREs jobs exploding in parallel. However, while most of these enterprises have embraced automated devOps tools for building, testing, and deploying code, production ops remains largely manual. Aside from a handful of incident management and ticketing systems for detecting incidents and assigning them to the right person, SREs are managing ever-growing fleets and increasingly complex environments with VMs, containers, microservices, managed services, multiple clouds and accounts with the technological equivalent of bats and shovels.
Enter Shoreline: Modern Production Operations
Shoreline was born out of the pain of broken production operations which founder Anurag Gupta observed and experienced firsthand through launching and building AWS Database and Analytics to $5B+ ARR. When we first met Anurag in 2019, it was a few months after he’d left AWS to start Shoreline. He knew incident remediation in production cloud environments was monolithic and that SREs were the torch bearers – they were ones he had to empower to build the world of reliable infrastructure he envisioned. It was clear to us that there was a massive brain (with first mover network effects) attacking a colossal problem, and we were excited to follow the story.
It has been amazing to see the combination of deep enabling technology and product innovation that have lead to where Shoreline is today. Shoreline started ground up, spending over 2 years (!) developing their underlying SQL-like ops programming language for metricizing traditional observability logs and alerts into code that commands and complex executables can be run on. This is the underlying technology of Shoreline’s “remediations-as-code” platform – a centralized place for SREs to query and debug infrastructure as code and more importantly, a platform to create and deploy “set it and forget it” remediations that automatically detect and fix known incidents (e.g. bounce server, cluster management, data access control).
From speaking with some of Shoreline’s customers, it was very clear their solution was both unique and powerful. Not only were customers raving about the platform’s ability to automatically resolve common incidents in production, but also highlighted how Shoreline’s intuitive UI/UX allowed for broader team members to safely repair incidents in addition to faster live site debugging of new incidents When companies can’t hire enough SREs as is, the ROI of Shoreline is clear. Dataiku for example, one of Shoreline’s customers, estimated that Shoreline saved each engineer about 10 days of work per month.
As enterprise systems continue to revamp around their respective digital transformation journeys, we have little doubt SREs will become more critical than ever. Furthermore, it’s clear to us that the power of SREs is not in their ability to manually resolve incidents and respond to toil, but as the developers of self-healing reliable infrastructure; and we can’t think of a more important enabler of this than Shoreline.