Site reliability engineering

Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to IT infrastructure and operations.^[1] The main goals are to create highly reliable and scalable software systems. Site reliability engineering is closely related to DevOps and has been described as a specific implementation.^[2]^[3]

History

The field of site reliability engineering originated at Google with Ben Treynor Sloss,^[4]^[5] who founded a site reliability team after joining the company in 2003.^[6] In 2016, Google employed more than 1,000 site reliability engineers.^[7] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers.^[8] The position is more common at larger web companies, as small companies often don't operate at a scale that would require dedicated SREs.^[8] Organizations who have adopted the concept include Airbnb, Dropbox, IBM,^[9] LinkedIn, Netflix,^[7] and Wikimedia.^[10] According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.^[11]^[12]

Definition

Site reliability engineering, as a job role, may be performed by solo practitioners or organized in teams usually being responsible for a combination of the following within a broader engineering organization: System availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.^[13] Site reliability engineers often have backgrounds in software engineering, system engineering, or system administration.^[14] Focuses of site reliability engineering include automation, system design, and improvements to system resilience.^[14]

Site reliability engineering, as a set of principles and practices, can be performed by anyone. SRE is similar to Security engineering in the way that anyone is expected to contribute to good security practices, but a company may decide to eventually staff specialists for the job. Conversely, for securing internet systems, companies may hire Security Engineers and to define and ensure their reliability goals, companies may hire SREs as well.

Site reliability engineering has also been described as a specific implementation of DevOps^[2]^[3] but it focuses specifically on building reliable systems, whereas DevOps is more broadly focused on infrastructure.^[2]

Stephen Gossett wrote in Built In that some companies have rebranded their operations teams to SRE teams with little meaningful change.^[8] This is also perceived to be true for operations teams rebranded to be called DevOps teams.

Principles and practices

There have been multiple attempts of defining a canonical list of site reliability engineering principles,^[15]^[16] but while consensus is lacking, the following characteristics are usually included in most of such definitions:

Automation or elimination of anything repetitive that's also cost-effective to automate or eliminate.
Avoidance to pursue much more reliability than what's strictly necessary. Defining what's necessary is a practice by itself (see list of practices below).
Systems design with a bias toward reduction of risks to availability, latency, and efficiency.
Observability, as in, the ability to be able to ask arbitrary questions about your system without having to know ahead of time what you wanted to ask.^[17]

The site reliability engineering practices also vary widely, but the list below is relatively commonly seen being at least partially implemented:

Toil management as the implementation of the first principle outlined above.
Defining and measuring reliability goals—SLIs, SLOs, and error budgets.
Non-Abstract Large Scale Systems Design (NALSD) with a focus on reliability.
Designing for and implementing observability.
Defining, testing, and running an incident management process.
Capacity planning.
Change and release management, including CI/CD.
Chaos engineering.

Implementations

Site reliability engineering teams engage with the other teams within their companies and the SRE principles and practices in various forms. Here is a high level overview of common SRE team implementations:^[18]

Kitchen Sink, a.k.a. “Everything SRE”

Scope of services or workflows covered is usually unbounded.

Infrastructure

Focuses on the reliability of behind-the-scenes systems that help make other teams' jobs more efficient. These are often confused with "Platform" teams or "Platform Operations" teams. Infrastructure SRE teams may pair up with one or more platform engineering team(s), but they differ in that Infrastructure SRE teams focuses on performing most, if not all, of the work described in the principles and practices list above. Platform teams tend to focus on building the platform and while reliability is desirable that's not their sole priority.

Tools

Focuses on tools to measure, maintain, and improve system reliability. For example, Nagios Core.

Product or application

SRE team for product and/or application. Some large companies tend to staff several of these.

Embedded

Usually SRE solo practitioners or pairs staffed within a software engineering team to apply most of the principles and practices described above.

Consulting

Consult on how to implement SRE principles and practices. These are usually experienced SREs who've worked on teams in one or several of the implementations above. SREs on external facing consulting SRE teams are often called "Customer Reliability Engineers". They rarely, if ever, change customer's configuration or code.

Large companies who have adopted SRE tend to have a combination of the implementations described above, including multiple teams of the same implementation, e.g. multiple Product/application SRE teams to meet specific demands of several products and an Infrastructure SRE team to pair up with a Platform engineering group to meet reliability goals of a common platform for both products/applications.

Industry

The USENIX organization has held an annual SREcon conference since 2014 for site reliability engineers in the industry, and also holds regional conferences with similar themes.^[19]

References

^ "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.
^ ^a ^b ^c Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.
^ ^a ^b Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
^ Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)
^ "What is SRE?". Red Hat. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)
^ Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)
^ ^a ^b Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)
^ ^a ^b ^c Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)
^ "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.{{cite web}}: CS1 maint: url-status (link)
^ "SRE - Wikitech". wikitech.wikimedia.org. Retrieved 2021-10-17.
^ Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.
^ Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)
^ Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
^ ^a ^b Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40, no. 3. pp. 35–39. Retrieved June 17, 2021.
^ "The 7 SRE Principles [And How to Put Them Into Practice]". www.blameless.com. Retrieved 2021-06-26.
^ "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.
^ "Learn about observability | Honeycomb". docs.honeycomb.io. Retrieved 2021-06-26.
^ "SRE at Google: How to structure your SRE team". Google Cloud Blog. Retrieved 2021-06-26.
^ "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.

External links

Awesome Site Reliability Engineering resources list
How they SRE resources list
SRE Weekly weekly newsletter devoted to SRE
SRE at Google landing page for learning more about SRE in Google
Komodor K8s Reliability learning center with resources for SREs working with Kubernetes

[1] "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.

[:0-2] Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.

[:2-3] Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.

[4] Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)

[:3-5] "What is SRE?". Red Hat. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)

[6] Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)

[:1-7] Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)

[:5-8] Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)

[9] "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.{{cite web}}: CS1 maint: url-status (link)

[10] "SRE - Wikitech". wikitech.wikimedia.org. Retrieved 2021-10-17.

[11] Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.

[12] Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.{{cite web}}: CS1 maint: url-status (link)

[13] Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.

[:4-14] Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40, no. 3. pp. 35–39. Retrieved June 17, 2021.

[15] "The 7 SRE Principles [And How to Put Them Into Practice]". www.blameless.com. Retrieved 2021-06-26.

[16] "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.

[17] "Learn about observability | Honeycomb". docs.honeycomb.io. Retrieved 2021-06-26.

[18] "SRE at Google: How to structure your SRE team". Google Cloud Blog. Retrieved 2021-06-26.

[19] "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]