Summary of Seeking SRE

Aram Koukia
Koukia
Published in
4 min readApr 12, 2023

--

Seeking SRE is a book that explores the diverse and evolving field of Site Reliability Engineering (SRE), which is a discipline that aims to ensure the reliability and performance of complex systems and applications. The book is curated and edited by David N. Blank-Edelman, and contains more than two dozen chapters written by different practitioners and leaders in the SRE world¹.

The book covers a wide range of topics, such as:

- Different ways of implementing SRE and SRE principles in various settings, such as startups, enterprises, third parties, and non-dedicated teams.
- How SRE relates to other approaches such as DevOps, incident management, observability, and chaos engineering.
- Specialties on the cutting edge that will soon be commonplace in SRE, such as machine learning, security, ethics, and sustainability.
- Best practices and technologies that make practicing SRE easier, such as metrics, dashboards, automation, playbooks, and postmortems.
- The important but rarely explored human side of SRE, such as hiring, interviewing, diversity, inclusion, burnout, culture, and communication.

The book is intended to bring readers into some of the important conversations going on in the SRE field right now, and to inspire them to seek their own ways of applying SRE principles to their own contexts. The book is not a prescriptive guide or a definitive reference, but rather a collection of perspectives and experiences that showcase the breadth and depth of SRE.

Some of the key takeaways from the book

  • SRE is not a one-size-fits-all solution. It can be adapted to different organizations, teams, systems, and goals. There is no single right way to do SRE.
  • SRE is not just about technical skills or tools. It also requires a mindset of continuous learning, improvement, collaboration, and ownership. It also involves balancing trade-offs between reliability and innovation.
  • SRE is not only about systems. It is also about people. People are the ones who design, build, operate, maintain, and use systems. People are also the ones who can suffer from stress, fatigue, frustration, and dissatisfaction when systems fail or perform poorly. SRE should aim to improve both the quality of systems and the quality of life for people.

Seeking SRE is a book that explores the diverse and evolving field of Site Reliability Engineering (SRE), which is a discipline that aims to ensure the reliability and performance of complex systems and applications. The book contains more than two dozen chapters written by different practitioners and leaders in the SRE world, covering topics such as implementation strategies, best practices, cutting-edge specialties, and human aspects of SRE. The book is intended to bring readers into some of the important conversations going on in the SRE field right now, and to inspire them to seek their own ways of applying SRE principles to their own contexts. The book showcases the breadth and depth of SRE as a field that can be adapted to different situations and goals, that requires both technical skills and a mindset of continuous learning and improvement, and that involves balancing trade-offs between reliability and innovation while caring for both systems and people.

Some examples of SRE in practice

  • Google, which pioneered the SRE model and has shared its best practices and case studies on topics such as service level objectives (SLOs), incident management, postmortems, and automation.
  • PagerDuty, which provides a platform for incident response and has adopted and adapted the Incident Command System (ICS) framework to coordinate and communicate during incidents.
  • Standard Chartered Bank, which has been implementing SRE as its primary support model and has improved its engineering culture and capabilities, as well as its reliability and performance metrics.
  • Moogsoft, which provides an observability platform that leverages artificial intelligence and automation to help SRE teams detect and resolve incidents faster and more efficiently.

Some challenges of implementing SRE

  • Following the methodology: SRE is a prescriptive approach that requires a clear understanding of its principles, practices, and tools. It also requires a commitment to adopt a culture of reliability, ownership, and collaboration among different teams and stakeholders.
  • SRE training: SRE involves a combination of technical skills and soft skills that may not be readily available or familiar to existing engineers or operators. It may require investing in training programs, mentoring, or hiring to build up the SRE capabilities and mindset.
  • Getting stakeholders on your side: SRE may face resistance or skepticism from some stakeholders who are used to traditional ways of working or who are concerned about the costs or risks of changing the status quo. It may require convincing them of the benefits and value of SRE, as well as addressing their fears and expectations.
  • Simulations: SRE relies on simulations such as load testing, chaos engineering, and disaster recovery drills to test and improve the reliability and resilience of systems. However, these simulations may be challenging to design, execute, and analyze, especially for complex or distributed systems.
  • Finding practitioners: SRE is a relatively new and evolving field that may not have a large pool of experienced or qualified practitioners. It may be hard to find or recruit SREs who have the right mix of skills, knowledge, and attitude for the role.

--

--

Software Engineer, Engineering Leader and Manager @ AWS. Living my dream life. http://koukia.ca/