LinkedIn’s latest report on ‘Emerging jobs 2020,’ lists Site Reliability Engineer (SRE) among the top 10 in-demand jobs for 2020. Companies in Telecom, Marketing, Advertising, and Information Technology and Services industries, across the US, UK, and India are at the forefront in hiring SREs, a trend that is explained by their aggressive automation and digitization strategies. Site reliability engineers are among the highest-paid IT professionals; this has made the job coveted by candidates.
What do Site Reliability Engineers do
Site Reliability Engineers are ‘custodians’ of the hundreds and thousands of software systems owned by a company. Their job is to ensure that these systems continue to be ‘reliable’ for the customers as their company scales. SRE is a comparatively new role and is a combination of software engineering and IT systems management. On a typical day at work, an SRE spends his time writing code to automate the management of software systems and resolving the infrastructural and operational challenges thrown by them
A little back story
SRE was conceptualized by Google in 2003. Back then, Google teamed up a few
software engineers to tackle its large-scale site problems. The performed the
task so well that it captured the attention of other technology companies like
Amazon, Dropbox, and Netflix. Overtime SRE became a distinctive IT domain
dedicated to developing automated practices like disaster response, change
management, latency, performance, and capacity planning. Today, SRE a huge
community and has its own conference, SREcon.
SREs expertly balance between operations and development work. They help teams understand which and when the novel feature should be initiated by Service-level agreements (SLAs) to mark out the essential reliability of the system through Service-level indicators (SLI) and Service-level Objectives (SLO). An SLI is a described extent of particular aspects of provided service levels. An SLO is about the target value or extent for a specific service-level based on SLI. Key SLIs are defined as error rate, system throughput, and request latency. An SLO for the demanded system reliability is then analyzed based on the downtime as per the measure. Thus, the determined downtime level is called an error budget, the maximum allowable requirement for errors and outages.
When releasing the new feature, the development team will be spending the error budget. Utilizing the SLO and error budget, the development team can conclude whether or not a product or service can launch based on an accessible error budget. If a service is running as per the error budget, then the development team can launch anytime, but if the system detects too many errors or is down for extended periods than that specified by error budget, then no new launches can happen until the errors are down to the specified limit.
How to become a Site Reliability Engineer
For an entry-level SRE role, companies prefer candidates with a bachelor’s degree in Computer Science or related fields. Certification as a Site Reliability Engineer or a Software Engineer is an added advantage. Those with work experience as a Systems administrator, DevOps engineer, or Software engineer have a significant edge over others. The hiring manager thoroughly scrutinizes the candidate’s aptitude for programming languages, familiarity with operating systems, and knowledge of automation technologies before offering him the job. Here is the technical skillset that makes a candidate sought-after in the SRE job market:
- Operating systems like Linux and Windows
- Automation technologies
- Cloud computing technologies including Software as a Service, Platform as a Service and Infrastructure as a Service
- Container orchestration like Kubernetes or docker swarm, configuration management tools like Ansible, Chef, Puppet, and SolarWinds
After getting through the door, an SRE has to strive to stay on top of technology shifts constantly. He should gain familiarity with front-end and back-end technologies that make up a software system to be able to understand the problem and expedite a solution.
From SRE to where?
Junior level SREs who perform exceedingly well in their roles get a chance to work on larger and more complex computer systems. Their willingness to collaborate with other IT experts, enthusiasm to learn new technologies, ability to thrive under pressure, think on toes and solve problems, and excellent communication skills go a long way and push them up the ladder. Senior Site reliability engineers often choose to continue in the same role for a long time before moving into managerial positions. Shifting into a similar role, such as DevOps or System Administrator, is not an unusual move
Some companies that frequently hire SREs are Google, Tesla, Microsoft, Twitter, Adobe, Slack, Apple, and Non-Tech giants like The Walt Disney Co., Mastercard, and Capital One