Senior Software Engineer, SRE, Cloud Incident Response
Google place London, UK
Apply
* Bachelor’s degree in Computer Science, a related field, or equivalent practical experience.
* 5 years of experience with software development in one or more programming languages.
* 5 years of experience with data structures or algorithms.
* 3 years of experience in designing, analyzing, and troubleshooting distributed systems, and 2 years of experience leading projects and providing technical leadership.
* Experience in SRE or incident management/response environments.
Preferred qualifications:
* Experience working in computing, distributed systems, storage, or networking.
* Experience in telemetry systems, incident and risk management.
* Experience in designing, analyzing, and troubleshooting large-scale distributed systems.
* Ability to debug, optimize code, and automate routine tasks.
* Excellent problem-solving skills, with strong verbal and written communication abilities.
About the job
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services—both internally critical and externally visible—maintain reliability, uptime appropriate to customer needs, and a rapid rate of improvement. Additionally, SREs monitor system capacity and performance continuously.
Much of our development focuses on optimizing existing systems, building infrastructure, and automating tasks. On the SRE team, you'll tackle the unique challenges of scale in Google Cloud, leveraging your expertise in coding, algorithms, and large-scale system design. Our culture emphasizes curiosity, problem solving, and openness, fostering collaboration and innovation in a supportive environment.
Responsibilities
* Ensure Google Cloud Platform (GCP) stability and reliability through incident support, driving customer outcomes, and cross-team collaboration.
* Create training and processes for incident management, collaborating with Cloud Support leadership.
* Develop systems and tools to improve incident visibility, issue detection, and communication with stakeholders.
* Identify and escalate risks, reducing major incident probabilities through pragmatic approaches.
* Support system scalability and reliability throughout their lifecycle via design consulting, platform development, capacity planning, and automation.
Google is an equal opportunity employer committed to diversity and inclusion. We value a workforce that reflects our users and fosters a culture of belonging. We provide equal employment opportunities regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy, or related conditions. See Google's EEO Policy and related resources.
As a global company, English proficiency is required for all roles unless otherwise specified.
Note: Google does not accept resumes from recruitment agencies and is not responsible for fees related to unsolicited resumes.
#J-18808-Ljbffr