SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands-on expertise who can lead modernization efforts while fostering a culture of reliability and innovation.
Key Skills:
Strong expertise in implementing Site Reliability Engineering (SRE) principles.
Advanced knowledge of establishing observability using tools - Dynatrace & Datadog (primary skills).
Proficiency in automation & Scripting using Python & Ansible (primary skills).
Strong experience with cloud platforms - AWS & Azure (primary skills).
Solid understanding of containerization and orchestration tools like Docker and Kubernetes.
Proficiency in cloud native distributed systems & microservices architecture.
Exposure to AI/ML techniques for predictive analytics and automated problem resolution.
Familiarity with CI/CD pipelines & enabling automated release & deployment engineering solutions.
Good to have experience with chaos engineering tools like Gremlin or Chaos Monkey and implementing automation frameworks for resilience tracking.
Ability to manage and prioritize multiple projects in a fast-paced environment.
Strong interpersonal and communication skills to work effectively across teams.
Excellent problem solving, analytical thinking, and adaptability.
Strategic mindset balancing engineering excellence with business priorities.
Preferred Qualifications:
12+ years of experience in IT operations, SRE, or DevOps roles.
Proven track record of SRE experience in implementing observability and automation solutions in large-scale environments.
Certifications in cloud platforms, observability tools & other SRE related areas.
Hybrid: 2/3 days onsite/week