SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands-on expertise who can lead modernization efforts while fostering a culture of reliability and innovation.
Key Skills:
* Strong expertise in implementing Site Reliability Engineering (SRE) principles.
* Advanced knowledge of establishing observability using tools - Dynatrace & Datadog (primary skills).
* Proficiency in automation & Scripting using Python & Ansible (primary skills).
* Strong experience with cloud platforms - AWS & Azure (primary skills).
* Solid understanding of containerization and orchestration tools like Docker and Kubernetes.
* Proficiency in cloud native distributed systems & microservices architecture.
* Exposure to AI/ML techniques for predictive analytics and automated problem resolution.
* Familiarity with CI/CD pipelines & enabling automated release & deployment engineering solutions.
* Good to have experience with chaos engineering tools like Gremlin or Chaos Monkey and implementing automation frameworks for resilience tracking.
* Ability to manage and prioritize multiple projects in a fast-paced environment.
* Strong interpersonal and communication skills to work effectively across teams.
* Excellent problem solving, analytical thinking, and adaptability.
* Strategic mindset balancing engineering excellence with business priorities.
Preferred Qualifications:
* 12+ years of experience in IT operations, SRE, or DevOps roles.
* Proven track record of SRE experience in implementing observability and automation solutions in large-scale environments.
* Certifications in cloud platforms, observability tools & other SRE related areas.
Hybrid: 2/3 days onsite/week