Responsibilities
* Serve as the primary technical point of contact during critical incidents, ensuring rapid resolution and minimal business impact.
* Lead and coordinate cross-functional teams (engineering, support, operations) during incident response, including root cause analysis, mitigation strategies, and post-mortem reviews.
* Monitor service health using tools such as CloudWatch, OpenSearch, Kibana, Grafana, and proactively identify potential issues before they impact customers.
* Troubleshoot and debug production issues in web architecture, microservices, and cloud environments.
* Manage and maintain system reliability by implementing best practices in observability, monitoring, and alerting.
* Collaborate closely with Software Development, Infrastructure, and Operations teams to improve incident response processes and system resilience.
* Manage incidents related to AWS services such as EC2 S3 RDS, DynamoDB, Aurora, Redis, Memcache, Kafka, SNS, SQS, OpenSearch, and Elasticsearch.
* Use Agile tools (Jira, Confluence) to track incident tickets, document resolutions, and maintain a clear audit trail.
* Oversee system and application deployments, supporting automation pipelines (Jenkins, Git).
* Perform Linux/Unix administration tasks as needed during incident investigation and resolution.
* Continuously update and refine incident response playbooks, runbooks, and SOPs.
* Provide regular incident reports to leadership, including root cause analysis and long-term corrective actions.
Requirements
* Proven experience as an Incident Manager, Site Reliability Engineer (SRE), or Technical Operations Lead in cloud-native and microservices-based environments.
* Strong understanding of web architecture and microservices development principles.
* Deep hands-on experience with AWS Cloud Services: Compute (EC2 Lambda), Storage (S3), Databases (DynamoDB, RDS, Aurora), Messaging (Kafka, SNS, SQS), Caching (Redis, Memcache), Search (OpenSearch, Elasticsearch).
* Expertise in Agile tools: Jira, Confluence, Git, Jenkins.
* Strong Linux / Unix system administration skills, including troubleshooting and performance tuning.
* Strong analytical skills with expertise in debugging complex distributed system issues.
* Experience with monitoring and observability tools like CloudWatch, Grafana, Nagios, and Kibana.
* Excellent communication and leadership skills to manage cross-functional incident response teams.
* Experience in writing detailed post-incident reports and driving continuous improvement.
* Strong scripting skills (Python, Bash, or similar) to automate diagnostic or remediation tasks.
#J-18808-Ljbffr