* Own production on-call responsibilities including incident response, mitigation, and post-mortem analysis.
* Troubleshoot complex system failures across distributed Linux/Unix environments.
* Design, deploy, and operate containerized applications in production infrastructure.
* Build and maintain highly available, scalable distributed services.
* Write, test, and release production-quality code in Python, Go, or similar languages.
* Improve observability using monitoring, logging, and alerting practices.
* Automate operational workflows to reduce manual intervention and MTTR.
* Collaborate with engineering teams to improve reliability, performance, and release readiness.
* Perform capacity planning, performance tuning, and resilience testing.
* Drive continuous improvements in reliability, operational excellence, and system stability.