Requirements
* Knowledge of Python
* Familiarity with cloud services (e.g. AWS)
* Experience managing or developing in Linux environments
* Understanding of CI/CD principles
* Experience using Kubernetes (k8s)
* (Desirable) Experience maintaining machine learning applications
* (Desirable) Experience deploying ML orchestration tools (e.g. NV Ray, KFP, SkyPilot)
* (Desirable) Experience managing ML accelerator hardware (e.g. DCGM)
* (Desirable) Experience with Infrastructure as Code (IaC) tools (e.g. Terraform/OpenTofu)
* (Desirable) Experience with GitHub Actions
* (Desirable) Experience with modern observability tooling (e.g. Prometheus)
* (Desirable) Experience with Grafana
* (Desirable) Knowledge of Go/Java/C++ (or similar language)
What the job involves
* Join our dynamic Software Infrastructure team and take a pivotal role in scaling and managing our infrastructure
* You will develop essential tools and services that empower our broader software team
* Your contributions will enhance the build, test, deployment, and productisation processes of our Machine Learning Software components
* Work with our High-Performance Computing (HPC) AI platforms and gain invaluable experience in distributed systems
* The Software Infrastructure team provides critical platforms and services for software development teams across the business
* Our responsibilities include managing the CI platform and services, build engineering, component integration, and packaging and release systems
* We operate in squads, fostering a culture of service ownership and empowerment for our engineers
* We focus on long‑term engineering solutions and strive to eliminate toil wherever possible
* Develop, own, and maintain tools and services to support AI research and engineering teams
* Deploy and maintain services with Kubernetes and Docker
* Manage our Cloud Infrastructure using tools such as Terraform
#J-18808-Ljbffr