Senior or Principal Software Engineer, ML Platform (Stability & Infrastructure)
Your Impact
We are building the largest foundation models in biotech and applying them immediately to cure disease. You will play a pivotal role in ensuring the reliability and scalability of the foundations that make this possible.
As a Principal Engineer, you will lead the efforts to harden our systems, ensuring our groundbreaking AI is built on an unshakeable base, working closely with the research team and the Applied ML teams to ensure the infrastructure is stable, reliable and can operate with more data and larger models as we grow.
What You Will Do
* You will own the end-to-end strategy for platform reliability, with a specific focus on our accelerator (GPU/TPU) infrastructure and workload orchestration. You will move between high-level architectural design and hands‑on systems engineering to eliminate friction in the researcher experience.
* Lead the reliability work for our global job scheduler. You will design and implement a robust \"test harness\" to safely validate infrastructure upgrades without impacting live research.
* Architect and optimize our next‑generation inference services. You will solve core scaling limits, ensuring high‑throughput performance and feature parity across our model serving stack.
* Overhaul our logging and monitoring systems to provide radical visibility. You will build proactive alerting and telemetry that identifies systemic failures before they impact research workflows.
* Improve our internal CI/CD stability, targeting a significant reduction in failure rates and significantly faster feedback loops for the engineering organization.
* Contribute to core technical decisions on tooling and architectural design while partnering with science, product, and operations teams to align infrastructure with biotech R&D cycles.
Skills And Qualifications
Essential
* Proven experience in architecting and managing large‑scale AI/ML workloads in a production environment.
* Expertise in cloud compute design, specifically within Google Cloud Platform (GCP).
* Significant experience deploying and managing complex workloads within Kubernetes (GKE).
* Professional familiarity with NVIDIA GPU generations and the intricacies of high‑performance compute.
* Strong programming skills and a \"reliability‑first\" approach to software development.
Nice to Have
* A career history that spans both ML Software Engineering and Infrastructure SRE roles.
* Experience leading multi‑disciplinary projects and navigating complex stakeholder requirements in a fast‑paced environment.
* Familiarity with workload scheduling, ML efficiency research, and hardware benchmarking.
* Experience with Google TPU generations and specialized ML‑driven R&D cycles.
Hybrid Working
It’s hugely important for us to share knowledge and build strong relationships with each other, and we find it easier to do this if we spend time together in person. This is why we follow a hybrid model, and you would require you to be able to come into the office 3 days a week (currently Tuesday, Wednesday, and one other day depending on which team you’re in). If you have additional needs that would prevent you from following this hybrid approach, we’d be happy to talk through these if you’re selected for an initial screening call.
Equal Employment Opportunity
We are committed to equal employment opportunities regardless of sex, race, religion or belief, ethnic or national origin, disability, age, citizenship, marital, domestic or civil partnership status, sexual orientation, gender identity, pregnancy or related condition (including breastfeeding) or any other basis protected by applicable law. If you have a disability or additional need that requires accommodation, please let us know.
#J-18808-Ljbffr