[Up to c. $425k Comp Package (or equivalent) | Hybrid Working]
We’re hiring on behalf of a top-tier technology-driven trading firm known for its world-class infrastructure and scientific approach to real-time systems. As part of a specialist engineering team, you’ll help scale and optimise massive distributed GPU environments powering AI, research, and quantitative strategies. This is a rare chance to take ownership of petabyte-scale infrastructure across global data centres - shaping the future of how data-intensive workloads are run and accelerated at scale...
Key Responsibilities
Design, deploy, and tune large-scale GPU-based compute environments used for AI and quant research workloads
Benchmark, analyse, and eliminate performance bottlenecks across compute, storage, and network layers
Automate system configuration, monitoring, and diagnostics across thousands of high-density nodes
Partner with researchers and engineers to align infrastructure improvements with evolving model and data demands
Manage end-to-end rollout of new hardware and software solutions, including hands-on testing and vendor coordination
Troubleshoot complex distributed systems across the full stack: hardware, OS, drivers, and container orchestration
Own critical projects that enhance performance, reliability, and observability at the fleet level
What You Bring...
4-8 years' experience managing large-scale Linux infrastructure in high-performance, distributed, or AI-centric environments
Deep technical fluency with GPU architecture, deployment, and tuning (e.g. memory management, driver compatibility, hardware diagnostics)
Strong scripting and automation skills, especially in Python, with infrastructure-as-code mindset
Hands-on experience resolving GPU workload issues across compute clusters and supporting technologies
Familiarity with performance tooling and debugging in live production environments
Practical experience with CUDA or systems-level programming in C/C++
Experience with config management frameworks like Salt, Ansible, or Puppet
(Preferred) Experience with GPU communication and interconnect technologies (e.g. collective communication libraries such as NCCL, low-latency solutions like GPUDirect RDMA, or high-speed GPU interconnects including NVLink)
...