[Up to c. $425k Comp Package (or equivalent) | Hybrid Working]
We’re hiring on behalf of a top-tier technology-driven trading firm known for its world-class infrastructure and scientific approach to real-time systems. As part of a specialist engineering team, you’ll help scale and optimise massive distributed GPU environments powering AI, research, and quantitative strategies. This is a rare chance to take ownership of petabyte-scale infrastructure across global data centres - shaping the future of how data-intensive workloads are run and accelerated at scale...
Key Responsibilities
* Design, deploy, and tune large-scale GPU-based compute environments used for AI and quant research workloads
* Benchmark, analyse, and eliminate performance bottlenecks across compute, storage, and network layers
* Automate system configuration, monitoring, and diagnostics across thousands of high-density nodes
* Partner with researchers and engineers to align infrastructure improvements with evolving model and data demands
* Manage end-to-end rollout of new hardware and software solutions, including hands-on testing and vendor coordination
* Troubleshoot complex distributed systems across the full stack: hardware, OS, drivers, and container orchestration
* Own critical projects that enhance performance, reliability, and observability at the fleet level
What You Bring...
* 4-8 years' experience managing large-scale Linux infrastructure in high-performance, distributed, or AI-centric environments
* Deep technical fluency with GPU architecture, deployment, and tuning (e.g. memory management, driver compatibility, hardware diagnostics)
* Strong scripting and automation skills, especially in Python, with infrastructure-as-code mindset
* Hands-on experience resolving GPU workload issues across compute clusters and supporting technologies
* Familiarity with performance tooling and debugging in live production environments
* Practical experience with CUDA or systems-level programming in C/C++
* Experience with config management frameworks like Salt, Ansible, or Puppet
* (Preferred) Experience with GPU communication and interconnect technologies (e.g. collective communication libraries such as NCCL, low-latency solutions like GPUDirect RDMA, or high-speed GPU interconnects including NVLink)
...