Oriole is seeking a talented ML Systems/Infrastructure Engineer to help co‑optimize our AI/ML software stack with cutting‑edge network hardware. You’ll be a key contributor to a high‑impact, agile team focused on integrating middleware communication libraries and modelling the performance of large‑scale AI/ML workloads.
Key Responsibilities
* Design and optimize custom GPU communication kernels to enhance performance and scalability across multi‑node environments.
* Develop and maintain distributed communication frameworks for large‑scale deep learning models, ensuring efficient parallelization and optimal resource utilization.
* Profile, benchmark, and debug GPU applications to identify and resolve bottlenecks in communication and computation pipelines.
* Collaborate closely with hardware and software teams to integrate optimized kernels with Oriole’s next‑generation network hardware and software stack.
* Contribute to system‑level architecture decisions for large‑scale GPU clusters, focusing on communication efficiency, fault tolerance, and novel architectures for advanced optical network infrastructure.
Required Skills & Experience
* Proficient in C++ and Python, with a strong track record in high‑performance computing or machine learning projects.
* Expertise in GPU programming with CUDA, including deep knowledge of GPU memory hierarchies and kernel optimization.
* Hands‑on experience debugging GPU kernels using tools such as Cuda‑gdb, Cuda Memcheck, NSight Systems, PTX, and SASS.
* Strong understanding of communication libraries and protocols, including NCCL, NVSHMEM, OpenMPI, UCX, or custom collective communication implementations.
* Familiarity with HPC networking protocols/libraries such as RoCE, Infiniband, Libibverbs, and libfabric.
* Experience with distributed deep learning MoE frameworks, including PyTorch Distributed, vLLM, or DeepEP.
* Solid understanding of deploying and optimizing large‑scale distributed deep learning workloads in production environments, including Linux, Kubernetes, SLURM, OpenMPI, GPU drivers, Docker, and CI/CD automation.
#J-18808-Ljbffr