Systems Engineer
We are looking for a Systems Engineer to join our WorldQuant Aligned Infrastructure team. The team is comprised of multidisciplinary individuals with unrestricted access across a large environment. We believe that one cannot build a truly great service without the ability to make changes across the stack. We take great care in focusing on solving real business problems, reducing operational overhead and working together as a team.
This team is responsible for the following areas – this includes engineering and operations:
1.Data modelling, database tuning & query optimization
2.HPC job scheduling
3.Workflow management and batch processing
4.Container orchestration
5.Service discovery
6.POSIX and object storage systems
On Premise:
1. Bare metal compute (Linux)
2. System tuning
3. Configuration management and drift management
4. Performance tuning
5. Network configuration management
6. Compute, storage, network system purchases / evaluations
Cloud:
7. Environment provisioning and management
Qualifications/Skills Required:
We are looking for individuals with experience in two or more of the following areas:
HPC job scheduling
8. Experience in environments at scale (eg. billions of jobs per week/month)
9. Understanding of cost metrics, preemption, job types, queuing, scheduler and optimizations
10. Experience with products like HTCondor, slurm, spectrum LSF, nomad, AWS batch
Container Orchestration (Kubernetes)
11. Experience with: PSPs, helm, admission/mutation controllers, PVs/PVCs, kube-router, BGP – generally demonstrated ability dig deep into the k8s projects to solve hard problems
12. Experience with docker & registries (eg. harbor, artifactory, GCP container registry, AWS container registry)
13. Mature approach to dealing with operational complexities and gaps of the kubernetes platform
Storage Systems
14. Experience deploying and managing petabyte scale systems supporting varied workloads
15. Mature approach to accessing price/performance, tiering and backup requirements
16. Experience with products like GPFS, NetApp, Pure, Lightbits Ceph, GCP PDs or other nvme specific products
17. Familiarity with NVMe over fabric, POSIX, object storage and various modes of permissioning data
Linux
18. Experience using configuration management systems (eg. saltstack, ansible)
19. Understanding of linux kernel components (eg. VFS, scheduler, memory mgmt., network)
20. Solid troubleshooting experience using gdb, OS & application tracing/profiling mechanisms
21. Experience with some of docker, lxd/lxc, kerberos, ebpf and virtualization technologies
Workflow management and batch processing
22. Experience in the challenges of workflow management in heavily multi-tenant environments
23. Mature approach to dealing with/avoiding task failure and system failure
24. Experience with products like airflow, nifi, gnubatch, GCP cloud composer, AWS sagemaker
Software Engineering
25. Proficient in OO development (we use python), git and CI/CD concepts
26. Comfortable contributing to a large code-base with varied technologies
In addition to the above, the following qualifications always apply:
27. Ability to review and/or extend open source platforms to satisfy business requirements
28. A passion for technology and automation, deep sense of curiosity and willingness to always question
29. A passion for in-depth understanding of technology, and building large-scale systems.
30. Excellent verbal and written communication skills.