Fractile's mission is to enable a new chapter in the AI revolution. We're pioneering AI innovation where hardware and software join to create something truly extraordinary, unlocking the power of the world's largest language models with speed increases of x100. Our team is rapidly expanding, and we're searching for visionary engineers, scientists, and thinkers who share our passion for pushing boundaries and redefining what's possible. If you're ready to join a dynamic group of innovators shaping AI's future, we want to hear from you
We are seeking an exceptional Infrastructure engineer to develop and maintain the foundations of our silicon development, supporting a wide range of computationally-intensive workloads while scaling up capacity by orders-of-magnitude. In this role you will work together with our build flow lead to provide efficient, scalable processes and compute to our front-end and back-end silicon engineering teams. You will need to work closely with engineers across the organisation to resolve bottlenecks and optimise workflows, to provide a environment that enables our team to execute at scale and with speed.
Key Responsibilities:
* Create and support tooling and workflows centred around EDA tooling, which will require coding and build-system knowledge to assist with tasks faced by different teams.
* Deploy, and maintain compute infrastructure (in either the cloud or on-premise) using an infrastructure-as-code (IaC) framework (Ansible/Terraform).
* Manage key network services such as a VPN, central authentication (LDAP), file/object storage, and license servers.
* Maintain a cluster compute solution, capable of scheduling a wide array of types of job with large resource requirements.
* Setup and monitor observation tooling for resource utilisation, machine failures, and more (e.g. Prometheus/Zabbix).
* Work with the engineering team to build and optimise their workloads.
Preferred Qualifications:
* Proficient in modern software development language(s) and infrastructure-as-code frameworks.
* Proficient in the use and administration of Linux/Unix systems, and ideally management of shared compute environments.
* Past experience with diagnosing and resolving network/storage/CPU/RAM bottlenecks across complex workloads.
* Experience deploying and managing a grid compute system (Slurm/LSF/SGE).
* Proficiency with containerisation frameworks (Docker/Singularity).