Hardware Engineer/Specialist– GPU Compute Platforms
About Us
We are a fast-growing GPU cloud engineered for AI, providing cost-effective, high-performance infrastructure for start-ups and large enterprises. Our mission is to eliminate the complexity of AI development by delivering powerful, reliable, and scalable compute—enabling customers to innovate faster and operate more efficiently.
Our team thrives on ownership, innovation, and accountability. Here, transparency builds trust, and every team member is empowered to deliver excellence with urgency. Join us, and you’ll help build the technology powering the future of AI.
About the Role
We’re seeking a Hardware Specialist to own the lifecycle of our GPU compute platforms—from working with OEMs to define the right systems, to powering, cooling, racking, networking, and maintaining high-density GPU infrastructure in production.
You’ll be the internal authority on NVIDIA GPU servers and surrounding data center hardware, and the primary liaison with vendors (Dell, Lenovo, etc.). Your work will directly influence performance, reliability, and our ability to scale.
What You’ll Do
Vendor & Platform Ownership
* Collaborate with hardware vendors (Dell, Lenovo, SIs/channel partners) to translate workload requirements into server configurations and BOMs.
* Maintain deep awareness of NVIDIA’s data center GPU portfolio, HGX/DGX systems, NVLink/NVSwitch, NICs/DPUs, and related components.
Power Requirements
* Produce accurate node and rack power budgets (typical/max) with A/B feed redundancy, PSU efficiency, and power-capping considerations.
* Specify PDUs (single/three-phase), breaker sizing, UPS integration, and downstream distribution.
* Validate power draw during burn-in and acceptance testing; track capacity vs. plan.
Cooling Requirements
* Define and validate cooling approaches for high-density GPU racks (air-cooled and liquid-ready).
* Support designs using RDHx or direct-to-chip liquid cooling systems (CDUs, facility water specs, quick-disconnects, leak detection).
* Monitor thermal telemetry and configure inlet/outlet thresholds.
Firmware & BIOS Lifecycle
* Define and maintain golden profiles for BIOS/UEFI, BMC/iDRAC/iLO/XCC, NIC, NVMe, GPU, and switch OS firmware.
* Automate updates using Redfish/IPMI/CLI/Ansible (or similar).
* Tune BIOS settings (NUMA, C-states, PCIe bifurcation, SR-IOV, Above 4G, etc.) to optimize GPU workloads.
Network Integration
* Design and integrate host networking for AI/HPC clusters across Ethernet (RoCEv2/DCB/PFC/ECN) and/or InfiniBand.
* Specify NICs, optics, transceivers, and cabling; support LAG/LACP/MLAG/EVPN-VXLAN where appropriate.
Operations, Reliability & Documentation
* Define acceptance and burn-in processes (power/thermal soak, memory/disk tests, GPU diagnostics).
* Track inventory, spares, failure rates, and vendor SLA performance.
* Produce runbooks, rack elevations, wiring diagrams, firmware matrices, and change documentation.
* Partner with Security on secure boot/TPM, firmware signing, and chain-of-custody.
* Own data-erasure standards for decommissioning.
About You (Qualifications)
Required:
* 5+ years building and operating enterprise/x86 data center hardware.
* Hands-on experience with GPU-accelerated platforms.
* Proven vendor management experience with at least one major OEM (Dell, Lenovo, Supermicro, HPE, etc.).
* Expertise across:
* Power: node/rack budgets, A/B feeds, PDUs
* Cooling: air and liquid-ready rack designs
* Rack layouts: elevations, space/weight planning, cabling standards
* Firmware/BIOS: baselines, automation, lifecycle
* Networking: InfiniBand/Ethernet for AI/HPC (RoCEv2, DCB/PFC, optics)
* Strong Linux server experience (Ubuntu, RHEL)
* Scripting/automation (Ansible, Python, shell)
* Excellent documentation and cross-functional communication skills
Nice to Have:
* Experience with NVIDIA DGX/HGX systems, NVLink/NVSwitch, MIG/vGPU
* Familiarity with liquid cooling systems (CDU, RDHx), facility water chemistry, and thermal engineering
* Background in AI/HPC cluster deployments
What We Offer
* Highly competitive compensation (base + equity)
* Rapid career growth in one of the fastest-growing AI infrastructure companies
* A remote-first, flexible work environment built on trust
* Opportunities to push boundaries, innovate, and have meaningful ownership
* Supportive, collaborative culture that puts people first