Responsibilities:
* Conduct pre-training of AI models on large, distributed servers with thousands of NVIDIA GPUs.
* Design, prototype, and scale innovative architectures to improve model intelligence.
* Execute experiments independently and collaboratively, analyze results, and refine methodologies for optimal performance.
* Investigate, debug, and enhance model efficiency and computational performance.
* Contribute to training systems development to ensure scalability and efficiency on target platforms.
Minimum Qualifications:
* A degree in Computer Science or a related field; a PhD in NLP, Machine Learning, or related fields is preferred, with a strong record in AI R&D and publications in top conferences.
* Hands-on experience with large-scale LLM training on distributed servers with thousands of NVIDIA GPUs.
* Familiarity with large-scale, distributed training frameworks, libraries, and tools.
* Deep knowledge of transformer and non-transformer model modifications to enhance intelligence, efficiency, and scalability.
* Expertise in PyTorch and Hugging Face libraries, with practical experience in model development, pretraining, and deployment.
#J-18808-Ljbffr