Design, build, and rigorously optimize everything necessary for large-scale training, fine-tuning and/or inference with different model architectures. Includes the complete stack from dataloading to distributed training to inference; to maximize the MFU (Model Flop Utilization) on the compute cluster.
Collaborate closely and proactively with research scientists, translating research models and algorithms into high-performance, production-ready code and infrastructure. Ability to implement, integrate & test latest advancements from research publications or open-source code.
Relentlessly profile and resolve training performance bottlenecks, optimizing every layer of the training stack from data loading to model inference for speed and efficiency.
Contribute to technology evaluations and selection of hardware, software, and cloud services that will define our AI infrastructure platform.
Experience with MLOps frameworks (MLFlow, WnB, etc) to implement best practices across the model lifecycle- development, training, validation, and monitoring- ensuring reproducibility, reliability, and continuous improvement.
Create thorough documentation for infrastructure, data pipelines, and training procedures, ensuring maintainability and knowledge transfer within the growing AI lab.
Stay at the forefront of advancements in large-scale training strategies and data engineering and proactively driving improvements and innovation in our workflows and infrastructure.
High-agency individual demonstrating initiative, problem-solving, and a commitment to delivering robust and scalable solutions for rapid prototyping and turnaround.

Bachelor's or masters degree in computer science, Engineering, or a related technical field.
5+ years of hands-on experience in a role specifically building and optimizing infrastructure for large-scale machine learning systems
Deep practical expertise with AI frameworks (PyTorch, Jax, Pytorch Lightning, etc). Hands-on experience with large-scale multi-node GPU training, and other optimization strategies for developing large foundation models, across various model architectures. Ability to scale solutions involving large datasets and complex models on distributed compute infrastructure.
Excellent problem-solving, debugging, and performance optimization skills, with a data-driven approach to identifying and resolving technical challenges.
Strong communication and teamwork skills, with a collaborative approach to working with research scientists and other engineers.
Experience with MLOps best practices for model tracking, evaluation and deployment.

Desired skills

Public GitHub profile demonstrating a track record of open-source contributions to relevant projects in data engineering or deep learning infrastructure is a BIG PLUS.
Experience with performance monitoring and profiling tools for distributed training and data pipelines.
Experience writing CUDA/Triton/CUTLASS kernels.

Senior AI Engineer

Job Description

Required Skills