SI

Senior AI Engineer

Siemens
Gurgaon4-8 LPA Posted 29 Apr 2025
FULL TIME
Jax
Debugging
Pytorch
MLops
Problem Solving
+1 more

Job Description

  • Design, build, and rigorously optimize everything necessary for large-scale training, fine-tuning and/or inference with different model architectures. Includes the complete stack from dataloading to distributed training to inference; to maximize the MFU (Model Flop Utilization) on the compute cluster.
  • Collaborate closely and proactively with research scientists, translating research models and algorithms into high-performance, production-ready code and infrastructure. Ability to implement, integrate & test latest advancements from research publications or open-source code.
  • Relentlessly profile and resolve training performance bottlenecks, optimizing every layer of the training stack from data loading to model inference for speed and efficiency.
  • Contribute to technology evaluations and selection of hardware, software, and cloud services that will define our AI infrastructure platform.
  • Experience with MLOps frameworks (MLFlow, WnB, etc) to implement best practices across the model lifecycle- development, training, validation, and monitoring- ensuring reproducibility, reliability, and continuous improvement.
  • Create thorough documentation for infrastructure, data pipelines, and training procedures, ensuring maintainability and knowledge transfer within the growing AI lab.
  • Stay at the forefront of advancements in large-scale training strategies and data engineering and proactively driving improvements and innovation in our workflows and infrastructure.
  • High-agency individual demonstrating initiative, problem-solving, and a commitment to delivering robust and scalable solutions for rapid prototyping and turnaround.

  • Bachelor's or masters degree in computer science, Engineering, or a related technical field.
  • 5+ years of hands-on experience in a role specifically building and optimizing infrastructure for large-scale machine learning systems
  • Deep practical expertise with AI frameworks (PyTorch, Jax, Pytorch Lightning, etc). Hands-on experience with large-scale multi-node GPU training, and other optimization strategies for developing large foundation models, across various model architectures. Ability to scale solutions involving large datasets and complex models on distributed compute infrastructure.
  • Excellent problem-solving, debugging, and performance optimization skills, with a data-driven approach to identifying and resolving technical challenges.
  • Strong communication and teamwork skills, with a collaborative approach to working with research scientists and other engineers.
  • Experience with MLOps best practices for model tracking, evaluation and deployment.

 Desired skills   

  • Public GitHub profile demonstrating a track record of open-source contributions to relevant projects in data engineering or deep learning infrastructure is a BIG PLUS.
  • Experience with performance monitoring and profiling tools for distributed training and data pipelines.
  • Experience writing CUDA/Triton/CUTLASS kernels.
Join WhatsApp Channel