SISiemens
Senior AI Engineer
Gurgaon ₹4-8 LPA Posted 29 Apr 2025
FULL TIME
Jax
Debugging
Pytorch
MLops
Problem Solving
+1 more
Job Description
- Design, build, and rigorously optimize everything necessary for large-scale training, fine-tuning and/or inference with different model architectures. Includes the complete stack from dataloading to distributed training to inference; to maximize the MFU (Model Flop Utilization) on the compute cluster.
- Collaborate closely and proactively with research scientists, translating research models and algorithms into high-performance, production-ready code and infrastructure. Ability to implement, integrate & test latest advancements from research publications or open-source code.
- Relentlessly profile and resolve training performance bottlenecks, optimizing every layer of the training stack from data loading to model inference for speed and efficiency.
- Contribute to technology evaluations and selection of hardware, software, and cloud services that will define our AI infrastructure platform.
- Experience with MLOps frameworks (MLFlow, WnB, etc) to implement best practices across the model lifecycle- development, training, validation, and monitoring- ensuring reproducibility, reliability, and continuous improvement.
- Create thorough documentation for infrastructure, data pipelines, and training procedures, ensuring maintainability and knowledge transfer within the growing AI lab.
- Stay at the forefront of advancements in large-scale training strategies and data engineering and proactively driving improvements and innovation in our workflows and infrastructure.
- High-agency individual demonstrating initiative, problem-solving, and a commitment to delivering robust and scalable solutions for rapid prototyping and turnaround.
- Bachelor's or masters degree in computer science, Engineering, or a related technical field.
- 5+ years of hands-on experience in a role specifically building and optimizing infrastructure for large-scale machine learning systems
- Deep practical expertise with AI frameworks (PyTorch, Jax, Pytorch Lightning, etc). Hands-on experience with large-scale multi-node GPU training, and other optimization strategies for developing large foundation models, across various model architectures. Ability to scale solutions involving large datasets and complex models on distributed compute infrastructure.
- Excellent problem-solving, debugging, and performance optimization skills, with a data-driven approach to identifying and resolving technical challenges.
- Strong communication and teamwork skills, with a collaborative approach to working with research scientists and other engineers.
- Experience with MLOps best practices for model tracking, evaluation and deployment.
Desired skills
- Public GitHub profile demonstrating a track record of open-source contributions to relevant projects in data engineering or deep learning infrastructure is a BIG PLUS.
- Experience with performance monitoring and profiling tools for distributed training and data pipelines.
- Experience writing CUDA/Triton/CUTLASS kernels.