AT

Senior Site Reliability Engineer

Chennai ₹5-8 LPA Posted 12 Jun 2025

FULL TIME

Ansible

Puppet

Terraform

Grafana

Prometheus

+2 more

Job Responsibilities

Provisioning and ongoing management of physical & virtual Linux machines using tools like Puppet, Ansible, and Terraform, to name a few
Engage closely with sister teams to assume ownership of various system lifecycle tasks
Automate away toil and/or create empowerment processes for transitioning high urgency work to the NOC's rapid response team
Build automated monitoring & observability using tools such as Prometheus/AlertManager, iCinga, Grafana, etc.
Participate in all Agile/scrum ceremonies including daily stand-ups, sprint planning, backlog grooming, etc.
Participate in the team's on-call rotation (expected to begin late 2024, early 2025)
Work closely with internal teams to integrate new monitoring & alerts into the NOC using Perl scripting to author custom parsing & mapping rules
Develop metrics and observability dashboards which can be used to measure and track various success measures for the team & the business

Typical Qualifications

5+ years of professional experience delivering SaaS solutions, preferably in a hybrid cloud environment
Bachelor's or Master's degree in a Computer Science / Engineering program
Proven experience using query languages to deliver observability solutions
Proficiency working with one or more configuration management tools (Puppet, Chef, Ansible, etc.)
Admin-level expertise with a Unix-based operating system
Proven ops background using cloud-native best practices
Proven proficiency with one or more scripting languages (Python, Ruby, Perl, Java, etc.)
Proficiency working with Git & Atlassian suite or similar
Proficiency working with containerized environments is a plus
Experience creating technical documentation & standard operating procedures (SOPs)