GR

Senior Site Reliability Engineer III - Ansible/Terraform

Gurgaon ₹6-8 LPA Posted 10 Jun 2025

FULL TIME

Gcp

Automation

Responsibilities :

Define and enforce SLOs, SLIs, and error budgets across microservices
Architect an observability stack (metrics, logs, traces) and drive operational insights
Automate toil and manual ops with robust tooling and runbooks
Own incident response lifecycle: detection, triage, RCA, and postmortems
Collaborate with product teams to build fault-tolerant systems
Champion performance tuning, capacity planning, and scalability testing
Optimise costs while maintaining the reliability of cloud infrastructure

Must have Skills :

6+ years in SRE/Infrastructure/Backend related roles using Cloud Native Technologies
2+ years in SRE-specific capacity
Strong experience with monitoring/observability tools (Datadog, Prometheus, Grafana, ELK etc.)
Experience with infrastructure-as-code (Terraform/Ansible)
Proficiency in Kubernetes, service mesh (Istio/Linkerd), and container orchestration
Deep understanding of distributed systems, networking, and failure domains
Expertise in automation with Python, Bash, or Go
Proficient in incident management, SLAs/SLOs, and system tuning
Hands-on experience with GCP (preferred)/AWS/Azure and cloud cost optimisation
Participation in on-call rotations and running large-scale production systems

Nice to have skills :

Familiarity with chaos engineering practices and tools (Gremlin, Litmus)
Background in performance testing and load simulation (Gatling, Locust, k6, JMeter)