GR

Senior Site Reliability Engineer III - Ansible/Terraform

GreyOrange
Gurgaon6-8 LPA Posted 10 Jun 2025
FULL TIME
Gcp
Automation

Job Description

Responsibilities :

  • Define and enforce SLOs, SLIs, and error budgets across microservices
  • Architect an observability stack (metrics, logs, traces) and drive operational insights
  • Automate toil and manual ops with robust tooling and runbooks
  • Own incident response lifecycle: detection, triage, RCA, and postmortems
  • Collaborate with product teams to build fault-tolerant systems
  • Champion performance tuning, capacity planning, and scalability testing
  • Optimise costs while maintaining the reliability of cloud infrastructure

Must have Skills :

  • 6+ years in SRE/Infrastructure/Backend related roles using Cloud Native Technologies
  • 2+ years in SRE-specific capacity
  • Strong experience with monitoring/observability tools (Datadog, Prometheus, Grafana, ELK etc.)
  • Experience with infrastructure-as-code (Terraform/Ansible)
  • Proficiency in Kubernetes, service mesh (Istio/Linkerd), and container orchestration
  • Deep understanding of distributed systems, networking, and failure domains
  • Expertise in automation with Python, Bash, or Go
  • Proficient in incident management, SLAs/SLOs, and system tuning
  • Hands-on experience with GCP (preferred)/AWS/Azure and cloud cost optimisation
  • Participation in on-call rotations and running large-scale production systems

Nice to have skills :

  • Familiarity with chaos engineering practices and tools (Gremlin, Litmus)
  • Background in performance testing and load simulation (Gatling, Locust, k6, JMeter)

Required Skills

Join WhatsApp Channel