GRGreyOrange
Senior Site Reliability Engineer III - Ansible/Terraform
Gurgaon ₹6-8 LPA Posted 10 Jun 2025
FULL TIME
Gcp
Automation
Job Description
Responsibilities :
- Define and enforce SLOs, SLIs, and error budgets across microservices
- Architect an observability stack (metrics, logs, traces) and drive operational insights
- Automate toil and manual ops with robust tooling and runbooks
- Own incident response lifecycle: detection, triage, RCA, and postmortems
- Collaborate with product teams to build fault-tolerant systems
- Champion performance tuning, capacity planning, and scalability testing
- Optimise costs while maintaining the reliability of cloud infrastructure
Must have Skills :
- 6+ years in SRE/Infrastructure/Backend related roles using Cloud Native Technologies
- 2+ years in SRE-specific capacity
- Strong experience with monitoring/observability tools (Datadog, Prometheus, Grafana, ELK etc.)
- Experience with infrastructure-as-code (Terraform/Ansible)
- Proficiency in Kubernetes, service mesh (Istio/Linkerd), and container orchestration
- Deep understanding of distributed systems, networking, and failure domains
- Expertise in automation with Python, Bash, or Go
- Proficient in incident management, SLAs/SLOs, and system tuning
- Hands-on experience with GCP (preferred)/AWS/Azure and cloud cost optimisation
- Participation in on-call rotations and running large-scale production systems
Nice to have skills :
- Familiarity with chaos engineering practices and tools (Gremlin, Litmus)
- Background in performance testing and load simulation (Gatling, Locust, k6, JMeter)