NONomiso
SRE Engineer
Gurgaon ₹5-8 LPA Posted 30 Jun 2025
FULL TIME
Docker
Kubernetes
Terraform
Grafana
Prometheus
Job Description
Position Overview:
- Take the lead in reliability engineering for large-scale, latency-sensitive applications serving thousands of users. You will drive automation-first operations and observability across all infrastructure layers while helping shape the reliability roadmap.
Roles and Responsibilities:
- Design and evolve comprehensive observability platforms, including distributed tracing.
- Lead resilience reviews, failure mode analysis, and risk mitigation plans.
- Define golden signals, dashboards, and alerts to drive proactive incident prevention.
- Develop and automate self-healing infrastructure patterns.
- Collaborate with DevOps and Development teams to ensure production readiness.
Must Have Skills:
- 7-10years of strong understanding of distributed systems, cloud networking, and capacity modeling.
- Deep experience with Kubernetes observability.
- Leadership in high-impact incident response, reliability planning, and RCA.
- Proven track record of managing infrastructure-as-code and CI/CD tooling.
- Strategic thinker with ability to drive platform-level reliability objectives.
Qualification:
- BE/BTech/MCA/ME/MTech/MS in Computer Science or a related technical field or equivalent practical experience.