CareerStationUnlock Job Success

NO

SRE Engineer

Gurgaon ₹5-8 LPA Posted 30 Jun 2025

FULL TIME

Docker

Kubernetes

Terraform

Grafana

Prometheus

Job Description

Position Overview:

Take the lead in reliability engineering for large-scale, latency-sensitive applications serving thousands of users. You will drive automation-first operations and observability across all infrastructure layers while helping shape the reliability roadmap.

Roles and Responsibilities:

Design and evolve comprehensive observability platforms, including distributed tracing.
Lead resilience reviews, failure mode analysis, and risk mitigation plans.
Define golden signals, dashboards, and alerts to drive proactive incident prevention.
Develop and automate self-healing infrastructure patterns.
Collaborate with DevOps and Development teams to ensure production readiness.

Must Have Skills:

7-10years of strong understanding of distributed systems, cloud networking, and capacity modeling.
Deep experience with Kubernetes observability.
Leadership in high-impact incident response, reliability planning, and RCA.
Proven track record of managing infrastructure-as-code and CI/CD tooling.
Strategic thinker with ability to drive platform-level reliability objectives.

Qualification:

BE/BTech/MCA/ME/MTech/MS in Computer Science or a related technical field or equivalent practical experience.

Required Skills

Docker Kubernetes Terraform Grafana Prometheus

Join WhatsApp Channel