GR

Senior Member Technical Staff - ITOps

GreyOrange
Gurgaon6-10 LPA Posted 11 Jun 2025
FULL TIME
Docker
Programming Languages
Automation Tools
Unix Systems
Networking

Job Description

Roles and Responsibilities :

We are seeking a talented and motivated Lead Site Reliability Engineer (SRE) to join our organisation.

The SRE team at GreyOrange is responsible for monitoring the stability and availability of mission-critical production systems, managing incidents for quicker resolution, and establishing BAU. The team also manages and maintains internal tools/infra which is consumed by other development teams.

The experienced SRE will play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies.

Requirements :

  • Should have 7+ years of experience.
  • Well-versed with scripting/programming languages (Python/Bash/PowerShell, etc.) to automate manual work, particularly within cloud environments
  • Well-versed with Observability tools (Grafana, Splunk, Dynatrace) for monitoring, alerting, and logging solutions to identify and address potential issues, especially in cloud infrastructure
  • Working experience with automation tools (Jenkins, GitLab, Ansible/Chef for configuration management) and processes to streamline deployment, monitoring, and management of systems and applications in the cloud
  • Hands-on experience with containerization and orchestration technologies such as Docker, Kubernetes, or similar, particularly in cloud-native environments
  • Well aware of SLI, SLO, SLA, and Error Budget concepts and their implementations; provide on-call support and participate in incident management & response activities as needed
  • Expert with troubleshooting production issues and bugs.
  • Good knowledge of Unix systems, networking, web technologies, and databases.
  • Incident Management experience coupled with effective communication skills for production workload.
  • Working knowledge in any one of the cloud platforms (AWS or GCP)

What youll do:

  • Lead reliability engineering projects and drive them to closure.
  • Ensure system stability and high availability by proactively monitoring performance and troubleshooting issues
  • Design, build and maintain efficient, reliable, and scalable cloud-based infrastructure and services
  • Automate processes and find opportunities to improve the observability and availability of the Platform to reduce toil.
  • Implement and manage observability tools for comprehensive monitoring, alerting, and logging
  • Own end-to-end availability and performance of different services & tools.
  • Practice sustainable incident response and blameless postmortems.
  • Provide on-call support for incident management and participate actively in response activities
Join WhatsApp Channel