Senior Member Technical Staff - ITOps
Job Description
Roles and Responsibilities :
We are seeking a talented and motivated Lead Site Reliability Engineer (SRE) to join our organisation.
The SRE team at GreyOrange is responsible for monitoring the stability and availability of mission-critical production systems, managing incidents for quicker resolution, and establishing BAU. The team also manages and maintains internal tools/infra which is consumed by other development teams.
The experienced SRE will play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies.
Requirements :
- Should have 7+ years of experience.
- Well-versed with scripting/programming languages (Python/Bash/PowerShell, etc.) to automate manual work, particularly within cloud environments
- Well-versed with Observability tools (Grafana, Splunk, Dynatrace) for monitoring, alerting, and logging solutions to identify and address potential issues, especially in cloud infrastructure
- Working experience with automation tools (Jenkins, GitLab, Ansible/Chef for configuration management) and processes to streamline deployment, monitoring, and management of systems and applications in the cloud
- Hands-on experience with containerization and orchestration technologies such as Docker, Kubernetes, or similar, particularly in cloud-native environments
- Well aware of SLI, SLO, SLA, and Error Budget concepts and their implementations; provide on-call support and participate in incident management & response activities as needed
- Expert with troubleshooting production issues and bugs.
- Good knowledge of Unix systems, networking, web technologies, and databases.
- Incident Management experience coupled with effective communication skills for production workload.
- Working knowledge in any one of the cloud platforms (AWS or GCP)
What youll do:
- Lead reliability engineering projects and drive them to closure.
- Ensure system stability and high availability by proactively monitoring performance and troubleshooting issues
- Design, build and maintain efficient, reliable, and scalable cloud-based infrastructure and services
- Automate processes and find opportunities to improve the observability and availability of the Platform to reduce toil.
- Implement and manage observability tools for comprehensive monitoring, alerting, and logging
- Own end-to-end availability and performance of different services & tools.
- Practice sustainable incident response and blameless postmortems.
- Provide on-call support for incident management and participate actively in response activities