Location : Hyderabad-L&T Metro-flr 1-9,11&12

Job Title: Jr. Site Reliability Engineer (SRE) – Azure Storage

Role Overview

We are seeking a Site Reliability Engineer (SRE) to support Azure Storage deployments and operations across public, sovereign, and pre‑production environments. The role focuses on deployment reliability, incident response, infrastructure health, automation, and data‑driven operational insights.

Key Responsibilities

Reliability, Deployments & Operations

Execute Azure Storage (Classic/ XPF / Direct Drive) tenant and infrastructure deployments across public, sovereign, and pre‑production environments.
Monitor and maintain server uptime, tenant stability, and overall environment health.
Track and reduce offline capacity and long‑running (long‑tail) deployments to improve deployment completion times.
Manage end‑to‑end release tracking for storage components and ensure deployment compliance.

Incident Management & Troubleshooting

Acknowledge, triage, and resolve deployment‑related incidents and operational alerts.
Apply technical mitigations (including node recovery) to unblock critical deployments.
Lead Severity‑2 bridge calls, coordinating with engineering, partner, and vendor teams through resolution.
Manually create and manage Incident Communication Management (ICM) records when required.

Root Cause Analysis & Stability Improvements

Perform root cause analysis (RCA) for hardware, infrastructure, and release‑related failures.
Analyze recurring deployment faults and failure trends; file defects with actionable remediation details.
Investigate and correct incorrect fault‑bucket assignments to improve diagnostic accuracy.
Collect and analyze hardware logs; deliver structured reports to engineering and vendor teams.

Process, Automation & Documentation

Identify and drive automation opportunities for repetitive or high‑risk operational tasks.
Develop, maintain, and publish SOPs, TSGs, troubleshooting playbooks, and KB articles.
Improve workflows through automation, procedural updates, and process optimizations.

Reporting & Stakeholder Communication

Publish daily operational status reports and defect summaries.
Deliver weekly dashboards and quality reports covering deployment health, reliability metrics, and SLO adherence.
Provide regular status updates to stakeholders and participate in daily syncs with on‑call teams.

Required Skills & Experience

Strong experience in Azure cloud operations, SRE, or large‑scale infrastructure support.
Hands‑on experience with incident triage, RCA, and production support.
Solid understanding of storage systems, hardware failures, and deployment pipelines.
Experience working in 24x7 on‑call / shift‑based operational environments.

Good‑to‑Have / Preferred Skills

Hyper‑V: Virtualization troubleshooting and host‑level diagnostics.
Azure DevOps: CI/CD pipelines, release tracking, automation, and operational workflows.
Kusto / Azure Data Explorer (ADX):
Writing KQL queries for operational insights
Building dashboards for deployment health, defects, capacity, and reliability metrics
Experience with automation scripting (PowerShell, Python, or similar).

Work Model

Hybrid: 3 days work from office, 2 days work from home.

Note:

Resource's are multiparked thus allocation will be on FCFS basis only.
CI Blocking is valid for 3 days, if no CI feedback is received within 3 days, resource will be automatically made available for other requirements.
If no response on proposed profiles within 3 days, RR will be marked on hold.

Site Reliability Engineer

Job Description

Required Skills