LTLTM
Site Reliability Engineer
Hyderabad Posted 14 Mar 2026
FULL TIME
Storage Systems
Root Cause Analysis
Production Support
Automation Scripting
Azure Devops
+2 more
Job Description
Location : Hyderabad-L&T Metro-flr 1-9,11&12
Job Title: Jr. Site Reliability Engineer (SRE) – Azure Storage
Role Overview
We are seeking a Site Reliability Engineer (SRE) to support Azure Storage deployments and operations across public, sovereign, and pre‑production environments. The role focuses on deployment reliability, incident response, infrastructure health, automation, and data‑driven operational insights.
Key Responsibilities
Reliability, Deployments & Operations
- Execute Azure Storage (Classic/ XPF / Direct Drive) tenant and infrastructure deployments across public, sovereign, and pre‑production environments.
- Monitor and maintain server uptime, tenant stability, and overall environment health.
- Track and reduce offline capacity and long‑running (long‑tail) deployments to improve deployment completion times.
- Manage end‑to‑end release tracking for storage components and ensure deployment compliance.
Incident Management & Troubleshooting
- Acknowledge, triage, and resolve deployment‑related incidents and operational alerts.
- Apply technical mitigations (including node recovery) to unblock critical deployments.
- Lead Severity‑2 bridge calls, coordinating with engineering, partner, and vendor teams through resolution.
- Manually create and manage Incident Communication Management (ICM) records when required.
Root Cause Analysis & Stability Improvements
- Perform root cause analysis (RCA) for hardware, infrastructure, and release‑related failures.
- Analyze recurring deployment faults and failure trends; file defects with actionable remediation details.
- Investigate and correct incorrect fault‑bucket assignments to improve diagnostic accuracy.
- Collect and analyze hardware logs; deliver structured reports to engineering and vendor teams.
Process, Automation & Documentation
- Identify and drive automation opportunities for repetitive or high‑risk operational tasks.
- Develop, maintain, and publish SOPs, TSGs, troubleshooting playbooks, and KB articles.
- Improve workflows through automation, procedural updates, and process optimizations.
Reporting & Stakeholder Communication
- Publish daily operational status reports and defect summaries.
- Deliver weekly dashboards and quality reports covering deployment health, reliability metrics, and SLO adherence.
- Provide regular status updates to stakeholders and participate in daily syncs with on‑call teams.
Required Skills & Experience
- Strong experience in Azure cloud operations, SRE, or large‑scale infrastructure support.
- Hands‑on experience with incident triage, RCA, and production support.
- Solid understanding of storage systems, hardware failures, and deployment pipelines.
- Experience working in 24x7 on‑call / shift‑based operational environments.
Good‑to‑Have / Preferred Skills
- Hyper‑V: Virtualization troubleshooting and host‑level diagnostics.
- Azure DevOps: CI/CD pipelines, release tracking, automation, and operational workflows.
- Kusto / Azure Data Explorer (ADX):
- Writing KQL queries for operational insights
- Building dashboards for deployment health, defects, capacity, and reliability metrics
- Experience with automation scripting (PowerShell, Python, or similar).
Work Model
- Hybrid: 3 days work from office, 2 days work from home.
Note:
- Resource's are multiparked thus allocation will be on FCFS basis only.
- CI Blocking is valid for 3 days, if no CI feedback is received within 3 days, resource will be automatically made available for other requirements.
- If no response on proposed profiles within 3 days, RR will be marked on hold.