AL

Site Reliability Engineer at Alter Domus

Alter Domus
Hyderabad5-7 LPA Posted 25 Jun 2025
FULL TIME
Windows
Javascript
Event Management
Networking
Linux
+2 more

Job Description

We are looking for an experienced and motivated DevOps Engineer to join our Site Reliability Engineering (SRE) team . This role involves spearheading the Grafana Cloud and Backstage implementations as part of our Observability project. The ideal candidate will bring a blend of technical expertise in observability tools, strong problem-solving skills, and a passion for creating efficient, reliable systems.

Key Responsibilities:

  • Configure and manage data sources, including Prometheus and Azure Monitor, to build dashboards in Grafana.
  • Collaborate with DevOps engineers, system administrators, and software developers to understand monitoring requirements and design robust observability solutions.
  • Customize and extend Grafana functionalities by developing and implementing plugins and scripts.
  • Enhance visualizations for observability solutions to meet organizational needs.
  • Optimize dashboard performance and usability by fine-tuning data queries.
  • Troubleshoot and resolve issues related to Grafana configuration, data ingestion, and visualizations.
  • Participate in the administration, maintenance, and development of observability tools, including Grafana and ELK stack.
  • Troubleshoot network communication problems and ensure smooth operations.
  • Support Backstage implementation to enhance developer experience within the organization.

Required Skills:

  • Familiarity with Event Management and Application Monitoring concepts.
  • Experience in building and enhancing visualizations for observability solutions.
  • Proficiency with observability tools such as Grafana Prometheus Dynatrace Splunk Azure Monitor , or AWS CloudWatch .
  • Expertise in scripting with one or more of the following languages: Unix Shell Windows PowerShell JavaScript Python , or Go .
  • Strong problem-solving and analytical skills, with the ability to troubleshoot complex network communication issues.
  • Hands-on experience with the administration, maintenance, and development of Grafana or ELK stack.
  • Minimum of 5-7 years of domain experience in monitoring or related fields.
  • Comfortable working with both Windows and Linux command lines.
  • Excellent communication and collaboration skills, with the ability to work effectively within a team and interact with stakeholders.

Core/Must-Have Skills

  • Observability Subject Matter Expertise (SME)
  • Prometheus
  • Azure Monitor
  • Grafana
  • Open Telemetry

Good-to-Have Skills

  • Proficiency in Unix Shell, Windows PowerShell, JavaScript, Python, or Go.
  • Familiarity with Backstage implementation.
  • Experience troubleshooting network communication problems.

Join WhatsApp Channel