We are looking for an experienced and motivated DevOps Engineer to join our Site Reliability Engineering (SRE) team . This role involves spearheading the Grafana Cloud and Backstage implementations as part of our Observability project. The ideal candidate will bring a blend of technical expertise in observability tools, strong problem-solving skills, and a passion for creating efficient, reliable systems.

Key Responsibilities:

Configure and manage data sources, including Prometheus and Azure Monitor, to build dashboards in Grafana.
Collaborate with DevOps engineers, system administrators, and software developers to understand monitoring requirements and design robust observability solutions.
Customize and extend Grafana functionalities by developing and implementing plugins and scripts.
Enhance visualizations for observability solutions to meet organizational needs.
Optimize dashboard performance and usability by fine-tuning data queries.
Troubleshoot and resolve issues related to Grafana configuration, data ingestion, and visualizations.
Participate in the administration, maintenance, and development of observability tools, including Grafana and ELK stack.
Troubleshoot network communication problems and ensure smooth operations.
Support Backstage implementation to enhance developer experience within the organization.

Required Skills:

Familiarity with Event Management and Application Monitoring concepts.
Experience in building and enhancing visualizations for observability solutions.
Proficiency with observability tools such as Grafana , Prometheus , Dynatrace , Splunk , Azure Monitor , or AWS CloudWatch .
Expertise in scripting with one or more of the following languages: Unix Shell , Windows PowerShell , JavaScript , Python , or Go .
Strong problem-solving and analytical skills, with the ability to troubleshoot complex network communication issues.
Hands-on experience with the administration, maintenance, and development of Grafana or ELK stack.
Minimum of 5-7 years of domain experience in monitoring or related fields.
Comfortable working with both Windows and Linux command lines.
Excellent communication and collaboration skills, with the ability to work effectively within a team and interact with stakeholders.

Core/Must-Have Skills

Observability Subject Matter Expertise (SME)
Prometheus
Azure Monitor
Grafana
Open Telemetry

Good-to-Have Skills

Proficiency in Unix Shell, Windows PowerShell, JavaScript, Python, or Go.
Familiarity with Backstage implementation.
Experience troubleshooting network communication problems.

Site Reliability Engineer at Alter Domus

Job Description

Required Skills