IF

Observability SRE

Ifintalent Global Private Limited
Hyderabad3-6 LPA Posted 14 Nov 2025
FULL TIME
platform development
Kubernetes
Appdynamics
Google Cloud Platform
Prometheus

Job Description

Key Responsibilities:

 

  • Design, build, and maintain observability platforms including monitoring, logging, tracing, and alerting systems.
  • Implement and optimize metrics collection using tools like Prometheus, Grafana, OpenTelemetry, or similar.
  • Develop and maintain centralized logging infrastructure (e.g., Data Dog, Open Telemetry, Splunk, or Google Cloud Logging).
  • Implement distributed tracing solutions using tools such as Jaeger, Zip kin, AppDynamics, or OpenTelemetry.
  • Collaborate with engineering teams to define SLIs, SLOs, and alerting thresholds.
  • Automate observability workflows and integrate observability into CI/CD pipelines.
  • Analyze and interpret telemetry data to proactively identify system issues and performance bottlenecks.
  • Provide training and documentation to teams on best practices in observability.
  • Continuously evaluate and adopt new observability technologies and practices.

 

Tools & Technologies:

 

  • Skilled in AppDynamics, Splunk, Thousand Eyes, ITRS for instrumentation, monitoring, alerting, and incident response.
  • Deep hands-on knowledge of Terraform, Kubernetes (GKE), GitLab CI/CD.
  • Familiar with modern observability practices like Open Telemetry, Grafana, Datadog
  • Strong knowledge of data platforms: Big Query, Cassandra, Kafka, PostgreSQL, MySQL.
  • Experience with AI/ML-based operations tools for automation, anomaly detection, and predictive alerting.

Qualifications:

  • Bachelor's degree in Computer Science, Engineering, or related field—or equivalent experience.
  • Proven experience as an SRE or DevOps engineer, particularly in Google Cloud Platform (GCP).
  • Expertise in designing and managing observability platforms and tools.
  • Hands-on experience with monitoring systems like Prometheus, Grafana, Datadog, New Relic, etc.
  • Proficient in logging solutions such as ELK, Splunk, Fluentd, or Google Cloud Logging.
  • Familiarity with distributed tracing tools like Open Telemetry, Jaeger, or Zip kin.
  • Strong scripting and automation skills using Python, Go, Bash, or similar.
  • Experience with cloud platforms (AWS, GCP, Azure) and their observability services.
  • Solid understanding of Kubernetes and observability in containerized environments.
  • Deep knowledge of networking, application performance, and distributed systems.
  • Exposure to AI/ML-based observability or anomaly detection tools.
  • Excellent troubleshooting, debugging, and analytical capabilities.
  • Strong communication and cross-team collaboration skills.

Join WhatsApp Channel