IFIfintalent Global Private Limited
Observability SRE
Hyderabad ₹3-6 LPA Posted 14 Nov 2025
FULL TIME
platform development
Kubernetes
Appdynamics
Google Cloud Platform
Prometheus
Job Description
Key Responsibilities:
- Design, build, and maintain observability platforms including monitoring, logging, tracing, and alerting systems.
- Implement and optimize metrics collection using tools like Prometheus, Grafana, OpenTelemetry, or similar.
- Develop and maintain centralized logging infrastructure (e.g., Data Dog, Open Telemetry, Splunk, or Google Cloud Logging).
- Implement distributed tracing solutions using tools such as Jaeger, Zip kin, AppDynamics, or OpenTelemetry.
- Collaborate with engineering teams to define SLIs, SLOs, and alerting thresholds.
- Automate observability workflows and integrate observability into CI/CD pipelines.
- Analyze and interpret telemetry data to proactively identify system issues and performance bottlenecks.
- Provide training and documentation to teams on best practices in observability.
- Continuously evaluate and adopt new observability technologies and practices.
Tools & Technologies:
- Skilled in AppDynamics, Splunk, Thousand Eyes, ITRS for instrumentation, monitoring, alerting, and incident response.
- Deep hands-on knowledge of Terraform, Kubernetes (GKE), GitLab CI/CD.
- Familiar with modern observability practices like Open Telemetry, Grafana, Datadog
- Strong knowledge of data platforms: Big Query, Cassandra, Kafka, PostgreSQL, MySQL.
- Experience with AI/ML-based operations tools for automation, anomaly detection, and predictive alerting.
Qualifications:
- Bachelor's degree in Computer Science, Engineering, or related field—or equivalent experience.
- Proven experience as an SRE or DevOps engineer, particularly in Google Cloud Platform (GCP).
- Expertise in designing and managing observability platforms and tools.
- Hands-on experience with monitoring systems like Prometheus, Grafana, Datadog, New Relic, etc.
- Proficient in logging solutions such as ELK, Splunk, Fluentd, or Google Cloud Logging.
- Familiarity with distributed tracing tools like Open Telemetry, Jaeger, or Zip kin.
- Strong scripting and automation skills using Python, Go, Bash, or similar.
- Experience with cloud platforms (AWS, GCP, Azure) and their observability services.
- Solid understanding of Kubernetes and observability in containerized environments.
- Deep knowledge of networking, application performance, and distributed systems.
- Exposure to AI/ML-based observability or anomaly detection tools.
- Excellent troubleshooting, debugging, and analytical capabilities.
- Strong communication and cross-team collaboration skills.