TE

DevOps Monitoring and Management

Teamware Solutions
Hyderabad2-4 LPA Posted 25 Jul 2025
FULL TIME
Jenkins
Devops
Docker
Splunk
Kubernetes
+3 more

Job Description

Key Responsibilities:

Monitoring and Alerting:

  • Set up, configure, and maintain monitoring systems (e.g., Prometheus, Grafana, Nagios, Datadog) for infrastructure, applications, and services.
  • Implement real-time monitoring and alerting systems to proactively identify and respond to issues related to performance, availability, and security.
  • Define and implement alerting thresholds for system and application performance to minimize downtime and service disruptions.
  • Continuously monitor the health and availability of key services and infrastructure components to ensure optimal performance.

Incident and Issue Management:

  • Work closely with engineering teams to investigate and resolve production incidents, performance bottlenecks, and security vulnerabilities.
  • Provide timely and accurate incident response for issues that arise, including conducting root cause analysis (RCA) and providing post-mortem reports.
  • Lead the effort to ensure the uptime and reliability of applications, managing incidents and ensuring they are resolved within established SLAs.

Automation and Optimization:

  • Automate monitoring, alerting, and reporting tasks to reduce manual intervention and improve operational efficiency.
  • Work closely with DevOps and Engineering teams to automate infrastructure provisioning, configuration management, and deployment pipelines using tools like Ansible, Terraform, Chef, or Puppet.
  • Implement performance tuning and optimization strategies for infrastructure and applications in production environments.

Cloud Infrastructure Management:

  • Manage and monitor cloud-based infrastructure (e.g., AWS, Azure, Google Cloud Platform), including compute, storage, and network services.
  • Integrate cloud monitoring tools to track resource utilization, cost management, and security compliance.
  • Manage cloud-based logging and metrics collection to gain insights into cloud system performance and optimize cloud resources.

Collaboration with Development and Operations Teams:

  • Work closely with development and operations teams to integrate monitoring and alerting systems with the overall CI/CD pipeline.
  • Assist developers in troubleshooting application performance issues and identifying bottlenecks or failures in the deployment pipeline.
  • Ensure seamless collaboration between DevOps, development, and IT operations teams for continuous improvement and effective issue resolution.

Reporting and Documentation:

  • Produce detailed operational reports and dashboards to provide visibility into system health, performance trends, and incident resolution metrics.
  • Develop and maintain runbooks and documentation for monitoring processes, incident management, and response strategies.
  • Provide actionable insights into the system and application performance to help leadership make informed decisions.

Security and Compliance:

  • Monitor system compliance with security policies, ensuring data protection and secure access controls across cloud environments and applications.
  • Collaborate with security teams to detect and mitigate security threats using integrated security monitoring tools.
  • Ensure audit logs are maintained, accessible, and compliant with relevant standards and regulations.

Continuous Improvement:

  • Stay up-to-date with emerging DevOps practices and monitoring technologies, continuously enhancing the capabilities of the monitoring environment.
  • Identify areas for continuous improvement in monitoring systems, processes, and tools to enhance operational efficiency and reduce downtime.
  • Propose and implement new tools and technologies to further improve the effectiveness and scalability of the monitoring infrastructure.

Required Qualifications:

  • Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field.
  • 3-5 years of hands-on experience in DevOps, Monitoring, and Infrastructure Management.
  • Strong expertise in monitoring tools such as Prometheus, Grafana, Datadog, Nagios, Zabbix, or similar.
  • Experience with cloud platforms such as AWS, Azure, or Google Cloud Platform.
  • Proficiency in automation tools like Ansible, Chef, Terraform, Puppet, or SaltStack.
  • Familiarity with CI/CD tools like Jenkins, GitLab CI, CircleCI, or Travis CI.
  • Experience with containerization and orchestration tools like Docker and Kubernetes.
  • Strong experience with infrastructure as code (IaC) practices and tools.
  • Familiarity with logging tools such as ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, or Fluentd.
  • Strong knowledge of Linux and Windows systems administration.
  • Solid understanding of networking concepts, firewalls, and load balancing
Join WhatsApp Channel