TETeamware Solutions
DevOps Monitoring and Management
Bangalore ₹5-7 LPA Posted 16 Jul 2025
FULL TIME
Splunk
Elk Stack
Azure
Aws
Job Description
- Key Responsibilities:
- Monitoring and Alerting:
- Set up, configure, and maintain monitoring systems (e.g., Prometheus, Grafana, Nagios, Datadog) for infrastructure, applications, and services.
- Implement real-time monitoring and alerting systems to proactively identify and respond to issues related to performance, availability, and security.
- Define and implement alerting thresholds for system and application performance to minimize downtime and service disruptions.
- Continuously monitor the health and availability of key services and infrastructure components to ensure optimal performance.
- Incident and Issue Management:
- Work closely with engineering teams to investigate and resolve production incidents, performance bottlenecks, and security vulnerabilities.
- Provide timely and accurate incident response for issues that arise, including conducting root cause analysis (RCA) and providing post-mortem reports.
- Lead the effort to ensure the uptime and reliability of applications, managing incidents and ensuring they are resolved within established SLAs.
- Automation and Optimization:
- Automate monitoring, alerting, and reporting tasks to reduce manual intervention and improve operational efficiency.
- Work closely with DevOps and Engineering teams to automate infrastructure provisioning, configuration management, and deployment pipelines using tools like Ansible, Terraform, Chef, or Puppet.
- Implement performance tuning and optimization strategies for infrastructure and applications in production environments.
- Cloud Infrastructure Management:
- Manage and monitor cloud-based infrastructure (e.g., AWS, Azure, Google Cloud Platform), including compute, storage, and network services.
- Integrate cloud monitoring tools to track resource utilization, cost management, and security compliance.
- Manage cloud-based logging and metrics collection to gain insights into cloud system performance and optimize cloud resources.
- Collaboration with Development and Operations Teams:
- Work closely with development and operations teams to integrate monitoring and alerting systems with the overall CI/CD pipeline.
- Assist developers in troubleshooting application performance issues and identifying bottlenecks or failures in the deployment pipeline.
- Ensure seamless collaboration between DevOps, development, and IT operations teams for continuous improvement and effective issue resolution.
- Reporting and Documentation:
- Produce detailed operational reports and dashboards to provide visibility into system health, performance trends, and incident resolution metrics.
- Develop and maintain runbooks and documentation for monitoring processes, incident management, and response strategies.
- Provide actionable insights into the system and application performance to help leadership make informed decisions.
- Security and Compliance:
- Monitor system compliance with security policies, ensuring data protection and secure access controls across cloud environments and applications.
- Collaborate with security teams to detect and mitigate security threats using integrated security monitoring tools.
- Ensure audit logs are maintained, accessible, and compliant with relevant standards and regulations.
- Continuous Improvement:
- Stay up-to-date with emerging DevOps practices and monitoring technologies, continuously enhancing the capabilities of the monitoring environment.
- Identify areas for continuous improvement in monitoring systems, processes, and tools to enhance operational efficiency and reduce downtime.
- Propose and implement new tools and technologies to further improve the effectiveness and scalability of the monitoring infrastructure.
- Required Qualifications:
- Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field.
- 3-5 years of hands-on experience in DevOps, Monitoring, and Infrastructure Management.
- Strong expertise in monitoring tools such as Prometheus, Grafana, Datadog, Nagios, Zabbix, or similar.
- Experience with cloud platforms such as AWS, Azure, or Google Cloud Platform.
- Proficiency in automation tools like Ansible, Chef, Terraform, Puppet, or SaltStack.
- Familiarity with CI/CD tools like Jenkins, GitLab CI, CircleCI, or Travis CI.
- Experience with containerization and orchestration tools like Docker and Kubernetes.
- Strong experience with infrastructure as code (IaC) practices and tools.
- Familiarity with logging tools such as ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, or Fluentd.
- Strong knowledge of Linux and Windows systems administration.
- Solid understanding of networking concepts, firewalls, and load balancing