Responsibilities:
* Design and implement comprehensive observability strategies and architectures for AWS cloud environments, including metrics, logs, and distributed tracing.
* Configure and maintain observability tools and platforms, ensuring their proper integration with our systems and applications (cloud native and monolithic)
* Develop custom dashboards and alerts to monitor key performance indicators (KPIs) and overall system health.
* Automate the deployment and management of observability infrastructure using Infrastructure as Code (IaC) tools.
* Work closely with development, operations, and engineering teams to understand their observability needs and provide effective solutions.
* Participate in incident resolution, providing observability data and analysis to identify root causes and facilitate recovery.
* Implement and manage observability solutions specifically for containerized environments and orchestration with Elastic Kubernetes Service (EKS).
* Evaluate and recommend new observability tools and technologies to enhance our capabilities.
* Document observability configurations, processes, and best practices.
* Train and support other teams in the use of observability tools and techniques.
* Stay up-to-date on the latest trends and best practices in observability and cloud technologies.
Requirements:
* Cloud Knowledge and Experience (AWS):
* Proven experience minimum 6 - 8 years working with the Amazon Web Services (AWS) cloud platform.
* In-depth knowledge of AWS services relevant to observability, such as CloudWatch (Logs, Metrics, Alarms), X-Ray, and potentially others like AWS Observability Service.
* Understanding of the architecture and design principles of applications in the AWS cloud.
* Infrastructure as Code (IaC):
* Practical experience in deploying and managing infrastructure using Infrastructure as Code (IaC) tools such as Terraform, or similar.
* Ability to write, maintain, and improve IaC code to automate the creation and configuration of observability infrastructure.
* Elastic Kubernetes Service (EKS):
* Significant experience in the deployment, management, and observability of containerized applications using Amazon EKS.
* Deep understanding of Kubernetes concepts and its interaction with AWS.
* Hands-on experience configuring observability tools specifically for Kubernetes environments, such as Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Jaeger, etc., within EKS.
* General Observability Experience:
* Solid understanding of observability principles and best practices (metrics, logs, distributed tracing).
* Experience with various observability and monitoring tools.
* Ability to develop effective dashboards and alerts based on observability data.
* Capacity to analyze observability data to identify performance and availability issues.
Additional Technical Skills:
* Ability to develop scripts and automate tasks using languages such as Python, Bash, etc.
* Knowledge of Linux operating systems.
* Familiarity with Agile and DevOps methodologies.
* Interpersonal Skills:
* Strong problem-solving skills and the ability to analyze complex data.
* Excellent communication and collaboration skills.
* Ability to work independently and as part of a team.
Nice to have
* Relevant AWS certifications (e.g., AWS Certified DevOps Engineer – Professional).
* Experience with other container orchestration platforms (e.g., vanilla Kubernetes).
* Knowledge of Site Reliability Engineering (SRE) principles.
* Experience in implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs).