Responsibilities:

* Design and implement comprehensive observability strategies and architectures for AWS cloud environments, including metrics, logs, and distributed tracing.

* Configure and maintain observability tools and platforms, ensuring their proper integration with our systems and applications (cloud native and monolithic)

* Develop custom dashboards and alerts to monitor key performance indicators (KPIs) and overall system health.

* Automate the deployment and management of observability infrastructure using Infrastructure as Code (IaC) tools.

* Work closely with development, operations, and engineering teams to understand their observability needs and provide effective solutions.

* Participate in incident resolution, providing observability data and analysis to identify root causes and facilitate recovery.

* Implement and manage observability solutions specifically for containerized environments and orchestration with Elastic Kubernetes Service (EKS).

* Evaluate and recommend new observability tools and technologies to enhance our capabilities.

* Document observability configurations, processes, and best practices.

* Train and support other teams in the use of observability tools and techniques.

* Stay up-to-date on the latest trends and best practices in observability and cloud technologies.

Requirements:

* Cloud Knowledge and Experience (AWS):

* Proven experience minimum 5 years working with the Amazon Web Services (AWS) cloud platform.

* In-depth knowledge of AWS services relevant to observability, such as CloudWatch (Logs, Metrics, Alarms), X-Ray, and potentially others like AWS Observability Service.

* Understanding of the architecture and design principles of applications in the AWS cloud.

* Infrastructure as Code (IaC):

* Practical experience in deploying and managing infrastructure using Infrastructure as Code (IaC) tools such as Terraform, or similar.

* Ability to write, maintain, and improve IaC code to automate the creation and configuration of observability infrastructure.

* Elastic Kubernetes Service (EKS):

* Significant experience in the deployment, management, and observability of containerized applications using Amazon EKS.

* Deep understanding of Kubernetes concepts and its interaction with AWS.

* Hands-on experience configuring observability tools specifically for Kubernetes environments, such as Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), Jaeger, etc., within EKS.

* General Observability Experience:

* Solid understanding of observability principles and best practices (metrics, logs, distributed tracing).

* Experience with various observability and monitoring tools.

* Ability to develop effective dashboards and alerts based on observability data.

* Capacity to analyze observability data to identify performance and availability issues.

Additional Technical Skills:

* Ability to develop scripts and automate tasks using languages such as Python, Bash, etc.

* Knowledge of Linux operating systems.

* Familiarity with Agile and DevOps methodologies.

* Interpersonal Skills:

* Strong problem-solving skills and the ability to analyze complex data.

* Excellent communication and collaboration skills.

* Ability to work independently and as part of a team.

Nice to have

* Relevant AWS certifications (e.g., AWS Certified DevOps Engineer – Professional).

* Experience with other container orchestration platforms (e.g., vanilla Kubernetes).

* Knowledge of Site Reliability Engineering (SRE) principles.

* Experience in implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs).