Join us as a Senior Site Reliability Engineer
As a Senior Site Reliability Engineer, you'll act as a hands on expert responsible for ensuring the reliability, availability, and performance of critical production platforms. You'll lead the adoption of Site Reliability Engineering (SRE) practices, embedding resilience, observability, and operational excellence into distributed systems running on AWS and Kubernetes. You'll also take ownership of 24/7 production support models, ensuring systems are highly available and that incidents are effectively managed and learned from.
We'll expect you as well to design and operate highly resilient AWS based Kubernetes platforms (EKS) aligned with enterprise standards while owning and continuously improving production reliability, availability, and Service Level Agreement or Service Level Objective (SLA/SLO) frameworks. You'll lead incident management, escalation, and 24/7 on call practices, including post incident reviews, and embed SRE principles such as error budgets, toil reduction, and reliability engineering into delivery teams. Furthermore, you'll implement infrastructure and platform automation using Terraform and GitOps methodologies and drive self healing, auto scaling, and failure recovery mechanisms using tools such as Karpenter.
In addition to this, you'll be:
We're looking for a highly experienced Site Reliability Engineer with a strong background in operating large scale, business critical platforms and a passion for reliability engineering. You must also have deep expertise in managing production systems on AWS and Kubernetes (EKS), along with strong experience in 24/7 support models, incident management, and on call leadership.
Moreover, you'll need to demonstrate advanced knowledge of SRE principles such as SLIs, SLOs, error budgets, and toil reduction, as well as proficiency in Terraform, GitOps, and cloud automation practices. Hands on experience with GitLab continuous integration and continuous delivery pipelines and Argo CD is also essential.
In addition, you'll have to bring: