it job board logo
  • Home
  • Find IT Jobs
  • Register CV
  • Career Advice
  • Contact us
  • Employers
    • Register as Employer
    • Pricing Plans
  • Recruiting? Post a job
  • Sign in
  • Sign up
  • Home
  • Find IT Jobs
  • Register CV
  • Career Advice
  • Contact us
  • Employers
    • Register as Employer
    • Pricing Plans
Sorry, that job is no longer available. Here are some results that may be similar to the job you were looking for.

64 jobs found

Email me jobs like this
Refine Search
Current Search
sre site reliability engineer
Partnerscale
Site Reliabilty Engineer / SRE
Partnerscale Manchester, Lancashire
Site Reliability Engineer / SRE Manchester City Centre / Hybrid £55k - £65k DOE + Bonus We are working with a leading Tech Brand in Manchester who operate with a large and mature in-house engineering function responsible for a platform used by Millions of consumers, daily. It is a company that takes its technology seriously, with mission-critical systems that need to perform reliably at all times. They are looking for a Site Reliability Engineer to join a well-funded, settled SRE team, working across multiple engineering squads to build tooling and automation, improve observability practices and drive a culture of continuous improvement. It's a great opportunity to do genuinely impactful work within a business that invests heavily in its people and its technology. Our ideal SRE / Platform Engineer will have a Software Development background and around 5 years+ commercial experience in a similar role. Key requirements: A software development background, with commercial experience writing and contributing to production code Strong SRE knowledge across SLIs, SLOs and reliability frameworks Hands-on experience with Splunk, New Relic, Grafana etc Experience with IaC tools including Ansible or Terraform Background in a large-scale, 24/7 enterprise environment Interest in Platform Engineering and modern observability practices If you're a passionate SRE looking for a step up into a well-resourced, fast-paced environment, apply now.
03/04/2026
Full time
Site Reliability Engineer / SRE Manchester City Centre / Hybrid £55k - £65k DOE + Bonus We are working with a leading Tech Brand in Manchester who operate with a large and mature in-house engineering function responsible for a platform used by Millions of consumers, daily. It is a company that takes its technology seriously, with mission-critical systems that need to perform reliably at all times. They are looking for a Site Reliability Engineer to join a well-funded, settled SRE team, working across multiple engineering squads to build tooling and automation, improve observability practices and drive a culture of continuous improvement. It's a great opportunity to do genuinely impactful work within a business that invests heavily in its people and its technology. Our ideal SRE / Platform Engineer will have a Software Development background and around 5 years+ commercial experience in a similar role. Key requirements: A software development background, with commercial experience writing and contributing to production code Strong SRE knowledge across SLIs, SLOs and reliability frameworks Hands-on experience with Splunk, New Relic, Grafana etc Experience with IaC tools including Ansible or Terraform Background in a large-scale, 24/7 enterprise environment Interest in Platform Engineering and modern observability practices If you're a passionate SRE looking for a step up into a well-resourced, fast-paced environment, apply now.
83Zero Ltd
Principle Site Reliability Engineer
83Zero Ltd
Principal Site Reliability Engineer - Active SC Required! Up to £100,000 + benefits Wokingham - Hybrid (UK-based) We're looking for a Principal Site Reliability Engineer to provide technical leadership across large-scale, complex platforms. This is a strategic role where you'll shape reliability engineering practices, influence architecture, and drive operational excellence across the organisation. What you'll be doing: Defining and driving SRE strategy, standards, and best practices Architecting highly resilient, scalable, and secure systems Leading major incident reviews and driving organisational improvements Influencing platform and application design at an architectural level Championing automation, self-healing systems, and reliability by design Acting as a mentor and technical authority across multiple teams What we're looking for: Extensive experience in SRE, DevOps, or platform engineering Proven track record of designing and operating large-scale distributed systems Deep expertise in cloud platforms and cloud-native architectures Strong experience with Kubernetes, infrastructure as code, and automation Excellent stakeholder management and leadership skills Ability to operate at both strategic and hands-on levels Why apply? High-impact role with influence across engineering and architecture Opportunity to shape reliability strategy at scale Work with cutting-edge technologies in a complex environment
02/04/2026
Full time
Principal Site Reliability Engineer - Active SC Required! Up to £100,000 + benefits Wokingham - Hybrid (UK-based) We're looking for a Principal Site Reliability Engineer to provide technical leadership across large-scale, complex platforms. This is a strategic role where you'll shape reliability engineering practices, influence architecture, and drive operational excellence across the organisation. What you'll be doing: Defining and driving SRE strategy, standards, and best practices Architecting highly resilient, scalable, and secure systems Leading major incident reviews and driving organisational improvements Influencing platform and application design at an architectural level Championing automation, self-healing systems, and reliability by design Acting as a mentor and technical authority across multiple teams What we're looking for: Extensive experience in SRE, DevOps, or platform engineering Proven track record of designing and operating large-scale distributed systems Deep expertise in cloud platforms and cloud-native architectures Strong experience with Kubernetes, infrastructure as code, and automation Excellent stakeholder management and leadership skills Ability to operate at both strategic and hands-on levels Why apply? High-impact role with influence across engineering and architecture Opportunity to shape reliability strategy at scale Work with cutting-edge technologies in a complex environment
Cambridge University Press & Assessment
Principal Developer Team Lead
Cambridge University Press & Assessment Cambridge, Cambridgeshire
Job Title: Principal Developer Team Lead Salary: £51,400 - £68,800 Location: Cambridge/Hybrid Contract: Permanent This Principal Developer Team Lead position offers a pivotal opportunity to shape the technical future of a world-renowned academic organisation. You'll spearhead the migration of enterprise systems to cutting-edge cloud-native AWS architectures, while balancing hands-on technical leadership with people management responsibilities. We are Cambridge University Press & Assessment, a world-leading academic publisher and assessment organisation and a proud part of the University of Cambridge. About the role We're seeking a hands-on Principal Developer Team Lead to drive the technical transformation of our Exam Technology Organisation as we migrate legacy enterprise applications to modern, cloud-native architectures on AWS. You'll balance technical leadership with people management, leading a team of 4-8 developers while establishing the foundations for our future technology stack. Your initial focus will be on two strategic priorities: Evolving our SRE function - Building the DevOps infrastructure, automation, and tooling that enables Site Reliability Engineering practices across development and operations teams Advancing our AI development practice - Establishing standards, frameworks, and best practices for responsibly integrating AI capabilities into our education platforms. What You'll Do Technical Leadership Lead migration of legacy applications to cloud-native AWS architectures Build DevOps automation to support SRE practices Establish AI/ML development standards and frameworks Set observability, monitoring, and incident response standards Promote best practices in web, event-driven, and cloud-native technologies Provide technical expertise and oversee code reviews People Leadership Manage and mentor a team of 4-8 developers, providing coaching, development plan Identifying training needs in AI/ML and SRE. Support recruitment and foster a culture of continual improvement and wellbeing. Delivery & Collaboration Deliver software in agile squads Collaborate with architects, SREs, product owners, and infrastructure teams Liaise with stakeholders to identify education sector needs Plan and estimate migrations and feature delivery Coordinate with service management, security, and AWS experts About you Essentialexperience Degree or equivalent Proven technical team leadership Skilled in two or more modern programming languages Experience with AWS cloud and infrastructure DevOps skills: automation, CI/CD, infrastructure-as-code Understanding of SRE and observability Experience in web-apps and modern frameworks Strong communicator with technical and non-technical audiences Technical Expertise CI/CD pipelines, automation frameworks, and developer tooling Observability tools, monitoring, logging, and alerting systems Responsible AI practices and governance Event-driven architecture and microservices patterns Software design patterns and scalability best practices Security principles in cloud environments Leadership Qualities Ability to set technical standards and provide thought leadership Experience balancing people management with hands-on contribution Strong mentoring and coaching skills Collaborative approach that builds trust across teams Passion for continuous learning in AI/ML and DevOps Promotes inclusion and continuous improvement You'll be instrumental in our digital transformation, establishing the foundations for reliable, innovative systems that serve millions of learners, teachers, and researchers worldwide. By evolving our SRE function and advancing our AI practice, you'll empower teams to deliver high-performance solutions while responsibly harnessing cutting-edge technologies. If you would like to know more about this opportunity and what will make you successful, please see the full job description attached to the bottom of this vacancy on our careers site. Rewards and benefits We will support you to be at your best in work and to live well outside of it. In addition to competitive salaries, we offer a world-class, flexible rewards package , featuring family-friendly and planet-friendly benefits including: 28 days annual leave plus bank holidays Private medical and Permanent Health Insurance Discretionary annual bonus Group personal pension scheme Life assurance up to 4 x annual salary Green travel schemes We are a hybrid working organisation, and we offer a range of flexible working options from day one. We expect most hybrid-working colleagues to spend 40-60% of their time at their dedicated office or location. We will also consider other work arrangements if you wish to work more flexibly or require adjustments due to a disability. Ready to pursue your potential? Apply now. We review applications on an ongoing basis, with a closing date for all applications being 16th April 2026. As part of the application process you can expect: Two questions to select one answer from multiple options. A 15-minute screening call with the Hiring Manager. First stage interview via MS Teams or in person. You will be provided with a brief to complete a role related task which will need to be returned by email in advance of your interview. Please note that successful applicants will be subject to satisfactory background checks including DBS due to working in a regulated industry. Cambridge University Press & Assessment is an approved UK employer for the sponsorship of eligible roles and applicants under the Skilled Worker visa route. Please refer to the gov.uk website for guidance to understand your own eligibility based on the role you are applying for. Why join us Joining us is your opportunity to pursue potential. You'll belong to a collaborative team that's exploring new and better ways to serve students, teachers and researchers across the globe - for the benefit of individuals, society and the world. Sharing our mission will inspire your own growth, development and progress, in an environment which embraces difference, change and aspiration. Cambridge University Press & Assessment is committed to being a place where anyone can enjoy a successful career, where it's safe to speak up, and where we learn continuously to improve together. We welcome applications from all candidates, regardless of demographic characteristics (age, disability, educational attainment, ethnicity, gender, marital status, neurodiversity, religion, sex, gender identity and sexual identity), cultural, or social class/background. We believe better outcomes come through diversity of thought, background and approach. We welcome applications from people from all backgrounds and communities, actively seeking to employ people from a wide range of different communities.
02/04/2026
Full time
Job Title: Principal Developer Team Lead Salary: £51,400 - £68,800 Location: Cambridge/Hybrid Contract: Permanent This Principal Developer Team Lead position offers a pivotal opportunity to shape the technical future of a world-renowned academic organisation. You'll spearhead the migration of enterprise systems to cutting-edge cloud-native AWS architectures, while balancing hands-on technical leadership with people management responsibilities. We are Cambridge University Press & Assessment, a world-leading academic publisher and assessment organisation and a proud part of the University of Cambridge. About the role We're seeking a hands-on Principal Developer Team Lead to drive the technical transformation of our Exam Technology Organisation as we migrate legacy enterprise applications to modern, cloud-native architectures on AWS. You'll balance technical leadership with people management, leading a team of 4-8 developers while establishing the foundations for our future technology stack. Your initial focus will be on two strategic priorities: Evolving our SRE function - Building the DevOps infrastructure, automation, and tooling that enables Site Reliability Engineering practices across development and operations teams Advancing our AI development practice - Establishing standards, frameworks, and best practices for responsibly integrating AI capabilities into our education platforms. What You'll Do Technical Leadership Lead migration of legacy applications to cloud-native AWS architectures Build DevOps automation to support SRE practices Establish AI/ML development standards and frameworks Set observability, monitoring, and incident response standards Promote best practices in web, event-driven, and cloud-native technologies Provide technical expertise and oversee code reviews People Leadership Manage and mentor a team of 4-8 developers, providing coaching, development plan Identifying training needs in AI/ML and SRE. Support recruitment and foster a culture of continual improvement and wellbeing. Delivery & Collaboration Deliver software in agile squads Collaborate with architects, SREs, product owners, and infrastructure teams Liaise with stakeholders to identify education sector needs Plan and estimate migrations and feature delivery Coordinate with service management, security, and AWS experts About you Essentialexperience Degree or equivalent Proven technical team leadership Skilled in two or more modern programming languages Experience with AWS cloud and infrastructure DevOps skills: automation, CI/CD, infrastructure-as-code Understanding of SRE and observability Experience in web-apps and modern frameworks Strong communicator with technical and non-technical audiences Technical Expertise CI/CD pipelines, automation frameworks, and developer tooling Observability tools, monitoring, logging, and alerting systems Responsible AI practices and governance Event-driven architecture and microservices patterns Software design patterns and scalability best practices Security principles in cloud environments Leadership Qualities Ability to set technical standards and provide thought leadership Experience balancing people management with hands-on contribution Strong mentoring and coaching skills Collaborative approach that builds trust across teams Passion for continuous learning in AI/ML and DevOps Promotes inclusion and continuous improvement You'll be instrumental in our digital transformation, establishing the foundations for reliable, innovative systems that serve millions of learners, teachers, and researchers worldwide. By evolving our SRE function and advancing our AI practice, you'll empower teams to deliver high-performance solutions while responsibly harnessing cutting-edge technologies. If you would like to know more about this opportunity and what will make you successful, please see the full job description attached to the bottom of this vacancy on our careers site. Rewards and benefits We will support you to be at your best in work and to live well outside of it. In addition to competitive salaries, we offer a world-class, flexible rewards package , featuring family-friendly and planet-friendly benefits including: 28 days annual leave plus bank holidays Private medical and Permanent Health Insurance Discretionary annual bonus Group personal pension scheme Life assurance up to 4 x annual salary Green travel schemes We are a hybrid working organisation, and we offer a range of flexible working options from day one. We expect most hybrid-working colleagues to spend 40-60% of their time at their dedicated office or location. We will also consider other work arrangements if you wish to work more flexibly or require adjustments due to a disability. Ready to pursue your potential? Apply now. We review applications on an ongoing basis, with a closing date for all applications being 16th April 2026. As part of the application process you can expect: Two questions to select one answer from multiple options. A 15-minute screening call with the Hiring Manager. First stage interview via MS Teams or in person. You will be provided with a brief to complete a role related task which will need to be returned by email in advance of your interview. Please note that successful applicants will be subject to satisfactory background checks including DBS due to working in a regulated industry. Cambridge University Press & Assessment is an approved UK employer for the sponsorship of eligible roles and applicants under the Skilled Worker visa route. Please refer to the gov.uk website for guidance to understand your own eligibility based on the role you are applying for. Why join us Joining us is your opportunity to pursue potential. You'll belong to a collaborative team that's exploring new and better ways to serve students, teachers and researchers across the globe - for the benefit of individuals, society and the world. Sharing our mission will inspire your own growth, development and progress, in an environment which embraces difference, change and aspiration. Cambridge University Press & Assessment is committed to being a place where anyone can enjoy a successful career, where it's safe to speak up, and where we learn continuously to improve together. We welcome applications from all candidates, regardless of demographic characteristics (age, disability, educational attainment, ethnicity, gender, marital status, neurodiversity, religion, sex, gender identity and sexual identity), cultural, or social class/background. We believe better outcomes come through diversity of thought, background and approach. We welcome applications from people from all backgrounds and communities, actively seeking to employ people from a wide range of different communities.
Head Resourcing
Senior Site Reliability Engineer (Public Cloud)
Head Resourcing Edinburgh, Midlothian
I'm partnered with a major organisation that's going through a huge SRE modernisation, and they're growing a brand-new, engineering-focused SRE function across their cloud platforms. We're now looking for an experienced Senior Site Reliability Engineer to join the team. This is real SRE work: reducing toil, building automation, improving system reliability and observability, and supporting large-scale cloud environments across Azure and GCP . The Role You'll be part of a unified SRE team supporting multiple cloud teams, working on: Reliability, performance and observability across Azure/GCP Automation to reduce repeat incidents, tickets, and manual processes Improving SLOs, SLIs, error budgets and platform health Building and maintaining Terraform modules, GitHub pipelines and IaC Supporting app teams as they migrate large workloads to cloud 1-in-4 on-call (enhanced pay) What They're Looking For 5+ years experience as an SRE in large/complex environments Strong Azure and/or GCP capability Terraform + CI/CD experience (GitHub, IaC, scripting) Deep understanding of observability, data, logs and alerting Someone who wants to help shape a modern SRE culture - not just keep the lights on Why It's a Great Move Massive modernisation programme Opportunity to influence tooling, processes and culture Multi-cloud exposure (Azure + GCP) Proper engineering autonomy Clear progression opportunities as the team scales What they are offering: Hybrid working environment with a requirement to be in the office 2 days per week (Leeds, Halifax, Manchester, Bristol or Edinburgh). Enhanced benefits package which includes flexible cash sum, private medical, enhanced pension contribution, 28 days + bank holidays and more. If you are interested in finding out more, please send across an updated version of your CV, clearing demonstrating your relevant experience!
02/04/2026
Full time
I'm partnered with a major organisation that's going through a huge SRE modernisation, and they're growing a brand-new, engineering-focused SRE function across their cloud platforms. We're now looking for an experienced Senior Site Reliability Engineer to join the team. This is real SRE work: reducing toil, building automation, improving system reliability and observability, and supporting large-scale cloud environments across Azure and GCP . The Role You'll be part of a unified SRE team supporting multiple cloud teams, working on: Reliability, performance and observability across Azure/GCP Automation to reduce repeat incidents, tickets, and manual processes Improving SLOs, SLIs, error budgets and platform health Building and maintaining Terraform modules, GitHub pipelines and IaC Supporting app teams as they migrate large workloads to cloud 1-in-4 on-call (enhanced pay) What They're Looking For 5+ years experience as an SRE in large/complex environments Strong Azure and/or GCP capability Terraform + CI/CD experience (GitHub, IaC, scripting) Deep understanding of observability, data, logs and alerting Someone who wants to help shape a modern SRE culture - not just keep the lights on Why It's a Great Move Massive modernisation programme Opportunity to influence tooling, processes and culture Multi-cloud exposure (Azure + GCP) Proper engineering autonomy Clear progression opportunities as the team scales What they are offering: Hybrid working environment with a requirement to be in the office 2 days per week (Leeds, Halifax, Manchester, Bristol or Edinburgh). Enhanced benefits package which includes flexible cash sum, private medical, enhanced pension contribution, 28 days + bank holidays and more. If you are interested in finding out more, please send across an updated version of your CV, clearing demonstrating your relevant experience!
IntaPeople
Senior Site Reliability Engineer
IntaPeople Nottingham, Nottinghamshire
We are partnering with a leading organisation in the data and analytics space to recruit an experienced Senior Site Reliability Engineer . This is an opportunity to join a highly collaborative, technically strong SRE function working on large scale, cloud native platforms that support high volume, high speed data services. The team is expanding due to increased workload, and this role will become the eighth member of an established, supportive engineering group. You ll play a key part in driving cloud automation, improving system reliability, and supporting critical production environments. Key Responsibilities Build, maintain, and improve AWS cloud infrastructure Develop automation using Terraform, Ansible, and Python Support incident response and troubleshoot performance issues Deliver routine maintenance, including patching and upgrades Enhance CI/CD pipelines (GitLab CI, GitHub CI) Contribute to Agile ceremonies and take ownership of user stories Implement new technologies and solutions to improve system reliability What You Will Bring Strong commercial experience with AWS (essential) Solid understanding of Linux systems (RHEL, CentOS or similar) Scripting skills, ideally Python Hands on experience with Terraform and/or Ansible Proficiency with Docker Exposure to CI/CD tooling and Agile ways of working Background in software engineering, systems engineering, or previous SRE roles Minimum 4 years experience in a relevant technical discipline Please note, this role is not suitable for candidates with Windows only experience or Engineers without hands on AWS or Linux exposure. Remote working is supported, with an on-site presence in Nottingham, ideally once per week preferred.
01/04/2026
Contractor
We are partnering with a leading organisation in the data and analytics space to recruit an experienced Senior Site Reliability Engineer . This is an opportunity to join a highly collaborative, technically strong SRE function working on large scale, cloud native platforms that support high volume, high speed data services. The team is expanding due to increased workload, and this role will become the eighth member of an established, supportive engineering group. You ll play a key part in driving cloud automation, improving system reliability, and supporting critical production environments. Key Responsibilities Build, maintain, and improve AWS cloud infrastructure Develop automation using Terraform, Ansible, and Python Support incident response and troubleshoot performance issues Deliver routine maintenance, including patching and upgrades Enhance CI/CD pipelines (GitLab CI, GitHub CI) Contribute to Agile ceremonies and take ownership of user stories Implement new technologies and solutions to improve system reliability What You Will Bring Strong commercial experience with AWS (essential) Solid understanding of Linux systems (RHEL, CentOS or similar) Scripting skills, ideally Python Hands on experience with Terraform and/or Ansible Proficiency with Docker Exposure to CI/CD tooling and Agile ways of working Background in software engineering, systems engineering, or previous SRE roles Minimum 4 years experience in a relevant technical discipline Please note, this role is not suitable for candidates with Windows only experience or Engineers without hands on AWS or Linux exposure. Remote working is supported, with an on-site presence in Nottingham, ideally once per week preferred.
Junior Site Reliability Engineer
Revybe IT Recruitment Ltd
Junior Site Reliability Engineer Central London (3 days a week in the office) Up to £55,000 per annum + Bonus + Generous Benefits Package We are working with an exciting technology company that are looking to bring in a Junior Site Reliability Engineer to help scale their cloud infrastructure and DevOps capability. They've built a high-performing engineering team and are now investing further into the platform side of things as demand grows. Think modern, cloud-native architecture, and a real emphasis on automation, scalability, and developer enablement. You'll join an experienced team you can learn and grow from. Tech stack AWS (Core services - EC2, RDS, S3, IAM, etc.) Monitoring and Observability Grafana, Prometheus, Datadog Kubernetes (building and managing production clusters) Terraform (IaC provisioning) Python, Bash or Go (scripting, automation) GitHub Actions (CI/CD pipelines) What They're Looking For Experience in AWS cloud infrastructure Previous experience working with Monitoring and Observability Tools - Datadog, Grafana or Prometheus Knowledge on how Kubernetes works. Understanding of IaC - Terraform. Experience with CI/CD (GitHub Actions or similar) A good communicator who enjoys working collaboratively across product and engineering. The client is willing to consider candidates without all the required skills and provide an environment to learn and grow on the job. Training and development is at the forefront of the business, where you will get plenty of opportunities to progress your career in whatever path you want. Junior Site Reliability Engineer Central London (3 days a week in the office) Up to £55,000 per annum + Bonus + Generous Benefits Package Click APPLY NOW to be considered for this position! AWS, SRE, Cloud, Kubernetes, EKS, Terraform, CI/CD, Automation etc.
01/04/2026
Full time
Junior Site Reliability Engineer Central London (3 days a week in the office) Up to £55,000 per annum + Bonus + Generous Benefits Package We are working with an exciting technology company that are looking to bring in a Junior Site Reliability Engineer to help scale their cloud infrastructure and DevOps capability. They've built a high-performing engineering team and are now investing further into the platform side of things as demand grows. Think modern, cloud-native architecture, and a real emphasis on automation, scalability, and developer enablement. You'll join an experienced team you can learn and grow from. Tech stack AWS (Core services - EC2, RDS, S3, IAM, etc.) Monitoring and Observability Grafana, Prometheus, Datadog Kubernetes (building and managing production clusters) Terraform (IaC provisioning) Python, Bash or Go (scripting, automation) GitHub Actions (CI/CD pipelines) What They're Looking For Experience in AWS cloud infrastructure Previous experience working with Monitoring and Observability Tools - Datadog, Grafana or Prometheus Knowledge on how Kubernetes works. Understanding of IaC - Terraform. Experience with CI/CD (GitHub Actions or similar) A good communicator who enjoys working collaboratively across product and engineering. The client is willing to consider candidates without all the required skills and provide an environment to learn and grow on the job. Training and development is at the forefront of the business, where you will get plenty of opportunities to progress your career in whatever path you want. Junior Site Reliability Engineer Central London (3 days a week in the office) Up to £55,000 per annum + Bonus + Generous Benefits Package Click APPLY NOW to be considered for this position! AWS, SRE, Cloud, Kubernetes, EKS, Terraform, CI/CD, Automation etc.
83Zero Ltd
Senior Site Reliability Engineer
83Zero Ltd Wokingham, Berkshire
Senior Site Reliability Engineer - Active SC Required! Up to £75,000 + benefits Wokingham - Hybrid (UK-based) We're seeking a Senior Site Reliability Engineer to play a key role in designing and operating highly reliable, scalable systems in a fast-paced environment. You'll act as a technical leader within the team, driving best practices across reliability engineering, automation, and system performance. What you'll be doing: Designing and improving system reliability, scalability, and observability Leading incident management and driving root cause analysis Building and maintaining robust CI/CD pipelines and automation frameworks Partnering with development teams to embed SRE principles into the SDLC Mentoring junior engineers and promoting engineering best practices What we're looking for: Strong experience in SRE, DevOps, or platform engineering roles Deep understanding of cloud infrastructure (AWS, Azure, or GCP) Hands-on experience with Kubernetes and containerised environments Strong scripting/programming skills (Python, Go, or similar) Experience with monitoring, alerting, and observability tooling Proven ability to troubleshoot complex distributed systems Why apply? Opportunity to influence technical direction and best practices Work on large-scale, mission-critical systems Leadership exposure with clear progression to principal level
01/04/2026
Full time
Senior Site Reliability Engineer - Active SC Required! Up to £75,000 + benefits Wokingham - Hybrid (UK-based) We're seeking a Senior Site Reliability Engineer to play a key role in designing and operating highly reliable, scalable systems in a fast-paced environment. You'll act as a technical leader within the team, driving best practices across reliability engineering, automation, and system performance. What you'll be doing: Designing and improving system reliability, scalability, and observability Leading incident management and driving root cause analysis Building and maintaining robust CI/CD pipelines and automation frameworks Partnering with development teams to embed SRE principles into the SDLC Mentoring junior engineers and promoting engineering best practices What we're looking for: Strong experience in SRE, DevOps, or platform engineering roles Deep understanding of cloud infrastructure (AWS, Azure, or GCP) Hands-on experience with Kubernetes and containerised environments Strong scripting/programming skills (Python, Go, or similar) Experience with monitoring, alerting, and observability tooling Proven ability to troubleshoot complex distributed systems Why apply? Opportunity to influence technical direction and best practices Work on large-scale, mission-critical systems Leadership exposure with clear progression to principal level
83Zero Ltd
Site Reliability Engineer
83Zero Ltd Wokingham, Berkshire
Site Reliability Engineer (SRE) - Active SC required! Up to £55,000 + benefits Hybrid (UK-based) We're looking for a Site Reliability Engineer to join a growing technology team delivering highly scalable, resilient systems across a range of enterprise environments. This is a fantastic opportunity for someone with a solid foundation in DevOps/SRE practices who wants to deepen their expertise in automation, reliability, and cloud-native technologies. What you'll be doing: Supporting the reliability, availability, and performance of production systems Monitoring applications and infrastructure, responding to incidents and driving resolution Automating manual processes to improve efficiency and reduce risk Collaborating with engineering teams to improve system design and resilience Contributing to CI/CD pipelines and infrastructure-as-code practices What we're looking for: Experience in an SRE, DevOps, or similar engineering role Knowledge of cloud platforms (AWS, Azure, or GCP) Familiarity with monitoring/logging tools (e.g. Prometheus, Grafana, ELK) Scripting or programming skills (e.g. Python, Bash, Go) Understanding of containers and orchestration (Docker/Kubernetes is a plus) Why apply? Work with modern, cloud-native technologies Supportive environment with strong learning and development opportunities Clear progression path into senior SRE roles
01/04/2026
Full time
Site Reliability Engineer (SRE) - Active SC required! Up to £55,000 + benefits Hybrid (UK-based) We're looking for a Site Reliability Engineer to join a growing technology team delivering highly scalable, resilient systems across a range of enterprise environments. This is a fantastic opportunity for someone with a solid foundation in DevOps/SRE practices who wants to deepen their expertise in automation, reliability, and cloud-native technologies. What you'll be doing: Supporting the reliability, availability, and performance of production systems Monitoring applications and infrastructure, responding to incidents and driving resolution Automating manual processes to improve efficiency and reduce risk Collaborating with engineering teams to improve system design and resilience Contributing to CI/CD pipelines and infrastructure-as-code practices What we're looking for: Experience in an SRE, DevOps, or similar engineering role Knowledge of cloud platforms (AWS, Azure, or GCP) Familiarity with monitoring/logging tools (e.g. Prometheus, Grafana, ELK) Scripting or programming skills (e.g. Python, Bash, Go) Understanding of containers and orchestration (Docker/Kubernetes is a plus) Why apply? Work with modern, cloud-native technologies Supportive environment with strong learning and development opportunities Clear progression path into senior SRE roles
Morson Edge
IT Manager
Morson Edge Manchester, Lancashire
IT Manager (CDN, AWS & SRE Focus) Manchester (Hybrid - 2 days in office) Up to £80,000 + Benefits Permanent, Full-Time The Opportunity Morson Edge are are looking for an experienced IT Manager to lead and evolve a highperforming infrastructure and reliability function. This is a key leadership role where you'll shape strategy, improve system resilience, and drive best practices across CDN, AWS cloud environments, and Site Reliability Engineering (SRE) . You'll work at the intersection of infrastructure, performance, and reliability-ensuring systems are scalable, secure, and always available. What You'll Be Doing Lead, mentor, and develop a team of engineers across cloud infrastructure and SRE Own and optimise AWS environments , ensuring scalability, cost-efficiency, and security Manage and enhance CDN performance and delivery strategies Drive adoption of SRE principles including SLIs, SLOs, and error budgets Improve system observability through monitoring, logging, and alerting Collaborate with engineering and product teams to support high-availability services Oversee incident management, root cause analysis, and continuous improvement Define and implement infrastructure best practices and automation What We're Looking For Proven experience in an IT Manager/Infrastructure Manager/SRE Lead role Strong expertise in AWS (EC2, Lambda, CloudFront, VPC, etc.) Solid understanding of Content Delivery Networks (CDN) and performance optimisation Experience implementing or working within SRE frameworks Knowledge of Infrastructure as Code (eg, Terraform, CloudFormation) Strong background in monitoring tools (eg, Prometheus, Grafana, Datadog) Excellent leadership and stakeholder management skills Nice to Have Experience with containerisation (Docker, Kubernetes) Exposure to DevOps culture and CI/CD pipelines Security and compliance awareness in cloud environments What's in It for You Salary up to £80,000 Hybrid working (2 days per week in Manchester office) Pension scheme Training and development opportunities A chance to shape and lead a modern, cloud-first infrastructure function
01/04/2026
Full time
IT Manager (CDN, AWS & SRE Focus) Manchester (Hybrid - 2 days in office) Up to £80,000 + Benefits Permanent, Full-Time The Opportunity Morson Edge are are looking for an experienced IT Manager to lead and evolve a highperforming infrastructure and reliability function. This is a key leadership role where you'll shape strategy, improve system resilience, and drive best practices across CDN, AWS cloud environments, and Site Reliability Engineering (SRE) . You'll work at the intersection of infrastructure, performance, and reliability-ensuring systems are scalable, secure, and always available. What You'll Be Doing Lead, mentor, and develop a team of engineers across cloud infrastructure and SRE Own and optimise AWS environments , ensuring scalability, cost-efficiency, and security Manage and enhance CDN performance and delivery strategies Drive adoption of SRE principles including SLIs, SLOs, and error budgets Improve system observability through monitoring, logging, and alerting Collaborate with engineering and product teams to support high-availability services Oversee incident management, root cause analysis, and continuous improvement Define and implement infrastructure best practices and automation What We're Looking For Proven experience in an IT Manager/Infrastructure Manager/SRE Lead role Strong expertise in AWS (EC2, Lambda, CloudFront, VPC, etc.) Solid understanding of Content Delivery Networks (CDN) and performance optimisation Experience implementing or working within SRE frameworks Knowledge of Infrastructure as Code (eg, Terraform, CloudFormation) Strong background in monitoring tools (eg, Prometheus, Grafana, Datadog) Excellent leadership and stakeholder management skills Nice to Have Experience with containerisation (Docker, Kubernetes) Exposure to DevOps culture and CI/CD pipelines Security and compliance awareness in cloud environments What's in It for You Salary up to £80,000 Hybrid working (2 days per week in Manchester office) Pension scheme Training and development opportunities A chance to shape and lead a modern, cloud-first infrastructure function
Moorepay
Site Reliability Engineer (CloudOps)
Moorepay Manchester, Lancashire
The Site Reliability Engineer plays a critical role in ensuring that our AI-driven, cloud-native platform is reliable, observable, secure, and able to scale with the organisation's growth. As we adopt intelligent agents, autonomous workflows, and increasingly complex distributed systems, the SRE ensures that resilience, performance, and operational excellence are built into everything we deliver. By partnering closely with Engineers, Architects, and the Engineering Manager, the SRE defines the patterns, tooling, and automation that enable fast, safe, and repeatable deployments. This role safeguards our production environment, drives continuous improvement across CI/CD and observability, and establishes the reliability practices that empower autonomous squads to move quickly without compromising stability. The SRE is essential to maintaining customer trust, supporting AI-first innovation, and ensuring our platform remains robust, secure, and highly available at scale. In this position you will ensure the reliability, scalability, and security of our engineering systems. Working closely with the Engineering Manager and Head of Engineering, the SRE will identify priorities to remove friction from engineering teams, streamline processes, and enhance operational excellence. This role combines software engineering principles with systems administration to deliver robust, automated, cost-effective, and secure-by-design solutions. Key Responsibilities Reliability, Performance & Security: Design and implement strategies to improve system reliability, availability, and security. Ensure all solutions follow secure-by-design principles, incorporating cybersecurity best practices from inception through deployment. Conduct regular security reviews and collaborate with security teams to address vulnerabilities. CI/CD Management: Own and optimise Continuous Integration and Continuous Deployment pipelines. Embed security checks (e.g., static analysis, dependency scanning) into CI/CD workflows. Ensure secure, efficient, and automated deployment processes across environments. Monitoring & Observability: Implement and maintain monitoring solutions for infrastructure and applications. Develop dashboards and alerting systems to ensure proactive incident and security event management. Evaluate and integrate new observability tools as needed. Automation & Tooling: Automate repetitive tasks to improve efficiency and reduce human error. Build and maintain internal tools that support engineering productivity and security compliance. Champion Infrastructure as Code (IaC) practices using tools like Terraform or ARM templates. Cloud Infrastructure Management: Manage and optimise services across AWS and Azure environments. Ensure scalability, resilience, and security of service-based architectures. Implement cost management strategies to optimise cloud spend without compromising performance or security. Incident Response & Root Cause Analysis: Lead incident response efforts, including security incidents, and conduct post-mortem reviews. Drive continuous improvement through lessons learned and preventive measures. Skills & Experience Proven experience in AWS and Azure cloud environments. Strong background in CI/CD tools (e.g., Azure DevOps, Pipelines, GitHub Actions, Jenkins). Expertise in monitoring and observability platforms (e.g., Prometheus, Grafana, Datadog). Proficiency in scripting and automation (Python, Bash, PowerShell). Familiarity with containerisation and orchestration (Docker, Kubernetes). Solid understanding of networking, security, and cost optimisation in cloud environments. Knowledge of cybersecurity principles, secure coding practices, and compliance frameworks. A problem-solver with a proactive mindset. Comfortable working in fast-paced, evolving environments. Strong communicator who can bridge gaps between operations, development, and security teams. Passionate about automation, scalability, cost efficiency, and security.
01/04/2026
Full time
The Site Reliability Engineer plays a critical role in ensuring that our AI-driven, cloud-native platform is reliable, observable, secure, and able to scale with the organisation's growth. As we adopt intelligent agents, autonomous workflows, and increasingly complex distributed systems, the SRE ensures that resilience, performance, and operational excellence are built into everything we deliver. By partnering closely with Engineers, Architects, and the Engineering Manager, the SRE defines the patterns, tooling, and automation that enable fast, safe, and repeatable deployments. This role safeguards our production environment, drives continuous improvement across CI/CD and observability, and establishes the reliability practices that empower autonomous squads to move quickly without compromising stability. The SRE is essential to maintaining customer trust, supporting AI-first innovation, and ensuring our platform remains robust, secure, and highly available at scale. In this position you will ensure the reliability, scalability, and security of our engineering systems. Working closely with the Engineering Manager and Head of Engineering, the SRE will identify priorities to remove friction from engineering teams, streamline processes, and enhance operational excellence. This role combines software engineering principles with systems administration to deliver robust, automated, cost-effective, and secure-by-design solutions. Key Responsibilities Reliability, Performance & Security: Design and implement strategies to improve system reliability, availability, and security. Ensure all solutions follow secure-by-design principles, incorporating cybersecurity best practices from inception through deployment. Conduct regular security reviews and collaborate with security teams to address vulnerabilities. CI/CD Management: Own and optimise Continuous Integration and Continuous Deployment pipelines. Embed security checks (e.g., static analysis, dependency scanning) into CI/CD workflows. Ensure secure, efficient, and automated deployment processes across environments. Monitoring & Observability: Implement and maintain monitoring solutions for infrastructure and applications. Develop dashboards and alerting systems to ensure proactive incident and security event management. Evaluate and integrate new observability tools as needed. Automation & Tooling: Automate repetitive tasks to improve efficiency and reduce human error. Build and maintain internal tools that support engineering productivity and security compliance. Champion Infrastructure as Code (IaC) practices using tools like Terraform or ARM templates. Cloud Infrastructure Management: Manage and optimise services across AWS and Azure environments. Ensure scalability, resilience, and security of service-based architectures. Implement cost management strategies to optimise cloud spend without compromising performance or security. Incident Response & Root Cause Analysis: Lead incident response efforts, including security incidents, and conduct post-mortem reviews. Drive continuous improvement through lessons learned and preventive measures. Skills & Experience Proven experience in AWS and Azure cloud environments. Strong background in CI/CD tools (e.g., Azure DevOps, Pipelines, GitHub Actions, Jenkins). Expertise in monitoring and observability platforms (e.g., Prometheus, Grafana, Datadog). Proficiency in scripting and automation (Python, Bash, PowerShell). Familiarity with containerisation and orchestration (Docker, Kubernetes). Solid understanding of networking, security, and cost optimisation in cloud environments. Knowledge of cybersecurity principles, secure coding practices, and compliance frameworks. A problem-solver with a proactive mindset. Comfortable working in fast-paced, evolving environments. Strong communicator who can bridge gaps between operations, development, and security teams. Passionate about automation, scalability, cost efficiency, and security.
Randstad Technologies
SRE - Site Reliability Engineer
Randstad Technologies
Senior Site Reliability Engineer (Observability) Location: London/UK (Remote) Contract: 12 Months Initial Day rate : £55 Per Hour - £62 Per Hour Inside IR35 Job Overview We are looking for a Senior Site Reliability Engineer with strong experience in Observability, Monitoring and Distributed Systems to support large-scale cloud infrastructure supporting millions of devices globally. The role focuses on building and scaling monitoring, logging and alerting platforms to ensure high availability and performance of cloud services. Responsibilities Design, deploy and scale observability platforms Manage and scale Prometheus monitoring systems Deploy and maintain large Elasticsearch clusters Build and maintain data pipelines using Kafka Develop alerting and monitoring frameworks Automate infrastructure using Terraform and Ansible Develop tools and scripts using Python, Go, Ruby or Bash Work with Linux systems (Debian/Ubuntu) Participate in on-call rotation Improve system reliability, performance and scalability Required Skills 5+ years experience in Site Reliability Engineering / DevOps Strong Linux systems experience Observability and Monitoring tools experience Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana) Kafka Terraform / Infrastructure as Code Ansible / Configuration Management Programming experience (Python, Go, Ruby or Bash) Distributed systems and cloud infrastructure experience This is an urgent vacancy where the hiring manager is shortlisting for an interview immediately. Please apply with a copy of your CV or send it khushboo. Co. uk Randstad Technologies is acting as an Employment Business in relation to this vacancy.
01/04/2026
Contractor
Senior Site Reliability Engineer (Observability) Location: London/UK (Remote) Contract: 12 Months Initial Day rate : £55 Per Hour - £62 Per Hour Inside IR35 Job Overview We are looking for a Senior Site Reliability Engineer with strong experience in Observability, Monitoring and Distributed Systems to support large-scale cloud infrastructure supporting millions of devices globally. The role focuses on building and scaling monitoring, logging and alerting platforms to ensure high availability and performance of cloud services. Responsibilities Design, deploy and scale observability platforms Manage and scale Prometheus monitoring systems Deploy and maintain large Elasticsearch clusters Build and maintain data pipelines using Kafka Develop alerting and monitoring frameworks Automate infrastructure using Terraform and Ansible Develop tools and scripts using Python, Go, Ruby or Bash Work with Linux systems (Debian/Ubuntu) Participate in on-call rotation Improve system reliability, performance and scalability Required Skills 5+ years experience in Site Reliability Engineering / DevOps Strong Linux systems experience Observability and Monitoring tools experience Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana) Kafka Terraform / Infrastructure as Code Ansible / Configuration Management Programming experience (Python, Go, Ruby or Bash) Distributed systems and cloud infrastructure experience This is an urgent vacancy where the hiring manager is shortlisting for an interview immediately. Please apply with a copy of your CV or send it khushboo. Co. uk Randstad Technologies is acting as an Employment Business in relation to this vacancy.
Site Reliability Engineer
Revybe IT Recruitment Ltd
Site Reliability Engineer Central London (3 days a week in the office) Up to £70,000 per annum + Bonus + Generous Benefits Package We are working with an exciting technology company that are looking to bring in a Site Reliability Engineer to help scale their cloud infrastructure and DevOps capability. They've built a high-performing engineering team and are now investing further into the platform side of things as demand grows. Think modern, cloud-native architecture, and a real emphasis on automation, scalability, and developer enablement. You'll have the autonomy to make technical decisions and help shape how platform engineering is done as the team continues to scale. Tech stack AWS (Core services - EC2, RDS, S3, IAM, etc.) Monitoring and Observability Grafana, Prometheus, Datadog Kubernetes (building and managing production clusters) Terraform (IaC provisioning) Python, Bash or Go (scripting, automation) GitHub Actions (CI/CD pipelines) What They're Looking For Experience in AWS cloud infrastructure (ideally in a regulated or high-traffic environment) Previous experience working with Monitoring and Observability Tools Hands-on Kubernetes know-how, specifically with EKS. Solid IaC experience with Terraform. Experience with containerisation (Docker, Helm) and CI/CD (GitHub Actions or similar) Solid scripting/Automation experience with Python, Bash or Go A good communicator who enjoys working collaboratively across product and engineering. Desirable Certifications - CKA, CKAD, AWS Solutions Architect etc. The client is willing to consider candidates without all the required skills and provide an environment to learn and grow on the job. Training and development is at the forefront of the business, where you will get plenty of opportunities to progress your career in whatever path you want. Site Reliability Engineer Central London (3 days a week in the office) Up to £70,000 per annum + Bonus + Generous Benefits Package Click APPLY NOW to be considered for this position! AWS, SRE, Cloud, Kubernetes, EKS, Terraform, CI/CD, Automation etc.
01/04/2026
Full time
Site Reliability Engineer Central London (3 days a week in the office) Up to £70,000 per annum + Bonus + Generous Benefits Package We are working with an exciting technology company that are looking to bring in a Site Reliability Engineer to help scale their cloud infrastructure and DevOps capability. They've built a high-performing engineering team and are now investing further into the platform side of things as demand grows. Think modern, cloud-native architecture, and a real emphasis on automation, scalability, and developer enablement. You'll have the autonomy to make technical decisions and help shape how platform engineering is done as the team continues to scale. Tech stack AWS (Core services - EC2, RDS, S3, IAM, etc.) Monitoring and Observability Grafana, Prometheus, Datadog Kubernetes (building and managing production clusters) Terraform (IaC provisioning) Python, Bash or Go (scripting, automation) GitHub Actions (CI/CD pipelines) What They're Looking For Experience in AWS cloud infrastructure (ideally in a regulated or high-traffic environment) Previous experience working with Monitoring and Observability Tools Hands-on Kubernetes know-how, specifically with EKS. Solid IaC experience with Terraform. Experience with containerisation (Docker, Helm) and CI/CD (GitHub Actions or similar) Solid scripting/Automation experience with Python, Bash or Go A good communicator who enjoys working collaboratively across product and engineering. Desirable Certifications - CKA, CKAD, AWS Solutions Architect etc. The client is willing to consider candidates without all the required skills and provide an environment to learn and grow on the job. Training and development is at the forefront of the business, where you will get plenty of opportunities to progress your career in whatever path you want. Site Reliability Engineer Central London (3 days a week in the office) Up to £70,000 per annum + Bonus + Generous Benefits Package Click APPLY NOW to be considered for this position! AWS, SRE, Cloud, Kubernetes, EKS, Terraform, CI/CD, Automation etc.
itecopeople
eDV DevOps Engineer / Site Reliability Engineer
itecopeople Cheltenham, Gloucestershire
eDV DevOps Engineer / Site Reliability Engineer (SRE) - AWS, Kubernetes - Contract Outside IR35. . We are supporting a specialist engineering consultancy delivering secure technology platforms to high-profile UK government organisations. They are seeking an eDV Cleared DevOps Engineer / Site Reliability Engineer (SRE) with strong experience across AWS, Kubernetes, Terraform, CI/CD and Linux environments to support the continued growth of critical cross-domain systems. This contract role will focus on improving platform reliability, automation, infrastructure as code, observability and DevOps practices across both cloud and on-premise environments. You will work closely with software engineers, platform engineers and operations teams to ensure highly secure, scalable and resilient systems supporting sensitive government programmes. Location: Cheltenham (Hybrid - 3 days onsite) Rate: 500- 650 per day Outside IR35 Security Clearance: Active eDV Clearance required Start Date ASAP As a DevOps / Site Reliability Engineer, you will be responsible for ensuring the availability, performance, and reliability of services supporting sensitive government programmes. You will collaborate with multiple feature development teams and BAU/support teams to evolve both cloud and on-premise infrastructure, delivery pipelines, and observability tooling. The role will focus on improving system reliability, monitoring, automation, and performance, while proactively identifying and mitigating operational risks. This position may also involve participation in an on-call rota, which could include occasional 24/7 call-out support. Key Responsibilities: Collaborate with software engineering teams to improve subsystem reliability and performance. Work with system administrators to automate operational processes and reduce manual effort. Enhance monitoring and observability capabilities to proactively detect and resolve issues. Support development environments to improve delivery speed and quality. Contribute to the evolution of infrastructure, DevOps practices, and CI/CD pipelines. Research and evaluate new technologies and tools to support engineering decisions. Develop expertise across multiple technical and business domains. Required Skills & Experience Active eDV clearance is essential configuration management tools such as Ansible, Chef, or similar Strong Terraform Docker containers and container orchestration platforms (Kubernetes, OpenShift, Docker Swarm) maintaining and using CI/CD tooling such as Jenkins Monitoring and observability experience with Prometheus, Grafana, or InfluxDB event-driven integration and messaging systems such as RabbitMQ or other AMQP solutions Strong Linux command line, administration, and shell scripting experience Solid understanding of relational databases and SQL network security protocols Working with cloud platforms, ideally AWS (EC2, RDS, S3, Lambda) Azure a plus Please send your CV to Laura at (url removed) to progress matters. Services Advertised are those of Employment Business.
31/03/2026
Contractor
eDV DevOps Engineer / Site Reliability Engineer (SRE) - AWS, Kubernetes - Contract Outside IR35. . We are supporting a specialist engineering consultancy delivering secure technology platforms to high-profile UK government organisations. They are seeking an eDV Cleared DevOps Engineer / Site Reliability Engineer (SRE) with strong experience across AWS, Kubernetes, Terraform, CI/CD and Linux environments to support the continued growth of critical cross-domain systems. This contract role will focus on improving platform reliability, automation, infrastructure as code, observability and DevOps practices across both cloud and on-premise environments. You will work closely with software engineers, platform engineers and operations teams to ensure highly secure, scalable and resilient systems supporting sensitive government programmes. Location: Cheltenham (Hybrid - 3 days onsite) Rate: 500- 650 per day Outside IR35 Security Clearance: Active eDV Clearance required Start Date ASAP As a DevOps / Site Reliability Engineer, you will be responsible for ensuring the availability, performance, and reliability of services supporting sensitive government programmes. You will collaborate with multiple feature development teams and BAU/support teams to evolve both cloud and on-premise infrastructure, delivery pipelines, and observability tooling. The role will focus on improving system reliability, monitoring, automation, and performance, while proactively identifying and mitigating operational risks. This position may also involve participation in an on-call rota, which could include occasional 24/7 call-out support. Key Responsibilities: Collaborate with software engineering teams to improve subsystem reliability and performance. Work with system administrators to automate operational processes and reduce manual effort. Enhance monitoring and observability capabilities to proactively detect and resolve issues. Support development environments to improve delivery speed and quality. Contribute to the evolution of infrastructure, DevOps practices, and CI/CD pipelines. Research and evaluate new technologies and tools to support engineering decisions. Develop expertise across multiple technical and business domains. Required Skills & Experience Active eDV clearance is essential configuration management tools such as Ansible, Chef, or similar Strong Terraform Docker containers and container orchestration platforms (Kubernetes, OpenShift, Docker Swarm) maintaining and using CI/CD tooling such as Jenkins Monitoring and observability experience with Prometheus, Grafana, or InfluxDB event-driven integration and messaging systems such as RabbitMQ or other AMQP solutions Strong Linux command line, administration, and shell scripting experience Solid understanding of relational databases and SQL network security protocols Working with cloud platforms, ideally AWS (EC2, RDS, S3, Lambda) Azure a plus Please send your CV to Laura at (url removed) to progress matters. Services Advertised are those of Employment Business.
CBSbutler Holdings Limited trading as CBSbutler
Senior Site Reliability Engineer (SRE)
CBSbutler Holdings Limited trading as CBSbutler
Senior Site Reliability Engineer (SRE) Remote 12-month contract (high chance of extension) Job Description Join a global pioneer in the video game industry and own the reliability of high-traffic, revenue-critical platforms used by millions worldwide. As a Senior SRE, you'll shape the architecture, improve platform-wide resiliency, and ensure services stay performant, scalable, and secure. This isn't just about maintaining a single system, you'll influence reliability across multiple services, driving improvements that touch the entire ecosystem. Key Responsibilities Lead incident response and troubleshooting for production systems, resolving high-severity issues and driving post-incident improvements. Influence architecture to improve platform-wide reliability, resiliency, and operational efficiency, ensuring services remain available under heavy load. Drive containerisation best practices and manage Kubernetes-based workloads at scale. Build and maintain event-driven architectures that scale globally while ensuring fault-tolerance and high availability. Automate infrastructure provisioning, deployment, and monitoring using Infrastructure as Code (Terraform, CloudFormation, Ansible, CDK). Collaborate with engineering, product, and security teams to define SLOs, SLIs, and error budgets across services. Provide mentorship, advocate SRE best practices, and ensure teams are empowered to deliver resilient, reliable systems. Experience / Must-Have Skills Extensive experience in AWS and AWS-managed services (EC2, Lambda, S3, VPC, CloudWatch, CloudTrail, IAM, EKS, Service Catalog, multi-account environments). Strong Kubernetes / container orchestration experience, including EKS, OpenShift, Docker, and service mesh. Deep understanding of networking fundamentals: DNS, VPCs, routing, load balancing, TCP/IP, firewall policies. Proven track record in incident response and troubleshooting at scale. Hands-on experience with infrastructure automation and CI/CD pipelines. Experience designing event-driven architectures and resilient systems. High level of autonomy, able to influence platform-wide decisions and architect for reliability across services. Ability and desire to mentor junior staff Bonus: experience in gaming, interactive entertainment, or other high-traffic, global-scale platforms. If you are interested in this role, please feel free to submit your CV.
31/03/2026
Contractor
Senior Site Reliability Engineer (SRE) Remote 12-month contract (high chance of extension) Job Description Join a global pioneer in the video game industry and own the reliability of high-traffic, revenue-critical platforms used by millions worldwide. As a Senior SRE, you'll shape the architecture, improve platform-wide resiliency, and ensure services stay performant, scalable, and secure. This isn't just about maintaining a single system, you'll influence reliability across multiple services, driving improvements that touch the entire ecosystem. Key Responsibilities Lead incident response and troubleshooting for production systems, resolving high-severity issues and driving post-incident improvements. Influence architecture to improve platform-wide reliability, resiliency, and operational efficiency, ensuring services remain available under heavy load. Drive containerisation best practices and manage Kubernetes-based workloads at scale. Build and maintain event-driven architectures that scale globally while ensuring fault-tolerance and high availability. Automate infrastructure provisioning, deployment, and monitoring using Infrastructure as Code (Terraform, CloudFormation, Ansible, CDK). Collaborate with engineering, product, and security teams to define SLOs, SLIs, and error budgets across services. Provide mentorship, advocate SRE best practices, and ensure teams are empowered to deliver resilient, reliable systems. Experience / Must-Have Skills Extensive experience in AWS and AWS-managed services (EC2, Lambda, S3, VPC, CloudWatch, CloudTrail, IAM, EKS, Service Catalog, multi-account environments). Strong Kubernetes / container orchestration experience, including EKS, OpenShift, Docker, and service mesh. Deep understanding of networking fundamentals: DNS, VPCs, routing, load balancing, TCP/IP, firewall policies. Proven track record in incident response and troubleshooting at scale. Hands-on experience with infrastructure automation and CI/CD pipelines. Experience designing event-driven architectures and resilient systems. High level of autonomy, able to influence platform-wide decisions and architect for reliability across services. Ability and desire to mentor junior staff Bonus: experience in gaming, interactive entertainment, or other high-traffic, global-scale platforms. If you are interested in this role, please feel free to submit your CV.
Pontoon
SRE Transformation Lead (Global Banking & Payments)
Pontoon
Job Title: SRE Transformation Lead/ Senior SRE Engineer (Global Banking & Payments) Contract Length: 12 months Location: Bromley / London (3 days a week) Working Pattern: Full Time Are you ready to lead a transformative journey in the world of Global Banking and Payments? Our client is seeking a passionate and experienced SRE Transformation Lead to help shape and scale Site Reliability Engineering (SRE) practises across a highly regulated banking environment. This is your chance to drive innovation, foster collaboration, and make a real impact on service reliability! Role Overview: As the SRE Transformation Lead/ Senior SRE Engineer, you will lead and accelerate transformation from traditional L2 production support toward an SRE operating model. Your hands-on experience will be crucial in defining and implementing SRE practises across critical banking and payment services, ensuring measurable reliability outcomes and streamlined operations. Required Skills: Significant experience in Site Reliability Engineering and implementing SRE practices across large scale, complex services in essential Demonstrated experience leading an SRE transformation in a corporate banking environment (or similarly regulated financial services enterprise). Proven ability to implement and scale SLO/SLI and error budget approaches, and to operationalize them across multiple teams and services. Strong engineering background with the ability to drive automation and reduce manual toil through code, tooling, and process redesign. Deep knowledge of incident response, problem management, root cause analysis, and operational resilience practices in mission critical environments. Strong stakeholder management skills, able to influence technology and business partners and communicate effectively at senior levels. Key Responsibilities: SRE Operating Model & Transformation : Lead the design and execution of the SRE adoption strategy, transitioning teams to a reliability engineering mindset. Reliability Measurement : Drive the implementation of Critical User Journeys, SLIs, SLOs, and error budgets to align metrics with user experience and business objectives. Toil Reduction & Automation : Identify and eliminate operational toil through automation, enhancing engineering practises and operational tooling. Incident & Problem Management : Strengthen incident response frameworks and improve production outcomes through effective root cause analysis and preventive engineering. Observability & Tooling : Establish observability standards to enhance service monitoring, partnering with teams to align SRE needs with enterprise tooling. Stakeholder Management : Influence leaders across operations and engineering, driving the adoption of SRE principles and fostering a culture of reliability. Preferred Qualifications: Experience with high-availability banking platforms and 24x7 operational expectations. Familiarity with observability tools and building SRE communities of practise. Why Join Us? Be a Pioneer : Lead the charge in transforming how reliability engineering is approached in the banking sector. Collaborative Environment : Work with a diverse team that values innovation, teamwork, and excellence. Professional Growth : Take on a pivotal role that will challenge and expand your skills in a dynamic and fast-paced industry. Are you ready to take the next step in your career and make a lasting impact? If you have the expertise and enthusiasm for driving SRE transformation, we want to hear from you! Apply Now! Join our client in revolutionising the Global Banking & Payments landscape. Your journey toward making a difference starts here! Pontoon is an employment consultancy. We put expertise, energy, and enthusiasm into improving everyone's chance of being part of the workplace. We respect and appreciate people of all ethnicities, generations, religious beliefs, sexual orientations, gender identities, and more. We do this by showcasing their talents, skills, and unique experience in an inclusive environment that helps them thrive. If you require reasonable adjustments at any stage, please let us know and we will be happy to support you.
31/03/2026
Contractor
Job Title: SRE Transformation Lead/ Senior SRE Engineer (Global Banking & Payments) Contract Length: 12 months Location: Bromley / London (3 days a week) Working Pattern: Full Time Are you ready to lead a transformative journey in the world of Global Banking and Payments? Our client is seeking a passionate and experienced SRE Transformation Lead to help shape and scale Site Reliability Engineering (SRE) practises across a highly regulated banking environment. This is your chance to drive innovation, foster collaboration, and make a real impact on service reliability! Role Overview: As the SRE Transformation Lead/ Senior SRE Engineer, you will lead and accelerate transformation from traditional L2 production support toward an SRE operating model. Your hands-on experience will be crucial in defining and implementing SRE practises across critical banking and payment services, ensuring measurable reliability outcomes and streamlined operations. Required Skills: Significant experience in Site Reliability Engineering and implementing SRE practices across large scale, complex services in essential Demonstrated experience leading an SRE transformation in a corporate banking environment (or similarly regulated financial services enterprise). Proven ability to implement and scale SLO/SLI and error budget approaches, and to operationalize them across multiple teams and services. Strong engineering background with the ability to drive automation and reduce manual toil through code, tooling, and process redesign. Deep knowledge of incident response, problem management, root cause analysis, and operational resilience practices in mission critical environments. Strong stakeholder management skills, able to influence technology and business partners and communicate effectively at senior levels. Key Responsibilities: SRE Operating Model & Transformation : Lead the design and execution of the SRE adoption strategy, transitioning teams to a reliability engineering mindset. Reliability Measurement : Drive the implementation of Critical User Journeys, SLIs, SLOs, and error budgets to align metrics with user experience and business objectives. Toil Reduction & Automation : Identify and eliminate operational toil through automation, enhancing engineering practises and operational tooling. Incident & Problem Management : Strengthen incident response frameworks and improve production outcomes through effective root cause analysis and preventive engineering. Observability & Tooling : Establish observability standards to enhance service monitoring, partnering with teams to align SRE needs with enterprise tooling. Stakeholder Management : Influence leaders across operations and engineering, driving the adoption of SRE principles and fostering a culture of reliability. Preferred Qualifications: Experience with high-availability banking platforms and 24x7 operational expectations. Familiarity with observability tools and building SRE communities of practise. Why Join Us? Be a Pioneer : Lead the charge in transforming how reliability engineering is approached in the banking sector. Collaborative Environment : Work with a diverse team that values innovation, teamwork, and excellence. Professional Growth : Take on a pivotal role that will challenge and expand your skills in a dynamic and fast-paced industry. Are you ready to take the next step in your career and make a lasting impact? If you have the expertise and enthusiasm for driving SRE transformation, we want to hear from you! Apply Now! Join our client in revolutionising the Global Banking & Payments landscape. Your journey toward making a difference starts here! Pontoon is an employment consultancy. We put expertise, energy, and enthusiasm into improving everyone's chance of being part of the workplace. We respect and appreciate people of all ethnicities, generations, religious beliefs, sexual orientations, gender identities, and more. We do this by showcasing their talents, skills, and unique experience in an inclusive environment that helps them thrive. If you require reasonable adjustments at any stage, please let us know and we will be happy to support you.
Randstad Technologies Recruitment
SRE - Site Reliability Engineer
Randstad Technologies Recruitment
Senior Site Reliability Engineer (Observability) Location: London/UK (Remote) Contract: 12 Months Initial Day rate : 55 Per Hour - 62 Per Hour Inside IR35 Job Overview We are looking for a Senior Site Reliability Engineer with strong experience in Observability, Monitoring and Distributed Systems to support large-scale cloud infrastructure supporting millions of devices globally. The role focuses on building and scaling monitoring, logging and alerting platforms to ensure high availability and performance of cloud services. Responsibilities Design, deploy and scale observability platforms Manage and scale Prometheus monitoring systems Deploy and maintain large Elasticsearch clusters Build and maintain data pipelines using Kafka Develop alerting and monitoring frameworks Automate infrastructure using Terraform and Ansible Develop tools and scripts using Python, Go, Ruby or Bash Work with Linux systems (Debian/Ubuntu) Participate in on-call rotation Improve system reliability, performance and scalability Required Skills 5+ years experience in Site Reliability Engineering / DevOps Strong Linux systems experience Observability and Monitoring tools experience Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana) Kafka Terraform / Infrastructure as Code Ansible / Configuration Management Programming experience (Python, Go, Ruby or Bash) Distributed systems and cloud infrastructure experience This is an urgent vacancy where the hiring manager is shortlisting for an interview immediately. Please apply with a copy of your CV or send it khushboo. Co. uk Randstad Technologies is acting as an Employment Business in relation to this vacancy.
31/03/2026
Contractor
Senior Site Reliability Engineer (Observability) Location: London/UK (Remote) Contract: 12 Months Initial Day rate : 55 Per Hour - 62 Per Hour Inside IR35 Job Overview We are looking for a Senior Site Reliability Engineer with strong experience in Observability, Monitoring and Distributed Systems to support large-scale cloud infrastructure supporting millions of devices globally. The role focuses on building and scaling monitoring, logging and alerting platforms to ensure high availability and performance of cloud services. Responsibilities Design, deploy and scale observability platforms Manage and scale Prometheus monitoring systems Deploy and maintain large Elasticsearch clusters Build and maintain data pipelines using Kafka Develop alerting and monitoring frameworks Automate infrastructure using Terraform and Ansible Develop tools and scripts using Python, Go, Ruby or Bash Work with Linux systems (Debian/Ubuntu) Participate in on-call rotation Improve system reliability, performance and scalability Required Skills 5+ years experience in Site Reliability Engineering / DevOps Strong Linux systems experience Observability and Monitoring tools experience Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana) Kafka Terraform / Infrastructure as Code Ansible / Configuration Management Programming experience (Python, Go, Ruby or Bash) Distributed systems and cloud infrastructure experience This is an urgent vacancy where the hiring manager is shortlisting for an interview immediately. Please apply with a copy of your CV or send it khushboo. Co. uk Randstad Technologies is acting as an Employment Business in relation to this vacancy.
Cambridge University Press & Assessment
Principal Developer Team Lead
Cambridge University Press & Assessment
Principal Developer Team Lead Salary:   £51,400 - £68,800 Location:   Cambridge/Hybrid Contract:   Permanent This Principal Developer Team Lead position offers a pivotal opportunity to shape the technical future of a world-renowned academic organisation. You'll spearhead the migration of enterprise systems to cutting-edge cloud-native AWS architectures, while balancing hands-on technical leadership with people management responsibilities. We are Cambridge University Press & Assessment, a world-leading academic publisher and assessment organisation and a proud part of the University of Cambridge.  About the role We're seeking a hands-on Principal Developer Team Lead to drive the technical transformation of our Exam Technology Organisation as we migrate legacy enterprise applications to modern, cloud-native architectures on AWS. You'll balance technical leadership with people management, leading a team of 4-8 developers while establishing the foundations for our future technology stack. Your initial focus will be on two strategic priorities: Evolving our SRE function   - Building the DevOps infrastructure, automation, and tooling that enables Site Reliability Engineering practices across development and operations teams Advancing our AI development practice   - Establishing standards, frameworks, and best practices for responsibly integrating AI capabilities into our education platforms. What You'll Do Technical Leadership Lead migration of legacy applications to cloud-native AWS architectures Build DevOps automation to support SRE practices Establish AI/ML development standards and frameworks Set observability, monitoring, and incident response standards Promote best practices in web, event-driven, and cloud-native technologies Provide technical expertise and oversee code reviews People Leadership Manage and mentor a team of 4–8 developers, providing coaching, development plan Identifying training needs in AI/ML and SRE. Support recruitment and foster a culture of continual improvement and wellbeing. Delivery & Collaboration Deliver software in agile squads Collaborate with architects, SREs, product owners, and infrastructure teams Liaise with stakeholders to identify education sector needs Plan and estimate migrations and feature delivery Coordinate with service management, security, and AWS experts About you Essential   experience Degree or equivalent Proven technical team leadership Skilled in two or more modern programming languages Experience with AWS cloud and infrastructure DevOps skills: automation, CI/CD, infrastructure-as-code Understanding of SRE and observability Experience in web-apps and modern frameworks Strong communicator with technical and non-technical audiences Technical Expertise CI/CD pipelines, automation frameworks, and developer tooling Observability tools, monitoring, logging, and alerting systems Responsible AI practices and governance Event-driven architecture and microservices patterns Software design patterns and scalability best practices Security principles in cloud environments Leadership Qualities Ability to set technical standards and provide thought leadership Experience balancing people management with hands-on contribution Strong mentoring and coaching skills Collaborative approach that builds trust across teams Passion for continuous learning in AI/ML and DevOps Promotes inclusion and continuous improvement You'll be instrumental in our digital transformation, establishing the foundations for reliable, innovative systems that serve millions of learners, teachers, and researchers worldwide. By evolving our SRE function and advancing our AI practice, you'll empower teams to deliver high-performance solutions while responsibly harnessing cutting-edge technologies. If you would like to know more about this opportunity and what will make you successful, please see the full job description attached to the bottom of this vacancy on our careers site. Rewards and benefits We will support you to be at your best in work and to live well outside of it. In addition to competitive salaries, we offer a world-class, flexible   rewards package , featuring family-friendly and planet-friendly benefits including: 28 days annual leave plus bank holidays Private medical and Permanent Health Insurance Discretionary annual bonus Group personal pension scheme Life assurance up to 4 x annual salary Green travel schemes We are a hybrid working organisation, and we offer a range of flexible working options from day one. We expect most hybrid-working colleagues to spend 40-60% of their time at their dedicated office or location. We will also consider other work arrangements if you wish to work more flexibly or require adjustments due to a disability. Ready to pursue your potential? Apply now. We review applications on an ongoing basis, with a closing date for all applications being 18 February  2026. If you are shortlisted and progressed through the stages, you can expect:       A 40-minute screening call with the Hiring Manager.  First stage interview via MS Teams or in person. You will be provided with a brief to complete a role related task which will need to be returned by email in advance of your interview.  Please note that successful applicants will be subject to satisfactory background checks including DBS due to working in a regulated industry. Cambridge University Press & Assessment is an approved UK employer for the sponsorship of eligible roles and applicants under the Skilled Worker visa route. Please refer to the   gov.uk   website for guidance to understand your own eligibility based on the role you are applying for.   Why join us Joining us is your opportunity to pursue potential. You'll belong to a collaborative team that's exploring new and better ways to serve students, teachers and researchers across the globe – for the benefit of individuals, society and the world. Sharing our mission will inspire your own growth, development and progress, in an environment which embraces difference, change and aspiration. Cambridge University Press & Assessment is committed to being a place where anyone can enjoy a successful career, where it's safe to speak up, and where we learn continuously to improve together. We welcome applications from all candidates, regardless of demographic characteristics (age, disability, educational attainment, ethnicity, gender, marital status, neurodiversity, religion, sex, gender identity and sexual identity), cultural, or social class/background. We believe better outcomes come through diversity of thought, background and approach. We welcome applications from people from all backgrounds and communities, actively seeking to employ people from a wide range of different communities.
04/02/2026
Full time
Principal Developer Team Lead Salary:   £51,400 - £68,800 Location:   Cambridge/Hybrid Contract:   Permanent This Principal Developer Team Lead position offers a pivotal opportunity to shape the technical future of a world-renowned academic organisation. You'll spearhead the migration of enterprise systems to cutting-edge cloud-native AWS architectures, while balancing hands-on technical leadership with people management responsibilities. We are Cambridge University Press & Assessment, a world-leading academic publisher and assessment organisation and a proud part of the University of Cambridge.  About the role We're seeking a hands-on Principal Developer Team Lead to drive the technical transformation of our Exam Technology Organisation as we migrate legacy enterprise applications to modern, cloud-native architectures on AWS. You'll balance technical leadership with people management, leading a team of 4-8 developers while establishing the foundations for our future technology stack. Your initial focus will be on two strategic priorities: Evolving our SRE function   - Building the DevOps infrastructure, automation, and tooling that enables Site Reliability Engineering practices across development and operations teams Advancing our AI development practice   - Establishing standards, frameworks, and best practices for responsibly integrating AI capabilities into our education platforms. What You'll Do Technical Leadership Lead migration of legacy applications to cloud-native AWS architectures Build DevOps automation to support SRE practices Establish AI/ML development standards and frameworks Set observability, monitoring, and incident response standards Promote best practices in web, event-driven, and cloud-native technologies Provide technical expertise and oversee code reviews People Leadership Manage and mentor a team of 4–8 developers, providing coaching, development plan Identifying training needs in AI/ML and SRE. Support recruitment and foster a culture of continual improvement and wellbeing. Delivery & Collaboration Deliver software in agile squads Collaborate with architects, SREs, product owners, and infrastructure teams Liaise with stakeholders to identify education sector needs Plan and estimate migrations and feature delivery Coordinate with service management, security, and AWS experts About you Essential   experience Degree or equivalent Proven technical team leadership Skilled in two or more modern programming languages Experience with AWS cloud and infrastructure DevOps skills: automation, CI/CD, infrastructure-as-code Understanding of SRE and observability Experience in web-apps and modern frameworks Strong communicator with technical and non-technical audiences Technical Expertise CI/CD pipelines, automation frameworks, and developer tooling Observability tools, monitoring, logging, and alerting systems Responsible AI practices and governance Event-driven architecture and microservices patterns Software design patterns and scalability best practices Security principles in cloud environments Leadership Qualities Ability to set technical standards and provide thought leadership Experience balancing people management with hands-on contribution Strong mentoring and coaching skills Collaborative approach that builds trust across teams Passion for continuous learning in AI/ML and DevOps Promotes inclusion and continuous improvement You'll be instrumental in our digital transformation, establishing the foundations for reliable, innovative systems that serve millions of learners, teachers, and researchers worldwide. By evolving our SRE function and advancing our AI practice, you'll empower teams to deliver high-performance solutions while responsibly harnessing cutting-edge technologies. If you would like to know more about this opportunity and what will make you successful, please see the full job description attached to the bottom of this vacancy on our careers site. Rewards and benefits We will support you to be at your best in work and to live well outside of it. In addition to competitive salaries, we offer a world-class, flexible   rewards package , featuring family-friendly and planet-friendly benefits including: 28 days annual leave plus bank holidays Private medical and Permanent Health Insurance Discretionary annual bonus Group personal pension scheme Life assurance up to 4 x annual salary Green travel schemes We are a hybrid working organisation, and we offer a range of flexible working options from day one. We expect most hybrid-working colleagues to spend 40-60% of their time at their dedicated office or location. We will also consider other work arrangements if you wish to work more flexibly or require adjustments due to a disability. Ready to pursue your potential? Apply now. We review applications on an ongoing basis, with a closing date for all applications being 18 February  2026. If you are shortlisted and progressed through the stages, you can expect:       A 40-minute screening call with the Hiring Manager.  First stage interview via MS Teams or in person. You will be provided with a brief to complete a role related task which will need to be returned by email in advance of your interview.  Please note that successful applicants will be subject to satisfactory background checks including DBS due to working in a regulated industry. Cambridge University Press & Assessment is an approved UK employer for the sponsorship of eligible roles and applicants under the Skilled Worker visa route. Please refer to the   gov.uk   website for guidance to understand your own eligibility based on the role you are applying for.   Why join us Joining us is your opportunity to pursue potential. You'll belong to a collaborative team that's exploring new and better ways to serve students, teachers and researchers across the globe – for the benefit of individuals, society and the world. Sharing our mission will inspire your own growth, development and progress, in an environment which embraces difference, change and aspiration. Cambridge University Press & Assessment is committed to being a place where anyone can enjoy a successful career, where it's safe to speak up, and where we learn continuously to improve together. We welcome applications from all candidates, regardless of demographic characteristics (age, disability, educational attainment, ethnicity, gender, marital status, neurodiversity, religion, sex, gender identity and sexual identity), cultural, or social class/background. We believe better outcomes come through diversity of thought, background and approach. We welcome applications from people from all backgrounds and communities, actively seeking to employ people from a wide range of different communities.
Cambridge University Press & Assessment
Site Reliability Engineer Team Lead
Cambridge University Press & Assessment Cambridge/Hybrid (with 2-3 days per week in office)
Job Title:  English Technology Platform SRE Team Lead Salary:  £68,600 - £91,700 Location:  Cambridge/Hybrid (with 2-3 days per week in office) Contract:  Permanent  Hours:  Full time Are you ready to shape the future of technology platforms at the heart of Cambridge's academic excellence? Join us as our English Technology Platform SRE Team Lead and help drive innovation, reliability, and intelligent automation in a world-class environment. We are Cambridge University Press & Assessment, a world-leading academic publisher and assessment organisation and a proud part of the University of Cambridge.  About the role   The SRE Team Lead will lead a mature Site Reliability Engineering function within the Platform Operations Team, working closely with Platform Support and Engineering teams. This role demands strong thought leadership, technical depth, and strategic direction for the discipline, with a particular emphasis on leveraging AI-driven operations (AIOps) and FinOps practices to optimise reliability, performance, and cloud spend. Although this is a hands-on technical role, the SRE Team Lead will also manage a small team of SRE, providing clear direction and ensuring consistent, data-driven, AI-enhanced service delivery across the platforms while working collaboratively with existing support and engineering groups. Apply core SRE and DevOps principles—culture, automation, testing, measurement, and continuous improvement—to build and optimise pipelines focused on rapid, reliable software delivery. Integrate AIOps capabilities, such as automated anomaly detection and intelligent alerting, to further enhance operational excellence. Work with Solutions Architecture, Development, and QA teams to automate processes wherever possible, creating and improving stable CI/CD pipelines for both software and infrastructure. Develop tools that enable rapid provisioning of environments and resources across all teams, incorporating AI-assisted automation where beneficial. Use automation, observability, and monitoring tools to improve site reliability and proactively identify issues. Support development teams with troubleshooting, particularly in infrastructure, networking, and multi-tier application design. Serve as a subject matter expert for cloud services—especially AWS PaaS—while applying FinOps practices to ensure cloud cost transparency, optimisation, and efficient resource usage. Create and maintain robust technical documentation for the infrastructure of the English platforms, including operational runbooks enhanced with predictive and AI-supported insights. Stay engaged with developments in the SRE, DevOps, AIOps, and FinOps communities, continually introducing new practices and technologies to improve reliability, performance, automation, and cloud cost efficiency   This position has been classified as a hybrid role, requiring the selected candidate to typically spend 40-60% of their time collaborating and connecting face-to-face at their dedicated location. Aside from our hybrid principles, other flexible working requests will be considered from the first day of employment, including other work arrangements should you require adjustments due to a disability or long-term health condition.    About you A passion for Site reliability engineering and driven to understand, anticipate, and counter platform related issues before they become problems and staying up to date with the latest technological trends and developments Great communication allowing effective collaboration across technical leadership and various business stakeholders with the ability to present ideas and strategies clearly and persuasively. Demonstratable soft skills in motivating, inspiring and leading a team (direct line management is not part of the roles remit) Educated to degree level or equivalent and with a minimum of 5 years proven experience in a systems administration or dev-ops blended role. Experience implementing technologies such as Terraform, Github Actions & Containerization/Orchestration e.g. Kubernetes & Docker Expertise in Monitoring tools like New Relic, Grafana, Alert Manager and site24x7. Have extreme knowledge of cloud computing infrastructure, especially using Amazon Web Services (EKS, ECS, RDS, Route53 etc.) Excellent troubleshooting, debugging, communication and documentation skills Experience of working within an Agile product development environment. For a detailed job description, please refer to the link at the bottom of the advert on our careers site. We are a Disability Confident (DC) employer that is committed to equality and inclusion ensuring our recruitment process is accessible to all. The DC scheme's   Offer of an Interview   commitment applies to applicants who opt in, and disclose a disability or a long-term health condition, and best meet the minimum criteria for the role. In instances where interviewing all qualifying candidates is not practicable, we prioritise those who best meet the minimum criteria, as we would for applicants who do not have a disability or long-term health condition. Cambridge University Press & Assessment is an approved UK employer for the sponsorship of eligible roles and applicants under the Skilled Worker visa route. Please refer to the  gov.uk   website for guidance to understand your own eligibility based on the role you are applying for. Rewards and benefits   We will support you to be at your best in work and to live well outside of it. In addition to competitive salaries, we offer a world-class, flexible  rewards package , featuring family-friendly and planet-friendly benefits including:   28 days annual leave plus bank holidays Private medical and Permanent Health Insurance   Discretionary annual bonus   Group personal pension scheme Life assurance up to 4 x annual salary   Green travel schemes     Ready to pursue your potential? Apply now. We aim to support candidates by making our interview process clear and transparent. The closing date for all applications will be 4th February. We will review applications on an ongoing basis, and shortlisted candidates can expect interviews to take place shortly after it closes. If you are shortlisted and progressed through the stages, you can expect:  A 15-minute screening call with the Hiring Manager. Final stage virtual interview via MS Teams.  If you require any reasonable adjustments during the recruitment process due to a disability or a long-term health condition, there will be an opportunity for you to inform us via the online application form. We will do our best to accommodate your needs.  Please note that successful applicants will be subject to satisfactory background checks including DBS due to working in a regulated industry. We are committed to an equitable recruitment process. As such, applications must be submitted via our official online application procedure. Please refrain from sending your CV directly to our recruiters. If you experience technical difficulties or require additional support with submitting your online application, contact the Recruiter.  Why join us  Joining us is your opportunity to pursue potential. You will belong to a collaborative team that is exploring new and better ways to serve students, teachers and researchers across the globe – for the benefit of individuals, society and the world. Sharing our mission will inspire your own growth, development and progress, in an environment which embraces difference, change and aspiration. Cambridge University Press & Assessment is committed to being a place where anyone can enjoy a successful career, where it is safe to speak up, and where we learn continuously to improve together. We welcome applications from all candidates, regardless of demographic characteristics (age, disability, educational attainment, ethnicity, gender, marital status, neurodiversity, religion, sex, gender identity and sexual identity), cultural, or social class/background.  We believe better outcomes come through diversity of thought, background and approach. We welcome applications from people from all backgrounds and communities, actively seeking to employ people from a wide range of different communities. If you are ready to take the next step in your Cambridge journey, we welcome your application. Together, we continue to shape a culture where everyone feels empowered to succeed and motivated to make a difference— for ourselves, for each other, and for learners worldwide.
21/01/2026
Full time
Job Title:  English Technology Platform SRE Team Lead Salary:  £68,600 - £91,700 Location:  Cambridge/Hybrid (with 2-3 days per week in office) Contract:  Permanent  Hours:  Full time Are you ready to shape the future of technology platforms at the heart of Cambridge's academic excellence? Join us as our English Technology Platform SRE Team Lead and help drive innovation, reliability, and intelligent automation in a world-class environment. We are Cambridge University Press & Assessment, a world-leading academic publisher and assessment organisation and a proud part of the University of Cambridge.  About the role   The SRE Team Lead will lead a mature Site Reliability Engineering function within the Platform Operations Team, working closely with Platform Support and Engineering teams. This role demands strong thought leadership, technical depth, and strategic direction for the discipline, with a particular emphasis on leveraging AI-driven operations (AIOps) and FinOps practices to optimise reliability, performance, and cloud spend. Although this is a hands-on technical role, the SRE Team Lead will also manage a small team of SRE, providing clear direction and ensuring consistent, data-driven, AI-enhanced service delivery across the platforms while working collaboratively with existing support and engineering groups. Apply core SRE and DevOps principles—culture, automation, testing, measurement, and continuous improvement—to build and optimise pipelines focused on rapid, reliable software delivery. Integrate AIOps capabilities, such as automated anomaly detection and intelligent alerting, to further enhance operational excellence. Work with Solutions Architecture, Development, and QA teams to automate processes wherever possible, creating and improving stable CI/CD pipelines for both software and infrastructure. Develop tools that enable rapid provisioning of environments and resources across all teams, incorporating AI-assisted automation where beneficial. Use automation, observability, and monitoring tools to improve site reliability and proactively identify issues. Support development teams with troubleshooting, particularly in infrastructure, networking, and multi-tier application design. Serve as a subject matter expert for cloud services—especially AWS PaaS—while applying FinOps practices to ensure cloud cost transparency, optimisation, and efficient resource usage. Create and maintain robust technical documentation for the infrastructure of the English platforms, including operational runbooks enhanced with predictive and AI-supported insights. Stay engaged with developments in the SRE, DevOps, AIOps, and FinOps communities, continually introducing new practices and technologies to improve reliability, performance, automation, and cloud cost efficiency   This position has been classified as a hybrid role, requiring the selected candidate to typically spend 40-60% of their time collaborating and connecting face-to-face at their dedicated location. Aside from our hybrid principles, other flexible working requests will be considered from the first day of employment, including other work arrangements should you require adjustments due to a disability or long-term health condition.    About you A passion for Site reliability engineering and driven to understand, anticipate, and counter platform related issues before they become problems and staying up to date with the latest technological trends and developments Great communication allowing effective collaboration across technical leadership and various business stakeholders with the ability to present ideas and strategies clearly and persuasively. Demonstratable soft skills in motivating, inspiring and leading a team (direct line management is not part of the roles remit) Educated to degree level or equivalent and with a minimum of 5 years proven experience in a systems administration or dev-ops blended role. Experience implementing technologies such as Terraform, Github Actions & Containerization/Orchestration e.g. Kubernetes & Docker Expertise in Monitoring tools like New Relic, Grafana, Alert Manager and site24x7. Have extreme knowledge of cloud computing infrastructure, especially using Amazon Web Services (EKS, ECS, RDS, Route53 etc.) Excellent troubleshooting, debugging, communication and documentation skills Experience of working within an Agile product development environment. For a detailed job description, please refer to the link at the bottom of the advert on our careers site. We are a Disability Confident (DC) employer that is committed to equality and inclusion ensuring our recruitment process is accessible to all. The DC scheme's   Offer of an Interview   commitment applies to applicants who opt in, and disclose a disability or a long-term health condition, and best meet the minimum criteria for the role. In instances where interviewing all qualifying candidates is not practicable, we prioritise those who best meet the minimum criteria, as we would for applicants who do not have a disability or long-term health condition. Cambridge University Press & Assessment is an approved UK employer for the sponsorship of eligible roles and applicants under the Skilled Worker visa route. Please refer to the  gov.uk   website for guidance to understand your own eligibility based on the role you are applying for. Rewards and benefits   We will support you to be at your best in work and to live well outside of it. In addition to competitive salaries, we offer a world-class, flexible  rewards package , featuring family-friendly and planet-friendly benefits including:   28 days annual leave plus bank holidays Private medical and Permanent Health Insurance   Discretionary annual bonus   Group personal pension scheme Life assurance up to 4 x annual salary   Green travel schemes     Ready to pursue your potential? Apply now. We aim to support candidates by making our interview process clear and transparent. The closing date for all applications will be 4th February. We will review applications on an ongoing basis, and shortlisted candidates can expect interviews to take place shortly after it closes. If you are shortlisted and progressed through the stages, you can expect:  A 15-minute screening call with the Hiring Manager. Final stage virtual interview via MS Teams.  If you require any reasonable adjustments during the recruitment process due to a disability or a long-term health condition, there will be an opportunity for you to inform us via the online application form. We will do our best to accommodate your needs.  Please note that successful applicants will be subject to satisfactory background checks including DBS due to working in a regulated industry. We are committed to an equitable recruitment process. As such, applications must be submitted via our official online application procedure. Please refrain from sending your CV directly to our recruiters. If you experience technical difficulties or require additional support with submitting your online application, contact the Recruiter.  Why join us  Joining us is your opportunity to pursue potential. You will belong to a collaborative team that is exploring new and better ways to serve students, teachers and researchers across the globe – for the benefit of individuals, society and the world. Sharing our mission will inspire your own growth, development and progress, in an environment which embraces difference, change and aspiration. Cambridge University Press & Assessment is committed to being a place where anyone can enjoy a successful career, where it is safe to speak up, and where we learn continuously to improve together. We welcome applications from all candidates, regardless of demographic characteristics (age, disability, educational attainment, ethnicity, gender, marital status, neurodiversity, religion, sex, gender identity and sexual identity), cultural, or social class/background.  We believe better outcomes come through diversity of thought, background and approach. We welcome applications from people from all backgrounds and communities, actively seeking to employ people from a wide range of different communities. If you are ready to take the next step in your Cambridge journey, we welcome your application. Together, we continue to shape a culture where everyone feels empowered to succeed and motivated to make a difference— for ourselves, for each other, and for learners worldwide.
Scope AT Limited
AVP Infrastructure Cloud Support - AWS, Terraform, Python, DevOps, SRE - Permanent
Scope AT Limited
AVP Infrastructure Cloud Support - AWS, Terraform, Python, DevOps, SRE - Permanent Job purpose This role is supporting the AWS Public cloud infrastructure and implementation of Infrastructure as Code using Terraform. The role will work closely with the SRE and Engineering teams to ensure that the Cloud environment has sufficient observability and is appropriately managed. What you will be doing: Responsible for ensuring the Production service is prioritized, with all service incidents, problems and requests for cloud hosted services responded to and actioned. Responsible for maintaining the reliability and security of the Cloud Hosted environments. Improve Observability and Telemetry in the Cloud Hosted environments utilizing SRE methodology to give SLA, SLO and SLIs. Ensure risks within the Cloud hosted environment are documented and regularly reviewed. Identified operational risk issues are captured with appropriate actions tracked to agreed timelines. Define and implement standards and procedures to adhere to current best practice and drive continual service improvement. Responsible for ensuring Security standards are implemented and maintained in the Cloud hosted environment. Including delivery of upgrades and security updates to minimise risk and ensure stability for all cloud hosted services. Responsible for maintaining service resilience for all cloud hosted services, including backup and disaster recovery processes. Where necessary plan and conduct quarterly DR tests for all cloud hosted services ensuring any findings are captured and addressed promptly. What we're looking for: Must have strong technical operational skills in supporting AWS Cloud Hosted environments and at least 3 years in an Infrastructure support role. Strong understanding of Infrastructure as Code technologies, ideally including Terraform and Ansible. Operational risk and control management processes, including an understanding of Security best practice and how to apply this safely within a Production environment. Asset management and life cycle (EOS/EOL) process management. Planning and leading disaster recovery fail-overs of IT systems and services. Preferably experience of working in a regulated financial services/banking organization. Able to understand and use AWS including an understanding of AWS services, security and networking. Knowledge of at least 1 programming language, preferably Python. Knowledge of CI/CD specifically relating to Cloud Hosted environments. Including an understanding of some of the Infrastructure as Code tools GIT, Terraform, Ansible, Jenkins. Permanent Role - Hybrid working (Central London based) - Candidate must be eligible to work in the UK By applying to this job you are sending us your CV, which may contain personal information. Please refer to our Privacy Notice to understand how we process this information. In short, in order to supply you with work finding services, we will hold and process your personal data, and only with your express permission we will share this personal data with a client (or a third party working on behalf of the client) by email or by upload to the Client/third parties vendor management system. By giving us permission to send your CV to a client, this constitutes permission to share the personal data that would be necessary to consider your application, interview you (Phone/video/face to face) and if successful hire you. Scope AT acts as an employment agency for Permanent Recruitment and an employment business for the supply of temporary workers. By applying for this job you accept the Terms and Conditions, Data Protection Policy, Privacy Notice and Disclaimers which can be found at our website.
06/10/2025
Full time
AVP Infrastructure Cloud Support - AWS, Terraform, Python, DevOps, SRE - Permanent Job purpose This role is supporting the AWS Public cloud infrastructure and implementation of Infrastructure as Code using Terraform. The role will work closely with the SRE and Engineering teams to ensure that the Cloud environment has sufficient observability and is appropriately managed. What you will be doing: Responsible for ensuring the Production service is prioritized, with all service incidents, problems and requests for cloud hosted services responded to and actioned. Responsible for maintaining the reliability and security of the Cloud Hosted environments. Improve Observability and Telemetry in the Cloud Hosted environments utilizing SRE methodology to give SLA, SLO and SLIs. Ensure risks within the Cloud hosted environment are documented and regularly reviewed. Identified operational risk issues are captured with appropriate actions tracked to agreed timelines. Define and implement standards and procedures to adhere to current best practice and drive continual service improvement. Responsible for ensuring Security standards are implemented and maintained in the Cloud hosted environment. Including delivery of upgrades and security updates to minimise risk and ensure stability for all cloud hosted services. Responsible for maintaining service resilience for all cloud hosted services, including backup and disaster recovery processes. Where necessary plan and conduct quarterly DR tests for all cloud hosted services ensuring any findings are captured and addressed promptly. What we're looking for: Must have strong technical operational skills in supporting AWS Cloud Hosted environments and at least 3 years in an Infrastructure support role. Strong understanding of Infrastructure as Code technologies, ideally including Terraform and Ansible. Operational risk and control management processes, including an understanding of Security best practice and how to apply this safely within a Production environment. Asset management and life cycle (EOS/EOL) process management. Planning and leading disaster recovery fail-overs of IT systems and services. Preferably experience of working in a regulated financial services/banking organization. Able to understand and use AWS including an understanding of AWS services, security and networking. Knowledge of at least 1 programming language, preferably Python. Knowledge of CI/CD specifically relating to Cloud Hosted environments. Including an understanding of some of the Infrastructure as Code tools GIT, Terraform, Ansible, Jenkins. Permanent Role - Hybrid working (Central London based) - Candidate must be eligible to work in the UK By applying to this job you are sending us your CV, which may contain personal information. Please refer to our Privacy Notice to understand how we process this information. In short, in order to supply you with work finding services, we will hold and process your personal data, and only with your express permission we will share this personal data with a client (or a third party working on behalf of the client) by email or by upload to the Client/third parties vendor management system. By giving us permission to send your CV to a client, this constitutes permission to share the personal data that would be necessary to consider your application, interview you (Phone/video/face to face) and if successful hire you. Scope AT acts as an employment agency for Permanent Recruitment and an employment business for the supply of temporary workers. By applying for this job you accept the Terms and Conditions, Data Protection Policy, Privacy Notice and Disclaimers which can be found at our website.
ARM (Advanced Resource Managers)
Senior Site Reliability Engineer
ARM (Advanced Resource Managers)
Senior Site Reliability Engineer 6 months Remote £Negotiable - INSIDE IR35 Tech Stack Multiple Platforms and Applications AWS and Azure - Cloud Mainframe skills would be handy Latest applications on Cloud Dev Ops skills would be helpful Attitude of being part of the team and owning the outcomes Advocate - to change the culture to SRE Disclaimer: This vacancy is being advertised by either Advanced Resource Managers Limited, Advanced Resource Managers IT Limited or Advanced Resource Managers Engineering Limited ("ARM"). ARM is a specialist talent acquisition and management consultancy. We provide technical contingency recruitment and a portfolio of more complex resource solutions. Our specialist recruitment divisions cover the entire technical arena, including some of the most economically and strategically important industries in the UK and the world today. We will never send your CV without your permission. Where the role is marked as Outside IR35 in the advertisement this is subject to receipt of a final Status Determination Statement from the end Client and may be subject to change.
06/10/2025
Contractor
Senior Site Reliability Engineer 6 months Remote £Negotiable - INSIDE IR35 Tech Stack Multiple Platforms and Applications AWS and Azure - Cloud Mainframe skills would be handy Latest applications on Cloud Dev Ops skills would be helpful Attitude of being part of the team and owning the outcomes Advocate - to change the culture to SRE Disclaimer: This vacancy is being advertised by either Advanced Resource Managers Limited, Advanced Resource Managers IT Limited or Advanced Resource Managers Engineering Limited ("ARM"). ARM is a specialist talent acquisition and management consultancy. We provide technical contingency recruitment and a portfolio of more complex resource solutions. Our specialist recruitment divisions cover the entire technical arena, including some of the most economically and strategically important industries in the UK and the world today. We will never send your CV without your permission. Where the role is marked as Outside IR35 in the advertisement this is subject to receipt of a final Status Determination Statement from the end Client and may be subject to change.

Modal Window

  • Home
  • Contact
  • About Us
  • FAQs
  • Terms & Conditions
  • Privacy
  • Employer
  • Post a Job
  • Search Resumes
  • Sign in
  • Job Seeker
  • Find Jobs
  • Create Resume
  • Sign in
  • IT blog
  • Facebook
  • Twitter
  • LinkedIn
  • Youtube
© 2008-2026 IT Job Board