Browse IT Jobs | IT Job Board

Senior Software Engineer / Reliability Engineering - Real-time Data

BLOOMBERG L.P.

Senior Software Engineer / Reliability Engineering - Real-time Data Location: London Business Area: Engineering and CTO Ref #: Description & Requirements Our department is responsible for efficiently distributing financial data from its source to interested users all around the world. This includes (for example) stock prices or foreign exchange rates. Data can either be served in response to a request or streamed in real time. The group owns: The distribution software and infrastructure A range of different sources of data Supporting services to administer and manage the system, including permissioning and metering The team is also responsible for the Enterprise endpoint ("B-PIPE"), which allows end-users to programmatically consume data via our SDK. Data is also available through the Bloomberg Terminal and Microsoft Excel. The main challenge faced by the group is one of scale. Data is sourced from more than 370 global exchanges, with a combined volume in excess of 60 billion messages each day. We deliver this data to hundreds of thousands of terminals and thousands of B-PIPEs. Handling this volume requires significant infrastructure, we manage multiple clusters in our main data centres, as well as a network of many thousands of servers around the world. Group Overview The RD Reliability Engineering group comprises three sub-teams located in Tokyo, London, and New York, providing follow-the-sun support. Our mission is to ensure systems are reliable, scalable, and observable through software engineering, while continuously improving how systems behave under load and failure conditions. We work in an outcome-driven model, focusing on measurable improvements in availability, latency, capacity, and recovery. Our goal is to ensure systems meet defined service level objectives while minimising manual operational effort through automation and software solutions. The systems we support must behave predictably under extreme load, recover quickly from failures, and continue to evolve without compromising stability - these are the core challenges we solve. London Team Focus - Availability & Resiliency The London team plays a key role in ensuring the availability and resiliency of RD infrastructure globally. We focus on: Detecting and preventing failures across large-scale distributed systems Ensuring infrastructure demonstrates sufficient capacity and failover capability during site-loss scenarios Reducing time to detect, diagnose, and recover from incidents Ensuring systems behave predictably under both normal and adverse conditions This role provides the opportunity to influence how reliability is engineered across the platform, working closely with teams globally to improve system behaviour and design. What You'll Do Build and maintain production-grade software supporting Bloomberg's global distribution infrastructure Design and implement scalable, fault-tolerant systems with a focus on observability, performance, and automation Analyse system behaviour under real-world and failure scenarios to validate capacity, failover, and recovery meet resilience objectives Identify bottlenecks, scaling limits, and reliability risks across distributed systems Improve detection, diagnosis, and prevention of production issues Build tools and frameworks to increase system visibility and reduce time to detect and resolve incidents Automate operational workflows to reduce manual effort and improve system reliability Partner with application and infrastructure teams to improve system design, resilience, and performance Contribute to design discussions, incident reviews, and reliability improvements across the platform Systems You'll Work With Configuration systems serving thousands of servers across the global network Service discovery and clustering systems for distributed infrastructure Monitoring and observability frameworks for large-scale server estates Tooling for diagnosing data quality and distribution issues Ownership of systems may evolve over time as the team focuses on areas of highest impact. What Success Looks Like Systems consistently meet defined reliability, latency, and capacity objectives Issues are detected and mitigated before significant customer impact Systems are demonstrably resilient, with proven failover capability and sufficient capacity under failure conditions Operational processes are automated and scalable Reliability is achieved through engineering improvements rather than manual intervention What We're Looking For We're not a traditional SRE team. We engineer reliability through software, building solutions that automate operations and improve system resilience by design. Experience with an object-oriented programming language (preferably Python or C++) Strong focus on building reliable, observable distributed systems Experience working with SLOs, SLIs, and production reliability metrics Proven ability to triage and resolve live production problems A mindset focused on automation and reducing operational toil A strength in collaborating within an inclusive team environment The ability to work across departments and build strong relationships with both technical and non-technical partners Why Join Us You'll work on systems that sit at the core of Bloomberg's real-time data platform, operating at global scale and under demanding performance and reliability requirements. This is an opportunity to: Solve complex distributed systems problems with real-world impact Influence how reliability is engineered across a critical platform Work with teams across multiple regions and technical domains Build systems that are resilient by design and operate at massive scale If indicated, please note that years of experience are a guide; we will consider applications from all candidates who can demonstrate the skills necessary for the role. Discover what makes Bloomberg unique - watch our for an inside look at our culture, values, and the people behind our success. Bloomberg is an equal opportunity employer and we value diversity at our company. We do not discriminate on the basis of age, ancestry, color, gender identity or expression, genetic predisposition or carrier status, marital status, national or ethnic origin, race, religion or belief, sex, sexual orientation, sexual and other reproductive health decisions, parental or caring status, physical or mental disability, pregnancy or parental leave, protected veteran status, status as a victim of domestic violence, or any other classification protected by applicable law. Bloomberg is a disability inclusive employer. Please let us know if you require any reasonable adjustments to be made for the recruitment process. If you would prefer to discuss this confidentially, please email

26/06/2026

Full time

Senior Software Engineer / Reliability Engineering - Real-time Data Location: London Business Area: Engineering and CTO Ref #: Description & Requirements Our department is responsible for efficiently distributing financial data from its source to interested users all around the world. This includes (for example) stock prices or foreign exchange rates. Data can either be served in response to a request or streamed in real time. The group owns: The distribution software and infrastructure A range of different sources of data Supporting services to administer and manage the system, including permissioning and metering The team is also responsible for the Enterprise endpoint ("B-PIPE"), which allows end-users to programmatically consume data via our SDK. Data is also available through the Bloomberg Terminal and Microsoft Excel. The main challenge faced by the group is one of scale. Data is sourced from more than 370 global exchanges, with a combined volume in excess of 60 billion messages each day. We deliver this data to hundreds of thousands of terminals and thousands of B-PIPEs. Handling this volume requires significant infrastructure, we manage multiple clusters in our main data centres, as well as a network of many thousands of servers around the world. Group Overview The RD Reliability Engineering group comprises three sub-teams located in Tokyo, London, and New York, providing follow-the-sun support. Our mission is to ensure systems are reliable, scalable, and observable through software engineering, while continuously improving how systems behave under load and failure conditions. We work in an outcome-driven model, focusing on measurable improvements in availability, latency, capacity, and recovery. Our goal is to ensure systems meet defined service level objectives while minimising manual operational effort through automation and software solutions. The systems we support must behave predictably under extreme load, recover quickly from failures, and continue to evolve without compromising stability - these are the core challenges we solve. London Team Focus - Availability & Resiliency The London team plays a key role in ensuring the availability and resiliency of RD infrastructure globally. We focus on: Detecting and preventing failures across large-scale distributed systems Ensuring infrastructure demonstrates sufficient capacity and failover capability during site-loss scenarios Reducing time to detect, diagnose, and recover from incidents Ensuring systems behave predictably under both normal and adverse conditions This role provides the opportunity to influence how reliability is engineered across the platform, working closely with teams globally to improve system behaviour and design. What You'll Do Build and maintain production-grade software supporting Bloomberg's global distribution infrastructure Design and implement scalable, fault-tolerant systems with a focus on observability, performance, and automation Analyse system behaviour under real-world and failure scenarios to validate capacity, failover, and recovery meet resilience objectives Identify bottlenecks, scaling limits, and reliability risks across distributed systems Improve detection, diagnosis, and prevention of production issues Build tools and frameworks to increase system visibility and reduce time to detect and resolve incidents Automate operational workflows to reduce manual effort and improve system reliability Partner with application and infrastructure teams to improve system design, resilience, and performance Contribute to design discussions, incident reviews, and reliability improvements across the platform Systems You'll Work With Configuration systems serving thousands of servers across the global network Service discovery and clustering systems for distributed infrastructure Monitoring and observability frameworks for large-scale server estates Tooling for diagnosing data quality and distribution issues Ownership of systems may evolve over time as the team focuses on areas of highest impact. What Success Looks Like Systems consistently meet defined reliability, latency, and capacity objectives Issues are detected and mitigated before significant customer impact Systems are demonstrably resilient, with proven failover capability and sufficient capacity under failure conditions Operational processes are automated and scalable Reliability is achieved through engineering improvements rather than manual intervention What We're Looking For We're not a traditional SRE team. We engineer reliability through software, building solutions that automate operations and improve system resilience by design. Experience with an object-oriented programming language (preferably Python or C++) Strong focus on building reliable, observable distributed systems Experience working with SLOs, SLIs, and production reliability metrics Proven ability to triage and resolve live production problems A mindset focused on automation and reducing operational toil A strength in collaborating within an inclusive team environment The ability to work across departments and build strong relationships with both technical and non-technical partners Why Join Us You'll work on systems that sit at the core of Bloomberg's real-time data platform, operating at global scale and under demanding performance and reliability requirements. This is an opportunity to: Solve complex distributed systems problems with real-world impact Influence how reliability is engineered across a critical platform Work with teams across multiple regions and technical domains Build systems that are resilient by design and operate at massive scale If indicated, please note that years of experience are a guide; we will consider applications from all candidates who can demonstrate the skills necessary for the role. Discover what makes Bloomberg unique - watch our for an inside look at our culture, values, and the people behind our success. Bloomberg is an equal opportunity employer and we value diversity at our company. We do not discriminate on the basis of age, ancestry, color, gender identity or expression, genetic predisposition or carrier status, marital status, national or ethnic origin, race, religion or belief, sex, sexual orientation, sexual and other reproductive health decisions, parental or caring status, physical or mental disability, pregnancy or parental leave, protected veteran status, status as a victim of domestic violence, or any other classification protected by applicable law. Bloomberg is a disability inclusive employer. Please let us know if you require any reasonable adjustments to be made for the recruitment process. If you would prefer to discuss this confidentially, please email

Tech Lead, Site Reliability Engineering

London Stock Exchange Group

The successful candidate will be responsible for building team resilience, overseeing day-to-day operations, and ensuring effective resource allocation across support demands. They must demonstrate strong business acumen and confidently collaborate with stakeholders during critical incidents. The role requires ownership of incident calls, leading them independently until Subject Matter Experts (SMEs) are engaged, and coordinating resolution efforts effectively.Flexibility is crucial, as the role includes out-of-office availability, including public holidays, and managing on-call responsibilities. The ideal candidate will possess excellent self-leadership, prioritization skills, and the ability to drive team performance while maintaining high service standards. Key Responsibilities Manage BAU support activities for Risk Intelligence applications during UTC morning hours Oversee shift planning to ensure 24x5 onsite coverage and 24x7 on-call support Provide out-of-hours and on-call support, including overnight production monitoring and weekend release support Ensure team adherence to performance metrics and promote continuous improvement Maintain effective documentation and processes to meet support and audit requirements Develop and regularly review 24-hour runbooks for each business line Lead incident management and resolution within the SRE team Deliver daily, weekly, and monthly status reports to stakeholders Collaborate with technical and business teams to manage stakeholder expectations People Leadership & Site Management Lead and mentor a team of SRE engineers, fostering a culture of ownership, learning, and engineering excellence Drive career development, performance management, and technical capability growth Partner with HR and Talent teams to support hiring and workforce planning Promote diversity, equity, inclusion (DEI), well-being, and psychological safety Represent the site in global SRE forums and contribute to offshore strategy and location planning Support Business Continuity Planning (BCP) and Disaster Recovery (DR) readiness Person Specification Education Bachelor's degree or equivalent experience or equivalent, preferably in a technical discipline Required Skills and Experience Experience with Linux (Amazon Linux AMI) and Windows Server 2019 in cloud environments Proficient in MySQL , PostgreSQL , MongoDB , and Aurora RDS Familiarity with AWS DocumentDB , DynamoDB , and SQLite Knowledge of MS SQL Always On Availability Groups and migration to Azure SQL Managed Instances Hands-on experience with AWS SQS and AWS SES Exposure to Amazon MSK , Coviant , and Cerberus Solid understanding of AWS S3 and EFS , including frontend integration Experience with Synapse Analytics and D365 Skilled in development using Spring Boot , Node.js , Python (Django, Flask, Apache Airflow) , Java (Java 11, Lambdas) , React , Angular , JavaScript , C# (.NET Framework) , and PHP Proficient in containerization and orchestration using Docker , Amazon ECS , EKS , and EC2 Additional Attributes 8+ years in production operations, SRE, or DevOps roles, with 3+ years in a leadership capacity Demonstrable experience managing SRE or production support teams in complex environments Strong collaboration and knowledge-sharing mentality Experience in investment banking and familiarity with financial products Excellent analytical and problem-solving skills Effective communicator with technical and business stakeholders, including senior managementLSEG is committed to encouraging a diverse, equitable, an inclusive work environment, ensuring equal opportunities for all employees, regardless of their background. We offer great employee benefits to make sure everyone performs to the best of their abilities.Join us and be part of a team that values innovation, quality, and continuous improvement. If you're ready to take your career to the next level and make a significant impact, we'd love to hear from you.LSEG is a leading global financial markets infrastructure and data provider. Our purpose is driving financial stability, empowering economies and enabling customers to create sustainable growth.Our purpose is the foundation on which our culture is built. Our values of Integrity, Partnership, Excellence and Change underpin our purpose and set the standard for everything we do, every day. They go to the heart of who we are and guide our decision making and everyday actions.Working with us means that you will be part of a dynamic organisation of 25,000 people across 65 countries. However, we will value your individuality and enable you to bring your true self to work so you can help enrich our diverse workforce.We are proud to be an equal opportunities employer. This means that we do not discriminate on the basis of anyone's race, religion, colour, national origin, gender, sexual orientation, gender identity, gender expression, age, marital status, veteran status, pregnancy or disability, or any other basis protected under applicable law. Conforming with applicable law, we can reasonably accommodate applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs.You will be part of a collaborative and creative culture where we encourage new ideas. We are committed to sustainability across our global business and we are proud to partner with our customers to help them meet their sustainability objectives. Our charity, the LSEG Foundation provides charitable grants to community groups that help people access economic opportunities and build a secure future with financial independence. Colleagues can get involved through fundraising and volunteering. LSEG offers a range of tailored benefits and support, including healthcare, retirement planning, paid volunteering days and wellbeing initiatives.Proud to share LSEG in the India is Great Place to Work certified (Jun '25 - Jun '26).Learn more about life and purpose of our company directly from India colleagues' video: Career Stage: Senior Associate London Stock Exchange Group (LSEG) Information: Join us and be part of a team that values innovation, quality, and continuous improvement. If you're ready to take your career to the next level and make a significant impact, we'd love to hear from you.LSEG is a leading global financial markets infrastructure and data provider. Our purpose is driving financial stability, empowering economies and enabling customers to create sustainable growth.Our purpose is the foundation on which our culture is built. Our values of Integrity, Partnership , Excellence and Change underpin our purpose and set the standard for everything we do, every day. They go to the heart of who we are and guide our decision making and everyday actions.Working with us means that you will be part of a dynamic organisation of 25,000 people across 65 countries. However, we will value your individuality and enable you to bring your true self to work so you can help enrich our diverse workforce.We are proud to be an equal opportunities employer. This means that we do not discriminate on the basis of anyone's race, religion, colour, national origin, gender, sexual orientation, gender identity, gender expression, age, marital status, veteran status, pregnancy or disability, or any other basis protected under applicable law. Conforming with applicable law, we can reasonably accommodate applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs.You will be part of a collaborative and creative culture where we encourage new ideas. We are committed to sustainability across our global business and we are proud to partner with our customers to help them meet their sustainability

26/06/2026

Full time

The successful candidate will be responsible for building team resilience, overseeing day-to-day operations, and ensuring effective resource allocation across support demands. They must demonstrate strong business acumen and confidently collaborate with stakeholders during critical incidents. The role requires ownership of incident calls, leading them independently until Subject Matter Experts (SMEs) are engaged, and coordinating resolution efforts effectively.Flexibility is crucial, as the role includes out-of-office availability, including public holidays, and managing on-call responsibilities. The ideal candidate will possess excellent self-leadership, prioritization skills, and the ability to drive team performance while maintaining high service standards. Key Responsibilities Manage BAU support activities for Risk Intelligence applications during UTC morning hours Oversee shift planning to ensure 24x5 onsite coverage and 24x7 on-call support Provide out-of-hours and on-call support, including overnight production monitoring and weekend release support Ensure team adherence to performance metrics and promote continuous improvement Maintain effective documentation and processes to meet support and audit requirements Develop and regularly review 24-hour runbooks for each business line Lead incident management and resolution within the SRE team Deliver daily, weekly, and monthly status reports to stakeholders Collaborate with technical and business teams to manage stakeholder expectations People Leadership & Site Management Lead and mentor a team of SRE engineers, fostering a culture of ownership, learning, and engineering excellence Drive career development, performance management, and technical capability growth Partner with HR and Talent teams to support hiring and workforce planning Promote diversity, equity, inclusion (DEI), well-being, and psychological safety Represent the site in global SRE forums and contribute to offshore strategy and location planning Support Business Continuity Planning (BCP) and Disaster Recovery (DR) readiness Person Specification Education Bachelor's degree or equivalent experience or equivalent, preferably in a technical discipline Required Skills and Experience Experience with Linux (Amazon Linux AMI) and Windows Server 2019 in cloud environments Proficient in MySQL , PostgreSQL , MongoDB , and Aurora RDS Familiarity with AWS DocumentDB , DynamoDB , and SQLite Knowledge of MS SQL Always On Availability Groups and migration to Azure SQL Managed Instances Hands-on experience with AWS SQS and AWS SES Exposure to Amazon MSK , Coviant , and Cerberus Solid understanding of AWS S3 and EFS , including frontend integration Experience with Synapse Analytics and D365 Skilled in development using Spring Boot , Node.js , Python (Django, Flask, Apache Airflow) , Java (Java 11, Lambdas) , React , Angular , JavaScript , C# (.NET Framework) , and PHP Proficient in containerization and orchestration using Docker , Amazon ECS , EKS , and EC2 Additional Attributes 8+ years in production operations, SRE, or DevOps roles, with 3+ years in a leadership capacity Demonstrable experience managing SRE or production support teams in complex environments Strong collaboration and knowledge-sharing mentality Experience in investment banking and familiarity with financial products Excellent analytical and problem-solving skills Effective communicator with technical and business stakeholders, including senior managementLSEG is committed to encouraging a diverse, equitable, an inclusive work environment, ensuring equal opportunities for all employees, regardless of their background. We offer great employee benefits to make sure everyone performs to the best of their abilities.Join us and be part of a team that values innovation, quality, and continuous improvement. If you're ready to take your career to the next level and make a significant impact, we'd love to hear from you.LSEG is a leading global financial markets infrastructure and data provider. Our purpose is driving financial stability, empowering economies and enabling customers to create sustainable growth.Our purpose is the foundation on which our culture is built. Our values of Integrity, Partnership, Excellence and Change underpin our purpose and set the standard for everything we do, every day. They go to the heart of who we are and guide our decision making and everyday actions.Working with us means that you will be part of a dynamic organisation of 25,000 people across 65 countries. However, we will value your individuality and enable you to bring your true self to work so you can help enrich our diverse workforce.We are proud to be an equal opportunities employer. This means that we do not discriminate on the basis of anyone's race, religion, colour, national origin, gender, sexual orientation, gender identity, gender expression, age, marital status, veteran status, pregnancy or disability, or any other basis protected under applicable law. Conforming with applicable law, we can reasonably accommodate applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs.You will be part of a collaborative and creative culture where we encourage new ideas. We are committed to sustainability across our global business and we are proud to partner with our customers to help them meet their sustainability objectives. Our charity, the LSEG Foundation provides charitable grants to community groups that help people access economic opportunities and build a secure future with financial independence. Colleagues can get involved through fundraising and volunteering. LSEG offers a range of tailored benefits and support, including healthcare, retirement planning, paid volunteering days and wellbeing initiatives.Proud to share LSEG in the India is Great Place to Work certified (Jun '25 - Jun '26).Learn more about life and purpose of our company directly from India colleagues' video: Career Stage: Senior Associate London Stock Exchange Group (LSEG) Information: Join us and be part of a team that values innovation, quality, and continuous improvement. If you're ready to take your career to the next level and make a significant impact, we'd love to hear from you.LSEG is a leading global financial markets infrastructure and data provider. Our purpose is driving financial stability, empowering economies and enabling customers to create sustainable growth.Our purpose is the foundation on which our culture is built. Our values of Integrity, Partnership , Excellence and Change underpin our purpose and set the standard for everything we do, every day. They go to the heart of who we are and guide our decision making and everyday actions.Working with us means that you will be part of a dynamic organisation of 25,000 people across 65 countries. However, we will value your individuality and enable you to bring your true self to work so you can help enrich our diverse workforce.We are proud to be an equal opportunities employer. This means that we do not discriminate on the basis of anyone's race, religion, colour, national origin, gender, sexual orientation, gender identity, gender expression, age, marital status, veteran status, pregnancy or disability, or any other basis protected under applicable law. Conforming with applicable law, we can reasonably accommodate applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs.You will be part of a collaborative and creative culture where we encourage new ideas. We are committed to sustainability across our global business and we are proud to partner with our customers to help them meet their sustainability

Senior AWS Engineer SRE

Spectrum IT Recruitment

Senior AWS Engineer SRE London (Hybrid - 2 Days Onsite near Barbican Station) Looking for an opportunity to make a real impact on a major government programme? We're recruiting a Cloud Operations Engineer SRE to join a large-scale project focused on onboarding a major public sector client and delivering highly secure customer contact solutions. This is a key role within a growing engineering team, helping to build, support and optimise cloud platforms that underpin critical services. You'll work across AWS cloud infrastructure, Linux environments, container platforms and databases, helping to ensure secure, scalable and highly available systems. What we're looking for: Strong AWS cloud experience (EKS, ECS, EC2, RDS, IAM, VPC) Linux systems administration expertise Containerisation and Kubernetes experience Terraform and Infrastructure as Code knowledge Database administration experience (PostgreSQL, MySQL, Aurora or similar) A passion for reliability, security and continuous improvement Why apply? Join a major long-term government project Work on secure, mission-critical technology Excellent opportunity to influence architecture and operational excellence Be part of a global technology organisation investing heavily in growth and innovation If you enjoy solving complex infrastructure challenges and want to play a key role in a high-profile programme, we'd love to hear from you. Spectrum IT Recruitment (South) Limited is acting as an Employment Agency in relation to this vacancy.

26/06/2026

Full time

Senior AWS Engineer SRE London (Hybrid - 2 Days Onsite near Barbican Station) Looking for an opportunity to make a real impact on a major government programme? We're recruiting a Cloud Operations Engineer SRE to join a large-scale project focused on onboarding a major public sector client and delivering highly secure customer contact solutions. This is a key role within a growing engineering team, helping to build, support and optimise cloud platforms that underpin critical services. You'll work across AWS cloud infrastructure, Linux environments, container platforms and databases, helping to ensure secure, scalable and highly available systems. What we're looking for: Strong AWS cloud experience (EKS, ECS, EC2, RDS, IAM, VPC) Linux systems administration expertise Containerisation and Kubernetes experience Terraform and Infrastructure as Code knowledge Database administration experience (PostgreSQL, MySQL, Aurora or similar) A passion for reliability, security and continuous improvement Why apply? Join a major long-term government project Work on secure, mission-critical technology Excellent opportunity to influence architecture and operational excellence Be part of a global technology organisation investing heavily in growth and innovation If you enjoy solving complex infrastructure challenges and want to play a key role in a high-profile programme, we'd love to hear from you. Spectrum IT Recruitment (South) Limited is acting as an Employment Agency in relation to this vacancy.

Support Engineer

Ordnance Survey Southampton, Hampshire

Support Engineer Full time Salary £43,918.00 - £51,238.00 (dependent on experience) Southampton, Hybrid Working Ordnance Survey (OS) is the national mapping agency for Great Britain, and a world-leading geospatial data and technology organisation. As a reliable partner to government, business and citizens across Britain and the world, OS helps its customers in virtually all sectors improve quality of life. OS expertise and data supports efficient public services and infrastructure, new technologies in transport and communications, national security and emergency services and exploring the great outdoors. By being at the forefront of geospatial capability for more than 230 years, we've built a reputation as the world's most inspiring and trusted geospatial partner. About the role You'll join one of our Technology Support Teams, which supports and maintains a range of internal ETL services that produce data for our customer delivery platforms. As a Support Engineer, you will help strengthen the resilience, reliability and quality of OS's ETL services by: Maintaining, managing and supporting software and infrastructure that underpin key business activities Designing and implementing improvements to service performance, including automating deployments, right-sizing systems, and extending monitoring and alerting capabilities Safeguarding critical services by continually assessing and improving observability, resilience and security Investigating and resolving root cause issues, identifying why failures occur, and working with subject matter experts when necessary to fully resolve problems Applying DevSecOps and SRE principles to improve both services and team capability Proactively delivering service improvements, ensuring they align with development cycles and Agile workflows You'll work closely with Engineers and wider stakeholders every day, collaborating within an Agile environment. To thrive in this role, you'll bring: Strong software development and operational knowledge Experience supporting scalable cloud-based services, ideally in Azure Effective communication and teamwork skills, enabling smooth collaboration across multiple teams What we're looking for You will need to demonstrate your track record against the following essential criteria: Experience in coding and engineering practices, for example, creating infrastructure-as-code, software engineering, DevOps methodologies or test automation Genuine passion for continually improving the reliability and stability of operational services Supporting and maintaining complex, cloud-based services working within an Agile framework Solid understanding of Site Reliability Engineering (SRE) and software engineering best practices Cloud technologies and best practice - ideally in Azure Infrastructure-as-Code - ideally using Bicep A track record of continually identifying and implementing service improvements or observability Experience of coaching and mentoring other team members and providing consultancy to other teams Additionally, you will provide expert technical consultancy to enable the business to successfully use IT systems, supporting Project Teams by advising on the performance, configuration and functionality of existing systems. You will ensure that new or updated systems can be effectively supported from day one. You will also play a key role in developing the technical capability of the team - mentoring, teaching and strengthening shared knowledge, skills and engineering practice. Here is a snapshot of the technologies that we use Azure Cloud (AppServices, Function Apps, DataFactory, Batch, EntraID, AzureML, etc.) Azure DevOps (Pipelines, Bicep templates, CLI, PowerShell, Artefacts) Python/C# Azure Databricks Git, YAML PowerBI ESRI ArcGIS, FME, QGIS, etc. Key details Closing date: Thursday 2nd July at 23:59 Interview Information: Candidates are required to complete a SOVA personality assessment and a practical exercise in advance of the interview. These will form part of the interview discussion.

26/06/2026

Full time

Support Engineer Full time Salary £43,918.00 - £51,238.00 (dependent on experience) Southampton, Hybrid Working Ordnance Survey (OS) is the national mapping agency for Great Britain, and a world-leading geospatial data and technology organisation. As a reliable partner to government, business and citizens across Britain and the world, OS helps its customers in virtually all sectors improve quality of life. OS expertise and data supports efficient public services and infrastructure, new technologies in transport and communications, national security and emergency services and exploring the great outdoors. By being at the forefront of geospatial capability for more than 230 years, we've built a reputation as the world's most inspiring and trusted geospatial partner. About the role You'll join one of our Technology Support Teams, which supports and maintains a range of internal ETL services that produce data for our customer delivery platforms. As a Support Engineer, you will help strengthen the resilience, reliability and quality of OS's ETL services by: Maintaining, managing and supporting software and infrastructure that underpin key business activities Designing and implementing improvements to service performance, including automating deployments, right-sizing systems, and extending monitoring and alerting capabilities Safeguarding critical services by continually assessing and improving observability, resilience and security Investigating and resolving root cause issues, identifying why failures occur, and working with subject matter experts when necessary to fully resolve problems Applying DevSecOps and SRE principles to improve both services and team capability Proactively delivering service improvements, ensuring they align with development cycles and Agile workflows You'll work closely with Engineers and wider stakeholders every day, collaborating within an Agile environment. To thrive in this role, you'll bring: Strong software development and operational knowledge Experience supporting scalable cloud-based services, ideally in Azure Effective communication and teamwork skills, enabling smooth collaboration across multiple teams What we're looking for You will need to demonstrate your track record against the following essential criteria: Experience in coding and engineering practices, for example, creating infrastructure-as-code, software engineering, DevOps methodologies or test automation Genuine passion for continually improving the reliability and stability of operational services Supporting and maintaining complex, cloud-based services working within an Agile framework Solid understanding of Site Reliability Engineering (SRE) and software engineering best practices Cloud technologies and best practice - ideally in Azure Infrastructure-as-Code - ideally using Bicep A track record of continually identifying and implementing service improvements or observability Experience of coaching and mentoring other team members and providing consultancy to other teams Additionally, you will provide expert technical consultancy to enable the business to successfully use IT systems, supporting Project Teams by advising on the performance, configuration and functionality of existing systems. You will ensure that new or updated systems can be effectively supported from day one. You will also play a key role in developing the technical capability of the team - mentoring, teaching and strengthening shared knowledge, skills and engineering practice. Here is a snapshot of the technologies that we use Azure Cloud (AppServices, Function Apps, DataFactory, Batch, EntraID, AzureML, etc.) Azure DevOps (Pipelines, Bicep templates, CLI, PowerShell, Artefacts) Python/C# Azure Databricks Git, YAML PowerBI ESRI ArcGIS, FME, QGIS, etc. Key details Closing date: Thursday 2nd July at 23:59 Interview Information: Candidates are required to complete a SOVA personality assessment and a practical exercise in advance of the interview. These will form part of the interview discussion.

Senior Lead Software Engineering - AI/ML Engineer

JPMorgan Chase & Co.

Join us to shape the future of AI/ML data platforms, where your expertise will help create resilient and market leading solutions. You will have the opportunity to collaborate with innovators across our global network, driving strategic change and mentoring others. We value your skills in solving complex challenges and fostering a culture of reliability and growth. At JPMorganChase, your impact will reach far beyond your team, opening doors to career advancement and meaningful relationships. As a Site Reliability Engineer in the AI/ML Data Platforms team, you will play a key role in building scalable and resilient data solutions. You will engage in root cause analysis, production changes, and operational improvements, while supporting budgetary and staffing decisions. You will mentor team members and partner with colleagues across the organization to drive strategic change. Your contributions will help shape a collaborative, innovative, and high performing team culture. Job Responsibilities Demonstrate expertise in application development and support across technologies such as Databricks, Snowflake, AWS, and Kubernetes Coordinate incident management coverage to ensure effective resolution of application issues Collaborate with cross functional teams to perform root cause analysis and implement production changes Develop and support AI/ML solutions for troubleshooting and incident resolution Mentor and guide team members to foster growth and drive strategic change Build and maintain scalable, resilient, and market leading data solutions Support budgetary and staffing considerations to optimize team performance Engage in operational stability and disaster recovery planning Implement automation tools to reduce toil and improve efficiency Ensure compliance with risk controls and company wide standards Build meaningful relationships across teams to achieve common goals Required Qualifications, Capabilities, and Skills Proficient in site reliability culture and principles, with experience implementing site reliability within applications or platforms Skilled in running production incident calls and managing incident resolution Experienced in observability, including white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk Strong understanding of SLI/SLO/SLA and Error Budgets Proficient in Python or PySpark for AI/ML modeling Able to reduce toil by building automation tools for repeated tasks Hands on experience in system design, resiliency, testing, operational stability, and disaster recovery Awareness of risk controls and compliance with departmental and company wide standards Collaborative team player with the ability to build meaningful relationships Preferred Qualifications, Capabilities, and Skills Experience in an SRE or production support role with AWS Cloud, Databricks, Snowflake, or similar technologies AWS and Databricks certifications Advanced knowledge of AI/ML troubleshooting and incident resolution Familiarity with budgetary and staffing optimization Experience mentoring and guiding team members Strong communication and interpersonal skills Demonstrated ability to drive strategic change across teams

26/06/2026

Full time

Join us to shape the future of AI/ML data platforms, where your expertise will help create resilient and market leading solutions. You will have the opportunity to collaborate with innovators across our global network, driving strategic change and mentoring others. We value your skills in solving complex challenges and fostering a culture of reliability and growth. At JPMorganChase, your impact will reach far beyond your team, opening doors to career advancement and meaningful relationships. As a Site Reliability Engineer in the AI/ML Data Platforms team, you will play a key role in building scalable and resilient data solutions. You will engage in root cause analysis, production changes, and operational improvements, while supporting budgetary and staffing decisions. You will mentor team members and partner with colleagues across the organization to drive strategic change. Your contributions will help shape a collaborative, innovative, and high performing team culture. Job Responsibilities Demonstrate expertise in application development and support across technologies such as Databricks, Snowflake, AWS, and Kubernetes Coordinate incident management coverage to ensure effective resolution of application issues Collaborate with cross functional teams to perform root cause analysis and implement production changes Develop and support AI/ML solutions for troubleshooting and incident resolution Mentor and guide team members to foster growth and drive strategic change Build and maintain scalable, resilient, and market leading data solutions Support budgetary and staffing considerations to optimize team performance Engage in operational stability and disaster recovery planning Implement automation tools to reduce toil and improve efficiency Ensure compliance with risk controls and company wide standards Build meaningful relationships across teams to achieve common goals Required Qualifications, Capabilities, and Skills Proficient in site reliability culture and principles, with experience implementing site reliability within applications or platforms Skilled in running production incident calls and managing incident resolution Experienced in observability, including white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk Strong understanding of SLI/SLO/SLA and Error Budgets Proficient in Python or PySpark for AI/ML modeling Able to reduce toil by building automation tools for repeated tasks Hands on experience in system design, resiliency, testing, operational stability, and disaster recovery Awareness of risk controls and compliance with departmental and company wide standards Collaborative team player with the ability to build meaningful relationships Preferred Qualifications, Capabilities, and Skills Experience in an SRE or production support role with AWS Cloud, Databricks, Snowflake, or similar technologies AWS and Databricks certifications Advanced knowledge of AI/ML troubleshooting and incident resolution Familiarity with budgetary and staffing optimization Experience mentoring and guiding team members Strong communication and interpersonal skills Demonstrated ability to drive strategic change across teams

Senior AI/ML Data Platform SRE Lead

JPMorgan Chase & Co.

JPMorgan Chase & Co. is seeking a Site Reliability Engineer for their AI/ML Data Platforms team in Greater London. In this role, you will develop scalable and resilient data solutions, manage incident resolutions, and foster team collaboration. Your experience with observability tools and site reliability principles will be key, along with proficiency in Python or PySpark. The position offers the chance to drive strategic change in a dynamic and innovative environment, opening doors for career advancement.

26/06/2026

Full time

JPMorgan Chase & Co. is seeking a Site Reliability Engineer for their AI/ML Data Platforms team in Greater London. In this role, you will develop scalable and resilient data solutions, manage incident resolutions, and foster team collaboration. Your experience with observability tools and site reliability principles will be key, along with proficiency in Python or PySpark. The position offers the chance to drive strategic change in a dynamic and innovative environment, opening doors for career advancement.

Principal Cloud Architect

TXP Southampton, Hampshire

3 Month intial contract, scope for extension Inside IR35, (Apply online only) a day Location: Southampton - 2 x a week on site Role Overview We are seeking an experienced Principal Cloud Platform Engineer with strong DevOps leadership capability to support the design, delivery and continuous improvement of secure, scalable and resilient cloud platforms across Azure and AWS environments. The role will focus on building and governing cloud architecture patterns, landing zones, infrastructure standards and automation practices, while working closely with engineering, security, product and delivery teams. The successful candidate will provide manager-level technical leadership across DevOps, cloud platforms, Infrastructure as Code, CI/CD, networking, security, observability and reliability engineering. They will help shape enterprise-scale transformation, hybrid cloud strategy and platform services aligned to the Azure Well-Architected Framework, ensuring solutions are robust, cost-effective and operationally ready. Own and evolve Azure & AWS cloud architecture, platform patterns, guardrails, and design principles. Provide architectural oversight across Terraform modules, CI/CD pipelines (Azure DevOps, GitHub Actions), networking patterns, and compute/storage design. Evaluate platform changes including major provider upgrades (AzureRM / Cloudflare), DR and high availability improvements, cost optimisation strategies, and observability frameworks. Lead technical designs for large-scale refactoring and provider upgrades, environment creation pipelines, secure container registry access, identity integration and Zero Trust patterns, and event-driven architectures with caching strategies. Drive Cloudflare integration (CDN, WAF, edge, traffic management) aligned to Azure networking and security architecture. Provide engineering governance through review of RFCs, design documents, and technical decisions, and collaborate with delivery teams, security, shared services, and product groups to maintain aligned architecture. Requires 8-12+ years in cloud engineering/architecture, with deep expertise in Azure architecture (networking, App Services, Functions, APIM, PaaS, ACR, VNets, Private Endpoints, identity/security), strong Terraform experience (modules, pipelines, state management), and strong CI/CD background. Experience designing for SRE (DR, failover, monitoring, logging, autoscaling, resilience) and working across multiple engineering teams. Previous Cloudflare experience preferred. Architecture-led platform role (not a pure Solution Architect); deep scripting expertise not a primary requirement.

26/06/2026

Contractor

3 Month intial contract, scope for extension Inside IR35, (Apply online only) a day Location: Southampton - 2 x a week on site Role Overview We are seeking an experienced Principal Cloud Platform Engineer with strong DevOps leadership capability to support the design, delivery and continuous improvement of secure, scalable and resilient cloud platforms across Azure and AWS environments. The role will focus on building and governing cloud architecture patterns, landing zones, infrastructure standards and automation practices, while working closely with engineering, security, product and delivery teams. The successful candidate will provide manager-level technical leadership across DevOps, cloud platforms, Infrastructure as Code, CI/CD, networking, security, observability and reliability engineering. They will help shape enterprise-scale transformation, hybrid cloud strategy and platform services aligned to the Azure Well-Architected Framework, ensuring solutions are robust, cost-effective and operationally ready. Own and evolve Azure & AWS cloud architecture, platform patterns, guardrails, and design principles. Provide architectural oversight across Terraform modules, CI/CD pipelines (Azure DevOps, GitHub Actions), networking patterns, and compute/storage design. Evaluate platform changes including major provider upgrades (AzureRM / Cloudflare), DR and high availability improvements, cost optimisation strategies, and observability frameworks. Lead technical designs for large-scale refactoring and provider upgrades, environment creation pipelines, secure container registry access, identity integration and Zero Trust patterns, and event-driven architectures with caching strategies. Drive Cloudflare integration (CDN, WAF, edge, traffic management) aligned to Azure networking and security architecture. Provide engineering governance through review of RFCs, design documents, and technical decisions, and collaborate with delivery teams, security, shared services, and product groups to maintain aligned architecture. Requires 8-12+ years in cloud engineering/architecture, with deep expertise in Azure architecture (networking, App Services, Functions, APIM, PaaS, ACR, VNets, Private Endpoints, identity/security), strong Terraform experience (modules, pipelines, state management), and strong CI/CD background. Experience designing for SRE (DR, failover, monitoring, logging, autoscaling, resilience) and working across multiple engineering teams. Previous Cloudflare experience preferred. Architecture-led platform role (not a pure Solution Architect); deep scripting expertise not a primary requirement.

Site Reliability Engineer

Spectrum IT Recruitment Basingstoke, Hampshire

Site Reliability Engineer - Fully Remote What We're Looking For We're looking for someone who enjoys solving complex operational challenges through engineering rather than manual intervention. You'll be proactive, collaborative, and passionate about improving reliability through automation and continuous improvement. If you're excited about building resilient cloud platforms and making a measurable impact on service reliability, we'd love to hear from you. Key Responsibilities Incident Management & Operations Participate in a 24/7 on-call rota as a primary or escalation point Lead or support major incident response, including triage, mitigation, and resolution. Coordinate with Engineering, Infrastructure, Security, and Product teams during incidents. Develop, maintain, and continuously improve operational runbooks and playbooks. Conduct blameless post-incident reviews and drive follow-up improvements. Monitoring & Alerting Monitor the health of infrastructure, applications, and services. Design and optimise alerting strategies aligned with service reliability objectives (SLIs/SLOs). Reduce alert fatigue through continuous tuning and optimisation. Build and maintain dashboards using technologies such as: Grafana Prometheus Datadog Splunk AWS CloudWatch Reliability Engineering & Automation Automate repetitive operational tasks to minimise manual effort. Improve Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). Develop automation tools and scripts using Python, Bash, Go, or similar languages. Implement self-healing and auto-remediation where appropriate. Work closely with engineering teams to improve application and platform reliability. Platform & Infrastructure Support and troubleshoot Linux-based production environments. Manage cloud infrastructure, primarily within AWS Support containerised environments using Docker and Kubernetes. Assist with capacity planning, availability reviews, and production readiness for new releases. Skills & Experience Essential Strong Linux systems administration experience. Experience supporting production environments and managing incidents. Hands-on experience with AWS cloud infrastructure. Experience with Docker and Kubernetes. Scripting or programming experience with Python, Bash, Go, or similar. Solid understanding of networking fundamentals, including DNS, TCP/IP, and load balancing. Experience working in a 24/7 operations or NOC environment. Ability to remain calm and effective during high-pressure production incidents. Excellent communication and stakeholder coordination skills. Desirable Experience working with Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Previous experience helping organisations transition from traditional NOC operations to an SRE model. Infrastructure as Code experience using Terraform, Ansible, or similar tools. Exposure to security, compliance, or regulated environments. Spectrum IT Recruitment (South) Limited is acting as an Employment Agency in relation to this vacancy.

26/06/2026

Full time

Site Reliability Engineer - Fully Remote What We're Looking For We're looking for someone who enjoys solving complex operational challenges through engineering rather than manual intervention. You'll be proactive, collaborative, and passionate about improving reliability through automation and continuous improvement. If you're excited about building resilient cloud platforms and making a measurable impact on service reliability, we'd love to hear from you. Key Responsibilities Incident Management & Operations Participate in a 24/7 on-call rota as a primary or escalation point Lead or support major incident response, including triage, mitigation, and resolution. Coordinate with Engineering, Infrastructure, Security, and Product teams during incidents. Develop, maintain, and continuously improve operational runbooks and playbooks. Conduct blameless post-incident reviews and drive follow-up improvements. Monitoring & Alerting Monitor the health of infrastructure, applications, and services. Design and optimise alerting strategies aligned with service reliability objectives (SLIs/SLOs). Reduce alert fatigue through continuous tuning and optimisation. Build and maintain dashboards using technologies such as: Grafana Prometheus Datadog Splunk AWS CloudWatch Reliability Engineering & Automation Automate repetitive operational tasks to minimise manual effort. Improve Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). Develop automation tools and scripts using Python, Bash, Go, or similar languages. Implement self-healing and auto-remediation where appropriate. Work closely with engineering teams to improve application and platform reliability. Platform & Infrastructure Support and troubleshoot Linux-based production environments. Manage cloud infrastructure, primarily within AWS Support containerised environments using Docker and Kubernetes. Assist with capacity planning, availability reviews, and production readiness for new releases. Skills & Experience Essential Strong Linux systems administration experience. Experience supporting production environments and managing incidents. Hands-on experience with AWS cloud infrastructure. Experience with Docker and Kubernetes. Scripting or programming experience with Python, Bash, Go, or similar. Solid understanding of networking fundamentals, including DNS, TCP/IP, and load balancing. Experience working in a 24/7 operations or NOC environment. Ability to remain calm and effective during high-pressure production incidents. Excellent communication and stakeholder coordination skills. Desirable Experience working with Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Previous experience helping organisations transition from traditional NOC operations to an SRE model. Infrastructure as Code experience using Terraform, Ansible, or similar tools. Exposure to security, compliance, or regulated environments. Spectrum IT Recruitment (South) Limited is acting as an Employment Agency in relation to this vacancy.

Platform Engineer / DevOps Engineer / Site Reliability Engineer (SRE) / Cloud Engineer

慨正橡扯 Gloucester, Gloucestershire

Platform Engineer Location: Gloucester Type: Permanent, full time Work arrangement: Hybrid (min 3 days a week on site) Security clearance: Must be willing to obtain SC and eDV clearance. Responsibilities Deploy applications and software to cloud or on-prem environments for various business areas. Build and set up development tools and infrastructure. Understand project stakeholder needs. Automate and improve development and release processes. Ensure systems are safe and secure against cyber security threats. Identify technical problems and develop software updates and solutions. Collaborate with other engineers to ensure development follows established processes and works as intended. Required experience Experience working in an Agile/SCRUM/DevOps delivery model. Proficiency with cloud technologies (AWS or Azure). Experience with infrastructure-as-code tools (e.g., Terraform, Puppet, Chef, Ansible). Experience building and deploying large-scale applications in Continuous Integration/Delivery pipelines. Experience with container platforms and orchestration systems (ECS, AKS, Kubernetes, Helm, Docker). Experience with automation and integration tools such as Jenkins, Concourse CI, or cloud equivalents. Experience with scripting languages and source control.

26/06/2026

Full time

Platform Engineer Location: Gloucester Type: Permanent, full time Work arrangement: Hybrid (min 3 days a week on site) Security clearance: Must be willing to obtain SC and eDV clearance. Responsibilities Deploy applications and software to cloud or on-prem environments for various business areas. Build and set up development tools and infrastructure. Understand project stakeholder needs. Automate and improve development and release processes. Ensure systems are safe and secure against cyber security threats. Identify technical problems and develop software updates and solutions. Collaborate with other engineers to ensure development follows established processes and works as intended. Required experience Experience working in an Agile/SCRUM/DevOps delivery model. Proficiency with cloud technologies (AWS or Azure). Experience with infrastructure-as-code tools (e.g., Terraform, Puppet, Chef, Ansible). Experience building and deploying large-scale applications in Continuous Integration/Delivery pipelines. Experience with container platforms and orchestration systems (ECS, AKS, Kubernetes, Helm, Docker). Experience with automation and integration tools such as Jenkins, Concourse CI, or cloud equivalents. Experience with scripting languages and source control.

Site Reliability Engineer (AWS)

慨正橡扯

Site Reliability Engineer (AWS) Reporting to:Director of Engineering Location:London (Hybrid - we're flexible) Job Type:Permanent About Us Camascope is a fast-growing technology company focused on empowering the care and medication sector with technology. We are a team of talented, caring, and ambitious individuals who are committed to making a difference in care. Our ecosystem connects pharmacies, care homes, and doctors to improve the lives of many. There has never been a better time to join Camascope. Our team is growing and our product is reaching more users and partners every day. You will join a collaborative and passionate team. We love solving real problems and are committed to building the highest-quality solutions. If you are eager to make a meaningful impact in healthcare and thrive in a fast-paced startup environment, Camascope will be the perfect place for you. What You'll Do Own reliability - Maintain and improve our AWS infrastructure using Terraform, bringing your expertise and best practices Champion observability - Partner with developers to implement effective monitoring, logging, and tracing strategies Strengthen security - Work closely with the CISO to implement security best practices and ensure compliance Optimise costs - Monitor cloud spend and implement FinOps best practices Maintain CI/CD pipelines - Implement and maintain reliability and observability aspects of GitHub workflows and deployment pipelines Incident response - Lead incidents, run blameless post-mortems, and drive continuous improvement Enable developers - Mentor teams on SRE and observability practices, helping them quickly understand and resolve issues Leverage AI tooling - Use AI assisted development tools (e.g. GitHub Copilot) to accelerate infrastructure work, and explore AI driven approaches to incident detection, root cause analysis, and remediation What We're Looking For Essential 3+ years in an SRE, Platform, or DevOps engineering role AWS services:CloudWatch, X-Ray, Lambda, API Gateway, S3, SQS, Aurora PostgreSQL, DynamoDB, CloudFront, VPC, IAM, Security Groups Python for scripting, tooling, and Lambda development Terraform for Infrastructure as Code GitHub (Actions, workflows, repository management) Strong understanding of observability - metrics, logs, and traces Good understanding of cloud security principles and best practices Experience with cloud cost management and optimisationExcellent communication skills for working with technical and non-technical colleagues Self starter who can prioritise and organise their own workload Comfortable using AI assisted development tools such as GitHub Copilot Bonus Points For Datadog for monitoring, APM, and log management Azure experience:Front Door, Storage Accounts, App Service, Azure SQL Database, Application Insights Previous experience in early-stage startups or scale ups Having worked in Healthcare or Pharmacy tech previously Experience working in regulated environments or with compliance frameworks Experience with AI driven DevOps tooling (e.g. AWS DevOps Agent or similar AI agents for incident resolution, root cause analysis, and operational improvement) Experience with SLIs, SLOs, and error budgets On Call We have a 24/7 customer support team who handle day to day issues. We don't have a formal on call engineering rota, but our platform supports care homes around the clock - so we're looking for someone who is happy to occasionally jump on a call with the team if critical platform issues arise out of hours (and part of your job will be making sure this isn't necessary!). Why Join Us? Join an established engineering team and have the opportunity to enhance and shape how we approach platform and reliability engineering Make a meaningful impact in healthcare technology Work with modern cloud-native infrastructure Influence engineering culture and platform practices Collaborate in an environment where your ideas matter Grow with us as we scale Benefits Competitive salary (dependent on experience) Pension scheme and healthcare benefits Ongoing training and professional development 25 days annual leave excluding bank holidays We welcome applications from candidates of all backgrounds. If you're excited about this role but don't meet 100% of the requirements, we encourage you to apply anyway.

26/06/2026

Full time

Site Reliability Engineer (AWS) Reporting to:Director of Engineering Location:London (Hybrid - we're flexible) Job Type:Permanent About Us Camascope is a fast-growing technology company focused on empowering the care and medication sector with technology. We are a team of talented, caring, and ambitious individuals who are committed to making a difference in care. Our ecosystem connects pharmacies, care homes, and doctors to improve the lives of many. There has never been a better time to join Camascope. Our team is growing and our product is reaching more users and partners every day. You will join a collaborative and passionate team. We love solving real problems and are committed to building the highest-quality solutions. If you are eager to make a meaningful impact in healthcare and thrive in a fast-paced startup environment, Camascope will be the perfect place for you. What You'll Do Own reliability - Maintain and improve our AWS infrastructure using Terraform, bringing your expertise and best practices Champion observability - Partner with developers to implement effective monitoring, logging, and tracing strategies Strengthen security - Work closely with the CISO to implement security best practices and ensure compliance Optimise costs - Monitor cloud spend and implement FinOps best practices Maintain CI/CD pipelines - Implement and maintain reliability and observability aspects of GitHub workflows and deployment pipelines Incident response - Lead incidents, run blameless post-mortems, and drive continuous improvement Enable developers - Mentor teams on SRE and observability practices, helping them quickly understand and resolve issues Leverage AI tooling - Use AI assisted development tools (e.g. GitHub Copilot) to accelerate infrastructure work, and explore AI driven approaches to incident detection, root cause analysis, and remediation What We're Looking For Essential 3+ years in an SRE, Platform, or DevOps engineering role AWS services:CloudWatch, X-Ray, Lambda, API Gateway, S3, SQS, Aurora PostgreSQL, DynamoDB, CloudFront, VPC, IAM, Security Groups Python for scripting, tooling, and Lambda development Terraform for Infrastructure as Code GitHub (Actions, workflows, repository management) Strong understanding of observability - metrics, logs, and traces Good understanding of cloud security principles and best practices Experience with cloud cost management and optimisationExcellent communication skills for working with technical and non-technical colleagues Self starter who can prioritise and organise their own workload Comfortable using AI assisted development tools such as GitHub Copilot Bonus Points For Datadog for monitoring, APM, and log management Azure experience:Front Door, Storage Accounts, App Service, Azure SQL Database, Application Insights Previous experience in early-stage startups or scale ups Having worked in Healthcare or Pharmacy tech previously Experience working in regulated environments or with compliance frameworks Experience with AI driven DevOps tooling (e.g. AWS DevOps Agent or similar AI agents for incident resolution, root cause analysis, and operational improvement) Experience with SLIs, SLOs, and error budgets On Call We have a 24/7 customer support team who handle day to day issues. We don't have a formal on call engineering rota, but our platform supports care homes around the clock - so we're looking for someone who is happy to occasionally jump on a call with the team if critical platform issues arise out of hours (and part of your job will be making sure this isn't necessary!). Why Join Us? Join an established engineering team and have the opportunity to enhance and shape how we approach platform and reliability engineering Make a meaningful impact in healthcare technology Work with modern cloud-native infrastructure Influence engineering culture and platform practices Collaborate in an environment where your ideas matter Grow with us as we scale Benefits Competitive salary (dependent on experience) Pension scheme and healthcare benefits Ongoing training and professional development 25 days annual leave excluding bank holidays We welcome applications from candidates of all backgrounds. If you're excited about this role but don't meet 100% of the requirements, we encourage you to apply anyway.

Senior SRE: Cloud Reliability & Incident Leader

Google Inc.

Google Inc. is looking for a Senior Software Engineer in Site Reliability Engineering to ensure reliable systems for Google Cloud services. Candidates should have extensive experience in software development, project leadership, and troubleshooting large-scale systems. This role involves engaging in all lifecycle stages of services, supporting prior to launch, and maintaining post-launch services through monitoring and tooling. The ideal candidate will possess a Bachelor's degree and experience working in technical environments.

26/06/2026

Full time

Google Inc. is looking for a Senior Software Engineer in Site Reliability Engineering to ensure reliable systems for Google Cloud services. Candidates should have extensive experience in software development, project leadership, and troubleshooting large-scale systems. This role involves engaging in all lifecycle stages of services, supporting prior to launch, and maintaining post-launch services through monitoring and tooling. The ideal candidate will possess a Bachelor's degree and experience working in technical environments.

Senior Software Engineer, Site Reliability Engineering, Cloud IRT

Google Inc.

Senior Software Engineer, Site Reliability Engineering, Cloud IRT corporate_fare Google place London, UK Apply Bachelor's degree in Computer Science, a related field, or equivalent practical experience. 5 years of experience with software development in one or more programming languages. 3 years of experience in designing, analyzing, and troubleshooting large-scale distributed systems. 2 years of experience leading projects and providing technical leadership. Experience troubleshooting production incidents as part of an on-call rotation. Preferred qualifications: Master's degree in Computer Science or Engineering. Experience in telemetry systems, incident and risk management. Ability to work across organizational boundaries. Excellent systematic problem-solving approach, coupled with effective communication skills and a sense of drive. About the job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame free environment. We promote self direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow. Responsibilities Engage in and improve the whole lifecycle of service from inception and design, through to deployment, operation, and refinement. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Build systems and tooling to support Cloud IRT team; improve visibility into state of Cloud, detection of large scale issues, communications to customers, stakeholders and customer facing teams. Participate in oncall rotation supporting critical incident response for Google Cloud Platform (GCP). Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents-to-be, criminal histories consistent with legal requirements, or any other basis protected by law. See also Google's EEO Policy , Know your rights: workplace discrimination is illegal , Belonging at Google , and How we hire . Google is a global company and, in order to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting. Equity is granted exclusively and discretionarily by Alphabet Inc. on the basis of an agreement concluded between you and Alphabet Inc. Alphabet Inc. is your sole contractual partner with respect to equity grants. GSU grants are not guaranteed, are discretionary, are subject to approval by the Alphabet Inc. board of directors or its delegate, the terms of the relevant Alphabet Inc. stock plan, and your grant agreement. They have no impact on statutory payments. Current or past grants do not confer an acquired right.

26/06/2026

Full time

Senior Software Engineer, Site Reliability Engineering, Cloud IRT corporate_fare Google place London, UK Apply Bachelor's degree in Computer Science, a related field, or equivalent practical experience. 5 years of experience with software development in one or more programming languages. 3 years of experience in designing, analyzing, and troubleshooting large-scale distributed systems. 2 years of experience leading projects and providing technical leadership. Experience troubleshooting production incidents as part of an on-call rotation. Preferred qualifications: Master's degree in Computer Science or Engineering. Experience in telemetry systems, incident and risk management. Ability to work across organizational boundaries. Excellent systematic problem-solving approach, coupled with effective communication skills and a sense of drive. About the job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame free environment. We promote self direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow. Responsibilities Engage in and improve the whole lifecycle of service from inception and design, through to deployment, operation, and refinement. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Build systems and tooling to support Cloud IRT team; improve visibility into state of Cloud, detection of large scale issues, communications to customers, stakeholders and customer facing teams. Participate in oncall rotation supporting critical incident response for Google Cloud Platform (GCP). Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents-to-be, criminal histories consistent with legal requirements, or any other basis protected by law. See also Google's EEO Policy , Know your rights: workplace discrimination is illegal , Belonging at Google , and How we hire . Google is a global company and, in order to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting. Equity is granted exclusively and discretionarily by Alphabet Inc. on the basis of an agreement concluded between you and Alphabet Inc. Alphabet Inc. is your sole contractual partner with respect to equity grants. GSU grants are not guaranteed, are discretionary, are subject to approval by the Alphabet Inc. board of directors or its delegate, the terms of the relevant Alphabet Inc. stock plan, and your grant agreement. They have no impact on statutory payments. Current or past grants do not confer an acquired right.

Site Reliability Engineer III

CME Group Inc. City, Belfast

Site Reliability Engineer IIIApplylocations: Belfast - Millennium Housetime type: Full timeposted on: Posted Todayjob requisition id: 33993 Site Reliability Engineer III/SRE III (Tue - Sat) CME Group is seeking a Site Reliability Engineer III (Tue - Sat) to take a key role in building, operating, and scaling systems in our Markets portfolio. As an SRE III, you will apply your experience to the complex challenges of the CME Globex trading platform, where our systems deliver an exceptional combination of low-latency performance and rock-solid reliability .You will work with senior engineers on complex projects, take ownership of key reliability initiatives, and act as a mentor to junior colleagues, helping to shape the team's technical direction. Key Responsibilities Own Observability: Design, build, and refine monitoring, alerting, and observability solutions. Drive the continuous improvement of our SLIs & SLOs to enable faster issue detection and resolution. Drive Reliability Projects: Take ownership of reliability-focused projects from design to implementation, collaborating with product teams to ensure new features are scalable, resilient, and safe. Lead Technical Solutions: Lead technical discussions for your work, presenting solution options and proposals with clear trade-offs. Automate Intelligently: Proactively identify and eliminate toil through robust automation, improving both system reliability and team velocity. Manage Incidents: Take a leading role in incident response, owning the resolution of significant incidents, ensuring rapid system recovery, and driving meaningful action from blameless post-mortems. Mentor & Coach: Act as a technical mentor and point of escalation for L1 and L2 SREs, fostering their growth through code reviews and paired work. Architect for the Future: Contribute your own ideas to the product backlog and play an active role in the architectural design for the migration to Google Cloud Platform (GCP). What We're Looking For 3-5+ years of professional experience in a Site Reliability, DevOps, Software, or Systems Engineering role. Strong, hands-on experience administering and troubleshooting Linux-based production systems. Proficient programming skills in a language like Python or Go, with a track record of automating complex operational tasks. Proven ability to lead technical initiatives and solve complex problems with a high degree of autonomy. Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences. A proactive and ownership-oriented mindset. Desirable Skills Cloud Platforms: Deep experience with Google Cloud Platform (GCP), especially GCE, GKE, and cloud networking. Monitoring Tools: Expertise in designing and managing monitoring stacks (e.g., Prometheus, Grafana, OpenTelemetry). Distributed Systems: Strong practical knowledge of building and maintaining large-scale distributed systems. Containerisation: Advanced experience with Kubernetes and Docker in a production environment. Networking: Solid understanding of networking protocols (HTTP, TCP/UDP, IP) and network architecture. Domain Knowledge: Experience in financial markets, low-latency systems, or with message-oriented middleware. Why CME Group Be part of a global leader in financial services technology. Work on cutting-edge technology in a collaborative and innovative culture. Receive a competitive compensation and benefits package. Grow your career in SRE with an organisation committed to this modern approach.Join CME Group and play a crucial role in ensuring the stability and performance of our global trading applications. Apply now to be a part of our dynamic SRE team! Company Benefits: Bonus Programme Generous shift allowance Equity Programme Employee Stock Purchase Plan (ESPP) Private Medical and Dental coverage Mental Health Benefit Programme Group Pension Plan Income Protection Life Assurance Cycle To Work EV Car Benefit Scheme Gym Membership Family Leave Education Assistance - MBA/Advanced Degree/Bachelor Degree Ongoing Employee Development Training/Certification Hybrid Working

26/06/2026

Full time

Site Reliability Engineer IIIApplylocations: Belfast - Millennium Housetime type: Full timeposted on: Posted Todayjob requisition id: 33993 Site Reliability Engineer III/SRE III (Tue - Sat) CME Group is seeking a Site Reliability Engineer III (Tue - Sat) to take a key role in building, operating, and scaling systems in our Markets portfolio. As an SRE III, you will apply your experience to the complex challenges of the CME Globex trading platform, where our systems deliver an exceptional combination of low-latency performance and rock-solid reliability .You will work with senior engineers on complex projects, take ownership of key reliability initiatives, and act as a mentor to junior colleagues, helping to shape the team's technical direction. Key Responsibilities Own Observability: Design, build, and refine monitoring, alerting, and observability solutions. Drive the continuous improvement of our SLIs & SLOs to enable faster issue detection and resolution. Drive Reliability Projects: Take ownership of reliability-focused projects from design to implementation, collaborating with product teams to ensure new features are scalable, resilient, and safe. Lead Technical Solutions: Lead technical discussions for your work, presenting solution options and proposals with clear trade-offs. Automate Intelligently: Proactively identify and eliminate toil through robust automation, improving both system reliability and team velocity. Manage Incidents: Take a leading role in incident response, owning the resolution of significant incidents, ensuring rapid system recovery, and driving meaningful action from blameless post-mortems. Mentor & Coach: Act as a technical mentor and point of escalation for L1 and L2 SREs, fostering their growth through code reviews and paired work. Architect for the Future: Contribute your own ideas to the product backlog and play an active role in the architectural design for the migration to Google Cloud Platform (GCP). What We're Looking For 3-5+ years of professional experience in a Site Reliability, DevOps, Software, or Systems Engineering role. Strong, hands-on experience administering and troubleshooting Linux-based production systems. Proficient programming skills in a language like Python or Go, with a track record of automating complex operational tasks. Proven ability to lead technical initiatives and solve complex problems with a high degree of autonomy. Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences. A proactive and ownership-oriented mindset. Desirable Skills Cloud Platforms: Deep experience with Google Cloud Platform (GCP), especially GCE, GKE, and cloud networking. Monitoring Tools: Expertise in designing and managing monitoring stacks (e.g., Prometheus, Grafana, OpenTelemetry). Distributed Systems: Strong practical knowledge of building and maintaining large-scale distributed systems. Containerisation: Advanced experience with Kubernetes and Docker in a production environment. Networking: Solid understanding of networking protocols (HTTP, TCP/UDP, IP) and network architecture. Domain Knowledge: Experience in financial markets, low-latency systems, or with message-oriented middleware. Why CME Group Be part of a global leader in financial services technology. Work on cutting-edge technology in a collaborative and innovative culture. Receive a competitive compensation and benefits package. Grow your career in SRE with an organisation committed to this modern approach.Join CME Group and play a crucial role in ensuring the stability and performance of our global trading applications. Apply now to be a part of our dynamic SRE team! Company Benefits: Bonus Programme Generous shift allowance Equity Programme Employee Stock Purchase Plan (ESPP) Private Medical and Dental coverage Mental Health Benefit Programme Group Pension Plan Income Protection Life Assurance Cycle To Work EV Car Benefit Scheme Gym Membership Family Leave Education Assistance - MBA/Advanced Degree/Bachelor Degree Ongoing Employee Development Training/Certification Hybrid Working

Senior Cloud Engineer (AWS) (Multiple)

Cloudscaler

Senior Cloud Engineer (AWS) Location: Central London - Hybrid - circa 3 days onsite in Central London Cloudscaler are based in Central London, with customers across the UK. Travel to our customer sites is required, the frequency of which will vary from customer-to-customer up to 3 days per week onsite. The customer that these vacancies are signposted for requires an onsite presence of 1 day every 2 weeks in Bath. Salary: £70,000 - £95,000 Eligibility: UK-based Security Clearance You must be eligible for SC clearance (you don't need active clearance, we'll sponsor it). Willingness to obtain DV clearance in the future is a bonus. About the Role We're looking for an experienced Senior AWS Cloud Engineer to lead the build and operation of secure, enterprise-scale cloud platforms for central government and private enterprise organisations clients. You'll work closely with senior stakeholders on complex cloud transformation projects, applying AWS best practices, security-first thinking, and modern platform engineering principles. What You'll Be Doing Designing and building AWS Landing Zones and multi-tenant cloud environments Creating infrastructure using Terraform (IaC) Implementing secure, scalable solutions in regulated environments Developing and maintaining CI/CD pipelines Applying SRE principles to ensure reliability, resilience, and operational excellence Collaborating with internal teams and external clients to drive cloud success What We're Looking For We're looking for someone who brings hands-on technical expertise and strategic thinking, with experience in: Enterprise-scale AWS platforms AWS Landing Zone implementation Infrastructure as Code (Terraform) Secure cloud architecture environments and operations Working in highly-regulated, central government or defence clients Communicating technical concepts to a range of stakeholders Why Join Us We're cloud specialists, AWS experts, and trusted advisors to government and enterprise clients. We combine deep technical knowledge with a collaborative, people-first culture. What you can expect: Discretionary bonus Discretionary security clearance bonus for those holding certain clearance Periodic offer of share options schemes 25 days' annual leave 5 additional days per year towards training, certifications, or charity work Option to buy additional annual leave up to 5 days per year Public holidays opt-out scheme, the option to work on public holidays creating the flexibility to enjoy your time off when it suits you Certifications and training expensed Life Assurance Long Term Disability cover Employee Assist Programme for employee advice and support (including legal and counselling helpline) Health, Mental Health, Wellbeing, Financial and Legal support 24/7 GP access Pension auto-enrolment and contribution Employee referral scheme Client referral scheme Cycle to work scheme Travel expenses policy Interview Process We keep things simple and transparent: Screening call - with our Talent Acquisition team First interview (30 mins) - with some of our engineering colleagues (remote) Technical interview (60 mins) - with some of our engineering colleagues (remote) Final interview (60 mins) - with our leadership team (in-person) Apply Now If you're ready to build impactful, secure, and scalable AWS solutions - we'd love to hear from you. Apply today or reach out with any questions. Cloudscaler are proud to be an equal opportunity employer, committed to equal opportunities regardless of gender identity, sexual orientation, race, ancestry, age, marital status, disability, parental status, religion or medical history. If you require reasonable adjustments during the recruitment process or within the workplace, please let us know when you speak to our Talent Acquisition team or contact at the earliest opportunity.

26/06/2026

Full time

Senior Cloud Engineer (AWS) Location: Central London - Hybrid - circa 3 days onsite in Central London Cloudscaler are based in Central London, with customers across the UK. Travel to our customer sites is required, the frequency of which will vary from customer-to-customer up to 3 days per week onsite. The customer that these vacancies are signposted for requires an onsite presence of 1 day every 2 weeks in Bath. Salary: £70,000 - £95,000 Eligibility: UK-based Security Clearance You must be eligible for SC clearance (you don't need active clearance, we'll sponsor it). Willingness to obtain DV clearance in the future is a bonus. About the Role We're looking for an experienced Senior AWS Cloud Engineer to lead the build and operation of secure, enterprise-scale cloud platforms for central government and private enterprise organisations clients. You'll work closely with senior stakeholders on complex cloud transformation projects, applying AWS best practices, security-first thinking, and modern platform engineering principles. What You'll Be Doing Designing and building AWS Landing Zones and multi-tenant cloud environments Creating infrastructure using Terraform (IaC) Implementing secure, scalable solutions in regulated environments Developing and maintaining CI/CD pipelines Applying SRE principles to ensure reliability, resilience, and operational excellence Collaborating with internal teams and external clients to drive cloud success What We're Looking For We're looking for someone who brings hands-on technical expertise and strategic thinking, with experience in: Enterprise-scale AWS platforms AWS Landing Zone implementation Infrastructure as Code (Terraform) Secure cloud architecture environments and operations Working in highly-regulated, central government or defence clients Communicating technical concepts to a range of stakeholders Why Join Us We're cloud specialists, AWS experts, and trusted advisors to government and enterprise clients. We combine deep technical knowledge with a collaborative, people-first culture. What you can expect: Discretionary bonus Discretionary security clearance bonus for those holding certain clearance Periodic offer of share options schemes 25 days' annual leave 5 additional days per year towards training, certifications, or charity work Option to buy additional annual leave up to 5 days per year Public holidays opt-out scheme, the option to work on public holidays creating the flexibility to enjoy your time off when it suits you Certifications and training expensed Life Assurance Long Term Disability cover Employee Assist Programme for employee advice and support (including legal and counselling helpline) Health, Mental Health, Wellbeing, Financial and Legal support 24/7 GP access Pension auto-enrolment and contribution Employee referral scheme Client referral scheme Cycle to work scheme Travel expenses policy Interview Process We keep things simple and transparent: Screening call - with our Talent Acquisition team First interview (30 mins) - with some of our engineering colleagues (remote) Technical interview (60 mins) - with some of our engineering colleagues (remote) Final interview (60 mins) - with our leadership team (in-person) Apply Now If you're ready to build impactful, secure, and scalable AWS solutions - we'd love to hear from you. Apply today or reach out with any questions. Cloudscaler are proud to be an equal opportunity employer, committed to equal opportunities regardless of gender identity, sexual orientation, race, ancestry, age, marital status, disability, parental status, religion or medical history. If you require reasonable adjustments during the recruitment process or within the workplace, please let us know when you speak to our Talent Acquisition team or contact at the earliest opportunity.

Lead SRE- Azure & GCP

慨正橡扯

Job Description Lead Site Reliability Engineer (SRE) opportunity within our Google Cloud Site Reliability Engineering team at JPMorgan Chase, part of the Infrastructure Platform - Cloud Foundational Services SRE organization, operating within a global follow the sun support model. Job Responsibilities Lead and implement SRE frameworks to support global Google Cloud environments and ensure the highest level of SLOs through operational excellence. Master application, data, infrastructure, and Agentic AI disciplines. Understand financial control and budget management, and partner with colleagues to lead collaborative teams to achieve common goals. Use enterprise authorized AI capabilities within the work environment to accelerate major incident triage, troubleshooting, and post incident analysis, validating outputs and handling operational data according to sensitivity and security requirements. Provide support to develop and improve the quality of technical engineering documentation. Provide technical supervision, oversight, and problem resolution for engineering activities. Champion a DevOps model so that services are automated and elastic across all platforms. Required Qualifications, Capabilities, and Skills Google and Azure cloud expertise in a mission critical production environment. Strong understanding of container technologies such as Docker, Kubernetes, GKE, and HELM. Programming experience in Python, shell scripting, or Go, with good understanding of REST APIs. Hands on experience with cloud based technologies and tools, especially for deployment, monitoring, and operations, such as Google Observability, Azure Monitor, DataDog, Prometheus, Splunk, Elasticsearch, and Grafana. Experience using enterprise authorized AI capabilities within the work environment to improve SRE workflows, with strong validation habits and awareness of data sensitivity. Ability to evaluate AI assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align with resiliency and security expectations. Strong understanding of Google Cloud governance, compliance, and cost management. Proficient with modern development technologies and tools such as Agile, CI/CD, Git, Infrastructure as Code, Terraform, and Jenkins. Google Cloud certification or equivalent technical experience in the public cloud. Good understanding of Agentic AI SDKs and GitHub Copilot skills. Preferred Qualifications, Capabilities, and Skills Good understanding of operating systems such as Windows and Linux (RedHat/Ubuntu). Good understanding of LLM and other AI/ML frameworks that can be used in AIOPS. We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs.

26/06/2026

Full time

Job Description Lead Site Reliability Engineer (SRE) opportunity within our Google Cloud Site Reliability Engineering team at JPMorgan Chase, part of the Infrastructure Platform - Cloud Foundational Services SRE organization, operating within a global follow the sun support model. Job Responsibilities Lead and implement SRE frameworks to support global Google Cloud environments and ensure the highest level of SLOs through operational excellence. Master application, data, infrastructure, and Agentic AI disciplines. Understand financial control and budget management, and partner with colleagues to lead collaborative teams to achieve common goals. Use enterprise authorized AI capabilities within the work environment to accelerate major incident triage, troubleshooting, and post incident analysis, validating outputs and handling operational data according to sensitivity and security requirements. Provide support to develop and improve the quality of technical engineering documentation. Provide technical supervision, oversight, and problem resolution for engineering activities. Champion a DevOps model so that services are automated and elastic across all platforms. Required Qualifications, Capabilities, and Skills Google and Azure cloud expertise in a mission critical production environment. Strong understanding of container technologies such as Docker, Kubernetes, GKE, and HELM. Programming experience in Python, shell scripting, or Go, with good understanding of REST APIs. Hands on experience with cloud based technologies and tools, especially for deployment, monitoring, and operations, such as Google Observability, Azure Monitor, DataDog, Prometheus, Splunk, Elasticsearch, and Grafana. Experience using enterprise authorized AI capabilities within the work environment to improve SRE workflows, with strong validation habits and awareness of data sensitivity. Ability to evaluate AI assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align with resiliency and security expectations. Strong understanding of Google Cloud governance, compliance, and cost management. Proficient with modern development technologies and tools such as Agile, CI/CD, Git, Infrastructure as Code, Terraform, and Jenkins. Google Cloud certification or equivalent technical experience in the public cloud. Good understanding of Agentic AI SDKs and GitHub Copilot skills. Preferred Qualifications, Capabilities, and Skills Good understanding of operating systems such as Windows and Linux (RedHat/Ubuntu). Good understanding of LLM and other AI/ML frameworks that can be used in AIOPS. We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs.

Senior SRE: Azure & GCP, AI-Driven Global Resilience

慨正橡扯

慨正橡扯 is seeking a Lead Site Reliability Engineer (SRE) to join our Google Cloud team at JPMorgan Chase. The role involves implementing SRE frameworks to support global cloud environments and ensuring high service levels. The ideal candidate will have a strong background in Google Cloud, container technologies, and AI applications, driving operational excellence and team collaboration. You will work closely with diverse teams to achieve shared goals in a fast-paced environment.

26/06/2026

Full time

慨正橡扯 is seeking a Lead Site Reliability Engineer (SRE) to join our Google Cloud team at JPMorgan Chase. The role involves implementing SRE frameworks to support global cloud environments and ensuring high service levels. The ideal candidate will have a strong background in Google Cloud, container technologies, and AI applications, driving operational excellence and team collaboration. You will work closely with diverse teams to achieve shared goals in a fast-paced environment.

Senior Platform Engineer

CapGemini

About the job you're consideringAre you a Senior or Lead Platform Engineer who thrives on solving complex infrastructure problems and building production-grade platforms that operate reliably at scale while also influencing the design, strategy, and governance decisions that shape how teams deliver? Join our engineering team helping public sector clients build and continuously improve critical digital services using modern cloud-native and open-source tooling.You'll be part of a strong, established community of digital specialists. Together, you will share your ideas, innovate and grow. Our team of engineers support each other to deliver and develop professionally and you'll get to work alongside amazing people in one of the best cultures you can find.Hybrid working: Your working location will vary depending on the client engagement, delivery phase, base location, and security requirements. You should expect a mix of home working, Capgemini offices, and client sites. More onsite presence is typically needed for discovery, workshops, and secure environments. This is not a 100% remote appointment.Your RoleWhat You'll Build & DeliverYou'll help design, build, run and improve the platforms behind critical government services, systems that need to stay secure, observable and resilient even under national-scale load.You'll help build secure multi-cloud landing zones (majority AWS), GitOps-driven platforms and internal developer platforms that give engineers true self-service. You'll modernise legacy estates into cloud-native architectures, unify observability across complex environments using OpenTelemetry and modern APM platforms, and drive FinOps-focused optimisation.You'll contribute to and help shape engineering standards, SRE-aligned operability, CI/CD performance, supply-chain security, platform-as-a-product thinking and event-driven automation. You'll collaborate closely with client teams, bringing clear thinking and bold ideas to co-create modern platforms that make a real national-scale difference.Your scope & influence as a Senior/Lead Platform EngineerAct as a senior technical voice across the engagement, influencing platform strategy, design choices and governance guardrails.Lead technical workshops and design reviews with client stakeholders and engineering teams.Set and evolve platform standards (security, reliability, operability, CI/CD, observability) and help teams adopt them in practice.Define SLOs, improve reliability, and lead incident reviews.Coach engineers through mentoring, pairing and pragmatic hands-on technical leadership.Bring platform-as-a-product thinking: golden paths, reusable capabilities and sustainable operations.Remain hands-on, by actively contributing to designs, code, reviews and problem-solving while leading and mentoring others.What this looks like in practiceDesigning secure-by-default multi-account AWS landing zones (awareness of Azure/GCP patterns also relevant).Building internal developer platforms (IDPs) and self-service automation that boost developer experience.Deploying and operating hardened Kubernetes platforms (EKS/AKS).Creating unified observability stacks with OpenTelemetry, Prometheus and Grafana.Delivering cloud-native modernisation and transformation of legacy systems.Engineering high-quality CI/CD pipelines with DevSecOps & supply chain integrity.Implementing event-driven, automated infrastructure capabilities.Driving FinOps and cloud optimisation across complex estates.Applying platform-as-a-product thinking to create scalable, reusable capabilities.Providing hands-on technical skills & leadership and collaborating with client teams.We like engineers who are curious, opinionated about good engineering and comfortable taking ownership of systems that need to work every time, at national scale.If you want to solve hard problems, push modern engineering forward and deliver platforms that genuinely make a difference - you'll fit right in.We're a Disability Confident EmployerCapgemini is proud to be a Disability Confident Employer (Level 2) under the UK Government's Disability Confident scheme.As part of our commitment to inclusive recruitment, we will offer an interview to all candidates who: Declare they have a disability, and Meet the minimum essential criteria for the role.Please opt in during the application process.Your Security ClearanceBaseline Personnel Security Standard (BPSS)To be successfully appointed to this role you will need to undergo Baseline Personnel Security Standard checks.There are certain criteria and checks required for BPSS, and throughout the recruitment process, you will be asked questions about your security clearance eligibility such as, but not limited to, country of residence and nationality.In addition to BPSS, you will also need SC (Security Check) Clearance or to be eligible for this level of clearance (by being a UK resident for at least 5 years and not having left the country for more than 28 consecutive days during this period)Make it real - what does it mean for you?You'd be joining an accredited Great Place to work for Wellbeing in 2023. Employee wellbeing is vitally important to us as an organisation. We see a healthy and happy workforce a critical component for us to achieve our organisational ambitions. To help support wellbeing we have trained 'Mental Health Champions' across each of our business areas, and we have invested in wellbeing apps such as Thrive and Peppy.You will reimagine what's possible: creating value for the world's leading organisations through technology to build a sustainable, more inclusive future. You will work with a range of clients all with a unique set of business, technological and societal ambitions, which will make a real impact across the UK.You will be empowered to explore, innovate, and progress. You will benefit from Capgemini's 'learning for life' mindset, meaning you will have countless training and development opportunities from think tanks to hackathons, and access to 250,000 courses with numerous external certifications from AWS, Microsoft, Harvard ManageMentor, Cybersecurity qualifications and much more.You'll be bringing your unique skills and perspectives to the team, inspiring and taking inspiration from your teammates as you unlock value in everything you do. You'll be joining a professional community of experts, who have got your back and will support you, every step of the way.Why you should consider CapgeminiGrowing clients' businesses while building a more sustainable, more inclusive future is a tough ask. But when you join Capgemini, you join a thriving company and become part of a diverse collective of free-thinkers, entrepreneurs and industry experts.A powerful source of energy that drives us all to find new ways technology can help us reimagine what's possible. It's why, together, we seek out opportunities that will transform the world's leading businesses. And it's how you'll gain the experiences and connections you need to shape your future. By learning from each other every day, sharing knowledge and always pushing yourself to do better, you'll build the skills you want. And you'll use them to help our clients leverage technology to grow their business and give innovation that human touch the world needs. So, it might not always be easy, but making the world a better place rarely is.About CapgeminiCapgemini is a global business and technology transformation partner, helping organisations to accelerate their dual transition to a digital and sustainable world, while creating tangible impact for enterprises and society. It is a responsible and diverse group of 340,000 team members in more than 50 countries. With its strong over 55-year heritage, Capgemini is trusted by its clients to unlock the value of technology to address the entire breadth of their business needs. It delivers end-to-end services and solutions leveraging strengths from strategy and design to engineering, all fuelled by its market-leading capabilities in AI, cloud and data, combined with its deep industry expertise and partner ecosystem. The Group reported 2023 global revenues of €22.5 billion.

26/06/2026

Full time

About the job you're consideringAre you a Senior or Lead Platform Engineer who thrives on solving complex infrastructure problems and building production-grade platforms that operate reliably at scale while also influencing the design, strategy, and governance decisions that shape how teams deliver? Join our engineering team helping public sector clients build and continuously improve critical digital services using modern cloud-native and open-source tooling.You'll be part of a strong, established community of digital specialists. Together, you will share your ideas, innovate and grow. Our team of engineers support each other to deliver and develop professionally and you'll get to work alongside amazing people in one of the best cultures you can find.Hybrid working: Your working location will vary depending on the client engagement, delivery phase, base location, and security requirements. You should expect a mix of home working, Capgemini offices, and client sites. More onsite presence is typically needed for discovery, workshops, and secure environments. This is not a 100% remote appointment.Your RoleWhat You'll Build & DeliverYou'll help design, build, run and improve the platforms behind critical government services, systems that need to stay secure, observable and resilient even under national-scale load.You'll help build secure multi-cloud landing zones (majority AWS), GitOps-driven platforms and internal developer platforms that give engineers true self-service. You'll modernise legacy estates into cloud-native architectures, unify observability across complex environments using OpenTelemetry and modern APM platforms, and drive FinOps-focused optimisation.You'll contribute to and help shape engineering standards, SRE-aligned operability, CI/CD performance, supply-chain security, platform-as-a-product thinking and event-driven automation. You'll collaborate closely with client teams, bringing clear thinking and bold ideas to co-create modern platforms that make a real national-scale difference.Your scope & influence as a Senior/Lead Platform EngineerAct as a senior technical voice across the engagement, influencing platform strategy, design choices and governance guardrails.Lead technical workshops and design reviews with client stakeholders and engineering teams.Set and evolve platform standards (security, reliability, operability, CI/CD, observability) and help teams adopt them in practice.Define SLOs, improve reliability, and lead incident reviews.Coach engineers through mentoring, pairing and pragmatic hands-on technical leadership.Bring platform-as-a-product thinking: golden paths, reusable capabilities and sustainable operations.Remain hands-on, by actively contributing to designs, code, reviews and problem-solving while leading and mentoring others.What this looks like in practiceDesigning secure-by-default multi-account AWS landing zones (awareness of Azure/GCP patterns also relevant).Building internal developer platforms (IDPs) and self-service automation that boost developer experience.Deploying and operating hardened Kubernetes platforms (EKS/AKS).Creating unified observability stacks with OpenTelemetry, Prometheus and Grafana.Delivering cloud-native modernisation and transformation of legacy systems.Engineering high-quality CI/CD pipelines with DevSecOps & supply chain integrity.Implementing event-driven, automated infrastructure capabilities.Driving FinOps and cloud optimisation across complex estates.Applying platform-as-a-product thinking to create scalable, reusable capabilities.Providing hands-on technical skills & leadership and collaborating with client teams.We like engineers who are curious, opinionated about good engineering and comfortable taking ownership of systems that need to work every time, at national scale.If you want to solve hard problems, push modern engineering forward and deliver platforms that genuinely make a difference - you'll fit right in.We're a Disability Confident EmployerCapgemini is proud to be a Disability Confident Employer (Level 2) under the UK Government's Disability Confident scheme.As part of our commitment to inclusive recruitment, we will offer an interview to all candidates who: Declare they have a disability, and Meet the minimum essential criteria for the role.Please opt in during the application process.Your Security ClearanceBaseline Personnel Security Standard (BPSS)To be successfully appointed to this role you will need to undergo Baseline Personnel Security Standard checks.There are certain criteria and checks required for BPSS, and throughout the recruitment process, you will be asked questions about your security clearance eligibility such as, but not limited to, country of residence and nationality.In addition to BPSS, you will also need SC (Security Check) Clearance or to be eligible for this level of clearance (by being a UK resident for at least 5 years and not having left the country for more than 28 consecutive days during this period)Make it real - what does it mean for you?You'd be joining an accredited Great Place to work for Wellbeing in 2023. Employee wellbeing is vitally important to us as an organisation. We see a healthy and happy workforce a critical component for us to achieve our organisational ambitions. To help support wellbeing we have trained 'Mental Health Champions' across each of our business areas, and we have invested in wellbeing apps such as Thrive and Peppy.You will reimagine what's possible: creating value for the world's leading organisations through technology to build a sustainable, more inclusive future. You will work with a range of clients all with a unique set of business, technological and societal ambitions, which will make a real impact across the UK.You will be empowered to explore, innovate, and progress. You will benefit from Capgemini's 'learning for life' mindset, meaning you will have countless training and development opportunities from think tanks to hackathons, and access to 250,000 courses with numerous external certifications from AWS, Microsoft, Harvard ManageMentor, Cybersecurity qualifications and much more.You'll be bringing your unique skills and perspectives to the team, inspiring and taking inspiration from your teammates as you unlock value in everything you do. You'll be joining a professional community of experts, who have got your back and will support you, every step of the way.Why you should consider CapgeminiGrowing clients' businesses while building a more sustainable, more inclusive future is a tough ask. But when you join Capgemini, you join a thriving company and become part of a diverse collective of free-thinkers, entrepreneurs and industry experts.A powerful source of energy that drives us all to find new ways technology can help us reimagine what's possible. It's why, together, we seek out opportunities that will transform the world's leading businesses. And it's how you'll gain the experiences and connections you need to shape your future. By learning from each other every day, sharing knowledge and always pushing yourself to do better, you'll build the skills you want. And you'll use them to help our clients leverage technology to grow their business and give innovation that human touch the world needs. So, it might not always be easy, but making the world a better place rarely is.About CapgeminiCapgemini is a global business and technology transformation partner, helping organisations to accelerate their dual transition to a digital and sustainable world, while creating tangible impact for enterprises and society. It is a responsible and diverse group of 340,000 team members in more than 50 countries. With its strong over 55-year heritage, Capgemini is trusted by its clients to unlock the value of technology to address the entire breadth of their business needs. It delivers end-to-end services and solutions leveraging strengths from strategy and design to engineering, all fuelled by its market-leading capabilities in AI, cloud and data, combined with its deep industry expertise and partner ecosystem. The Group reported 2023 global revenues of €22.5 billion.

Site Reliability Engineer - AWS & Observability

慨正橡扯

Site Reliability Engineer (AWS) Reporting to:Director of Engineering Location:London (Hybrid - we're flexible) Job Type:Permanent About Us Camascope is a fast-growing technology company focused on empowering the care and medication sector with technology. We are a team of talented, caring, and ambitious individuals who are committed to making a difference in care. Our ecosystem connects pharmacies, care homes, and doctors to improve the lives of many. There has never been a better time to join Camascope. Our team is growing and our product is reaching more users and partners every day. You will join a collaborative and passionate team. We love solving real problems and are committed to building the highest-quality solutions. If you are eager to make a meaningful impact in healthcare and thrive in a fast-paced startup environment, Camascope will be the perfect place for you. What You'll Do Own reliability - Maintain and improve our AWS infrastructure using Terraform, bringing your expertise and best practices Champion observability - Partner with developers to implement effective monitoring, logging, and tracing strategies Strengthen security - Work closely with the CISO to implement security best practices and ensure compliance Optimise costs - Monitor cloud spend and implement FinOps best practices Maintain CI/CD pipelines - Implement and maintain reliability and observability aspects of GitHub workflows and deployment pipelines Incident response - Lead incidents, run blameless post-mortems, and drive continuous improvement Enable developers - Mentor teams on SRE and observability practices, helping them quickly understand and resolve issues Leverage AI tooling - Use AI assisted development tools (e.g. GitHub Copilot) to accelerate infrastructure work, and explore AI driven approaches to incident detection, root cause analysis, and remediation What We're Looking For Essential 3+ years in an SRE, Platform, or DevOps engineering role AWS services:CloudWatch, X-Ray, Lambda, API Gateway, S3, SQS, Aurora PostgreSQL, DynamoDB, CloudFront, VPC, IAM, Security Groups Python for scripting, tooling, and Lambda development Terraform for Infrastructure as Code GitHub (Actions, workflows, repository management) Strong understanding of observability - metrics, logs, and traces Good understanding of cloud security principles and best practices Experience with cloud cost management and optimisationExcellent communication skills for working with technical and non-technical colleagues Self starter who can prioritise and organise their own workload Comfortable using AI assisted development tools such as GitHub Copilot Bonus Points For Datadog for monitoring, APM, and log management Azure experience:Front Door, Storage Accounts, App Service, Azure SQL Database, Application Insights Previous experience in early-stage startups or scale ups Having worked in Healthcare or Pharmacy tech previously Experience working in regulated environments or with compliance frameworks Experience with AI driven DevOps tooling (e.g. AWS DevOps Agent or similar AI agents for incident resolution, root cause analysis, and operational improvement) Experience with SLIs, SLOs, and error budgets On Call We have a 24/7 customer support team who handle day to day issues. We don't have a formal on call engineering rota, but our platform supports care homes around the clock - so we're looking for someone who is happy to occasionally jump on a call with the team if critical platform issues arise out of hours (and part of your job will be making sure this isn't necessary!). Why Join Us? Join an established engineering team and have the opportunity to enhance and shape how we approach platform and reliability engineering Make a meaningful impact in healthcare technology Work with modern cloud-native infrastructure Influence engineering culture and platform practices Collaborate in an environment where your ideas matter Grow with us as we scale Benefits Competitive salary (dependent on experience) Pension scheme and healthcare benefits Ongoing training and professional development 25 days annual leave excluding bank holidays We welcome applications from candidates of all backgrounds. If you're excited about this role but don't meet 100% of the requirements, we encourage you to apply anyway.

26/06/2026

Full time

Site Reliability Engineer (AWS) Reporting to:Director of Engineering Location:London (Hybrid - we're flexible) Job Type:Permanent About Us Camascope is a fast-growing technology company focused on empowering the care and medication sector with technology. We are a team of talented, caring, and ambitious individuals who are committed to making a difference in care. Our ecosystem connects pharmacies, care homes, and doctors to improve the lives of many. There has never been a better time to join Camascope. Our team is growing and our product is reaching more users and partners every day. You will join a collaborative and passionate team. We love solving real problems and are committed to building the highest-quality solutions. If you are eager to make a meaningful impact in healthcare and thrive in a fast-paced startup environment, Camascope will be the perfect place for you. What You'll Do Own reliability - Maintain and improve our AWS infrastructure using Terraform, bringing your expertise and best practices Champion observability - Partner with developers to implement effective monitoring, logging, and tracing strategies Strengthen security - Work closely with the CISO to implement security best practices and ensure compliance Optimise costs - Monitor cloud spend and implement FinOps best practices Maintain CI/CD pipelines - Implement and maintain reliability and observability aspects of GitHub workflows and deployment pipelines Incident response - Lead incidents, run blameless post-mortems, and drive continuous improvement Enable developers - Mentor teams on SRE and observability practices, helping them quickly understand and resolve issues Leverage AI tooling - Use AI assisted development tools (e.g. GitHub Copilot) to accelerate infrastructure work, and explore AI driven approaches to incident detection, root cause analysis, and remediation What We're Looking For Essential 3+ years in an SRE, Platform, or DevOps engineering role AWS services:CloudWatch, X-Ray, Lambda, API Gateway, S3, SQS, Aurora PostgreSQL, DynamoDB, CloudFront, VPC, IAM, Security Groups Python for scripting, tooling, and Lambda development Terraform for Infrastructure as Code GitHub (Actions, workflows, repository management) Strong understanding of observability - metrics, logs, and traces Good understanding of cloud security principles and best practices Experience with cloud cost management and optimisationExcellent communication skills for working with technical and non-technical colleagues Self starter who can prioritise and organise their own workload Comfortable using AI assisted development tools such as GitHub Copilot Bonus Points For Datadog for monitoring, APM, and log management Azure experience:Front Door, Storage Accounts, App Service, Azure SQL Database, Application Insights Previous experience in early-stage startups or scale ups Having worked in Healthcare or Pharmacy tech previously Experience working in regulated environments or with compliance frameworks Experience with AI driven DevOps tooling (e.g. AWS DevOps Agent or similar AI agents for incident resolution, root cause analysis, and operational improvement) Experience with SLIs, SLOs, and error budgets On Call We have a 24/7 customer support team who handle day to day issues. We don't have a formal on call engineering rota, but our platform supports care homes around the clock - so we're looking for someone who is happy to occasionally jump on a call with the team if critical platform issues arise out of hours (and part of your job will be making sure this isn't necessary!). Why Join Us? Join an established engineering team and have the opportunity to enhance and shape how we approach platform and reliability engineering Make a meaningful impact in healthcare technology Work with modern cloud-native infrastructure Influence engineering culture and platform practices Collaborate in an environment where your ideas matter Grow with us as we scale Benefits Competitive salary (dependent on experience) Pension scheme and healthcare benefits Ongoing training and professional development 25 days annual leave excluding bank holidays We welcome applications from candidates of all backgrounds. If you're excited about this role but don't meet 100% of the requirements, we encourage you to apply anyway.

Senior SRE: Cloud, Kubernetes & Automation

FNZ (UK) Ltd Edinburgh, Midlothian

A global wealth management platform provider is seeking a Lead Site Reliability Engineer to ensure the reliability and performance of its platforms. The role involves deploying and integrating mission-critical systems, collaborating with various engineering teams, and using modern automation practices. Candidates should have deep expertise in Kubernetes, Terraform, and public cloud environments like AWS or Azure, along with strong problem-solving skills. The role offers a full-time contract based in Edinburgh, with a strong emphasis on teamwork and automation.

26/06/2026

Full time

A global wealth management platform provider is seeking a Lead Site Reliability Engineer to ensure the reliability and performance of its platforms. The role involves deploying and integrating mission-critical systems, collaborating with various engineering teams, and using modern automation practices. Candidates should have deep expertise in Kubernetes, Terraform, and public cloud environments like AWS or Azure, along with strong problem-solving skills. The role offers a full-time contract based in Edinburgh, with a strong emphasis on teamwork and automation.

Lead Site Reliability Engineer

FNZ (UK) Ltd Edinburgh, Midlothian

Lead Site Reliability Engineer page is loaded Lead Site Reliability Engineerlocations: Edinburgh - United Kingdom: London - United Kingdomtime type: Full timeposted on: Posted Todayjob requisition id: REQ-16143 Role Purpose The Site Reliability Engineer will work closely with Application, Infrastructure, and Network Engineering teams to ensure the reliability, scalability, and performance of FNZ platforms. This role focuses on deploying, integrating, and providing ongoing operational support for mission-critical systems, leveraging modern automation and cloud-native practices. Key Responsibilities Maintain high availability and performance of FNZ platforms. Implement monitoring, alerting, and observability solutions to proactively detect and resolve issues. Collaborate with engineering teams to design and implement robust deployment pipelines. Ensure smooth integration of applications with infrastructure and network components. Use Terraform for provisioning and managing infrastructure across environments. Operate and optimize workloads onprem and public cloud. Manage and troubleshoot application delivery networks, load balancing, and traffic routing. Configure and support F5 Distributed Cloud or similar CDN/ADC technologies. Participate in on-call rotations, perform root cause analysis, and implement preventive measures. Work cross-functionally with Application, Infrastructure, and Network Engineering teams to deliver reliable services. Required Skills & Experience Kubernetes (K8s): Deep understanding of container orchestration and cluster management. Terraform: Strong experience in Infrastructure as Code for cloud and on-prem environments. Public Cloud: Hands-on experience with AWS, Azure, or GCP. F5 Distributed Cloud or Similar: Knowledge of CDN/ADC platforms and their integration. Networking Fundamentals: Expertise in application delivery networks, load balancing, traffic routing, and troubleshooting. Observability Tools: Familiarity with Splunk, NewRelic, or similar. Scripting & Automation: Proficiency in Terraform, Bash, or similar languages. Desirable Skills Experience with CI/CD pipelines and GitOps workflows. Knowledge of SRE principles. Familiarity with security best practices. Key Attributes Strong problem-solving and troubleshooting skills. Ability to work collaboratively across multiple teams. Passion for automation and reducing operational toil. Reporting Line Reports to: Head of Platform Operations/Application Engineering.Works closely with Application Engineering, Infrastructure Engineering, Network Engineering teams. About FNZ FNZ is committed to opening up wealth so that everyone, everywhere can invest in their future on their terms. We know the foundation to do that already exists in the wealth management industry, but complexity holds firms back. We created wealth's growth platform to help. We provide a global, end-to-end wealth management platform that integrates modern technology with business and investment operations. All in a regulated financial institution. We partner with the world's leading financial institutions, with over US$2.2 trillion in assets on platform (AoP). Together with our clients, we empower nearly 30 million people across all wealth segments to invest in their future. (blob:)0:00 / 2:32

26/06/2026

Full time

Lead Site Reliability Engineer page is loaded Lead Site Reliability Engineerlocations: Edinburgh - United Kingdom: London - United Kingdomtime type: Full timeposted on: Posted Todayjob requisition id: REQ-16143 Role Purpose The Site Reliability Engineer will work closely with Application, Infrastructure, and Network Engineering teams to ensure the reliability, scalability, and performance of FNZ platforms. This role focuses on deploying, integrating, and providing ongoing operational support for mission-critical systems, leveraging modern automation and cloud-native practices. Key Responsibilities Maintain high availability and performance of FNZ platforms. Implement monitoring, alerting, and observability solutions to proactively detect and resolve issues. Collaborate with engineering teams to design and implement robust deployment pipelines. Ensure smooth integration of applications with infrastructure and network components. Use Terraform for provisioning and managing infrastructure across environments. Operate and optimize workloads onprem and public cloud. Manage and troubleshoot application delivery networks, load balancing, and traffic routing. Configure and support F5 Distributed Cloud or similar CDN/ADC technologies. Participate in on-call rotations, perform root cause analysis, and implement preventive measures. Work cross-functionally with Application, Infrastructure, and Network Engineering teams to deliver reliable services. Required Skills & Experience Kubernetes (K8s): Deep understanding of container orchestration and cluster management. Terraform: Strong experience in Infrastructure as Code for cloud and on-prem environments. Public Cloud: Hands-on experience with AWS, Azure, or GCP. F5 Distributed Cloud or Similar: Knowledge of CDN/ADC platforms and their integration. Networking Fundamentals: Expertise in application delivery networks, load balancing, traffic routing, and troubleshooting. Observability Tools: Familiarity with Splunk, NewRelic, or similar. Scripting & Automation: Proficiency in Terraform, Bash, or similar languages. Desirable Skills Experience with CI/CD pipelines and GitOps workflows. Knowledge of SRE principles. Familiarity with security best practices. Key Attributes Strong problem-solving and troubleshooting skills. Ability to work collaboratively across multiple teams. Passion for automation and reducing operational toil. Reporting Line Reports to: Head of Platform Operations/Application Engineering.Works closely with Application Engineering, Infrastructure Engineering, Network Engineering teams. About FNZ FNZ is committed to opening up wealth so that everyone, everywhere can invest in their future on their terms. We know the foundation to do that already exists in the wealth management industry, but complexity holds firms back. We created wealth's growth platform to help. We provide a global, end-to-end wealth management platform that integrates modern technology with business and investment operations. All in a regulated financial institution. We partner with the world's leading financial institutions, with over US$2.2 trillion in assets on platform (AoP). Together with our clients, we empower nearly 30 million people across all wealth segments to invest in their future. (blob:)0:00 / 2:32

256 jobs found

Modal Window