it job board logo
  • Home
  • Find IT Jobs
  • Register CV
  • Career Advice
  • Contact us
  • Employers
    • Register as Employer
    • Pricing Plans
  • Recruiting? Post a job
  • Sign in
  • Sign up
  • Home
  • Find IT Jobs
  • Register CV
  • Career Advice
  • Contact us
  • Employers
    • Register as Employer
    • Pricing Plans
Sorry, that job is no longer available. Here are some results that may be similar to the job you were looking for.

210 jobs found

Email me jobs like this
Refine Search
Current Search
site reliability engineer sre
ARM (Advanced Resource Managers)
Senior SRE Engineer
ARM (Advanced Resource Managers)
Senior Site Reliability Coach 6 months Remote £Negotiable - INSIDE IR35 Tech Stack Multiple Platforms and Applications AWS and Azure - Cloud Mainframe skills would be handy Latest applications on Cloud Dev Ops skills would be helpful Attitude of being part of the team and owning the outcomes Advocate - to change the culture to SRE Disclaimer: This vacancy is being advertised by either Advanced Resource Managers Limited, Advanced Resource Managers IT Limited or Advanced Resource Managers Engineering Limited ("ARM"). ARM is a specialist talent acquisition and management consultancy. We provide technical contingency recruitment and a portfolio of more complex resource solutions. Our specialist recruitment divisions cover the entire technical arena, including some of the most economically and strategically important industries in the UK and the world today. We will never send your CV without your permission. Where the role is marked as Outside IR35 in the advertisement this is subject to receipt of a final Status Determination Statement from the end Client and may be subject to change.
27/05/2026
Contractor
Senior Site Reliability Coach 6 months Remote £Negotiable - INSIDE IR35 Tech Stack Multiple Platforms and Applications AWS and Azure - Cloud Mainframe skills would be handy Latest applications on Cloud Dev Ops skills would be helpful Attitude of being part of the team and owning the outcomes Advocate - to change the culture to SRE Disclaimer: This vacancy is being advertised by either Advanced Resource Managers Limited, Advanced Resource Managers IT Limited or Advanced Resource Managers Engineering Limited ("ARM"). ARM is a specialist talent acquisition and management consultancy. We provide technical contingency recruitment and a portfolio of more complex resource solutions. Our specialist recruitment divisions cover the entire technical arena, including some of the most economically and strategically important industries in the UK and the world today. We will never send your CV without your permission. Where the role is marked as Outside IR35 in the advertisement this is subject to receipt of a final Status Determination Statement from the end Client and may be subject to change.
DV Cleared Cloud DevOps Architect - Hybrid
Experis - ManpowerGroup
Senior Cloud / DevOps Architect (DV Cleared) Clearance: Active DV clearance is essential for this role Sector: Defence & Security Location: Multiple locations + hybrid About the Role Play a key role in shaping the engineering strategy across Cloud Native and application development. You'll act as both a technical expert and an inspiring leader, driving DevOps excellence while building inclusive, high-performing teams. What You'll Be Doing Design, build, and operate cloud-based platforms and systems Drive best practices across DevOps and Site Reliability Engineering (SRE) Implement infrastructure as code, CI/CD pipelines, and automated deployments Collaborate closely with developers, testers, and business analysts Ensure systems are scalable, maintainable, and resilient Champion observability through monitoring, logging, and alerting Develop reusable frameworks and tooling for wider engineering teams Mentor junior engineers and contribute to a culture of continuous improvement Essential Skills & Experience DV (Developed Vetting) clearance Strong background in software engineering or IT operations Cloud expertise (AWS, Azure, GCP, or Oracle Cloud) Infrastructure as Code (e.g. Terraform, CloudFormation, AWS SAM) Strong understanding ofDevOps practices: CI/CD pipelines Containerisation Deployment automation Experience in one or more of the following: Systems Administration Full Stack Development Virtualisation / Container platforms Configuration & environment management Scripting experience (Python, Bash, or Go) Experience implementing observability (monitoring, alerting, logging) Ability to establish and promote DevOps / SRE best practices
27/05/2026
Full time
Senior Cloud / DevOps Architect (DV Cleared) Clearance: Active DV clearance is essential for this role Sector: Defence & Security Location: Multiple locations + hybrid About the Role Play a key role in shaping the engineering strategy across Cloud Native and application development. You'll act as both a technical expert and an inspiring leader, driving DevOps excellence while building inclusive, high-performing teams. What You'll Be Doing Design, build, and operate cloud-based platforms and systems Drive best practices across DevOps and Site Reliability Engineering (SRE) Implement infrastructure as code, CI/CD pipelines, and automated deployments Collaborate closely with developers, testers, and business analysts Ensure systems are scalable, maintainable, and resilient Champion observability through monitoring, logging, and alerting Develop reusable frameworks and tooling for wider engineering teams Mentor junior engineers and contribute to a culture of continuous improvement Essential Skills & Experience DV (Developed Vetting) clearance Strong background in software engineering or IT operations Cloud expertise (AWS, Azure, GCP, or Oracle Cloud) Infrastructure as Code (e.g. Terraform, CloudFormation, AWS SAM) Strong understanding ofDevOps practices: CI/CD pipelines Containerisation Deployment automation Experience in one or more of the following: Systems Administration Full Stack Development Virtualisation / Container platforms Configuration & environment management Scripting experience (Python, Bash, or Go) Experience implementing observability (monitoring, alerting, logging) Ability to establish and promote DevOps / SRE best practices
Senior SRE: Data Platforms on AWS & Databricks
慨正橡扯
Job Description We have an opportunity to impact your career and provide an adventure where you can push the limits of what's possible. The Chief Data & Analytics Office (CDAO) at JPMorgan Chase is responsible for accelerating the firm's data and analytics journey. This includes ensuring the quality, integrity, and security of the company's data, as well as leveraging this data to generate insights and drive decision making. The CDAO is also responsible for developing and implementing solutions that support the firm's commercial goals by harnessing artificial intelligence and machine learning technologies to develop new products, improve productivity, and enhance risk management effectively and responsibly. As a Site Reliability Engineer at JPMorgan Chase within the AIML Data Platforms and Chief Data and Analytics Team, you will develop and deliver advanced technology products focused on data and analytics. Tackle complex cloud data platform challenges, especially around DataLake Tools. In this role you will work in an agile environment, collaborating with cross functional teams. Job Responsibilities Maintains a managed AWS Databricks platform, and provides engineering and operational support for the platform to application teams. Performs platform design, set up and configuration, workspace administration, resource monitoring, providing engineering support to data engineering teams, Data Science/ML, and Application/integration teams. Leads evaluation sessions with external vendors, startups, and internal teams to drive outcomes oriented probing of architectural designs, technical credentials, and applicability for use within existing systems and information architecture. Drives continuous improvement in system observability, alerting, and capacity planning. Collaborates with engineering and data teams to optimize infrastructure and deployment processes, focusing on automation and operational excellence. Executes creative software solutions, design, development, and technical troubleshooting with ability to think beyond routine or conventional approaches to build solutions or break down technical problems. Develops secure high quality production code, and reviews and debugs code written by others. Identifies opportunities to eliminate or automate remediation of recurring issues to improve overall operational stability of software applications and systems. Adds to team culture of diversity, opportunity, and respect. Implements Site Reliability Engineering (SRE) best practices to ensure reliability, scalability, and performance of data platforms. Develops and maintains incident response procedures, including root cause analysis and post mortem documentation. Required qualifications, capabilities, and skills Formal training or certification on software engineering concepts and 10+ years applied experience. Extensive experience with AWS Databricks platform administration and engineering support is a must. Strong understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management. Experience with monitoring tools, automation frameworks, and CI/CD pipelines. Proficient in Python application program development with use of automated unit testing. Experience with Terraform development and understanding of Terraform Enterprise. Experience in delivering system design, application development, testing, and operational stability. Knowledge of Big Data distributed compute frameworks like Spark, Glue, MapReduce etc. Excellent troubleshooting, analytical, and communication skills. Experience in Data pipelines using Spark. Exposure to AWS & Databricks Platform administration. Knowledge of containerization (Docker, Kubernetes) and orchestration. Familiarity with distributed systems and large scale data processing. Preferred qualifications, capabilities, and skills Experience in Data pipelines using Spark. Exposure to AWS & Databricks Platform administration. Knowledge of containerization (Docker, Kubernetes) and orchestration. Familiarity with distributed systems and large scale data processing. We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants and employees' religious practices and beliefs, as well as mental health or physical disability needs.
27/05/2026
Full time
Job Description We have an opportunity to impact your career and provide an adventure where you can push the limits of what's possible. The Chief Data & Analytics Office (CDAO) at JPMorgan Chase is responsible for accelerating the firm's data and analytics journey. This includes ensuring the quality, integrity, and security of the company's data, as well as leveraging this data to generate insights and drive decision making. The CDAO is also responsible for developing and implementing solutions that support the firm's commercial goals by harnessing artificial intelligence and machine learning technologies to develop new products, improve productivity, and enhance risk management effectively and responsibly. As a Site Reliability Engineer at JPMorgan Chase within the AIML Data Platforms and Chief Data and Analytics Team, you will develop and deliver advanced technology products focused on data and analytics. Tackle complex cloud data platform challenges, especially around DataLake Tools. In this role you will work in an agile environment, collaborating with cross functional teams. Job Responsibilities Maintains a managed AWS Databricks platform, and provides engineering and operational support for the platform to application teams. Performs platform design, set up and configuration, workspace administration, resource monitoring, providing engineering support to data engineering teams, Data Science/ML, and Application/integration teams. Leads evaluation sessions with external vendors, startups, and internal teams to drive outcomes oriented probing of architectural designs, technical credentials, and applicability for use within existing systems and information architecture. Drives continuous improvement in system observability, alerting, and capacity planning. Collaborates with engineering and data teams to optimize infrastructure and deployment processes, focusing on automation and operational excellence. Executes creative software solutions, design, development, and technical troubleshooting with ability to think beyond routine or conventional approaches to build solutions or break down technical problems. Develops secure high quality production code, and reviews and debugs code written by others. Identifies opportunities to eliminate or automate remediation of recurring issues to improve overall operational stability of software applications and systems. Adds to team culture of diversity, opportunity, and respect. Implements Site Reliability Engineering (SRE) best practices to ensure reliability, scalability, and performance of data platforms. Develops and maintains incident response procedures, including root cause analysis and post mortem documentation. Required qualifications, capabilities, and skills Formal training or certification on software engineering concepts and 10+ years applied experience. Extensive experience with AWS Databricks platform administration and engineering support is a must. Strong understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management. Experience with monitoring tools, automation frameworks, and CI/CD pipelines. Proficient in Python application program development with use of automated unit testing. Experience with Terraform development and understanding of Terraform Enterprise. Experience in delivering system design, application development, testing, and operational stability. Knowledge of Big Data distributed compute frameworks like Spark, Glue, MapReduce etc. Excellent troubleshooting, analytical, and communication skills. Experience in Data pipelines using Spark. Exposure to AWS & Databricks Platform administration. Knowledge of containerization (Docker, Kubernetes) and orchestration. Familiarity with distributed systems and large scale data processing. Preferred qualifications, capabilities, and skills Experience in Data pipelines using Spark. Exposure to AWS & Databricks Platform administration. Knowledge of containerization (Docker, Kubernetes) and orchestration. Familiarity with distributed systems and large scale data processing. We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants and employees' religious practices and beliefs, as well as mental health or physical disability needs.
Cloud Network SRE: Python/Go & AWS
Golang Works
Network Site Reliability Engineer Location: London, United Kingdom Posted about 1 year ago Tech Stack Hardware Python Go Amazon AWS Operating systems Reliability Tools and Techniques The role involves collaborative work across various teams, exploring domains like hardware, operating systems, Python/ Go development, AWS, and storage. Responsibilities Develop network and datacenter infrastructure with consistent and straightforward Compensation Competitive Role type Full time Visa sponsorship Not provided
27/05/2026
Full time
Network Site Reliability Engineer Location: London, United Kingdom Posted about 1 year ago Tech Stack Hardware Python Go Amazon AWS Operating systems Reliability Tools and Techniques The role involves collaborative work across various teams, exploring domains like hardware, operating systems, Python/ Go development, AWS, and storage. Responsibilities Develop network and datacenter infrastructure with consistent and straightforward Compensation Competitive Role type Full time Visa sponsorship Not provided
Amazon
Graduate DevOps Engineer - AWS, Automation & Linux
Amazon
Job ID: AWS EMEA SARL (UK Branch) AWS Utility Computing (UC) provides product innovations-from foundational services such as Amazon's Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2) to new product innovations that set AWS services apart in the industry. As a member of the UC organization, you'll support the development and management of Compute, Database, Storage, Internet of Things (IoT), Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for their cloud services. The region service team is a customer experience oriented team looking for a self motivated, talented engineer who can solve complex problems and improve service support. We need an engineer who brings a mix of operations and networking expertise and shares our passion to change the way our customers operate. A systems engineer will drive opportunities to automate and simplify daily operations and scale the organisational operations. Key Job Responsibilities Work proactively to solve potential problems and inefficiencies. Communicate clearly and collaborate with others to deliver results with minimal supervision. Participate in 24/7 on call rotation to troubleshoot high severity issues. Analyze dashboards and investigate metrics with a vision for improvements. Troubleshoot and diagnose problems and work on solutions. Create and maintain Standard Operating Procedures (SOPs) and runbooks for documentation. Discuss radical new approaches to automate operational issues, assess risks and develop creative solutions. You will need to be a UK national and be able to obtain and maintain a UK Government Security Clearance. Further details can be found here: A Day in the Life On a typical day, engineers dive deep into understanding the root cause of a customer issue, investigate why a metric is trending the wrong way, and consult with senior engineers. Engineers own their services and implement Operational Excellence best practices to make out of hours support painless, automating manual processes. Systems engineering roles focus on troubleshooting, innovating fixes and workarounds, maintaining software updates, and providing data and metrics that support capacity and efficiency. Engineers use Linux skills, networking knowledge, and clear communication to deliver results and thrive in an environment of ambiguity and change. Basic Qualifications Bachelor's degree in Computer Science or another technical degree or related experience. Knowledge of networking fundamentals. Experience working in a 24/7 production environment. Experience in Linux systems administration and/or development. Experience working in at least two of these languages: Python, Java, Perl, PHP, Ruby or Bash/Shell. Preferred Qualifications Knowledge of configuration management systems, such as Puppet, Chef, Ansible, or related systems. Experience in site reliability engineering (SRE), systems engineering, systems administration, DevOps, security administration, or network administration. Experience in network capture and systems troubleshooting. Experience building scripts, tooling, and automation for large scale computing environments. Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, please visit for more information. Please consult our Privacy Notice ( ) to know more about how we collect, use and transfer the personal data of our candidates. Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
27/05/2026
Full time
Job ID: AWS EMEA SARL (UK Branch) AWS Utility Computing (UC) provides product innovations-from foundational services such as Amazon's Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2) to new product innovations that set AWS services apart in the industry. As a member of the UC organization, you'll support the development and management of Compute, Database, Storage, Internet of Things (IoT), Platform, and Productivity Apps services in AWS, including support for customers who require specialized security solutions for their cloud services. The region service team is a customer experience oriented team looking for a self motivated, talented engineer who can solve complex problems and improve service support. We need an engineer who brings a mix of operations and networking expertise and shares our passion to change the way our customers operate. A systems engineer will drive opportunities to automate and simplify daily operations and scale the organisational operations. Key Job Responsibilities Work proactively to solve potential problems and inefficiencies. Communicate clearly and collaborate with others to deliver results with minimal supervision. Participate in 24/7 on call rotation to troubleshoot high severity issues. Analyze dashboards and investigate metrics with a vision for improvements. Troubleshoot and diagnose problems and work on solutions. Create and maintain Standard Operating Procedures (SOPs) and runbooks for documentation. Discuss radical new approaches to automate operational issues, assess risks and develop creative solutions. You will need to be a UK national and be able to obtain and maintain a UK Government Security Clearance. Further details can be found here: A Day in the Life On a typical day, engineers dive deep into understanding the root cause of a customer issue, investigate why a metric is trending the wrong way, and consult with senior engineers. Engineers own their services and implement Operational Excellence best practices to make out of hours support painless, automating manual processes. Systems engineering roles focus on troubleshooting, innovating fixes and workarounds, maintaining software updates, and providing data and metrics that support capacity and efficiency. Engineers use Linux skills, networking knowledge, and clear communication to deliver results and thrive in an environment of ambiguity and change. Basic Qualifications Bachelor's degree in Computer Science or another technical degree or related experience. Knowledge of networking fundamentals. Experience working in a 24/7 production environment. Experience in Linux systems administration and/or development. Experience working in at least two of these languages: Python, Java, Perl, PHP, Ruby or Bash/Shell. Preferred Qualifications Knowledge of configuration management systems, such as Puppet, Chef, Ansible, or related systems. Experience in site reliability engineering (SRE), systems engineering, systems administration, DevOps, security administration, or network administration. Experience in network capture and systems troubleshooting. Experience building scripts, tooling, and automation for large scale computing environments. Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, please visit for more information. Please consult our Privacy Notice ( ) to know more about how we collect, use and transfer the personal data of our candidates. Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
Software Engineer III, Site Reliability Engineering
WeAreTechWomen
Minimum qualifications: Bachelor's degree in Computer Science, a related field, or equivalent practical experience. 2 years of experience with software development in one or more programming languages. Preferred qualifications: Master's degree in Computer Science or Engineering. 2 years of experience designing, analyzing, and troubleshooting large-scale distributed systems. About the job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow. The ATS Matrix team is a unique SRE Product Group within the AI, Trust, and Security (ATS) SRE product area. Our mission is to proactively deliver risk assurance in Google's infrastructure and products through principled and innovative reliability engineering. We manage systemic security risks, from the network to the application layer, and research and mitigate emerging threats in areas like AI. Responsibilities Write product or system development code. Review code developed by other engineers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency). Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback. Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality. Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies. Google is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. See also Google's EEO Policy and EEO is the Law. If you have a disability or special need that requires accommodation, please let us know by completing our Accommodations for Applicants form.
27/05/2026
Full time
Minimum qualifications: Bachelor's degree in Computer Science, a related field, or equivalent practical experience. 2 years of experience with software development in one or more programming languages. Preferred qualifications: Master's degree in Computer Science or Engineering. 2 years of experience designing, analyzing, and troubleshooting large-scale distributed systems. About the job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow. The ATS Matrix team is a unique SRE Product Group within the AI, Trust, and Security (ATS) SRE product area. Our mission is to proactively deliver risk assurance in Google's infrastructure and products through principled and innovative reliability engineering. We manage systemic security risks, from the network to the application layer, and research and mitigate emerging threats in areas like AI. Responsibilities Write product or system development code. Review code developed by other engineers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency). Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback. Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality. Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies. Google is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. See also Google's EEO Policy and EEO is the Law. If you have a disability or special need that requires accommodation, please let us know by completing our Accommodations for Applicants form.
Site Reliability Engineer. Job in London LilyLifestyle Jobs
United Cerebral Palsy of Georgia
Senior Site Reliability Engineer (AWS / CDK / TypeScript) Remote First Occasional travel to Leeds £40,000 - £50,000 + benefits No Sponsorship Available VIQU have partnered with a major UK technology led organisation undergoing a significant transformation following a large scale business merger. As part of a wider move away from contractor heavy delivery, they are investing heavily in permanent engineering talent and building out a high performing cloud and platform function. They are looking for a Site Reliability Engineer to help improve the reliability, scalability and automation of their AWS estate. This is a hands on engineering role working across cloud infrastructure, observability, CI/CD and platform tooling, helping development teams deliver faster and more reliably. You'll be joining a collaborative engineering environment with the opportunity to influence platform standards, improve operational resilience and support modern DevOps and SRE practices across the business. Key responsibilities: Build, maintain and improve scalable AWS infrastructure. Develop and manage Infrastructure as Code using AWS CDK. Support CI/CD pipelines and deployment automation. Improve monitoring, logging and observability across distributed systems. Support incident management, root cause analysis and platform reliability improvements. Work closely with engineering and architecture teams to improve operational performance and security. Contribute to cloud best practice, automation and platform engineering standards. Key requirements: Strong experience in a Site Reliability Engineering, DevOps or Platform Engineering role. Strong AWS experience within production environments. Experience with AWS CDK (TypeScript preferred). Strong TypeScript experience. Experience with CI/CD tooling such as Jenkins or GitLab CI. Containerisation experience with Docker, Kubernetes, EKS or ECS. Experience with observability tooling such as Prometheus, Grafana, AppDynamics or OpenSearch. Experience with scripting or development using Python, TypeScript or Java. Understanding of cloud security and reliability best practices. AWS certifications are desirable but not essential.
27/05/2026
Full time
Senior Site Reliability Engineer (AWS / CDK / TypeScript) Remote First Occasional travel to Leeds £40,000 - £50,000 + benefits No Sponsorship Available VIQU have partnered with a major UK technology led organisation undergoing a significant transformation following a large scale business merger. As part of a wider move away from contractor heavy delivery, they are investing heavily in permanent engineering talent and building out a high performing cloud and platform function. They are looking for a Site Reliability Engineer to help improve the reliability, scalability and automation of their AWS estate. This is a hands on engineering role working across cloud infrastructure, observability, CI/CD and platform tooling, helping development teams deliver faster and more reliably. You'll be joining a collaborative engineering environment with the opportunity to influence platform standards, improve operational resilience and support modern DevOps and SRE practices across the business. Key responsibilities: Build, maintain and improve scalable AWS infrastructure. Develop and manage Infrastructure as Code using AWS CDK. Support CI/CD pipelines and deployment automation. Improve monitoring, logging and observability across distributed systems. Support incident management, root cause analysis and platform reliability improvements. Work closely with engineering and architecture teams to improve operational performance and security. Contribute to cloud best practice, automation and platform engineering standards. Key requirements: Strong experience in a Site Reliability Engineering, DevOps or Platform Engineering role. Strong AWS experience within production environments. Experience with AWS CDK (TypeScript preferred). Strong TypeScript experience. Experience with CI/CD tooling such as Jenkins or GitLab CI. Containerisation experience with Docker, Kubernetes, EKS or ECS. Experience with observability tooling such as Prometheus, Grafana, AppDynamics or OpenSearch. Experience with scripting or development using Python, TypeScript or Java. Understanding of cloud security and reliability best practices. AWS certifications are desirable but not essential.
Barclays
Site Reliability Engineer
Barclays Knutsford, Cheshire
Join us as a Site Reliability Engineer where you'll spearhead the evolution of our digital landscape, driving innovation and excellence. As a Microsoft SQL Database Site Reliability Engineer (SRE) at Barclays, you will assume a key technical role. You will assist in shaping the direction of our database administration, ensuring our technological approaches are innovative and aligned with the Bank's business goals. You will contribute high-impact projects to completion, collaborate with management, and implement SRE practices using software engineering and database administration to address infrastructure and operational challenges at scale. As part of the Database SRE team, you will be data-driven and work to eliminate TOIL through simplification, automation, and observability, thereby enhancing the reliability of our platforms. With a focus on database scalability, availability, security, and performance, you will work closely with the Engineering team, product managers, and other teams. You will ensure the seamless flow and robust security of information on our platforms, meeting high traffic volumes and demanding operational needs. To be successful as a Site Reliability Engineer Technical specialization with MS SQL expertise on version - SQL for complex database related issues from availability, to tuning to architecture on enterprise scale. Contribute shaping, designing SRE practice for MSSQL offering, delivering through SRE team. Serve as the technical escalation for complex database related issues, providing expert solutions. Assist establishment and evolution of the SRE function and implementing advanced SLIs and SLOs. Some other highly valued skills may include: Experience on database automation with estate standardization. Expert knowledge in system configuration management tools such as Chef, Ansible for database server configurations. Expert expertise with scripting languages (e.g. PowerShell) for automation/migration tasks. You may be assessed on the key critical skills relevant for success in role, such as risk and controls, change and transformation, business acumen strategic thinking and digital and technology, as well as job-specific technical skills. This role will be based in our Knutsford campus. Purpose of the role To apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them. Accountabilities Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning. Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring. Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience. Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning. Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations. Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth. Assistant Vice President Expectations To advise and influence decision making, contribute to policy development and take responsibility for operational effectiveness. Collaborate closely with other functions/ business divisions. Lead a team performing complex tasks, using well developed professional knowledge and skills to deliver on work that impacts the whole business function. Set objectives and coach employees in pursuit of those objectives, appraisal of performance relative to objectives and determination of reward outcomes If the position has leadership responsibilities, People Leaders are expected to demonstrate a clear set of leadership behaviours to create an environment for colleagues to thrive and deliver to a consistently excellent standard. The four LEAD behaviours are: L - Listen and be authentic, E - Energise and inspire, A - Align across the enterprise, D - Develop others. OR for an individual contributor, they will lead collaborative assignments and guide team members through structured assignments, identify the need for the inclusion of other areas of specialization to complete assignments. They will identify new directions for assignments and/ or projects, identifying a combination of cross functional methodologies or practices to meet required outcomes. Consult on complex issues; providing advice to People Leaders to support the resolution of escalated issues. Identify ways to mitigate risk and developing new policies/procedures in support of the control and governance agenda. Take ownership for managing risk and strengthening controls in relation to the work done. Perform work that is closely related to that of other areas, which requires understanding of how areas coordinate and contribute to the achievement of the objectives of the organisation sub-function. Collaborate with other areas of work, for business aligned support areas to keep up to speed with business activity and the business strategy. Engage in complex analysis of data from multiple sources of information, internal and external sources such as procedures and practises (in other areas, teams, companies, etc).to solve problems creatively and effectively. Communicate complex information. 'Complex' information could include sensitive information or information that is difficult to communicate because of its content or its audience. Influence or convince stakeholders to achieve outcomes. All colleagues will be expected to demonstrate the Barclays Values of Respect, Integrity, Service, Excellence and Stewardship - our moral compass, helping us do what we believe is right. They will also be expected to demonstrate the Barclays Mindset - to Empower, Challenge and Drive - the operating manual for how we behave.
27/05/2026
Full time
Join us as a Site Reliability Engineer where you'll spearhead the evolution of our digital landscape, driving innovation and excellence. As a Microsoft SQL Database Site Reliability Engineer (SRE) at Barclays, you will assume a key technical role. You will assist in shaping the direction of our database administration, ensuring our technological approaches are innovative and aligned with the Bank's business goals. You will contribute high-impact projects to completion, collaborate with management, and implement SRE practices using software engineering and database administration to address infrastructure and operational challenges at scale. As part of the Database SRE team, you will be data-driven and work to eliminate TOIL through simplification, automation, and observability, thereby enhancing the reliability of our platforms. With a focus on database scalability, availability, security, and performance, you will work closely with the Engineering team, product managers, and other teams. You will ensure the seamless flow and robust security of information on our platforms, meeting high traffic volumes and demanding operational needs. To be successful as a Site Reliability Engineer Technical specialization with MS SQL expertise on version - SQL for complex database related issues from availability, to tuning to architecture on enterprise scale. Contribute shaping, designing SRE practice for MSSQL offering, delivering through SRE team. Serve as the technical escalation for complex database related issues, providing expert solutions. Assist establishment and evolution of the SRE function and implementing advanced SLIs and SLOs. Some other highly valued skills may include: Experience on database automation with estate standardization. Expert knowledge in system configuration management tools such as Chef, Ansible for database server configurations. Expert expertise with scripting languages (e.g. PowerShell) for automation/migration tasks. You may be assessed on the key critical skills relevant for success in role, such as risk and controls, change and transformation, business acumen strategic thinking and digital and technology, as well as job-specific technical skills. This role will be based in our Knutsford campus. Purpose of the role To apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them. Accountabilities Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning. Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring. Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience. Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning. Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations. Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth. Assistant Vice President Expectations To advise and influence decision making, contribute to policy development and take responsibility for operational effectiveness. Collaborate closely with other functions/ business divisions. Lead a team performing complex tasks, using well developed professional knowledge and skills to deliver on work that impacts the whole business function. Set objectives and coach employees in pursuit of those objectives, appraisal of performance relative to objectives and determination of reward outcomes If the position has leadership responsibilities, People Leaders are expected to demonstrate a clear set of leadership behaviours to create an environment for colleagues to thrive and deliver to a consistently excellent standard. The four LEAD behaviours are: L - Listen and be authentic, E - Energise and inspire, A - Align across the enterprise, D - Develop others. OR for an individual contributor, they will lead collaborative assignments and guide team members through structured assignments, identify the need for the inclusion of other areas of specialization to complete assignments. They will identify new directions for assignments and/ or projects, identifying a combination of cross functional methodologies or practices to meet required outcomes. Consult on complex issues; providing advice to People Leaders to support the resolution of escalated issues. Identify ways to mitigate risk and developing new policies/procedures in support of the control and governance agenda. Take ownership for managing risk and strengthening controls in relation to the work done. Perform work that is closely related to that of other areas, which requires understanding of how areas coordinate and contribute to the achievement of the objectives of the organisation sub-function. Collaborate with other areas of work, for business aligned support areas to keep up to speed with business activity and the business strategy. Engage in complex analysis of data from multiple sources of information, internal and external sources such as procedures and practises (in other areas, teams, companies, etc).to solve problems creatively and effectively. Communicate complex information. 'Complex' information could include sensitive information or information that is difficult to communicate because of its content or its audience. Influence or convince stakeholders to achieve outcomes. All colleagues will be expected to demonstrate the Barclays Values of Respect, Integrity, Service, Excellence and Stewardship - our moral compass, helping us do what we believe is right. They will also be expected to demonstrate the Barclays Mindset - to Empower, Challenge and Drive - the operating manual for how we behave.
Senior Site Reliability Engineer - Scale & Resilience
WeAreTechWomen
Minimum qualifications: Bachelor's degree in Computer Science, a related field, or equivalent practical experience. 2 years of experience with software development in one or more programming languages. Preferred qualifications: Master's degree in Computer Science or Engineering. 2 years of experience designing, analyzing, and troubleshooting large-scale distributed systems. About the job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow. The ATS Matrix team is a unique SRE Product Group within the AI, Trust, and Security (ATS) SRE product area. Our mission is to proactively deliver risk assurance in Google's infrastructure and products through principled and innovative reliability engineering. We manage systemic security risks, from the network to the application layer, and research and mitigate emerging threats in areas like AI. Responsibilities Write product or system development code. Review code developed by other engineers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency). Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback. Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality. Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies. Google is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. See also Google's EEO Policy and EEO is the Law. If you have a disability or special need that requires accommodation, please let us know by completing our Accommodations for Applicants form.
27/05/2026
Full time
Minimum qualifications: Bachelor's degree in Computer Science, a related field, or equivalent practical experience. 2 years of experience with software development in one or more programming languages. Preferred qualifications: Master's degree in Computer Science or Engineering. 2 years of experience designing, analyzing, and troubleshooting large-scale distributed systems. About the job Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow. The ATS Matrix team is a unique SRE Product Group within the AI, Trust, and Security (ATS) SRE product area. Our mission is to proactively deliver risk assurance in Google's infrastructure and products through principled and innovative reliability engineering. We manage systemic security risks, from the network to the application layer, and research and mitigate emerging threats in areas like AI. Responsibilities Write product or system development code. Review code developed by other engineers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency). Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback. Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality. Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies. Google is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. See also Google's EEO Policy and EEO is the Law. If you have a disability or special need that requires accommodation, please let us know by completing our Accommodations for Applicants form.
Senior Site Reliability Engineer
Tes
Norfolk St, Sheffield City Centre, Sheffield S1 2JE, UK Grays Inn Rd, London WC1X 8NH, UK Yeovil, UK Job Title: Senior Site Reliability Engineer Department: Technology Location: Sheffield, London, Talbot Green or Yoevil Working Pattern: Hybrid, includes 3 days each week in the office Contract Type: Full time, permanent Salary: Up to £90,000 per annum Role Overview As a Senior SRE Engineer, you will be pivotal in designing and implementing best SRE practices while fostering a culture of continuous improvement and optimization. You will collaborate closely with development and operations teams to improve the platform stability and performance, ensuring that our systems are reliable, secure, and scalable. Key Responsibilities Infrastructure Management Manage and scale cloud-based infrastructure (e.g., AWS, Azure, GCP). Apply Infrastructure as Code (IaC) principles for provisioning and configuration management. Security and Compliance Collaborate with the security team to implement best practices for system and data security. Ensure systems comply with relevant industry standards and regulations. Monitoring and Performance Set up and maintain monitoring and alerting systems for early issue detection and resolution. Continuously optimize system performance and resource usage. Documentation Create and maintain thorough documentation for SRE/platform processes, tools, and practices. Exposure to Jira and equivalent tool would be beneficial Experience Proven experience in a SRE/DevOps/Platform role, with a strong background in both software development or operations. Knowledge of CI/CD tools (e.g., Jenkins, GitLab CI/CD, Travis CI). Proficiency in scripting and automation (e.g., Bash, Python, Ansible). Strong experience with containerization and orchestration technologies (e.g., Docker, Kubernetes). Strong hands on experience of at least one major public cloud platforms (e.g., AWS, Azure, GCP). Strong problem solving and troubleshooting abilities in a timebound situation (Major incidents). Clear communication and incident management experience. Demonstrable strong hands on experience with Terraform. Knowledge of microservices architecture. Familiarity with security best practices and tools. Demonstrable experience of monitoring / observability tools preferred Grafana, Prometheus, PagerDuty, uptime. Knowledge Cloud Platforms: Strong knowledge of AWS, Azure, or GCP, including cloud architecture, services, and security models. Containerization & Orchestration: In-depth understanding of Docker and Kubernetes for deploying and managing containerized applications. Infrastructure as Code (IaC): Knowledge of IaC frameworks, particularly Terraform, to manage cloud infrastructure via code. Microservices Architecture: Familiarity with microservices design patterns and deployment strategies in a cloud-native environment. Monitoring & Observability: Understanding of monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK) to ensure system performance and issue tracking. Skills CI/CD Tools: Hands on experience with Jenkins, GitLab CI/CD, Travis CI, or similar tools for building CI/CD pipelines. Scripting & Automation: Proficiency in scripting languages like Bash and Python, along with automation tools such as Ansible for managing configurations and deployments. Containerization & Orchestration: Practical skills in deploying and managing containers using Docker and orchestrating workloads using Kubernetes. Cloud Platform Management: Expertise in managing and scaling cloud environments on AWS, Azure, or GCP, leveraging services for compute, storage, networking, and security. Infrastructure as Code (IaC): Skilled in using Terraform to automate provisioning and management of cloud infrastructure. Troubleshooting & Problem Solving: Strong analytical skills for identifying and resolving complex system issues, especially in production environments. Collaboration & Communication: Excellent ability to work under pressure e.g. in a Major incident. Qualifications Certifications (Preferred): Holding certifications such as AWS Certified DevOps Engineer, CKA (Certified Kubernetes Administrator), or other relevant credentials. Benefits 25 days annual leave rising to 30 State of the art offices Access to a range of benefits via My Benefits World Free eye care cover Life Assurance Cycle to Work Scheme EAP (Employee assistance programme) Quarterly Tes Socials Access to an extensive Learning and Development menu
27/05/2026
Full time
Norfolk St, Sheffield City Centre, Sheffield S1 2JE, UK Grays Inn Rd, London WC1X 8NH, UK Yeovil, UK Job Title: Senior Site Reliability Engineer Department: Technology Location: Sheffield, London, Talbot Green or Yoevil Working Pattern: Hybrid, includes 3 days each week in the office Contract Type: Full time, permanent Salary: Up to £90,000 per annum Role Overview As a Senior SRE Engineer, you will be pivotal in designing and implementing best SRE practices while fostering a culture of continuous improvement and optimization. You will collaborate closely with development and operations teams to improve the platform stability and performance, ensuring that our systems are reliable, secure, and scalable. Key Responsibilities Infrastructure Management Manage and scale cloud-based infrastructure (e.g., AWS, Azure, GCP). Apply Infrastructure as Code (IaC) principles for provisioning and configuration management. Security and Compliance Collaborate with the security team to implement best practices for system and data security. Ensure systems comply with relevant industry standards and regulations. Monitoring and Performance Set up and maintain monitoring and alerting systems for early issue detection and resolution. Continuously optimize system performance and resource usage. Documentation Create and maintain thorough documentation for SRE/platform processes, tools, and practices. Exposure to Jira and equivalent tool would be beneficial Experience Proven experience in a SRE/DevOps/Platform role, with a strong background in both software development or operations. Knowledge of CI/CD tools (e.g., Jenkins, GitLab CI/CD, Travis CI). Proficiency in scripting and automation (e.g., Bash, Python, Ansible). Strong experience with containerization and orchestration technologies (e.g., Docker, Kubernetes). Strong hands on experience of at least one major public cloud platforms (e.g., AWS, Azure, GCP). Strong problem solving and troubleshooting abilities in a timebound situation (Major incidents). Clear communication and incident management experience. Demonstrable strong hands on experience with Terraform. Knowledge of microservices architecture. Familiarity with security best practices and tools. Demonstrable experience of monitoring / observability tools preferred Grafana, Prometheus, PagerDuty, uptime. Knowledge Cloud Platforms: Strong knowledge of AWS, Azure, or GCP, including cloud architecture, services, and security models. Containerization & Orchestration: In-depth understanding of Docker and Kubernetes for deploying and managing containerized applications. Infrastructure as Code (IaC): Knowledge of IaC frameworks, particularly Terraform, to manage cloud infrastructure via code. Microservices Architecture: Familiarity with microservices design patterns and deployment strategies in a cloud-native environment. Monitoring & Observability: Understanding of monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK) to ensure system performance and issue tracking. Skills CI/CD Tools: Hands on experience with Jenkins, GitLab CI/CD, Travis CI, or similar tools for building CI/CD pipelines. Scripting & Automation: Proficiency in scripting languages like Bash and Python, along with automation tools such as Ansible for managing configurations and deployments. Containerization & Orchestration: Practical skills in deploying and managing containers using Docker and orchestrating workloads using Kubernetes. Cloud Platform Management: Expertise in managing and scaling cloud environments on AWS, Azure, or GCP, leveraging services for compute, storage, networking, and security. Infrastructure as Code (IaC): Skilled in using Terraform to automate provisioning and management of cloud infrastructure. Troubleshooting & Problem Solving: Strong analytical skills for identifying and resolving complex system issues, especially in production environments. Collaboration & Communication: Excellent ability to work under pressure e.g. in a Major incident. Qualifications Certifications (Preferred): Holding certifications such as AWS Certified DevOps Engineer, CKA (Certified Kubernetes Administrator), or other relevant credentials. Benefits 25 days annual leave rising to 30 State of the art offices Access to a range of benefits via My Benefits World Free eye care cover Life Assurance Cycle to Work Scheme EAP (Employee assistance programme) Quarterly Tes Socials Access to an extensive Learning and Development menu
Senior Cloud Support Engineer
BetterCloud
About the Team Customer & Product Support (C&PS) sits at the intersection of sales, customer success and R&D teams. Through close teamwork and collaboration, we drive positive customer outcomes and ensure seamless access and optimal utilization of AlphaSense's market leading platform and products. We are committed to enhancing every user's experience through consistent delivery of prompt and knowledgeable responses. The Customer & Product Support team is based globally across the US, UK, India and Singapore. About the Role We are looking for a proactive and experienced Senior Cloud Support Engineer to join our dynamic team. In this role, you will serve as a key technical resource and advocate for our customers, tackling complex technical challenges and driving customer satisfaction. You will leverage your expertise in cloud technologies, distributed systems and AI driven search platforms to ensure smooth deployment, troubleshooting, and optimization of customer solutions. This role is ideal for someone who thrives in a fast paced environment and enjoys solving high impact problems. For longer term growth opportunities within AlphaSense, we see this role as a stepping stone for those interested in growing into our Site Reliability Engineering (SRE) team. To make this possible, we have a program to facilitate ongoing training and tight collaboration with Engineering. Other options for progression pathways will be dependent on skill sets and individual performance goals. What You'll Do Deliver exceptional technical support: Act as the second line of defense for technical queries, ensuring timely and effective resolution of customer issues through close collaboration between L1 support and R&D teams. Troubleshoot and resolve technical challenges: Investigate, diagnose, and resolve issues related to both SaaS and private cloud environments, including AWS infrastructure and Kubernetes workloads, agentic systems and MCP connectors, providing necessary detail for Product and Engineering to carry forward complex cases. Collaborate cross functionally: Partner with the wider Customer & Product Support organization, Product and Engineering and Customer Education teams to relay customer feedback, identify opportunities for improvement, and contribute to strategic product enhancements. Contribute to knowledge sharing: Create and update support documentation, KEDBs, FAQs, and tutorials to empower customers and internal teams. Be an advocate for customers: Represent the voice of the customer in internal discussions and initiatives to maximize the value of our platform and products. Be an expert on our product and continuously build your knowledge: Proactively learn, stay up-to-date on new features and fill any gaps in your knowledge. Who You Are A technically skilled professional with 3-5 years in technical support, IT operations, or a related field. A strong communicator with the ability to distill complex technical concepts for diverse audiences. A problem solver with a proactive and customer first mindset. Adaptable and eager to continuously learn and improve. Enjoy being part of an entrepreneurial team and work diligently to help others when needed. Minimum Qualifications Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent practical experience). Proficiency in CLI tools (e.g., kubectl, awscli) and scripting skills in Python, JavaScript, or similar. Understanding of GraphQL, REST APIs and MCP related troubleshooting. Experience and knowledge of connecting LLMs to external tools, including multi agent systems. Experience with alerting/logging systems (e.g., Prometheus, Grafana, FireHydrant). Demonstrated ability to communicate complex technical concepts to customers and team members. Experience in managing customer support cases throughout their lifecycle, including inquiry, triage, bug reporting, and resolution. Ability to create and maintain runbooks that ensure clear and actionable documentation for deeper troubleshooting procedures and Level 1 and 2 support training. Preferred Qualifications Experience in networking and troubleshooting complex network issues. Experience troubleshooting in at least one major cloud platform (preferably AWS) and containerized environments using Kubernetes or Docker. Extensive experience working with GraphQL and other web APIs. Hands on experience with Infrastructure as Code tools, such as Crossplane, and related troubleshooting. Experience with Search Technologies and Data Storages (e.g., Vespa, ElasticSearch, MongoDB, MySQL). Experience or familiarity with the Java programming language. Experience with standard software release lifecycles. Equal Opportunity Employer AlphaSense is an equal opportunity employer. We are committed to a work environment that supports, inspires, and respects all individuals. All employees share in the responsibility for fulfilling AlphaSense's commitment to equal employment opportunity. AlphaSense does not discriminate against any employee or applicant on the basis of race, color, sex (including pregnancy), national origin, age, religion, marital status, sexual orientation, gender identity, gender expression, military or veteran status, disability, or any other non merit factor. This policy applies to every aspect of employment at AlphaSense, including recruitment, hiring, training, advancement, and termination. In addition, it is the policy of AlphaSense to provide reasonable accommodation to qualified employees who have protected disabilities to the extent required by applicable laws, regulations, and ordinances where a particular employee works.
27/05/2026
Full time
About the Team Customer & Product Support (C&PS) sits at the intersection of sales, customer success and R&D teams. Through close teamwork and collaboration, we drive positive customer outcomes and ensure seamless access and optimal utilization of AlphaSense's market leading platform and products. We are committed to enhancing every user's experience through consistent delivery of prompt and knowledgeable responses. The Customer & Product Support team is based globally across the US, UK, India and Singapore. About the Role We are looking for a proactive and experienced Senior Cloud Support Engineer to join our dynamic team. In this role, you will serve as a key technical resource and advocate for our customers, tackling complex technical challenges and driving customer satisfaction. You will leverage your expertise in cloud technologies, distributed systems and AI driven search platforms to ensure smooth deployment, troubleshooting, and optimization of customer solutions. This role is ideal for someone who thrives in a fast paced environment and enjoys solving high impact problems. For longer term growth opportunities within AlphaSense, we see this role as a stepping stone for those interested in growing into our Site Reliability Engineering (SRE) team. To make this possible, we have a program to facilitate ongoing training and tight collaboration with Engineering. Other options for progression pathways will be dependent on skill sets and individual performance goals. What You'll Do Deliver exceptional technical support: Act as the second line of defense for technical queries, ensuring timely and effective resolution of customer issues through close collaboration between L1 support and R&D teams. Troubleshoot and resolve technical challenges: Investigate, diagnose, and resolve issues related to both SaaS and private cloud environments, including AWS infrastructure and Kubernetes workloads, agentic systems and MCP connectors, providing necessary detail for Product and Engineering to carry forward complex cases. Collaborate cross functionally: Partner with the wider Customer & Product Support organization, Product and Engineering and Customer Education teams to relay customer feedback, identify opportunities for improvement, and contribute to strategic product enhancements. Contribute to knowledge sharing: Create and update support documentation, KEDBs, FAQs, and tutorials to empower customers and internal teams. Be an advocate for customers: Represent the voice of the customer in internal discussions and initiatives to maximize the value of our platform and products. Be an expert on our product and continuously build your knowledge: Proactively learn, stay up-to-date on new features and fill any gaps in your knowledge. Who You Are A technically skilled professional with 3-5 years in technical support, IT operations, or a related field. A strong communicator with the ability to distill complex technical concepts for diverse audiences. A problem solver with a proactive and customer first mindset. Adaptable and eager to continuously learn and improve. Enjoy being part of an entrepreneurial team and work diligently to help others when needed. Minimum Qualifications Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent practical experience). Proficiency in CLI tools (e.g., kubectl, awscli) and scripting skills in Python, JavaScript, or similar. Understanding of GraphQL, REST APIs and MCP related troubleshooting. Experience and knowledge of connecting LLMs to external tools, including multi agent systems. Experience with alerting/logging systems (e.g., Prometheus, Grafana, FireHydrant). Demonstrated ability to communicate complex technical concepts to customers and team members. Experience in managing customer support cases throughout their lifecycle, including inquiry, triage, bug reporting, and resolution. Ability to create and maintain runbooks that ensure clear and actionable documentation for deeper troubleshooting procedures and Level 1 and 2 support training. Preferred Qualifications Experience in networking and troubleshooting complex network issues. Experience troubleshooting in at least one major cloud platform (preferably AWS) and containerized environments using Kubernetes or Docker. Extensive experience working with GraphQL and other web APIs. Hands on experience with Infrastructure as Code tools, such as Crossplane, and related troubleshooting. Experience with Search Technologies and Data Storages (e.g., Vespa, ElasticSearch, MongoDB, MySQL). Experience or familiarity with the Java programming language. Experience with standard software release lifecycles. Equal Opportunity Employer AlphaSense is an equal opportunity employer. We are committed to a work environment that supports, inspires, and respects all individuals. All employees share in the responsibility for fulfilling AlphaSense's commitment to equal employment opportunity. AlphaSense does not discriminate against any employee or applicant on the basis of race, color, sex (including pregnancy), national origin, age, religion, marital status, sexual orientation, gender identity, gender expression, military or veteran status, disability, or any other non merit factor. This policy applies to every aspect of employment at AlphaSense, including recruitment, hiring, training, advancement, and termination. In addition, it is the policy of AlphaSense to provide reasonable accommodation to qualified employees who have protected disabilities to the extent required by applicable laws, regulations, and ordinances where a particular employee works.
Site Reliability Engineer, K8s
PulsePoint
WebMD and its affiliates is an Equal Opportunity/Affirmative Action employer and does not discriminate on the basis of race, ancestry, color, religion, sex, gender, age, marital status, sexual orientation, gender identity, national origin, medical condition, disability, veterans status, or any other basis protected by law. Position Overview Our BI team runs a set of GCP-based APIs and data services that a lot of internal products depend on. As we've grown, keeping things running has increasingly been a side responsibility for engineers who are primarily building features - and that's not sustainable. We're looking for an SRE to own that space: service health, incident response, infrastructure monitoring, and making sure we're not blindly burning cloud budget. Responsibilities Monitor and maintain uptime of GCP-hosted APIs and services, keeping performance within agreed targets Lead incident response for BI platform services - triage, resolve, and follow up with post-mortems that actually prevent recurrence Build and manage observability infrastructure: dashboards, alerts, and logging across GCP services Track GCP cloud spend and set up cost alerting to flag anomalies before they become problems Review and fix security gaps - IAP configs, service account permissions, API access controls Work with data and backend engineers to shore up reliability of data pipelines and BigQuery workflows Contribute to infrastructure-as-code and help keep deployments documented and reproducible Qualifications 2+ years in a Site Reliability, DevOps, or Cloud Infrastructure role in a production environment Bachelor's degree in Computer Science, Engineering, or related field, or equivalent hands on experience Practical experience with GCP - Cloud Run, API Gateway, and BigQuery in particular Experience with monitoring and observability tooling (Cloud Monitoring, Datadog, or similar) Solid grasp of cloud security fundamentals - IAM, network controls, access management Proficiency with Git and version control in a team setting Preferred Skills CI/CD pipelines and deployment automation (GitHub Actions, Cloud Build, or similar) Terraform or other infrastructure-as-code tools Python for scripting or automation MySQL, Spanner, or BigQuery at any meaningful depth GCP cost management and spend optimization Experience with dbt or Looker Comfortable working across CET/EST hours in a distributed team
27/05/2026
Full time
WebMD and its affiliates is an Equal Opportunity/Affirmative Action employer and does not discriminate on the basis of race, ancestry, color, religion, sex, gender, age, marital status, sexual orientation, gender identity, national origin, medical condition, disability, veterans status, or any other basis protected by law. Position Overview Our BI team runs a set of GCP-based APIs and data services that a lot of internal products depend on. As we've grown, keeping things running has increasingly been a side responsibility for engineers who are primarily building features - and that's not sustainable. We're looking for an SRE to own that space: service health, incident response, infrastructure monitoring, and making sure we're not blindly burning cloud budget. Responsibilities Monitor and maintain uptime of GCP-hosted APIs and services, keeping performance within agreed targets Lead incident response for BI platform services - triage, resolve, and follow up with post-mortems that actually prevent recurrence Build and manage observability infrastructure: dashboards, alerts, and logging across GCP services Track GCP cloud spend and set up cost alerting to flag anomalies before they become problems Review and fix security gaps - IAP configs, service account permissions, API access controls Work with data and backend engineers to shore up reliability of data pipelines and BigQuery workflows Contribute to infrastructure-as-code and help keep deployments documented and reproducible Qualifications 2+ years in a Site Reliability, DevOps, or Cloud Infrastructure role in a production environment Bachelor's degree in Computer Science, Engineering, or related field, or equivalent hands on experience Practical experience with GCP - Cloud Run, API Gateway, and BigQuery in particular Experience with monitoring and observability tooling (Cloud Monitoring, Datadog, or similar) Solid grasp of cloud security fundamentals - IAM, network controls, access management Proficiency with Git and version control in a team setting Preferred Skills CI/CD pipelines and deployment automation (GitHub Actions, Cloud Build, or similar) Terraform or other infrastructure-as-code tools Python for scripting or automation MySQL, Spanner, or BigQuery at any meaningful depth GCP cost management and spend optimization Experience with dbt or Looker Comfortable working across CET/EST hours in a distributed team
Director, Site Reliability Engineer Senior Engineering Team Director
LGBT Great
About this role We're seeking a Site Reliability Engineering (SRE) Lead to design, build, and maintain resilient, high scale systems supporting BlackRock's Private Markets platform. In this hands on leadership role, you'll apply deep engineering expertise to solve complex challenges, guide a global team, shape technical direction, and communicate effectively with senior stakeholders-ensuring the reliability of mission critical systems that power private market investment workflows and decision making. You will drive the adoption of AI driven solutions to accelerate incident detection and triage, reduce toil, improve forecasting and capacity planning, and strengthen end to end observability and resilience. Role Responsibilities Take ownership of project priorities, deadlines and deliverables using Agile methodologies, with clear outcomes around reliability automation and AI enabled operations Understand and refine business and functional requirements, translating them into SLOs/SLIs and AI assisted observability and support capabilities Hands on approach to getting work done-this role requires a "roll your sleeves up" mentality, including building and operationalizing reliability tooling and automation that measurably reduces toil and improves stability Be a leader with vision and a partner in brainstorming solutions for team productivity and efficiency to improve engineering effectiveness Drive priority setting of the engineering teams, balancing foundational reliability work with delivery of new product features Improve Engineering culture by encouraging continuous focus on reliability across the entire application lifecycle, and by adopting AI enabled SRE practices (e.g., intelligent alerting, automated diagnosis, and self healing where appropriate) Proactive participant in architectural and design decisions, including AI ready telemetry, data quality, and model integration patterns for operational analytics Design and implement end to end monitoring solutions for application and infrastructure components, leveraging modern observability platforms plus AI/ML techniques for anomaly detection, correlation, and alert noise reduction Drive the engineering of capacity management and demand forecasting solutions, including predictive analytics/ML approaches where they add measurable value Act as a culture carrier and leader, passing on SRE knowledge and best practices to the engineering team Drive detailed root cause investigations for production incidents with rigorous focus on issue avoidance, using AI assisted correlation/analysis to accelerate time to insight Create/coordinate retros for significant incidents, ensuring learnings are captured in automated/AI assisted runbooks and embedded into prevention mechanisms Additional core engineering functions, such as adding custom telemetry metrics/logs/traces to the code base of in scope applications to enable AI/ML driven operational insights Anticipate new opportunities to continuously evolve the resiliency profile of scoped applications and infrastructure Skills/Qualifications Must Have B.S. / M.S. degree in Computer Science, Engineering or a related discipline with 10+ years of experience Experience leading high performing engineering/SRE teams, with a track record of driving continuous improvement through automation and AI enabled operations Demonstrated ability to represent engineering/SRE priorities, status, and risk to senior leadership stakeholders with clear, executive ready communication Hands on experience building or operating AI assisted capabilities (AIOps, ML based anomaly detection, or GenAI workflows) in an engineering/production environment A passion for providing engineering support for highly available, performant full stack applications with a "Student of Technology" attitude Experience with relational database and NoSQL Database (e.g., Redis, Apache Cassandra) Benefits We offer a wide range of employee benefits including retirement investment and tools designed to help you build a sound financial future; education reimbursement; comprehensive resources to support your physical health and emotional well being; family support programs; and Flexible Time Off (FTO) so you can relax, recharge, and be there for the people you care about. Hybrid Work Model BlackRock's hybrid work model is designed to enable a culture of collaboration and apprenticeship that enriches the experience of our employees, while supporting flexibility for all. Employees are required to work at least 4 days in the office per week, with the flexibility to work from home 1 day a week. Some business groups may require more time in the office due to their roles and responsibilities. Equal Opportunity Employer BlackRock is proud to be an Equal Opportunity Employer. We evaluate qualified applicants without regard to age, disability, race, religion, sex, sexual orientation and other protected characteristics at law.
27/05/2026
Full time
About this role We're seeking a Site Reliability Engineering (SRE) Lead to design, build, and maintain resilient, high scale systems supporting BlackRock's Private Markets platform. In this hands on leadership role, you'll apply deep engineering expertise to solve complex challenges, guide a global team, shape technical direction, and communicate effectively with senior stakeholders-ensuring the reliability of mission critical systems that power private market investment workflows and decision making. You will drive the adoption of AI driven solutions to accelerate incident detection and triage, reduce toil, improve forecasting and capacity planning, and strengthen end to end observability and resilience. Role Responsibilities Take ownership of project priorities, deadlines and deliverables using Agile methodologies, with clear outcomes around reliability automation and AI enabled operations Understand and refine business and functional requirements, translating them into SLOs/SLIs and AI assisted observability and support capabilities Hands on approach to getting work done-this role requires a "roll your sleeves up" mentality, including building and operationalizing reliability tooling and automation that measurably reduces toil and improves stability Be a leader with vision and a partner in brainstorming solutions for team productivity and efficiency to improve engineering effectiveness Drive priority setting of the engineering teams, balancing foundational reliability work with delivery of new product features Improve Engineering culture by encouraging continuous focus on reliability across the entire application lifecycle, and by adopting AI enabled SRE practices (e.g., intelligent alerting, automated diagnosis, and self healing where appropriate) Proactive participant in architectural and design decisions, including AI ready telemetry, data quality, and model integration patterns for operational analytics Design and implement end to end monitoring solutions for application and infrastructure components, leveraging modern observability platforms plus AI/ML techniques for anomaly detection, correlation, and alert noise reduction Drive the engineering of capacity management and demand forecasting solutions, including predictive analytics/ML approaches where they add measurable value Act as a culture carrier and leader, passing on SRE knowledge and best practices to the engineering team Drive detailed root cause investigations for production incidents with rigorous focus on issue avoidance, using AI assisted correlation/analysis to accelerate time to insight Create/coordinate retros for significant incidents, ensuring learnings are captured in automated/AI assisted runbooks and embedded into prevention mechanisms Additional core engineering functions, such as adding custom telemetry metrics/logs/traces to the code base of in scope applications to enable AI/ML driven operational insights Anticipate new opportunities to continuously evolve the resiliency profile of scoped applications and infrastructure Skills/Qualifications Must Have B.S. / M.S. degree in Computer Science, Engineering or a related discipline with 10+ years of experience Experience leading high performing engineering/SRE teams, with a track record of driving continuous improvement through automation and AI enabled operations Demonstrated ability to represent engineering/SRE priorities, status, and risk to senior leadership stakeholders with clear, executive ready communication Hands on experience building or operating AI assisted capabilities (AIOps, ML based anomaly detection, or GenAI workflows) in an engineering/production environment A passion for providing engineering support for highly available, performant full stack applications with a "Student of Technology" attitude Experience with relational database and NoSQL Database (e.g., Redis, Apache Cassandra) Benefits We offer a wide range of employee benefits including retirement investment and tools designed to help you build a sound financial future; education reimbursement; comprehensive resources to support your physical health and emotional well being; family support programs; and Flexible Time Off (FTO) so you can relax, recharge, and be there for the people you care about. Hybrid Work Model BlackRock's hybrid work model is designed to enable a culture of collaboration and apprenticeship that enriches the experience of our employees, while supporting flexibility for all. Employees are required to work at least 4 days in the office per week, with the flexibility to work from home 1 day a week. Some business groups may require more time in the office due to their roles and responsibilities. Equal Opportunity Employer BlackRock is proud to be an Equal Opportunity Employer. We evaluate qualified applicants without regard to age, disability, race, religion, sex, sexual orientation and other protected characteristics at law.
Intellegens
Site Reliability Engineer
Intellegens Cambridge, Cambridgeshire
We are seeking a Site Reliability Engineer to maintain and develop our cloud infrastructure and monitoring systems Key features Location: Cambridge Fantastic opportunity to help the business develop and thrive Full time hybrid working The opportunity We are seeking a Site Reliability Engineer to maintain and develop our cloud infrastructure and monitoring systems and processes, helping to ensure the reliability and security of the service we provide to our customers. This role will report into our Head of Platform. Main duties and responsibilities You will be responsible for the continued development of our monitoring systems and use them to proactively identify and communicate performance, reliability, security and cost issues. You will assist in responding to incidents and the remediation of vulnerabilities in our platform. You will also identify, plan and implement improvements to our cloud infrastructure and deployment processes in a secure and robust way, working alongside other engineers to support our product roadmap. As part of the wider product engineering team you will advocate throughout the design process for effective monitoring to ensure the performance, stability and security of our products in line with our commitment to ISO 27001 compliance. What makes you our next Site Reliability Engineer? Minimum Bachelor 2:1 degree in computer science or a related field 2+ years experience in a professional DevOps, SRE, Platform Engineering or similar role Self-motivated with strong problem-solving and analytical skills Experience using and configuring monitoring tools, ideally Grafana and Prometheus, to identify insights and alert to potential issues Experience using and configuring cloud infrastructure (ideally GCP but Azure also desirable) Experience with IaC tools (ideally Terraform) Experience with Docker, Kubernetes and Helm Knowledge of security and reliability best practices for cloud infrastructure and application deployments to kubernetes Experience using Python and Bash for scripting or small CLI applications Experience using Git for professional software development Experience responding to and investigating security or reliability incidents in a distributed cloud environment The ability to communicate technical challenges and opportunities to people outside your area of expertise Some familiarity with the applications in our tech stack: NGINX, Flask (Python), React (TypeScript), PostgreSQL, Opensearch, Valkey, Keycloak Knowledge of administering Linux based systems Experience using CI tools, ideally CircleCI, to manage application deployments Experience applying and monitoring compliance with information security policies Experience applying Agile methodologies and working in sprints The above is not an exhaustive list and you are required to be flexible in your approach to carrying out your duties which may change from time to time in order to reflect business needs or the company's continuous improvement. What can we offer you? A competitive financial package - salary and share options 5 weeks annual leave pro rata, flexible leave policy Salary sacrifice pension, with company savings being paid into the scheme A collaborative work environment with neither red tape nor bureaucracy Scope for career development as an early team member Support and resources to develop your skills and succeed in the role Hybrid working arrangements and a great team culture Access to an EAP, wellbeing champion, and financial advice Enhanced sickness policy Regular social and team building events
27/05/2026
Full time
We are seeking a Site Reliability Engineer to maintain and develop our cloud infrastructure and monitoring systems Key features Location: Cambridge Fantastic opportunity to help the business develop and thrive Full time hybrid working The opportunity We are seeking a Site Reliability Engineer to maintain and develop our cloud infrastructure and monitoring systems and processes, helping to ensure the reliability and security of the service we provide to our customers. This role will report into our Head of Platform. Main duties and responsibilities You will be responsible for the continued development of our monitoring systems and use them to proactively identify and communicate performance, reliability, security and cost issues. You will assist in responding to incidents and the remediation of vulnerabilities in our platform. You will also identify, plan and implement improvements to our cloud infrastructure and deployment processes in a secure and robust way, working alongside other engineers to support our product roadmap. As part of the wider product engineering team you will advocate throughout the design process for effective monitoring to ensure the performance, stability and security of our products in line with our commitment to ISO 27001 compliance. What makes you our next Site Reliability Engineer? Minimum Bachelor 2:1 degree in computer science or a related field 2+ years experience in a professional DevOps, SRE, Platform Engineering or similar role Self-motivated with strong problem-solving and analytical skills Experience using and configuring monitoring tools, ideally Grafana and Prometheus, to identify insights and alert to potential issues Experience using and configuring cloud infrastructure (ideally GCP but Azure also desirable) Experience with IaC tools (ideally Terraform) Experience with Docker, Kubernetes and Helm Knowledge of security and reliability best practices for cloud infrastructure and application deployments to kubernetes Experience using Python and Bash for scripting or small CLI applications Experience using Git for professional software development Experience responding to and investigating security or reliability incidents in a distributed cloud environment The ability to communicate technical challenges and opportunities to people outside your area of expertise Some familiarity with the applications in our tech stack: NGINX, Flask (Python), React (TypeScript), PostgreSQL, Opensearch, Valkey, Keycloak Knowledge of administering Linux based systems Experience using CI tools, ideally CircleCI, to manage application deployments Experience applying and monitoring compliance with information security policies Experience applying Agile methodologies and working in sprints The above is not an exhaustive list and you are required to be flexible in your approach to carrying out your duties which may change from time to time in order to reflect business needs or the company's continuous improvement. What can we offer you? A competitive financial package - salary and share options 5 weeks annual leave pro rata, flexible leave policy Salary sacrifice pension, with company savings being paid into the scheme A collaborative work environment with neither red tape nor bureaucracy Scope for career development as an early team member Support and resources to develop your skills and succeed in the role Hybrid working arrangements and a great team culture Access to an EAP, wellbeing champion, and financial advice Enhanced sickness policy Regular social and team building events
DevOps Architect
Experis - ManpowerGroup
Senior Cloud / DevOps Architect (DV Cleared) Clearance: Active DV clearance is essential for this role Sector: Defence & Security Location: Multiple locations + hybrid About the Role Play a key role in shaping the engineering strategy across Cloud Native and application development. You'll act as both a technical expert and an inspiring leader, driving DevOps excellence while building inclusive, high-performing teams. What You'll Be Doing Design, build, and operate cloud-based platforms and systems Drive best practices across DevOps and Site Reliability Engineering (SRE) Implement infrastructure as code, CI/CD pipelines, and automated deployments Collaborate closely with developers, testers, and business analysts Ensure systems are scalable, maintainable, and resilient Champion observability through monitoring, logging, and alerting Develop reusable frameworks and tooling for wider engineering teams Mentor junior engineers and contribute to a culture of continuous improvement Essential Skills & Experience DV (Developed Vetting) clearance Strong background in software engineering or IT operations Cloud expertise (AWS, Azure, GCP, or Oracle Cloud) Infrastructure as Code (e.g. Terraform, CloudFormation, AWS SAM) Strong understanding ofDevOps practices: CI/CD pipelines Containerisation Deployment automation Experience in one or more of the following: Systems Administration Full Stack Development Virtualisation / Container platforms Configuration & environment management Scripting experience (Python, Bash, or Go) Experience implementing observability (monitoring, alerting, logging) Ability to establish and promote DevOps / SRE best practices
26/05/2026
Full time
Senior Cloud / DevOps Architect (DV Cleared) Clearance: Active DV clearance is essential for this role Sector: Defence & Security Location: Multiple locations + hybrid About the Role Play a key role in shaping the engineering strategy across Cloud Native and application development. You'll act as both a technical expert and an inspiring leader, driving DevOps excellence while building inclusive, high-performing teams. What You'll Be Doing Design, build, and operate cloud-based platforms and systems Drive best practices across DevOps and Site Reliability Engineering (SRE) Implement infrastructure as code, CI/CD pipelines, and automated deployments Collaborate closely with developers, testers, and business analysts Ensure systems are scalable, maintainable, and resilient Champion observability through monitoring, logging, and alerting Develop reusable frameworks and tooling for wider engineering teams Mentor junior engineers and contribute to a culture of continuous improvement Essential Skills & Experience DV (Developed Vetting) clearance Strong background in software engineering or IT operations Cloud expertise (AWS, Azure, GCP, or Oracle Cloud) Infrastructure as Code (e.g. Terraform, CloudFormation, AWS SAM) Strong understanding ofDevOps practices: CI/CD pipelines Containerisation Deployment automation Experience in one or more of the following: Systems Administration Full Stack Development Virtualisation / Container platforms Configuration & environment management Scripting experience (Python, Bash, or Go) Experience implementing observability (monitoring, alerting, logging) Ability to establish and promote DevOps / SRE best practices
OMC Security Systems Specialist
Vantage Data Centers UK Management Company Ltd Newport, Gwent
About Vantage Data CentersVantage Data Centers powers, cools, protects and connects the technology of the world's well-known hyperscalers, cloud providers and large enterprises. Developing and operating across North America, EMEA and Asia Pacific, Vantage has evolved data center design in innovative ways to deliver dramatic gains in reliability, efficiency and sustainability in flexible environments that can scale as quickly as the market demands.ABOUT VANTAGE DATA CENTERSVantage is committed to being a workplace of inclusion, equity, respect and acceptance. We celebrate diversity and intentionally seek out opportunities to learn from one another's experience.Vantage Data Centers powers, cools, protects, and connects the technology of the world's well-known hyperscalers, cloud providers and large enterprises. Developing and operating across six markets in North America and six markets in Europe, Vantage has evolved data center design in innovative ways to deliver dramatic gains in reliability, efficiency and sustainability in flexible environments that can scale as quickly as the market demands. Vantage's business is growing exponentially, through a combination of greenfield market expansion and acquisitions across North America, Europe Africa and Asia-Pacific.Security DepartmentThe Physical Security Department for Vantage Data Centers EMEA is very hands on. In most cases, we manage, full-spectrum system specifications, site & system design, procurement, installation, configuration, maintenance of all network and server hardware and provide live-site physical security. We also work closely identifying & selecting trusted partner vendors and suppliers to learn about the latest technology changes so that we can make informed purchase decisions. We are always looking for ways to strike the best balance between human delivered physical security, technology, performance, and cost. Vantage Physical Security Department also participates in designing each of our new data centre building's security infrastructure. If you like getting your hands dirty and helping to design, build and maintain Physical Security infrastructure in a modern data centre, then come work at Vantage. We are expanding with many new builds!Position OverviewWe are seeking a Operational Management Centre, Systems Specialist to join our newly established Operations Management Centre in EMEA. This role is crucial in monitoring and managing security systems across 17+ sites, focusing on critical incident response and troubleshooting.Essential Job Functions, Key ResponsibilitiesMonitor Genetec security systems across 17+ EMEA sites from the Operations Management CenterRespond to and manage critical security incidents according to predefined processesProvide remote technical support and troubleshooting for security systemsCollaborate with local Site Security (Security Office Center Operators, Security Systems Technicians/Engineers, Security Integrator Subcontractors) during incidents and system issuesGenerate and analyse reports for security audits and customer requirementsMaintain and update security procedures and related documentationParticipate in the ongoing improvement of security protocols and systemsRequired Qualifications3+ years of experience with Access Control Systems (ACS) and Video Surveillance Systems (VSS)In-depth knowledge of Genetec security systems or willing to learnStrong understanding of IP networking and security technologiesExcellent problem-solving and analytical skillsAbility to work in shifts, including weekends and holidaysFluent in English; additional European languages are a plusPreferred QualificationsGenetec certification or ability to obtain within 90 days of hireExperience in data center security operationsFamiliarity with intercom systems, biometric systems, and key management systemsBackground in IT or cybersecurityKey CompetenciesStrong communication skills, both verbal and writtenAbility to remain calm and focused under pressureDetail-oriented with excellent organisational skillsProactive approach to identifying and resolving potential security issuesAdaptability to changing technologies and security landscapesWorking ConditionsFull-time position based in the CELT Operations Management CenterShift work required, including nights, weekends, and holidaysNo travel or on-site hardware maintenance requiredDon't meet all the requirements? Please still apply if you think you are the right person for the position. We are always keen to speak to people who connect with our mission and values.We operate with No Ego and No Arrogance. We work to build each other up and support one another, appreciating each other's strengths and respecting each other's weaknesses. We find joy in our work and each other, actively seeking opportunities to inject fun into what we do. Our hard and efficient work is rewarded with an above market total compensation package. We offer a comprehensive suite of health and welfare, retirement, and paid leave benefits exceeding local expectations.Throughout the year, the advantages of being part of the Vantage team are evident with an array of benefits, recognition, training and development, and the knowledge that your contribution adds value to the company and our community.We operate with No Ego and No Arrogance. We work to build each other up and support one another, appreciating each other's strengths and respecting each other's weaknesses. We find joy in our work and each other, actively seeking opportunities to inject fun into what we do. Our hard and efficient work is rewarded with an above market total compensation package. We offer a comprehensive suite of health and welfare, retirement, and paid leave benefits exceeding local expectations.Throughout the year, the advantage of being part of the Vantage team is evident with an array of benefits, recognition, training and development, and the knowledge that your contribution adds value to the company and our community.Don't meet all the requirements? Please still apply if you think you are the right person for the position. We are always keen to speak to people who connect with our mission and values.Vantage Data Centers is an Equal Opportunity EmployerVantage Data Centers does not accept unsolicited resumes from search firm agencies. Fees will not be paid in the event a candidate submitted by a recruiter without an agreement in place is hired; such resumes will be deemed the sole property of Vantage Data Centers.
26/05/2026
Full time
About Vantage Data CentersVantage Data Centers powers, cools, protects and connects the technology of the world's well-known hyperscalers, cloud providers and large enterprises. Developing and operating across North America, EMEA and Asia Pacific, Vantage has evolved data center design in innovative ways to deliver dramatic gains in reliability, efficiency and sustainability in flexible environments that can scale as quickly as the market demands.ABOUT VANTAGE DATA CENTERSVantage is committed to being a workplace of inclusion, equity, respect and acceptance. We celebrate diversity and intentionally seek out opportunities to learn from one another's experience.Vantage Data Centers powers, cools, protects, and connects the technology of the world's well-known hyperscalers, cloud providers and large enterprises. Developing and operating across six markets in North America and six markets in Europe, Vantage has evolved data center design in innovative ways to deliver dramatic gains in reliability, efficiency and sustainability in flexible environments that can scale as quickly as the market demands. Vantage's business is growing exponentially, through a combination of greenfield market expansion and acquisitions across North America, Europe Africa and Asia-Pacific.Security DepartmentThe Physical Security Department for Vantage Data Centers EMEA is very hands on. In most cases, we manage, full-spectrum system specifications, site & system design, procurement, installation, configuration, maintenance of all network and server hardware and provide live-site physical security. We also work closely identifying & selecting trusted partner vendors and suppliers to learn about the latest technology changes so that we can make informed purchase decisions. We are always looking for ways to strike the best balance between human delivered physical security, technology, performance, and cost. Vantage Physical Security Department also participates in designing each of our new data centre building's security infrastructure. If you like getting your hands dirty and helping to design, build and maintain Physical Security infrastructure in a modern data centre, then come work at Vantage. We are expanding with many new builds!Position OverviewWe are seeking a Operational Management Centre, Systems Specialist to join our newly established Operations Management Centre in EMEA. This role is crucial in monitoring and managing security systems across 17+ sites, focusing on critical incident response and troubleshooting.Essential Job Functions, Key ResponsibilitiesMonitor Genetec security systems across 17+ EMEA sites from the Operations Management CenterRespond to and manage critical security incidents according to predefined processesProvide remote technical support and troubleshooting for security systemsCollaborate with local Site Security (Security Office Center Operators, Security Systems Technicians/Engineers, Security Integrator Subcontractors) during incidents and system issuesGenerate and analyse reports for security audits and customer requirementsMaintain and update security procedures and related documentationParticipate in the ongoing improvement of security protocols and systemsRequired Qualifications3+ years of experience with Access Control Systems (ACS) and Video Surveillance Systems (VSS)In-depth knowledge of Genetec security systems or willing to learnStrong understanding of IP networking and security technologiesExcellent problem-solving and analytical skillsAbility to work in shifts, including weekends and holidaysFluent in English; additional European languages are a plusPreferred QualificationsGenetec certification or ability to obtain within 90 days of hireExperience in data center security operationsFamiliarity with intercom systems, biometric systems, and key management systemsBackground in IT or cybersecurityKey CompetenciesStrong communication skills, both verbal and writtenAbility to remain calm and focused under pressureDetail-oriented with excellent organisational skillsProactive approach to identifying and resolving potential security issuesAdaptability to changing technologies and security landscapesWorking ConditionsFull-time position based in the CELT Operations Management CenterShift work required, including nights, weekends, and holidaysNo travel or on-site hardware maintenance requiredDon't meet all the requirements? Please still apply if you think you are the right person for the position. We are always keen to speak to people who connect with our mission and values.We operate with No Ego and No Arrogance. We work to build each other up and support one another, appreciating each other's strengths and respecting each other's weaknesses. We find joy in our work and each other, actively seeking opportunities to inject fun into what we do. Our hard and efficient work is rewarded with an above market total compensation package. We offer a comprehensive suite of health and welfare, retirement, and paid leave benefits exceeding local expectations.Throughout the year, the advantages of being part of the Vantage team are evident with an array of benefits, recognition, training and development, and the knowledge that your contribution adds value to the company and our community.We operate with No Ego and No Arrogance. We work to build each other up and support one another, appreciating each other's strengths and respecting each other's weaknesses. We find joy in our work and each other, actively seeking opportunities to inject fun into what we do. Our hard and efficient work is rewarded with an above market total compensation package. We offer a comprehensive suite of health and welfare, retirement, and paid leave benefits exceeding local expectations.Throughout the year, the advantage of being part of the Vantage team is evident with an array of benefits, recognition, training and development, and the knowledge that your contribution adds value to the company and our community.Don't meet all the requirements? Please still apply if you think you are the right person for the position. We are always keen to speak to people who connect with our mission and values.Vantage Data Centers is an Equal Opportunity EmployerVantage Data Centers does not accept unsolicited resumes from search firm agencies. Fees will not be paid in the event a candidate submitted by a recruiter without an agreement in place is hired; such resumes will be deemed the sole property of Vantage Data Centers.
DevOps Engineer
Onyx-Conseil Manchester, Lancashire
Role: DevOps Engineer Role Type: Manchester - UK - Hybrid (2 Days on-site) Salary / Rate: Negotiable Type: Contract Inside IR35 About the client Project The Core Banking Platform is responsible for the systems that process payments, manage accounts, and deliver essential banking functionality. We serve all areas of the Bank that consume transactional data from our Core Banking Systems-supporting colleague and customer journeys, regulatory reporting, and real-time event streaming. Our platform is moving towards a modern, event-driven architecture, leveraging APIs, cloud-native services, and advanced DevOps tooling to deliver secure, scalable, and high-performing solutions. We work closely with partners across the Group, including Data & Machine Learning, Consumer Lending, and Cloud Services, to deliver strategic account data products and enable the next generation of banking services. We're committed to continuous improvement, modernizing our tooling (e.g., migrating from Jenkins to Harness for CI/CD, enabling blue green deployments, and adoption of Backstage as the Internal Developer Portal) and ensuring our engineers have the skills and support to deliver at pace and scale. About the Role As a DevOps Engineer in Core Banking, you'll play a pivotal role in designing, building, and supporting the tools and processes that enable our teams to deliver high-quality software at pace. You'll work collaboratively across engineering, product, and operations teams to ensure our platform is secure, resilient, and continuously improving. What you'll do Design, implementation, and maintenance of CI/CD pipelines using industry-standard tools (e.g., Jenkins, Harness Git, Nexus, SonarQube, Docker, Kubernetes). Champion DevOps best practices, including Infrastructure as Code, automated testing, monitoring, and observability, ensuring compliance with banking regulations and internal security standards. Support the deployment and operation of microservices on both private and public cloud environments, leveraging container orchestration (Kubernetes) and service mesh (Istio) where appropriate. Drive continuous improvement by identifying bottlenecks, automating manual processes, and sharing knowledge across engineering teams. Ensure robust monitoring, alerting, and incident response processes are in place to maintain high availability and reliability of banking services. What we're looking for Strong engineering background with a deep understanding of DevOps practices, processes, and supporting tools. Experience designing and implementing CI/CD pipelines and associated DevOps tooling (Jenkins, Harness, Backstage, Git, Nexus, SonarQube, Docker, Kubernetes). Proficiency with source control systems (Git), including branching strategies and code review practices. Hands on experience with cloud platforms (especially GCP), Kubernetes, Docker, and Infrastructure as Code. Familiarity with monitoring, logging, alerting, and SRE concepts (Prometheus, Grafana, ELK, etc.). Experience with test automation tools (JUnit, Cucumber, Selenium, Postman). Knowledge of messaging and streaming platforms (Kafka) and service mesh (Istio). Understanding of network protocols, security best practices, and compliance requirements for banking. Curiosity, adaptability, and a commitment to continuous improvement and inclusive teamwork. Experience with network protocols and security. Exposure to large-scale, regulated environments (preferably financial services). Knowledge of additional DevOps tools and emerging technologies.
26/05/2026
Full time
Role: DevOps Engineer Role Type: Manchester - UK - Hybrid (2 Days on-site) Salary / Rate: Negotiable Type: Contract Inside IR35 About the client Project The Core Banking Platform is responsible for the systems that process payments, manage accounts, and deliver essential banking functionality. We serve all areas of the Bank that consume transactional data from our Core Banking Systems-supporting colleague and customer journeys, regulatory reporting, and real-time event streaming. Our platform is moving towards a modern, event-driven architecture, leveraging APIs, cloud-native services, and advanced DevOps tooling to deliver secure, scalable, and high-performing solutions. We work closely with partners across the Group, including Data & Machine Learning, Consumer Lending, and Cloud Services, to deliver strategic account data products and enable the next generation of banking services. We're committed to continuous improvement, modernizing our tooling (e.g., migrating from Jenkins to Harness for CI/CD, enabling blue green deployments, and adoption of Backstage as the Internal Developer Portal) and ensuring our engineers have the skills and support to deliver at pace and scale. About the Role As a DevOps Engineer in Core Banking, you'll play a pivotal role in designing, building, and supporting the tools and processes that enable our teams to deliver high-quality software at pace. You'll work collaboratively across engineering, product, and operations teams to ensure our platform is secure, resilient, and continuously improving. What you'll do Design, implementation, and maintenance of CI/CD pipelines using industry-standard tools (e.g., Jenkins, Harness Git, Nexus, SonarQube, Docker, Kubernetes). Champion DevOps best practices, including Infrastructure as Code, automated testing, monitoring, and observability, ensuring compliance with banking regulations and internal security standards. Support the deployment and operation of microservices on both private and public cloud environments, leveraging container orchestration (Kubernetes) and service mesh (Istio) where appropriate. Drive continuous improvement by identifying bottlenecks, automating manual processes, and sharing knowledge across engineering teams. Ensure robust monitoring, alerting, and incident response processes are in place to maintain high availability and reliability of banking services. What we're looking for Strong engineering background with a deep understanding of DevOps practices, processes, and supporting tools. Experience designing and implementing CI/CD pipelines and associated DevOps tooling (Jenkins, Harness, Backstage, Git, Nexus, SonarQube, Docker, Kubernetes). Proficiency with source control systems (Git), including branching strategies and code review practices. Hands on experience with cloud platforms (especially GCP), Kubernetes, Docker, and Infrastructure as Code. Familiarity with monitoring, logging, alerting, and SRE concepts (Prometheus, Grafana, ELK, etc.). Experience with test automation tools (JUnit, Cucumber, Selenium, Postman). Knowledge of messaging and streaming platforms (Kafka) and service mesh (Istio). Understanding of network protocols, security best practices, and compliance requirements for banking. Curiosity, adaptability, and a commitment to continuous improvement and inclusive teamwork. Experience with network protocols and security. Exposure to large-scale, regulated environments (preferably financial services). Knowledge of additional DevOps tools and emerging technologies.
SRE III- Data & AWS
慨正橡扯
Job Description We have an opportunity to impact your career and provide an adventure where you can push the limits of what's possible. The Chief Data & Analytics Office (CDAO) at JPMorgan Chase is responsible for accelerating the firm's data and analytics journey. This includes ensuring the quality, integrity, and security of the company's data, as well as leveraging this data to generate insights and drive decision making. The CDAO is also responsible for developing and implementing solutions that support the firm's commercial goals by harnessing artificial intelligence and machine learning technologies to develop new products, improve productivity, and enhance risk management effectively and responsibly. As a Site Reliability Engineer at JPMorgan Chase within the AIML Data Platforms and Chief Data and Analytics Team, you will develop and deliver advanced technology products focused on data and analytics. Tackle complex cloud data platform challenges, especially around DataLake Tools. In this role you will work in an agile environment, collaborating with cross functional teams. Job Responsibilities Maintains a managed AWS Databricks platform, and provides engineering and operational support for the platform to application teams. Performs platform design, set up and configuration, workspace administration, resource monitoring, providing engineering support to data engineering teams, Data Science/ML, and Application/integration teams. Leads evaluation sessions with external vendors, startups, and internal teams to drive outcomes oriented probing of architectural designs, technical credentials, and applicability for use within existing systems and information architecture. Drives continuous improvement in system observability, alerting, and capacity planning. Collaborates with engineering and data teams to optimize infrastructure and deployment processes, focusing on automation and operational excellence. Executes creative software solutions, design, development, and technical troubleshooting with ability to think beyond routine or conventional approaches to build solutions or break down technical problems. Develops secure high quality production code, and reviews and debugs code written by others. Identifies opportunities to eliminate or automate remediation of recurring issues to improve overall operational stability of software applications and systems. Adds to team culture of diversity, opportunity, and respect. Implements Site Reliability Engineering (SRE) best practices to ensure reliability, scalability, and performance of data platforms. Develops and maintains incident response procedures, including root cause analysis and post mortem documentation. Required qualifications, capabilities, and skills Formal training or certification on software engineering concepts and 10+ years applied experience. Extensive experience with AWS Databricks platform administration and engineering support is a must. Strong understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management. Experience with monitoring tools, automation frameworks, and CI/CD pipelines. Proficient in Python application program development with use of automated unit testing. Experience with Terraform development and understanding of Terraform Enterprise. Experience in delivering system design, application development, testing, and operational stability. Knowledge of Big Data distributed compute frameworks like Spark, Glue, MapReduce etc. Excellent troubleshooting, analytical, and communication skills. Experience in Data pipelines using Spark. Exposure to AWS & Databricks Platform administration. Knowledge of containerization (Docker, Kubernetes) and orchestration. Familiarity with distributed systems and large scale data processing. Preferred qualifications, capabilities, and skills Experience in Data pipelines using Spark. Exposure to AWS & Databricks Platform administration. Knowledge of containerization (Docker, Kubernetes) and orchestration. Familiarity with distributed systems and large scale data processing. We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants and employees' religious practices and beliefs, as well as mental health or physical disability needs.
26/05/2026
Full time
Job Description We have an opportunity to impact your career and provide an adventure where you can push the limits of what's possible. The Chief Data & Analytics Office (CDAO) at JPMorgan Chase is responsible for accelerating the firm's data and analytics journey. This includes ensuring the quality, integrity, and security of the company's data, as well as leveraging this data to generate insights and drive decision making. The CDAO is also responsible for developing and implementing solutions that support the firm's commercial goals by harnessing artificial intelligence and machine learning technologies to develop new products, improve productivity, and enhance risk management effectively and responsibly. As a Site Reliability Engineer at JPMorgan Chase within the AIML Data Platforms and Chief Data and Analytics Team, you will develop and deliver advanced technology products focused on data and analytics. Tackle complex cloud data platform challenges, especially around DataLake Tools. In this role you will work in an agile environment, collaborating with cross functional teams. Job Responsibilities Maintains a managed AWS Databricks platform, and provides engineering and operational support for the platform to application teams. Performs platform design, set up and configuration, workspace administration, resource monitoring, providing engineering support to data engineering teams, Data Science/ML, and Application/integration teams. Leads evaluation sessions with external vendors, startups, and internal teams to drive outcomes oriented probing of architectural designs, technical credentials, and applicability for use within existing systems and information architecture. Drives continuous improvement in system observability, alerting, and capacity planning. Collaborates with engineering and data teams to optimize infrastructure and deployment processes, focusing on automation and operational excellence. Executes creative software solutions, design, development, and technical troubleshooting with ability to think beyond routine or conventional approaches to build solutions or break down technical problems. Develops secure high quality production code, and reviews and debugs code written by others. Identifies opportunities to eliminate or automate remediation of recurring issues to improve overall operational stability of software applications and systems. Adds to team culture of diversity, opportunity, and respect. Implements Site Reliability Engineering (SRE) best practices to ensure reliability, scalability, and performance of data platforms. Develops and maintains incident response procedures, including root cause analysis and post mortem documentation. Required qualifications, capabilities, and skills Formal training or certification on software engineering concepts and 10+ years applied experience. Extensive experience with AWS Databricks platform administration and engineering support is a must. Strong understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management. Experience with monitoring tools, automation frameworks, and CI/CD pipelines. Proficient in Python application program development with use of automated unit testing. Experience with Terraform development and understanding of Terraform Enterprise. Experience in delivering system design, application development, testing, and operational stability. Knowledge of Big Data distributed compute frameworks like Spark, Glue, MapReduce etc. Excellent troubleshooting, analytical, and communication skills. Experience in Data pipelines using Spark. Exposure to AWS & Databricks Platform administration. Knowledge of containerization (Docker, Kubernetes) and orchestration. Familiarity with distributed systems and large scale data processing. Preferred qualifications, capabilities, and skills Experience in Data pipelines using Spark. Exposure to AWS & Databricks Platform administration. Knowledge of containerization (Docker, Kubernetes) and orchestration. Familiarity with distributed systems and large scale data processing. We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants and employees' religious practices and beliefs, as well as mental health or physical disability needs.
Senior Site Reliability Engineer
慨正橡扯
Professional Services was formed as a progressive development towards the convergence of multiple domains across BT. We pride ourselves on providing expert third line support to an extensive range of services; ensuring the required levels of availability are maintained. The team is widely recognised for getting things done, while making transformational improvements along the way. We do this by ensuring we have the right people to achieve our high ambitions. This role is pivotal in driving automation and operational excellence within Professional Services. You'll lead the automation strategy, evolve development practices and champion an SRE mindset to improve reliability, scalability and performance across our platforms, inclusive of our hosting infrastructure. By leveraging deep technical expertise and end-to-end ownership, you'll deliver flawless change, enhance service availability, and unlock time for the team to focus on solving business problems. Your work will enable smarter processes, stronger collaboration, and a "brilliant" customer experience, making automation and resilience the foundation of everything we do. What you'll be doing - your accountabilities Actively contributing and drives the Professional Services automation strategy. Owns and contributes to the roadmap for the APPID's we manage, including EoSL planning, application lifecycle, compute requirements & BCM planning. Manages compute infrastructure hosts including patching and security vulnerability management. Builds network automation frameworks that allow for efficient test, build & deployment of new automated processes, inclusive of CI/CD pipelines. Ensuring we deliver robust code in an efficient manor. Enact the Professional Services driverless change ambition, actively looking for potential use cases, starting the conversation and driving for success. Contributes to the development of the Professional Services team in the adoption of an SRE approach. Participate and be the technical lead on incidents within immediate domain and be part of PIR write up and solution. Acts as a subject matter expert for the TSA guidelines, influencing working practices to ensure we remain compliant with industry standards. Will champion and build effective working relationships, both internally and externally to deliver business outcomes. Responsible for delivering complex network change and providing assurance of Fixed network platforms managed by Professional Services. The skills you'll need to succeed You must have: A strong level of proficiency in atleast one programming language preferably Python. An understanding of the fundamentals of Ansible and how to apply it. A strong understanding of containerisation using docker, podman or a similar container engine. A good knowledge of coding best practices including code structure, peer review & testing. A good knowledge of CI/CD pipelines and familiarity with either Gitlab CI/CD or Jenkins. You will be confident and professional in communicating with all stakeholders, both locally and with members of the Senior Management Team. You will have the ability to work in a high-pressure environment. A good grasp of network engineering fundamentals. A good understanding of HTTP/ web API's Desired: Strong understanding of IP/MPLS networks. Strong proficiency in building CI/CD pipelines. Strong knowledge of Ansible, you will have built an Ansible repo from the ground up to perform heavy lift deployment. Strong understanding of container orchestration (e.g. Kubernetes). OS Administration: Strong system administration skills, across an array of operating systems such as Linux, Windows etc. with a focus on performance optimisation, security, and automation. Cloud Computing: Strong understanding cloud computing and best practices.
26/05/2026
Full time
Professional Services was formed as a progressive development towards the convergence of multiple domains across BT. We pride ourselves on providing expert third line support to an extensive range of services; ensuring the required levels of availability are maintained. The team is widely recognised for getting things done, while making transformational improvements along the way. We do this by ensuring we have the right people to achieve our high ambitions. This role is pivotal in driving automation and operational excellence within Professional Services. You'll lead the automation strategy, evolve development practices and champion an SRE mindset to improve reliability, scalability and performance across our platforms, inclusive of our hosting infrastructure. By leveraging deep technical expertise and end-to-end ownership, you'll deliver flawless change, enhance service availability, and unlock time for the team to focus on solving business problems. Your work will enable smarter processes, stronger collaboration, and a "brilliant" customer experience, making automation and resilience the foundation of everything we do. What you'll be doing - your accountabilities Actively contributing and drives the Professional Services automation strategy. Owns and contributes to the roadmap for the APPID's we manage, including EoSL planning, application lifecycle, compute requirements & BCM planning. Manages compute infrastructure hosts including patching and security vulnerability management. Builds network automation frameworks that allow for efficient test, build & deployment of new automated processes, inclusive of CI/CD pipelines. Ensuring we deliver robust code in an efficient manor. Enact the Professional Services driverless change ambition, actively looking for potential use cases, starting the conversation and driving for success. Contributes to the development of the Professional Services team in the adoption of an SRE approach. Participate and be the technical lead on incidents within immediate domain and be part of PIR write up and solution. Acts as a subject matter expert for the TSA guidelines, influencing working practices to ensure we remain compliant with industry standards. Will champion and build effective working relationships, both internally and externally to deliver business outcomes. Responsible for delivering complex network change and providing assurance of Fixed network platforms managed by Professional Services. The skills you'll need to succeed You must have: A strong level of proficiency in atleast one programming language preferably Python. An understanding of the fundamentals of Ansible and how to apply it. A strong understanding of containerisation using docker, podman or a similar container engine. A good knowledge of coding best practices including code structure, peer review & testing. A good knowledge of CI/CD pipelines and familiarity with either Gitlab CI/CD or Jenkins. You will be confident and professional in communicating with all stakeholders, both locally and with members of the Senior Management Team. You will have the ability to work in a high-pressure environment. A good grasp of network engineering fundamentals. A good understanding of HTTP/ web API's Desired: Strong understanding of IP/MPLS networks. Strong proficiency in building CI/CD pipelines. Strong knowledge of Ansible, you will have built an Ansible repo from the ground up to perform heavy lift deployment. Strong understanding of container orchestration (e.g. Kubernetes). OS Administration: Strong system administration skills, across an array of operating systems such as Linux, Windows etc. with a focus on performance optimisation, security, and automation. Cloud Computing: Strong understanding cloud computing and best practices.
Site Reliability Engineer (AWS)
慨正橡扯
Site Reliability Engineer (AWS) Reporting to:Director of Engineering Location:London (Hybrid - we're flexible) Job Type:Permanent About Us Camascope is a fast-growing technology company focused on empowering the care and medication sector with technology. We are a team of talented, caring, and ambitious individuals who are committed to making a difference in care. Our ecosystem connects pharmacies, care homes, and doctors to improve the lives of many. There has never been a better time to join Camascope. Our team is growing and our product is reaching more users and partners every day. You will join a collaborative and passionate team. We love solving real problems and are committed to building the highest-quality solutions. If you are eager to make a meaningful impact in healthcare and thrive in a fast-paced startup environment, Camascope will be the perfect place for you. What You'll Do Own reliability - Maintain and improve our AWS infrastructure using Terraform, bringing your expertise and best practices Champion observability - Partner with developers to implement effective monitoring, logging, and tracing strategies Strengthen security - Work closely with the CISO to implement security best practices and ensure compliance Optimise costs - Monitor cloud spend and implement FinOps best practices Maintain CI/CD pipelines - Implement and maintain reliability and observability aspects of GitHub workflows and deployment pipelines Incident response - Lead incidents, run blameless post-mortems, and drive continuous improvement Enable developers - Mentor teams on SRE and observability practices, helping them quickly understand and resolve issues Leverage AI tooling - Use AI assisted development tools (e.g. GitHub Copilot) to accelerate infrastructure work, and explore AI driven approaches to incident detection, root cause analysis, and remediation What We're Looking For Essential 3+ years in an SRE, Platform, or DevOps engineering role AWS services:CloudWatch, X-Ray, Lambda, API Gateway, S3, SQS, Aurora PostgreSQL, DynamoDB, CloudFront, VPC, IAM, Security Groups Python for scripting, tooling, and Lambda development Terraform for Infrastructure as Code GitHub (Actions, workflows, repository management) Strong understanding of observability - metrics, logs, and traces Good understanding of cloud security principles and best practices Experience with cloud cost management and optimisationExcellent communication skills for working with technical and non-technical colleagues Self starter who can prioritise and organise their own workload Comfortable using AI assisted development tools such as GitHub Copilot Bonus Points For Datadog for monitoring, APM, and log management Azure experience:Front Door, Storage Accounts, App Service, Azure SQL Database, Application Insights Previous experience in early-stage startups or scale ups Having worked in Healthcare or Pharmacy tech previously Experience working in regulated environments or with compliance frameworks Experience with AI driven DevOps tooling (e.g. AWS DevOps Agent or similar AI agents for incident resolution, root cause analysis, and operational improvement) Experience with SLIs, SLOs, and error budgets On Call We have a 24/7 customer support team who handle day to day issues. We don't have a formal on call engineering rota, but our platform supports care homes around the clock - so we're looking for someone who is happy to occasionally jump on a call with the team if critical platform issues arise out of hours (and part of your job will be making sure this isn't necessary!). Why Join Us? Join an established engineering team and have the opportunity to enhance and shape how we approach platform and reliability engineering Make a meaningful impact in healthcare technology Work with modern cloud-native infrastructure Influence engineering culture and platform practices Collaborate in an environment where your ideas matter Grow with us as we scale Benefits Competitive salary (dependent on experience) Pension scheme and healthcare benefits Ongoing training and professional development 25 days annual leave excluding bank holidays We welcome applications from candidates of all backgrounds. If you're excited about this role but don't meet 100% of the requirements, we encourage you to apply anyway.
26/05/2026
Full time
Site Reliability Engineer (AWS) Reporting to:Director of Engineering Location:London (Hybrid - we're flexible) Job Type:Permanent About Us Camascope is a fast-growing technology company focused on empowering the care and medication sector with technology. We are a team of talented, caring, and ambitious individuals who are committed to making a difference in care. Our ecosystem connects pharmacies, care homes, and doctors to improve the lives of many. There has never been a better time to join Camascope. Our team is growing and our product is reaching more users and partners every day. You will join a collaborative and passionate team. We love solving real problems and are committed to building the highest-quality solutions. If you are eager to make a meaningful impact in healthcare and thrive in a fast-paced startup environment, Camascope will be the perfect place for you. What You'll Do Own reliability - Maintain and improve our AWS infrastructure using Terraform, bringing your expertise and best practices Champion observability - Partner with developers to implement effective monitoring, logging, and tracing strategies Strengthen security - Work closely with the CISO to implement security best practices and ensure compliance Optimise costs - Monitor cloud spend and implement FinOps best practices Maintain CI/CD pipelines - Implement and maintain reliability and observability aspects of GitHub workflows and deployment pipelines Incident response - Lead incidents, run blameless post-mortems, and drive continuous improvement Enable developers - Mentor teams on SRE and observability practices, helping them quickly understand and resolve issues Leverage AI tooling - Use AI assisted development tools (e.g. GitHub Copilot) to accelerate infrastructure work, and explore AI driven approaches to incident detection, root cause analysis, and remediation What We're Looking For Essential 3+ years in an SRE, Platform, or DevOps engineering role AWS services:CloudWatch, X-Ray, Lambda, API Gateway, S3, SQS, Aurora PostgreSQL, DynamoDB, CloudFront, VPC, IAM, Security Groups Python for scripting, tooling, and Lambda development Terraform for Infrastructure as Code GitHub (Actions, workflows, repository management) Strong understanding of observability - metrics, logs, and traces Good understanding of cloud security principles and best practices Experience with cloud cost management and optimisationExcellent communication skills for working with technical and non-technical colleagues Self starter who can prioritise and organise their own workload Comfortable using AI assisted development tools such as GitHub Copilot Bonus Points For Datadog for monitoring, APM, and log management Azure experience:Front Door, Storage Accounts, App Service, Azure SQL Database, Application Insights Previous experience in early-stage startups or scale ups Having worked in Healthcare or Pharmacy tech previously Experience working in regulated environments or with compliance frameworks Experience with AI driven DevOps tooling (e.g. AWS DevOps Agent or similar AI agents for incident resolution, root cause analysis, and operational improvement) Experience with SLIs, SLOs, and error budgets On Call We have a 24/7 customer support team who handle day to day issues. We don't have a formal on call engineering rota, but our platform supports care homes around the clock - so we're looking for someone who is happy to occasionally jump on a call with the team if critical platform issues arise out of hours (and part of your job will be making sure this isn't necessary!). Why Join Us? Join an established engineering team and have the opportunity to enhance and shape how we approach platform and reliability engineering Make a meaningful impact in healthcare technology Work with modern cloud-native infrastructure Influence engineering culture and platform practices Collaborate in an environment where your ideas matter Grow with us as we scale Benefits Competitive salary (dependent on experience) Pension scheme and healthcare benefits Ongoing training and professional development 25 days annual leave excluding bank holidays We welcome applications from candidates of all backgrounds. If you're excited about this role but don't meet 100% of the requirements, we encourage you to apply anyway.

Modal Window

  • Home
  • Contact
  • About Us
  • FAQs
  • Terms & Conditions
  • Privacy
  • Employer
  • Post a Job
  • Search Resumes
  • Sign in
  • Job Seeker
  • Find Jobs
  • Create Resume
  • Sign in
  • IT blog
  • Facebook
  • Twitter
  • LinkedIn
  • Youtube
© 2008-2026 IT Job Board