# Remote Solution Architect Job at Xapo BankAnywhere17 hours agoFull TimeAnywhere$25000 - $40000 USDCI/CDPythonAWSGCParchitecturecloudsecurity"When applying, mention the word FarCoder to show you've read the job post completely. Employers can look for these words to identify genuine, thoughtful applicants and avoid spam." Job OverviewXapo Bank is hiring a remote candidate for Solution Architect. This is a full time position. Work location: Anywhere.The role typically involves technologies such as CI/CD, Python, AWS, GCP, architecture, cloud. Key Responsibilities Own and facilitate the Architectural Advice Forum (AAF), driving decentralised decision-making that is captured as ADRs. The forum is yours, the challenge is expected, but you are not "single" approver of every architectural change, the goal is for decisions to be made as close to the teams doing the work as possible. Personally build production-grade Proofs of Concept to de-risk new patterns, validate architectural ideas and demonstrate "what good looks like" ahead of broader adoption. Partner with the CTO on Xapo's AI strategy, identifying high-value opportunities, evaluating tooling, prototyping solutions and helping engineering teams adopt AI responsibly across the SDLC and our customer-facing products. Own cross-cutting standardisation across backend, web and mobile engineering, patterns, paradigms, observability, security, testing and delivery, so that quality and velocity scale together. Define and uphold the engineering quality bar across backend, web and mobile, covering testing, code review, observability, security, performance and production-readiness, and lead by example through your own POCs and code reviews Embed Security-First Architecture across everything we build. For a regulated crypto bank, security is not a checkbox or a separate workstream, it is the DNA of our products. You will set the threat-modelling, secure-by-default and defence-in-depth patterns that every team builds against, and partner with our Security function to evolve them as our risk surface grows. Champion Domain-Driven Design and Event-Driven Architecture as foundations for how we model our domain and evolve our systems. Work closely with the SRE team on reliability, capacity, incident response, observability and the evolution of our internal developer platform. Coach and mentor engineers across SATs, platform and enabling teams, leading through inspiration and influence rather than authority, and helping the whole organisation raise its architectural bar. Surface and unblock systemic technical risks, technical debt and cross-team dependencies before they hit production. Maintain Xapo's engineering principles and Tech Radar, and contribute to architecture-relevant aspects of regulatory, security and risk conversations. Build a great place to work for talented and motivated people, and help develop innovative solutions with Bitcoin at their core. Required Skills Primary Skills CI/CD Python AWS GCP Secondary Skills architecture cloud securitySkills required for this role include CI/CD, Python, AWS, and related tools for day-to-day development. Job Details Employment Type: Full Time Location: Anywhere Salary: $25000 - $40000 USD Tech StackCI/CD, Python, AWS, GCP, architecture, cloud, security Role details Work from anywhere, impact everywhereDiversity is at the heart of who we are at Xapo Bank. We're a fully distributed team of over 160 Xapiens that work remotely from 50+ countries around the world.Our beginning: A world that enjoys economic freedom and wealth protection,no matter where you live or who is running your country. To achieve that, we search the world for the best people for the job. We work hard, think globally, and inspire each other to learn and grow. We are committed to changing the way things are done. Although we are headquartered in Gibraltar, this is a full time, 100% remote position Work from anywhere! Position overviewWe're looking for a Solution Architect to join our engineering function, reporting directly to the CTO. At Xapo, we are building truly cross-functional teams with full ownership of design, architecture, building, testing, delivery, data, and operations, structured as Stream Aligned Teams (SATs), Platform teams and Enabling teams as per Team Topologies. You will collaborate closely with engineering leaders and the SRE function, as well as fellow members of the product, apps, and design.This is a deeply hands-on role. You will not sit on top of an ivory tower drawing boxes and arrows, you will write code, build production-grade Proofs of Concept and demonstrate what good looks like by example. You will own the Architectural Advice Forum (AAF), facilitating decentralised architectural decisions captured as ADRs without becoming a bottleneck or single point of approval.You will be the CTO's partner on AI strategy, drive standardisation across backend, web and mobile engineering, and work closely with our SRE team on reliability, observability and platform concerns.This is an opportunity to shape how a regulated crypto bank designs, builds and operates software, having a real impact on how the future of finance looks. Responsibilities Skills needed 10+ years of software engineering experience, including 3+ years in an Architect, Staff or equivalent senior-IC role. A strong track record as a hands-on engineer-turned-architect: you still read, write and review production code regularly, and you enjoy doing so. The role is language-agnostic, we care how you think, not which language you write, though familiarity with our primary stack is a plus. Deep experience designing, building and operating distributed, event-driven systems in production, ideally at scale and under regulatory scrutiny. Hands-on experience with Domain-Driven Design, both strategic (bounded contexts, context mapping) and tactical patterns. Practical familiarity with decentralised architectural governance, Architectural Advice Forum, RFCs and ADRs, and a clear point of view on why architectural decisions should be made close to the teams doing the work. Demonstrated ability to lead by influence rather than authority: you have inspired engineers to adopt good practices because they wanted to, not because they were told to. Experience working in regulated fintech, payments or banking environments, with an appreciation for the constraints that come with serving customers' money. Deep experience designing security-first architectures, threat modelling, secure-by-default patterns, defence in depth, identity, key and secret management, in environments where a breach has material business and regulatory consequences. A strong perspective on modern AI/ML, LLMs, agents and applied AI in production, and how to integrate them responsibly into engineering workflows and customer-facing products. Excellent written and verbal communication: you can write a crisp ADR, run an effective architecture review, present to engineers and explain trade-offs to non-technical stakeholders. Solid understanding of cloud-native architecture, microservices, container-based 12-factor apps and patterns around fault tolerance, security and resilience. Strong CI/CD, automated testing (unit, service and end-to-end) and overall SDLC practices. Nice to have Hands-on experience in crypto, Web3, custody or blockchain-based systems. Familiarity with Team Topologies and how Stream Aligned, Platform and Enabling teams interact in practice. Multi-platform standardisation experience spanning backend, web and mobile. Background contributing to SRE or platform engineering work, SLOs, error budgets, developer experience. Open-source contributions or external thought leadership (talks, articles, OSS projects). Familiarity with Python, a meaningful portion of our stack is in Python, so it helps you hit the ground running, but the role is language-agnostic. Other requirements AWS as our cloud platform GCP as our data warehousing Containerised microservices A multi-language stack Based in or near a CET-compatible timezone A dedicated workspace A reliable internet connection with the fastest speed possible in your area Alignment with Our Values and the Xapo Values-Driven Leadership principles Why work for Xapo?Impact Globally, Work Remotely. Shape the Future Improve lives through cutting-edge technology, work 100% remotely from anywhere in the world. Great work-life balance Build amazing things with a balance of autonomy and collaborative teamwork. Set your own work schedule and make use of a flexible PTO plan when you need to recharge. Expect Excellence Collaborate, learn, and grow with a high-performance team. Learn how you learn best - from books to conferences, you'll get a yearly budget for your individual learning and development goals. At Xapo, we prioritize consumer protection and adhere to regulatory requirements by ensuring that all Xapiens are accountable for upholding principles of fair treatment, transparency, and ethical conduct in their interactions with customers and stakeholders.
27/06/2026
Full time
# Remote Solution Architect Job at Xapo BankAnywhere17 hours agoFull TimeAnywhere$25000 - $40000 USDCI/CDPythonAWSGCParchitecturecloudsecurity"When applying, mention the word FarCoder to show you've read the job post completely. Employers can look for these words to identify genuine, thoughtful applicants and avoid spam." Job OverviewXapo Bank is hiring a remote candidate for Solution Architect. This is a full time position. Work location: Anywhere.The role typically involves technologies such as CI/CD, Python, AWS, GCP, architecture, cloud. Key Responsibilities Own and facilitate the Architectural Advice Forum (AAF), driving decentralised decision-making that is captured as ADRs. The forum is yours, the challenge is expected, but you are not "single" approver of every architectural change, the goal is for decisions to be made as close to the teams doing the work as possible. Personally build production-grade Proofs of Concept to de-risk new patterns, validate architectural ideas and demonstrate "what good looks like" ahead of broader adoption. Partner with the CTO on Xapo's AI strategy, identifying high-value opportunities, evaluating tooling, prototyping solutions and helping engineering teams adopt AI responsibly across the SDLC and our customer-facing products. Own cross-cutting standardisation across backend, web and mobile engineering, patterns, paradigms, observability, security, testing and delivery, so that quality and velocity scale together. Define and uphold the engineering quality bar across backend, web and mobile, covering testing, code review, observability, security, performance and production-readiness, and lead by example through your own POCs and code reviews Embed Security-First Architecture across everything we build. For a regulated crypto bank, security is not a checkbox or a separate workstream, it is the DNA of our products. You will set the threat-modelling, secure-by-default and defence-in-depth patterns that every team builds against, and partner with our Security function to evolve them as our risk surface grows. Champion Domain-Driven Design and Event-Driven Architecture as foundations for how we model our domain and evolve our systems. Work closely with the SRE team on reliability, capacity, incident response, observability and the evolution of our internal developer platform. Coach and mentor engineers across SATs, platform and enabling teams, leading through inspiration and influence rather than authority, and helping the whole organisation raise its architectural bar. Surface and unblock systemic technical risks, technical debt and cross-team dependencies before they hit production. Maintain Xapo's engineering principles and Tech Radar, and contribute to architecture-relevant aspects of regulatory, security and risk conversations. Build a great place to work for talented and motivated people, and help develop innovative solutions with Bitcoin at their core. Required Skills Primary Skills CI/CD Python AWS GCP Secondary Skills architecture cloud securitySkills required for this role include CI/CD, Python, AWS, and related tools for day-to-day development. Job Details Employment Type: Full Time Location: Anywhere Salary: $25000 - $40000 USD Tech StackCI/CD, Python, AWS, GCP, architecture, cloud, security Role details Work from anywhere, impact everywhereDiversity is at the heart of who we are at Xapo Bank. We're a fully distributed team of over 160 Xapiens that work remotely from 50+ countries around the world.Our beginning: A world that enjoys economic freedom and wealth protection,no matter where you live or who is running your country. To achieve that, we search the world for the best people for the job. We work hard, think globally, and inspire each other to learn and grow. We are committed to changing the way things are done. Although we are headquartered in Gibraltar, this is a full time, 100% remote position Work from anywhere! Position overviewWe're looking for a Solution Architect to join our engineering function, reporting directly to the CTO. At Xapo, we are building truly cross-functional teams with full ownership of design, architecture, building, testing, delivery, data, and operations, structured as Stream Aligned Teams (SATs), Platform teams and Enabling teams as per Team Topologies. You will collaborate closely with engineering leaders and the SRE function, as well as fellow members of the product, apps, and design.This is a deeply hands-on role. You will not sit on top of an ivory tower drawing boxes and arrows, you will write code, build production-grade Proofs of Concept and demonstrate what good looks like by example. You will own the Architectural Advice Forum (AAF), facilitating decentralised architectural decisions captured as ADRs without becoming a bottleneck or single point of approval.You will be the CTO's partner on AI strategy, drive standardisation across backend, web and mobile engineering, and work closely with our SRE team on reliability, observability and platform concerns.This is an opportunity to shape how a regulated crypto bank designs, builds and operates software, having a real impact on how the future of finance looks. Responsibilities Skills needed 10+ years of software engineering experience, including 3+ years in an Architect, Staff or equivalent senior-IC role. A strong track record as a hands-on engineer-turned-architect: you still read, write and review production code regularly, and you enjoy doing so. The role is language-agnostic, we care how you think, not which language you write, though familiarity with our primary stack is a plus. Deep experience designing, building and operating distributed, event-driven systems in production, ideally at scale and under regulatory scrutiny. Hands-on experience with Domain-Driven Design, both strategic (bounded contexts, context mapping) and tactical patterns. Practical familiarity with decentralised architectural governance, Architectural Advice Forum, RFCs and ADRs, and a clear point of view on why architectural decisions should be made close to the teams doing the work. Demonstrated ability to lead by influence rather than authority: you have inspired engineers to adopt good practices because they wanted to, not because they were told to. Experience working in regulated fintech, payments or banking environments, with an appreciation for the constraints that come with serving customers' money. Deep experience designing security-first architectures, threat modelling, secure-by-default patterns, defence in depth, identity, key and secret management, in environments where a breach has material business and regulatory consequences. A strong perspective on modern AI/ML, LLMs, agents and applied AI in production, and how to integrate them responsibly into engineering workflows and customer-facing products. Excellent written and verbal communication: you can write a crisp ADR, run an effective architecture review, present to engineers and explain trade-offs to non-technical stakeholders. Solid understanding of cloud-native architecture, microservices, container-based 12-factor apps and patterns around fault tolerance, security and resilience. Strong CI/CD, automated testing (unit, service and end-to-end) and overall SDLC practices. Nice to have Hands-on experience in crypto, Web3, custody or blockchain-based systems. Familiarity with Team Topologies and how Stream Aligned, Platform and Enabling teams interact in practice. Multi-platform standardisation experience spanning backend, web and mobile. Background contributing to SRE or platform engineering work, SLOs, error budgets, developer experience. Open-source contributions or external thought leadership (talks, articles, OSS projects). Familiarity with Python, a meaningful portion of our stack is in Python, so it helps you hit the ground running, but the role is language-agnostic. Other requirements AWS as our cloud platform GCP as our data warehousing Containerised microservices A multi-language stack Based in or near a CET-compatible timezone A dedicated workspace A reliable internet connection with the fastest speed possible in your area Alignment with Our Values and the Xapo Values-Driven Leadership principles Why work for Xapo?Impact Globally, Work Remotely. Shape the Future Improve lives through cutting-edge technology, work 100% remotely from anywhere in the world. Great work-life balance Build amazing things with a balance of autonomy and collaborative teamwork. Set your own work schedule and make use of a flexible PTO plan when you need to recharge. Expect Excellence Collaborate, learn, and grow with a high-performance team. Learn how you learn best - from books to conferences, you'll get a yearly budget for your individual learning and development goals. At Xapo, we prioritize consumer protection and adhere to regulatory requirements by ensuring that all Xapiens are accountable for upholding principles of fair treatment, transparency, and ethical conduct in their interactions with customers and stakeholders.
We have a Lead Site Reliability Engineer (SRE) opportunity within our Google Cloud Site Reliability Engineering team. As a Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure Platform - Cloud Foundational Services SRE organization, you will join our Google Cloud Site Reliability Engineering team operating within a global follow-the-sun support model. Job Responsibilities: Lead and Implement SRE frameworks to support global google cloud environments and ensure the highest level of SLOs through operational excellence Mastery of application, data, infrastructure, and Agentic AI disciplines Keen understanding of financial control and budget management using expertise in working in partnership with colleagues throughout the firm, and in leading collaborative teams to achieve common goals Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements. Provide support to develop & improve the quality of technical engineering documentation Provide technical supervision, oversight and problem resolution for engineering activities Champion a DevOps model so that services are automated and elastic across all platforms Required qualifications, capabilities, and skills: Google & Azure cloud expertise in a mission critical production environment Strong understanding about container technologies such as Docker, Kubernetes, GKE and HELM Experiencein programming in one of the following languages: Python, shell scripting or GO along with good understanding of REST APIs Hands-on experience with cloud-based technologies and tools especially in deployment, monitoring and operations, such as Google Observability, Azure Monitor, Data Dog, Prometheus, Splunk, Elasticsearch and Grafana. Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity. Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations. Strong understanding about the Google Cloud governance and compliance and cost management Strong working knowledge of modern development technologies and tools such Agile, CI/CD, Git, Infrastructure as Code, Terraform and Jenkins. Google Cloud certification or equivalent technical experience in the Public Cloud. Good understanding of Agentic AI SDKs and GitHub Copilot Skills. Preferred qualifications, capabilities, and skills: Good understanding of operating systems such as Windows, Linux (Redhat / Ubuntu) Good understanding of LLM and other AI/ML frameworks which can be used in AIOPS
27/06/2026
Full time
We have a Lead Site Reliability Engineer (SRE) opportunity within our Google Cloud Site Reliability Engineering team. As a Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure Platform - Cloud Foundational Services SRE organization, you will join our Google Cloud Site Reliability Engineering team operating within a global follow-the-sun support model. Job Responsibilities: Lead and Implement SRE frameworks to support global google cloud environments and ensure the highest level of SLOs through operational excellence Mastery of application, data, infrastructure, and Agentic AI disciplines Keen understanding of financial control and budget management using expertise in working in partnership with colleagues throughout the firm, and in leading collaborative teams to achieve common goals Uses enterprise-authorized AI capabilities within the work environment to accelerate major-incident triage, troubleshooting, and post-incident analysis, validating outputs and handling operational data according to sensitivity and security requirements. Provide support to develop & improve the quality of technical engineering documentation Provide technical supervision, oversight and problem resolution for engineering activities Champion a DevOps model so that services are automated and elastic across all platforms Required qualifications, capabilities, and skills: Google & Azure cloud expertise in a mission critical production environment Strong understanding about container technologies such as Docker, Kubernetes, GKE and HELM Experiencein programming in one of the following languages: Python, shell scripting or GO along with good understanding of REST APIs Hands-on experience with cloud-based technologies and tools especially in deployment, monitoring and operations, such as Google Observability, Azure Monitor, Data Dog, Prometheus, Splunk, Elasticsearch and Grafana. Demonstrated experience using enterprise-authorized AI capabilities within the work environment to improve SRE workflows (e.g., incident investigation support and knowledge capture) with strong validation habits and awareness of data sensitivity. Ability to evaluate AI-assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align to resiliency and security expectations. Strong understanding about the Google Cloud governance and compliance and cost management Strong working knowledge of modern development technologies and tools such Agile, CI/CD, Git, Infrastructure as Code, Terraform and Jenkins. Google Cloud certification or equivalent technical experience in the Public Cloud. Good understanding of Agentic AI SDKs and GitHub Copilot Skills. Preferred qualifications, capabilities, and skills: Good understanding of operating systems such as Windows, Linux (Redhat / Ubuntu) Good understanding of LLM and other AI/ML frameworks which can be used in AIOPS
CMETS CME Technology and Support Services Ltd.
City, Belfast
Site Reliability Engineer III (Tue - Sat) CME Group is seeking a Site Reliability Engineer III to take a key role in building, operating, and scaling systems in our Markets portfolio. As an SRE III, you will apply your experience to the complex challenges of the CME Globex trading platform, where our systems deliver an exceptional combination of low latency performance and rock solid reliability. You will work with senior engineers on complex projects, take ownership of key reliability initiatives, and mentor junior colleagues, shaping the team's technical direction. Key Responsibilities Own Observability: design, build, and refine monitoring, alerting, and observability solutions; drive continuous improvement of SLIs and SLOs to enable faster issue detection and resolution. Drive Reliability Projects: take ownership of reliability focused projects from design to implementation, collaborating with product teams to ensure new features are scalable, resilient, and safe. Lead Technical Solutions: lead technical discussions for your work, presenting solution options and proposals with clear trade offs. Automate Intelligently: proactively identify and eliminate toil through robust automation, improving both system reliability and team velocity. Manage Incidents: lead incident response, own resolution of significant incidents, ensure rapid system recovery, and drive meaningful action from blameless post mortems. Mentor & Coach: act as a technical mentor and point of escalation for L1 and L2 SREs, fostering their growth through code reviews and paired work. Architect for the Future: contribute ideas to the product backlog and play an active role in the architectural design for the migration to Google Cloud Platform. What We're Looking For 3-5+ years of professional experience in a Site Reliability, DevOps, Software, or Systems Engineering role. Strong, hands on experience administering and troubleshooting Linux based production systems. Proficient programming skills in a language like Python or Go, with a track record of automating complex operational tasks. Proven ability to lead technical initiatives and solve complex problems with a high degree of autonomy. Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences. A proactive and ownership oriented mindset. Desirable Skills Cloud Platforms: deep experience with Google Cloud Platform, especially GCE, GKE, and cloud networking. Monitoring Tools: expertise in designing and managing monitoring stacks such as Prometheus, Grafana, and OpenTelemetry. Distributed Systems: strong practical knowledge of building and maintaining large scale distributed systems. Containerisation: advanced experience with Kubernetes and Docker in a production environment. Networking: solid understanding of networking protocols (HTTP, TCP/UDP, IP) and network architecture. Domain Knowledge: experience in financial markets, low latency systems, or message oriented middleware. Company Benefits Bonus Programme Generous shift allowance Equity Programme Employee Stock Purchase Plan (ESPP) Private Medical and Dental coverage Mental Health Benefit Programme Group Pension Plan Income Protection Life Assurance Cycle to Work EV Car Benefit Scheme Gym Membership Family Leave Education Assistance - MBA/Advanced Degree/Bachelor Degree Ongoing Employee Development Hybrid Working Equal Opportunity Employer As an equal opportunity employer, we consider all potential employees without regard to any protected characteristic.
27/06/2026
Full time
Site Reliability Engineer III (Tue - Sat) CME Group is seeking a Site Reliability Engineer III to take a key role in building, operating, and scaling systems in our Markets portfolio. As an SRE III, you will apply your experience to the complex challenges of the CME Globex trading platform, where our systems deliver an exceptional combination of low latency performance and rock solid reliability. You will work with senior engineers on complex projects, take ownership of key reliability initiatives, and mentor junior colleagues, shaping the team's technical direction. Key Responsibilities Own Observability: design, build, and refine monitoring, alerting, and observability solutions; drive continuous improvement of SLIs and SLOs to enable faster issue detection and resolution. Drive Reliability Projects: take ownership of reliability focused projects from design to implementation, collaborating with product teams to ensure new features are scalable, resilient, and safe. Lead Technical Solutions: lead technical discussions for your work, presenting solution options and proposals with clear trade offs. Automate Intelligently: proactively identify and eliminate toil through robust automation, improving both system reliability and team velocity. Manage Incidents: lead incident response, own resolution of significant incidents, ensure rapid system recovery, and drive meaningful action from blameless post mortems. Mentor & Coach: act as a technical mentor and point of escalation for L1 and L2 SREs, fostering their growth through code reviews and paired work. Architect for the Future: contribute ideas to the product backlog and play an active role in the architectural design for the migration to Google Cloud Platform. What We're Looking For 3-5+ years of professional experience in a Site Reliability, DevOps, Software, or Systems Engineering role. Strong, hands on experience administering and troubleshooting Linux based production systems. Proficient programming skills in a language like Python or Go, with a track record of automating complex operational tasks. Proven ability to lead technical initiatives and solve complex problems with a high degree of autonomy. Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences. A proactive and ownership oriented mindset. Desirable Skills Cloud Platforms: deep experience with Google Cloud Platform, especially GCE, GKE, and cloud networking. Monitoring Tools: expertise in designing and managing monitoring stacks such as Prometheus, Grafana, and OpenTelemetry. Distributed Systems: strong practical knowledge of building and maintaining large scale distributed systems. Containerisation: advanced experience with Kubernetes and Docker in a production environment. Networking: solid understanding of networking protocols (HTTP, TCP/UDP, IP) and network architecture. Domain Knowledge: experience in financial markets, low latency systems, or message oriented middleware. Company Benefits Bonus Programme Generous shift allowance Equity Programme Employee Stock Purchase Plan (ESPP) Private Medical and Dental coverage Mental Health Benefit Programme Group Pension Plan Income Protection Life Assurance Cycle to Work EV Car Benefit Scheme Gym Membership Family Leave Education Assistance - MBA/Advanced Degree/Bachelor Degree Ongoing Employee Development Hybrid Working Equal Opportunity Employer As an equal opportunity employer, we consider all potential employees without regard to any protected characteristic.
We're seeking a Lead Infrastructure Engineer covering designing, building and maintaining robust, resilient and scalable infrastructure automation systems. In the Technology division, we leverage innovation to build the connections and capabilities that power our Firm, enabling our clients and colleagues to redefine markets and shape the future of our communities. This is a Lead Cloud and Infrastructure Engineering position at Vice President level, which is part of the job family responsible for managing and optimizing technical infrastructure and ensuring the seamless operation of IT systems to support business needs effectively. What you'll do in the role: Design, implement and support new infrastructure solutions to enable our algorithmic trading business to operate in an efficient, effective and compliant manner. Collaborate closely with business-facing development groups, system administrators and enterprise infrastructure engineers to implement and maintain appropriate solutions to reduce costs and manage technology risk for our business. Act as a subject matter expert for our Linux-based electronic trading infrastructure platform. Mentor, coach and lead less experienced engineers. Provide technical leadership and vision for a small business-aligned team of engineers and SREs. Drive the continuous improvement of our infrastructure and working practices through regular reviews, blameless post-mortems and other SRE techniques. What you'll bring to the role: Advanced knowledge of the Python programming language and standard software engineering concepts such as common data structures, regular expressions, object-oriented programming, and advanced algorithms. Strong understanding of core Unix components - networking stack, daemon configuration, OS customisation. Experience with architecting large scale, robust, resilient and scalable systems. Domain expertise in infrastructure automation and integration. The ability to describe algorithmic and architectural trade-offs in an algebraic or quantitative fashion. Comfortable developing in a Linux and CLI-based environment. Knowledge of standard Linux command line debugging tools such as tcpdump and strace. Familiarity with modern development tools and practices including agentic-workflows, git, jenkins, test-driven development, and continuous integration. Can take full-lifecycle ownership of components within a project, from initial architecture design to ongoing support. A proven track record of technical leadership, able to work independently on both technical problems as well as customer interactions. Familiarity with the Go programming language would also be useful. Certified Persons Regulatory Requirements: If this role is deemed a Certified role and may require the role holder to hold mandatory regulatory qualifications or the minimum qualifications to meet internal company benchmarks. Flexible work statement: Morgan Stanley empowers employees to have greater freedom of choice through flexible working arrangements. Speak to our recruitment team to find out more. Morgan Stanley is an equal opportunity employer committed to building and maintaining a workforce that is diverse in experience and background. Our recruiting efforts reflect our strong commitment to a culture of inclusion, where individuals are hired, developed, and advanced based on their skills and talents. Our workforce reflects a broad cross-section of the global communities in which we operate, bringing a variety of backgrounds, talents, perspectives, and experiences. For more information, please visit:
27/06/2026
Full time
We're seeking a Lead Infrastructure Engineer covering designing, building and maintaining robust, resilient and scalable infrastructure automation systems. In the Technology division, we leverage innovation to build the connections and capabilities that power our Firm, enabling our clients and colleagues to redefine markets and shape the future of our communities. This is a Lead Cloud and Infrastructure Engineering position at Vice President level, which is part of the job family responsible for managing and optimizing technical infrastructure and ensuring the seamless operation of IT systems to support business needs effectively. What you'll do in the role: Design, implement and support new infrastructure solutions to enable our algorithmic trading business to operate in an efficient, effective and compliant manner. Collaborate closely with business-facing development groups, system administrators and enterprise infrastructure engineers to implement and maintain appropriate solutions to reduce costs and manage technology risk for our business. Act as a subject matter expert for our Linux-based electronic trading infrastructure platform. Mentor, coach and lead less experienced engineers. Provide technical leadership and vision for a small business-aligned team of engineers and SREs. Drive the continuous improvement of our infrastructure and working practices through regular reviews, blameless post-mortems and other SRE techniques. What you'll bring to the role: Advanced knowledge of the Python programming language and standard software engineering concepts such as common data structures, regular expressions, object-oriented programming, and advanced algorithms. Strong understanding of core Unix components - networking stack, daemon configuration, OS customisation. Experience with architecting large scale, robust, resilient and scalable systems. Domain expertise in infrastructure automation and integration. The ability to describe algorithmic and architectural trade-offs in an algebraic or quantitative fashion. Comfortable developing in a Linux and CLI-based environment. Knowledge of standard Linux command line debugging tools such as tcpdump and strace. Familiarity with modern development tools and practices including agentic-workflows, git, jenkins, test-driven development, and continuous integration. Can take full-lifecycle ownership of components within a project, from initial architecture design to ongoing support. A proven track record of technical leadership, able to work independently on both technical problems as well as customer interactions. Familiarity with the Go programming language would also be useful. Certified Persons Regulatory Requirements: If this role is deemed a Certified role and may require the role holder to hold mandatory regulatory qualifications or the minimum qualifications to meet internal company benchmarks. Flexible work statement: Morgan Stanley empowers employees to have greater freedom of choice through flexible working arrangements. Speak to our recruitment team to find out more. Morgan Stanley is an equal opportunity employer committed to building and maintaining a workforce that is diverse in experience and background. Our recruiting efforts reflect our strong commitment to a culture of inclusion, where individuals are hired, developed, and advanced based on their skills and talents. Our workforce reflects a broad cross-section of the global communities in which we operate, bringing a variety of backgrounds, talents, perspectives, and experiences. For more information, please visit:
Metro Bank is looking for a Senior Engineer - Platform Engineering (Azure Architect and SRE) to enhance their cloud foundations. This role focuses on Azure architecture, ensuring secure, observable, and resilient platforms while offering hybrid working opportunities. The ideal candidate will bring strong Azure design and SRE principles to the team, with responsibilities including driving SRE practices and mentoring engineers across cloud disciplines. Metro Bank emphasizes personal growth and offers a competitive salary and comprehensive benefits.
27/06/2026
Full time
Metro Bank is looking for a Senior Engineer - Platform Engineering (Azure Architect and SRE) to enhance their cloud foundations. This role focuses on Azure architecture, ensuring secure, observable, and resilient platforms while offering hybrid working opportunities. The ideal candidate will bring strong Azure design and SRE principles to the team, with responsibilities including driving SRE practices and mentoring engineers across cloud disciplines. Metro Bank emphasizes personal growth and offers a competitive salary and comprehensive benefits.
Overview We tackle the most complex problems in quantitative finance, by bringing scientific clarity to financial complexity. From our London HQ, we unite world class researchers and engineers in an environment that values deep exploration and methodical execution - because the best ideas take time to evolve. Together we're building a world class platform to amplify our teams' most powerful ideas. Role The Observability Engineering Team manages the doors - both entry and exit - to the telemetry backends at G Research, ensuring our engineers can effectively produce and consume telemetry for their services. As an Observability Engineer, you'll help make observability seamless for developers and platform teams by building pipelines to ingest and route data in predictable, composable ways, as well as visualising that data after the fact. You'll have deep experience across observability stacks, a clear understanding of the unique problems that come with moving cloud level volumes of telemetry data at scale, and excitement at the prospect of ensuring our customers have eyes into the telemetry data to run their services as efficiently as possible. Key Responsibilities Extending and maintaining OpenTelemetry, including collectors, SDKs and exporters Building scalable telemetry pipelines Contributing to Golden Path SDKs and auto instrumentation Ensuring Kubernetes workloads are fully observable and resilient Embedding observability standards across platform and application teams Improving incident response with better telemetry coverage Providing external industry observability experience and input to our long term roadmap Participating in the out of hours on call rotation Qualifications Operating/onboarding SaaS Observability platforms at scale, such as DataDog, NewRelic, Dynatrace etc. Strong hands on experience with OpenTelemetry Familiarity with public cloud infrastructure, ideally AWS Proficiency in Kubernetes and DevOps tooling (Terraform, ArgoCD, Helm, Jenkins) Experience with metrics, logs and tracing backends Coding in Go, Python, or similar Industry background in Observability or SRE Desirable Experience Profiling (eBPF, Pixie, Parca) Synthetic monitoring, AI observability tools and Kafka Benefits Highly competitive compensation plus annual discretionary bonus Lunch provided (via Just Eat for Business) and dedicated barista bar 35 days' annual leave 9% company pension contributions Informal dress code and excellent work/life balance Comprehensive healthcare and life assurance Cycle to work scheme Monthly company events Inclusion Statement G Research is committed to cultivating and preserving an inclusive work environment. We are an ideas driven business and we place great value on diversity of experience and opinions. We want to ensure that applicants receive a recruitment experience that enables them to perform at their best. If you have a disability or special need that requires accommodation please let us know in the relevant section.
27/06/2026
Full time
Overview We tackle the most complex problems in quantitative finance, by bringing scientific clarity to financial complexity. From our London HQ, we unite world class researchers and engineers in an environment that values deep exploration and methodical execution - because the best ideas take time to evolve. Together we're building a world class platform to amplify our teams' most powerful ideas. Role The Observability Engineering Team manages the doors - both entry and exit - to the telemetry backends at G Research, ensuring our engineers can effectively produce and consume telemetry for their services. As an Observability Engineer, you'll help make observability seamless for developers and platform teams by building pipelines to ingest and route data in predictable, composable ways, as well as visualising that data after the fact. You'll have deep experience across observability stacks, a clear understanding of the unique problems that come with moving cloud level volumes of telemetry data at scale, and excitement at the prospect of ensuring our customers have eyes into the telemetry data to run their services as efficiently as possible. Key Responsibilities Extending and maintaining OpenTelemetry, including collectors, SDKs and exporters Building scalable telemetry pipelines Contributing to Golden Path SDKs and auto instrumentation Ensuring Kubernetes workloads are fully observable and resilient Embedding observability standards across platform and application teams Improving incident response with better telemetry coverage Providing external industry observability experience and input to our long term roadmap Participating in the out of hours on call rotation Qualifications Operating/onboarding SaaS Observability platforms at scale, such as DataDog, NewRelic, Dynatrace etc. Strong hands on experience with OpenTelemetry Familiarity with public cloud infrastructure, ideally AWS Proficiency in Kubernetes and DevOps tooling (Terraform, ArgoCD, Helm, Jenkins) Experience with metrics, logs and tracing backends Coding in Go, Python, or similar Industry background in Observability or SRE Desirable Experience Profiling (eBPF, Pixie, Parca) Synthetic monitoring, AI observability tools and Kafka Benefits Highly competitive compensation plus annual discretionary bonus Lunch provided (via Just Eat for Business) and dedicated barista bar 35 days' annual leave 9% company pension contributions Informal dress code and excellent work/life balance Comprehensive healthcare and life assurance Cycle to work scheme Monthly company events Inclusion Statement G Research is committed to cultivating and preserving an inclusive work environment. We are an ideas driven business and we place great value on diversity of experience and opinions. We want to ensure that applicants receive a recruitment experience that enables them to perform at their best. If you have a disability or special need that requires accommodation please let us know in the relevant section.
Machine Learning Platform Engineering We're on a mission to make money work for everyone, and our Machine Learning Platform team builds the systems that help teams across Monzo train, evaluate, deploy, and serve ML models and AI features safely and reliably. We work on backend services, Python libraries, model lifecycle tooling, evaluation workflows, and low latency serving systems. Our users are internal ML engineers, scientists, and product teams building with ML and LLMs. The work matters because machine learning powers many important decisions and experiences at Monzo, from fraud checks and credit decisions to customer operations. Location & Compensation London, UK (remote within the UK available). Salary £85,000 - £110,000 plus incentive awards tied to performance. Benefits include relocation support, visa sponsorship, flexible working hours, learning budget, and a full list of benefits. Responsibilities Develop backend services, platform APIs, and production systems using Go. Write Python libraries, workflows, and tooling used by our ML engineers and scientists. Implement feature platforms and data workflows with Chronon, Feast, and DataHub. Build model training pipelines and experiment tracking using Vertex AI and Comet. Maintain AI observability, evaluation, and tracing using Langfuse. Deploy and maintain real time serving on AWS and batch compute on GCP, including BigQuery data warehousing. Qualifications Strong backend engineering background with experience in Go and Python. Experience with ML or AI platforms, including pipelines, feature stores, model serving, experiment tracking, or LLM tooling. Designed and operated distributed systems that handle scale, concurrency, and failure. Focus on developer experience and removing friction for internal teams. Comfortable with ambiguity and ability to shape a platform as it grows. Experience with strongly typed languages and writing backend software. Curiosity about system behavior in production, including reliability, latency, quality, safety, and operational risk. This Might NOT Be the Right Fit If Your background is predominantly DevOps, SRE, or infrastructure operations. You are focused on data science or modelling rather than platform engineering. You have shipped AI product features but have not worked on the platform side (serving, evaluation, model lifecycle). Benefits Competitive salary £85,000 - £110,000 plus incentive awards. Relocation assistance to the UK and visa sponsorship. Flexible working hours and trust to work the hours that suit you. Annual learning budget of £1,000 for books, training courses, and conferences. Additional benefits available - see our full benefits list. Equal Opportunity Employer Diversity and inclusion are a priority for us. We are an equal opportunity employer and will consider all applicants without regard to age, ethnicity, religion, sex, sexual orientation, gender identity, family or parental status, national origin, veteran status, neurodiversity, or disability status.
27/06/2026
Full time
Machine Learning Platform Engineering We're on a mission to make money work for everyone, and our Machine Learning Platform team builds the systems that help teams across Monzo train, evaluate, deploy, and serve ML models and AI features safely and reliably. We work on backend services, Python libraries, model lifecycle tooling, evaluation workflows, and low latency serving systems. Our users are internal ML engineers, scientists, and product teams building with ML and LLMs. The work matters because machine learning powers many important decisions and experiences at Monzo, from fraud checks and credit decisions to customer operations. Location & Compensation London, UK (remote within the UK available). Salary £85,000 - £110,000 plus incentive awards tied to performance. Benefits include relocation support, visa sponsorship, flexible working hours, learning budget, and a full list of benefits. Responsibilities Develop backend services, platform APIs, and production systems using Go. Write Python libraries, workflows, and tooling used by our ML engineers and scientists. Implement feature platforms and data workflows with Chronon, Feast, and DataHub. Build model training pipelines and experiment tracking using Vertex AI and Comet. Maintain AI observability, evaluation, and tracing using Langfuse. Deploy and maintain real time serving on AWS and batch compute on GCP, including BigQuery data warehousing. Qualifications Strong backend engineering background with experience in Go and Python. Experience with ML or AI platforms, including pipelines, feature stores, model serving, experiment tracking, or LLM tooling. Designed and operated distributed systems that handle scale, concurrency, and failure. Focus on developer experience and removing friction for internal teams. Comfortable with ambiguity and ability to shape a platform as it grows. Experience with strongly typed languages and writing backend software. Curiosity about system behavior in production, including reliability, latency, quality, safety, and operational risk. This Might NOT Be the Right Fit If Your background is predominantly DevOps, SRE, or infrastructure operations. You are focused on data science or modelling rather than platform engineering. You have shipped AI product features but have not worked on the platform side (serving, evaluation, model lifecycle). Benefits Competitive salary £85,000 - £110,000 plus incentive awards. Relocation assistance to the UK and visa sponsorship. Flexible working hours and trust to work the hours that suit you. Annual learning budget of £1,000 for books, training courses, and conferences. Additional benefits available - see our full benefits list. Equal Opportunity Employer Diversity and inclusion are a priority for us. We are an equal opportunity employer and will consider all applicants without regard to age, ethnicity, religion, sex, sexual orientation, gender identity, family or parental status, national origin, veteran status, neurodiversity, or disability status.
Department Overview This opportunity is part of the Global Technology Infrastructure & Operations team (GTIO), where our mission is to deliver modern and relevant technology that supports the way McDonald's works. We provide outstanding foundational technology products and services including Global Networking, Cloud, End User Computing, and IT Service Management. It's our goal to always provide an engaging, relevant, and simple experience for our customers. The Site Reliability Engineer (SRE) - Edge Platform is a key member of the Edge Operations and SRE team within Global Technology Infrastructure & Operations. This role is responsible for ensuring the reliability, scalability, and operational excellence of the Edge computing platform that supports McDonald's global restaurant technology ecosystem. You will work closely with Architecture, Platform Engineering, Security teams to implement observability, automation, and incident response strategies that ensure the Edge platform is resilient and maintainable. This is a unique opportunity to influence the operational maturity of a global platform and drive continuous improvement across infrastructure and services. Duties Operate and maintain Edge platform infrastructure to ensure 24x7x365 availability, reliability, and performance. Design and implement observability frameworks using tools such as Prometheus, Grafana, Jaeger, and Datadog. Collaborate with Platform Engineering and Edge Solution Delivery teams to ensure platform features are operable, maintainable, and supportable in production environments. Develop and maintain runbooks, playbooks, and automation scripts to streamline operations and reduce manual effort. Lead incident response, root cause analysis, and post-incident reviews to drive continuous improvement. Participate in capacity planning, performance tuning, and disaster recovery exercises. Implement and manage CI/CD pipelines and Infrastructure-as-Code (IaC) for operational tooling and automation. Architect and maintain self-healing and auto-scaling capabilities across Edge clusters. Partner with security teams to ensure compliance with enterprise standards and implement secure operational practices. Contribute to platform architecture discussions with a focus on operational readiness and supportability. Stay current with industry trends in SRE, edge computing, and distributed systems. Skills and experience required: Experience in Site Reliability Engineering, DevOps, or Platform Operations. Experience supporting Edge computing or hybrid cloud environments. Strong expertise in observability tools (Prometheus, Grafana, Jaeger, Datadog, ELK). Experience with container orchestration platforms (Kubernetes, GKE) and virtualization technologies. Proficiency in scripting and automation (Python, Bash, PowerShell). Hands-on experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD) and IaC (Terraform). Solid understanding of cloud platforms (GCP, AWS) and distributed systems. Strong problem-solving skills and ability to work in a fast-paced, collaborative environment. Excellent communication and documentation skills. GCP or AWS certification preferred. Experience with Agile methodologies is a plus. Qualifications Bachelor's degree in Computer Science, Engineering, or related field; or equivalent experience.
27/06/2026
Full time
Department Overview This opportunity is part of the Global Technology Infrastructure & Operations team (GTIO), where our mission is to deliver modern and relevant technology that supports the way McDonald's works. We provide outstanding foundational technology products and services including Global Networking, Cloud, End User Computing, and IT Service Management. It's our goal to always provide an engaging, relevant, and simple experience for our customers. The Site Reliability Engineer (SRE) - Edge Platform is a key member of the Edge Operations and SRE team within Global Technology Infrastructure & Operations. This role is responsible for ensuring the reliability, scalability, and operational excellence of the Edge computing platform that supports McDonald's global restaurant technology ecosystem. You will work closely with Architecture, Platform Engineering, Security teams to implement observability, automation, and incident response strategies that ensure the Edge platform is resilient and maintainable. This is a unique opportunity to influence the operational maturity of a global platform and drive continuous improvement across infrastructure and services. Duties Operate and maintain Edge platform infrastructure to ensure 24x7x365 availability, reliability, and performance. Design and implement observability frameworks using tools such as Prometheus, Grafana, Jaeger, and Datadog. Collaborate with Platform Engineering and Edge Solution Delivery teams to ensure platform features are operable, maintainable, and supportable in production environments. Develop and maintain runbooks, playbooks, and automation scripts to streamline operations and reduce manual effort. Lead incident response, root cause analysis, and post-incident reviews to drive continuous improvement. Participate in capacity planning, performance tuning, and disaster recovery exercises. Implement and manage CI/CD pipelines and Infrastructure-as-Code (IaC) for operational tooling and automation. Architect and maintain self-healing and auto-scaling capabilities across Edge clusters. Partner with security teams to ensure compliance with enterprise standards and implement secure operational practices. Contribute to platform architecture discussions with a focus on operational readiness and supportability. Stay current with industry trends in SRE, edge computing, and distributed systems. Skills and experience required: Experience in Site Reliability Engineering, DevOps, or Platform Operations. Experience supporting Edge computing or hybrid cloud environments. Strong expertise in observability tools (Prometheus, Grafana, Jaeger, Datadog, ELK). Experience with container orchestration platforms (Kubernetes, GKE) and virtualization technologies. Proficiency in scripting and automation (Python, Bash, PowerShell). Hands-on experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD) and IaC (Terraform). Solid understanding of cloud platforms (GCP, AWS) and distributed systems. Strong problem-solving skills and ability to work in a fast-paced, collaborative environment. Excellent communication and documentation skills. GCP or AWS certification preferred. Experience with Agile methodologies is a plus. Qualifications Bachelor's degree in Computer Science, Engineering, or related field; or equivalent experience.
Lead Site Reliability Engineer The Business Operations team is seeking a highly motivated and experienced Lead Site Reliability Engineer (SRE) to join our team. You will play a critical role in ensuring the reliability, scalability, and performance of our applications, supporting essential services that power Mastercard's global operations. As a thought leader in your field, you will bring technical expertise, a passion for automation, and the ability to mentor. Responsibilities Be a developing subject matter expert in the Site Reliability Engineering area, influencing stakeholders and applying advanced knowledge to drive achievement of area goals and initiatives by contributing to solution development and improvements for existing products, services, and/or processes. Implement and maintain high-availability system solutions, ensuring stability, performance, and operational continuity. Evaluate operational requirements to develop effective technical solutions within existing frameworks. Lead automation and scripting efforts to streamline operational processes and incident response workflows. Troubleshoot and resolve complex system issues, escalating as necessary to maintain system health and proactively address risks. Contribute to documentation, knowledge sharing, and best practices to improve team operational procedures. Conduct reviews and quality assurance activities to uphold organizational standards for system stability. Keep current with industry trends and emerging technologies relevant to system reliability and operational automation. Guide and mentor junior team members through on-the-job experiences, reviewing work and fostering a culture of continuous improvement to grow expertise around their discipline. Qualifications Observability - Ability to use scripting and tooling to implement observability solutions, enabling the collection, analysis, and visualization of metrics, logs, and traces to support incident detection, diagnosis, and continuous service improvement. Programming and Scripting - Ability to write and maintain code and scripts to automate tasks, build operational tools, and support monitoring, deployment, and incident response using languages such as Python, Go, Bash, or similar. Systems and Network Administration - Ability to configure, operate, and troubleshoot Linux/Unix systems and network components, applying knowledge of networking concepts, protocols, security, and system reliability. Cloud Computing and Infrastructure - Ability to design, deploy, and manage applications and infrastructure on cloud platforms (e.g., AWS, Azure, GCP), ensuring scalability, security, availability, and operational efficiency. Reliability and Scalability - Ability to design and operate systems for high availability, fault tolerance, and disaster recovery, while ensuring systems can scale to meet current and future demand. DevOps Practices - Ability to apply DevOps principles and practices, including CI/CD pipelines, containerization, and orchestration, to enable faster, more reliable software delivery and operations. Troubleshooting - Capability to systematically identify, diagnose, and resolve technical issues across systems, applications, and networks, using analytical methods and tools to restore functionality, minimize disruption, and ensure stable operations. Capacity Planning and Performance Optimization - Ability to monitor resource utilization, forecast future capacity needs, and optimize system performance to support growth, scalability, and efficient infrastructure usage. IT Service Management - Ability to apply IT service management principles to incident, problem, and change management, ensuring reliable service delivery, effective incident response, and continuous service improvement aligned to business needs. Proactive Monitoring and Improvement (SRE Applications) - The ability to use application reliability signals to anticipate issues, identify risks, and drive preventative improvements that enhance application performance and availability. Corporate Security Responsibility All activities involving access to Mastercard assets, information, and networks come with an inherent risk to the organization. Each person working for, or on behalf of, Mastercard is responsible for information security and must: abide by Mastercard's security policies and practices; ensure the confidentiality and integrity of the information being accessed; report any suspected information security violation or breach; and complete all periodic mandatory security trainings in accordance with Mastercard's guidelines.
27/06/2026
Full time
Lead Site Reliability Engineer The Business Operations team is seeking a highly motivated and experienced Lead Site Reliability Engineer (SRE) to join our team. You will play a critical role in ensuring the reliability, scalability, and performance of our applications, supporting essential services that power Mastercard's global operations. As a thought leader in your field, you will bring technical expertise, a passion for automation, and the ability to mentor. Responsibilities Be a developing subject matter expert in the Site Reliability Engineering area, influencing stakeholders and applying advanced knowledge to drive achievement of area goals and initiatives by contributing to solution development and improvements for existing products, services, and/or processes. Implement and maintain high-availability system solutions, ensuring stability, performance, and operational continuity. Evaluate operational requirements to develop effective technical solutions within existing frameworks. Lead automation and scripting efforts to streamline operational processes and incident response workflows. Troubleshoot and resolve complex system issues, escalating as necessary to maintain system health and proactively address risks. Contribute to documentation, knowledge sharing, and best practices to improve team operational procedures. Conduct reviews and quality assurance activities to uphold organizational standards for system stability. Keep current with industry trends and emerging technologies relevant to system reliability and operational automation. Guide and mentor junior team members through on-the-job experiences, reviewing work and fostering a culture of continuous improvement to grow expertise around their discipline. Qualifications Observability - Ability to use scripting and tooling to implement observability solutions, enabling the collection, analysis, and visualization of metrics, logs, and traces to support incident detection, diagnosis, and continuous service improvement. Programming and Scripting - Ability to write and maintain code and scripts to automate tasks, build operational tools, and support monitoring, deployment, and incident response using languages such as Python, Go, Bash, or similar. Systems and Network Administration - Ability to configure, operate, and troubleshoot Linux/Unix systems and network components, applying knowledge of networking concepts, protocols, security, and system reliability. Cloud Computing and Infrastructure - Ability to design, deploy, and manage applications and infrastructure on cloud platforms (e.g., AWS, Azure, GCP), ensuring scalability, security, availability, and operational efficiency. Reliability and Scalability - Ability to design and operate systems for high availability, fault tolerance, and disaster recovery, while ensuring systems can scale to meet current and future demand. DevOps Practices - Ability to apply DevOps principles and practices, including CI/CD pipelines, containerization, and orchestration, to enable faster, more reliable software delivery and operations. Troubleshooting - Capability to systematically identify, diagnose, and resolve technical issues across systems, applications, and networks, using analytical methods and tools to restore functionality, minimize disruption, and ensure stable operations. Capacity Planning and Performance Optimization - Ability to monitor resource utilization, forecast future capacity needs, and optimize system performance to support growth, scalability, and efficient infrastructure usage. IT Service Management - Ability to apply IT service management principles to incident, problem, and change management, ensuring reliable service delivery, effective incident response, and continuous service improvement aligned to business needs. Proactive Monitoring and Improvement (SRE Applications) - The ability to use application reliability signals to anticipate issues, identify risks, and drive preventative improvements that enhance application performance and availability. Corporate Security Responsibility All activities involving access to Mastercard assets, information, and networks come with an inherent risk to the organization. Each person working for, or on behalf of, Mastercard is responsible for information security and must: abide by Mastercard's security policies and practices; ensure the confidentiality and integrity of the information being accessed; report any suspected information security violation or breach; and complete all periodic mandatory security trainings in accordance with Mastercard's guidelines.
Senior Data Engineer - Reference Data (Assistant Vice President) London, United Kingdom Job Description Jefferies is looking for a highly experienced Senior Data Engineer to join the Reference Data Group within our Technology division. You will play a key role in designing, building, and managing the firm's critical reference data platforms - including Security Master, Account Master, and Counterparty Master - which underpin trading, risk, compliance, and operations across the firm. This is a high-impact, hands on engineering role. You will work closely with business stakeholders, data consumers, and cross functional technology teams to deliver robust, scalable, and well governed data pipelines and platforms on modern cloud infrastructure. Reference Data at Jefferies is foundational - the data you build and manage powers trading systems, regulatory reporting, risk models, and client facing applications globally. About the Team The Reference Data Group is responsible for the authoritative master data for securities, accounts, and counterparties at Jefferies. The team manages end to end data ingestion from vendors and internal systems, normalization, golden record creation, and distribution to downstream consumers across the firm. We operate on a modern cloud native stack centered on Snowflake, AWS, and Apache Airflow, and follow engineering best practices including CI/CD, code review, and automated testing. Key Responsibilities Design, build, and maintain scalable data pipelines for Security Master, Account Master, and Counterparty Master using Python and Apache Airflow. Develop and optimize complex data transformations, stored procedures, and views in Snowflake, ensuring high performance and data quality. Own the end to end lifecycle of reference data - from source ingestion and normalization through golden record creation and downstream distribution. Collaborate with data consumers across trading, risk, compliance, and operations to understand requirements and deliver reliable data products. B uild and maintain infrastructure as code and deployment pipelines using AWS services, Git, and CI/CD tooling. Implement data quality frameworks, lineage tracking, and monitoring to ensure the accuracy, completeness, and timeliness of reference data. Participate in design and code reviews, contribute to engineering standards, and mentor junior engineers. Work with vendors and external data providers (e.g. Bloomberg, Refinitiv) to onboard and manage data feeds. Contribute to platform modernization initiatives and help drive adoption of best practices across the team. Troubleshoot production data issues, perform root cause analysis, and implement preventative measures. Required Skills and Experience Required: 7+ years of hands on data engineering experience Expert level Python for data engineering and automation Strong Snowflake experience - SQL, stored procedures, streams, tasks, and performance tuning Production experience with Apache Airflow - DAG design, scheduling, dependency management Solid AWS cloud experience - S3, Lambda, Glue, IAM, or equivalent services Proficient with Git, branching strategies, pull requests, and code review workflows Experience with CI/CD pipelines - GitHub Actions, Jenkins, or equivalent Strong understanding of data modelling - dimensional, relational, and hub spoke patterns Experience building and operating production grade data pipelines at scale Financial services experience is preferred but not required. Strong candidates from other industries with excellent data engineering credentials and a desire to learn financial domain concepts are encouraged to apply. Nice to have: Experience with financial reference data - Security Master, Counterparty, or Account data Knowledge of financial instruments - equities, fixed income, derivatives, or FX Familiarity with data vendors such as Bloomberg, Refinitiv, or FactSet Experience with data governance, lineage tools, or metadata management Familiarity with dbt or similar transformation frameworks Exposure to Kafka or event driven data architectures Experience in a regulated financial services environment Core Competencies Communication: Ability to clearly articulate technical concepts to non technical stakeholders including business analysts, traders, and senior management. Collaboration: Strong team player who works effectively across engineering, business, and operations teams in a fast paced environment. Problem Solving: Analytical mindset with a track record of diagnosing complex data quality and pipeline issues in production environments. Ownership: Takes end to end accountability for data products - from design through delivery, monitoring, and continuous improvement. Adaptability: Comfortable managing multiple priorities and adapting to changing business requirements in a dynamic financial services environment. What we offer Opportunity to work on high visibility, firm critical data infrastructure used across global trading and operations. Collaborative, engineering led culture with strong emphasis on code quality, testing, and continuous improvement. Access to modern cloud tooling and the opportunity to influence platform architecture decisions. Exposure to a wide range of financial products and business domains across a leading global investment bank. About Us Jefferies is a leading global, full service investment banking and capital markets firm that provides advisory, sales and trading, research, and wealth and asset management services. With more than 40 offices around the world, we offer insights and expertise to investors, companies, and governments. At Jefferies, we believe that diversity fosters creativity, innovation and thought leadership through the infusion of new ideas and perspectives. We have made a commitment to building a culture that provides opportunities for all employees regardless of our differences and supports a workforce that is reflective of the communities where we work and live. As a result, we are able to pool our collective insights and intelligence to provide fresh and innovative thinking for our clients. Jefferies is an equal employment opportunityemployer, and takes affirmative action to ensure that all qualified applicantswill receive consideration for employment without regard torace,creed, color, national origin, ancestry,religion, gender, pregnancy,age, physicalor mental disability, marital status, sexual orientation, gender identity or expression, veteran or military status, genetic information, reproductive health decisions, or any other factor protected by applicable law. We are committed to hiring the most qualified applicants and complying withall federal, state, and local equal employment opportunity laws. As part of this commitment, Jefferies will extend reasonableaccommodationsto individuals with disabilities, asrequired by applicable law. Job Info Job Identification 4534 Job Category Information Technology Posting Date 06/23/2026, 04:02 PM Job Schedule Full time Locations 100 Bishopsgate, London, EC2N 4JL, GB
27/06/2026
Full time
Senior Data Engineer - Reference Data (Assistant Vice President) London, United Kingdom Job Description Jefferies is looking for a highly experienced Senior Data Engineer to join the Reference Data Group within our Technology division. You will play a key role in designing, building, and managing the firm's critical reference data platforms - including Security Master, Account Master, and Counterparty Master - which underpin trading, risk, compliance, and operations across the firm. This is a high-impact, hands on engineering role. You will work closely with business stakeholders, data consumers, and cross functional technology teams to deliver robust, scalable, and well governed data pipelines and platforms on modern cloud infrastructure. Reference Data at Jefferies is foundational - the data you build and manage powers trading systems, regulatory reporting, risk models, and client facing applications globally. About the Team The Reference Data Group is responsible for the authoritative master data for securities, accounts, and counterparties at Jefferies. The team manages end to end data ingestion from vendors and internal systems, normalization, golden record creation, and distribution to downstream consumers across the firm. We operate on a modern cloud native stack centered on Snowflake, AWS, and Apache Airflow, and follow engineering best practices including CI/CD, code review, and automated testing. Key Responsibilities Design, build, and maintain scalable data pipelines for Security Master, Account Master, and Counterparty Master using Python and Apache Airflow. Develop and optimize complex data transformations, stored procedures, and views in Snowflake, ensuring high performance and data quality. Own the end to end lifecycle of reference data - from source ingestion and normalization through golden record creation and downstream distribution. Collaborate with data consumers across trading, risk, compliance, and operations to understand requirements and deliver reliable data products. B uild and maintain infrastructure as code and deployment pipelines using AWS services, Git, and CI/CD tooling. Implement data quality frameworks, lineage tracking, and monitoring to ensure the accuracy, completeness, and timeliness of reference data. Participate in design and code reviews, contribute to engineering standards, and mentor junior engineers. Work with vendors and external data providers (e.g. Bloomberg, Refinitiv) to onboard and manage data feeds. Contribute to platform modernization initiatives and help drive adoption of best practices across the team. Troubleshoot production data issues, perform root cause analysis, and implement preventative measures. Required Skills and Experience Required: 7+ years of hands on data engineering experience Expert level Python for data engineering and automation Strong Snowflake experience - SQL, stored procedures, streams, tasks, and performance tuning Production experience with Apache Airflow - DAG design, scheduling, dependency management Solid AWS cloud experience - S3, Lambda, Glue, IAM, or equivalent services Proficient with Git, branching strategies, pull requests, and code review workflows Experience with CI/CD pipelines - GitHub Actions, Jenkins, or equivalent Strong understanding of data modelling - dimensional, relational, and hub spoke patterns Experience building and operating production grade data pipelines at scale Financial services experience is preferred but not required. Strong candidates from other industries with excellent data engineering credentials and a desire to learn financial domain concepts are encouraged to apply. Nice to have: Experience with financial reference data - Security Master, Counterparty, or Account data Knowledge of financial instruments - equities, fixed income, derivatives, or FX Familiarity with data vendors such as Bloomberg, Refinitiv, or FactSet Experience with data governance, lineage tools, or metadata management Familiarity with dbt or similar transformation frameworks Exposure to Kafka or event driven data architectures Experience in a regulated financial services environment Core Competencies Communication: Ability to clearly articulate technical concepts to non technical stakeholders including business analysts, traders, and senior management. Collaboration: Strong team player who works effectively across engineering, business, and operations teams in a fast paced environment. Problem Solving: Analytical mindset with a track record of diagnosing complex data quality and pipeline issues in production environments. Ownership: Takes end to end accountability for data products - from design through delivery, monitoring, and continuous improvement. Adaptability: Comfortable managing multiple priorities and adapting to changing business requirements in a dynamic financial services environment. What we offer Opportunity to work on high visibility, firm critical data infrastructure used across global trading and operations. Collaborative, engineering led culture with strong emphasis on code quality, testing, and continuous improvement. Access to modern cloud tooling and the opportunity to influence platform architecture decisions. Exposure to a wide range of financial products and business domains across a leading global investment bank. About Us Jefferies is a leading global, full service investment banking and capital markets firm that provides advisory, sales and trading, research, and wealth and asset management services. With more than 40 offices around the world, we offer insights and expertise to investors, companies, and governments. At Jefferies, we believe that diversity fosters creativity, innovation and thought leadership through the infusion of new ideas and perspectives. We have made a commitment to building a culture that provides opportunities for all employees regardless of our differences and supports a workforce that is reflective of the communities where we work and live. As a result, we are able to pool our collective insights and intelligence to provide fresh and innovative thinking for our clients. Jefferies is an equal employment opportunityemployer, and takes affirmative action to ensure that all qualified applicantswill receive consideration for employment without regard torace,creed, color, national origin, ancestry,religion, gender, pregnancy,age, physicalor mental disability, marital status, sexual orientation, gender identity or expression, veteran or military status, genetic information, reproductive health decisions, or any other factor protected by applicable law. We are committed to hiring the most qualified applicants and complying withall federal, state, and local equal employment opportunity laws. As part of this commitment, Jefferies will extend reasonableaccommodationsto individuals with disabilities, asrequired by applicable law. Job Info Job Identification 4534 Job Category Information Technology Posting Date 06/23/2026, 04:02 PM Job Schedule Full time Locations 100 Bishopsgate, London, EC2N 4JL, GB
We're changing the way we do things, and putting industry leading innovation at the heart of how we operate; we need a stellar engineering team to make it happen. You'll be joining one of the most iconic brands in the UK on its most exciting cycle yet. We're more integrated and product led in our tech teams than ever before: learning, changing, and adapting constantly, with millions of people benefiting from your work every single day. You'll be joining the M&S Platform team as a Software Engineering Manager. Our mission is to streamline development at M&S for 1000+ engineers and 30+ applications - covering both our customer-facing and colleague-facing applications. You will manage the team building the SRE function. You will support the team and help direct the technical vision for how we will build reusable pipelines, tooling, build plugins and frameworks that our many apps will harness to boost the developer experience, while contributing code every step of the way. We want to make M&S one of the best places for software development, target the latest tooling and enable our engineers to focus on building the best user experience so that they don't get held up by common engineering and continuous integration problems. What You'll Do Being accountable for engineering excellence within your teams, from behaviours to operations, from technical direction to solution in production and from skills and growth to reputation. Cultivating self-management and accountability throughout the team via leadership, clear sense of purpose and thoughtful talent management. Lead alignment with the overarching technical strategy and working with the wider Technology organisation to craft it. Act as a platform owner, apply accurate product thinking to what is being built with a view to enable and empower as much as possible the digital product(s) that it supports through data and customer centricity, driving the related partner management. Collaborate with the entire engineering leadership, to make us think strategically, to ensure maximum alignment and to maintain a healthy ability to "think big" within their teams. Line management of supporting Staff and Senior Engineers, as well as driving recruitment and retention within the team. Act as custodian of OKR's within your hub. Supports our engineering communities to drive bar raising and strategic alignment, by creating the space and time within teams for the agenda of these communities to be progressed efficiently. When vendors are involved, the SEM is positioned as the key technical partner. They own it as a vital part of their platform with the same level of reliability and satisfaction as the in house capabilities they build and run. Who You Are Previous polyglot hands on senior software engineer Excellent knowledge in SRE concepts and trade offs Extensive background in software engineering with several years' experience in a variety of systems and technologies Experience building and leading teams of highly skilled, senior software engineers that deliver high quality software. Excellent understanding of system design, software architecture, cloud, and software engineering standard methodologies, Promoter of DevOps: you build it, you run it. Strong understanding of testing strategies and reliability engineering Excellent people management, interpersonal, analytical, and problem solving skills Ability to lead and line manage senior engineers and technical partners to a desired outcome, without prescribing it. Excellent communication skills, both written and spoken and able to adjust for different, including non technical audiences. A servant leadership mentality that is willing to take ownership of problems. Able to influence people at senior levels and from the highly technical to non technical Our Tech Stack SLOs, SLIs, error budgets, and reliability strategy Observability with metrics, logs, traces, and alerting Incident response, on call operations, postmortems, and runbooks Cloud infrastructure, Kubernetes, containers, and distributed systems CI/CD, progressive delivery, rollback, and release safety Automation, toil reduction, and self healing systems Capacity planning, performance engineering, and cost awareness Security, access control, compliance, and operational governance What's In It For You Working at M&S means being part of something bigger - helping to deliver quality, value and service to millions of customers every day. We're inclusive, fast moving and always evolving, with a strong sense of purpose and a focus on doing the right thing. Here are just a few of the benefits that make working here even more rewarding: 20% colleague discount on all M&S products and many third party brands for you and someone in your household, available once you've completed your probation Competitive holiday allowance with the option to buy more Discretionary bonus schemes linked to your performance and ours Strong pension and life assurance to help plan for the future Tailored induction and training to support your development from day one Exclusive perks and savings through our M&S Choices portal Market leading family policies, including parental, adoption and neonatal leave 24/7 wellbeing support, including virtual GP access and mental health services One paid volunteer day a year to support a cause that matters to you Everyone's Welcome We are ambitious about the future of retail. We're disrupting, innovating and leading the industry into a more conscientious, inspiring digital era. We're transforming how we work together and offering our most exciting opportunities yet. Marks & Spencer strives to be an inclusive organisation, trusted and admired by our colleagues, customers and suppliers. Join us and make change happen. We are committed to building diverse and representative teams, where everyone can bring their whole selves to work and be at their best. We support each other and work together to win together. If you feel you'd benefit from any support or reasonable adjustments during any stage of the recruitment process, please don't hesitate to let us know when completing your application. This information will be picked up by our team, so we can try and put steps in place to help you be at your best through this process.
27/06/2026
Full time
We're changing the way we do things, and putting industry leading innovation at the heart of how we operate; we need a stellar engineering team to make it happen. You'll be joining one of the most iconic brands in the UK on its most exciting cycle yet. We're more integrated and product led in our tech teams than ever before: learning, changing, and adapting constantly, with millions of people benefiting from your work every single day. You'll be joining the M&S Platform team as a Software Engineering Manager. Our mission is to streamline development at M&S for 1000+ engineers and 30+ applications - covering both our customer-facing and colleague-facing applications. You will manage the team building the SRE function. You will support the team and help direct the technical vision for how we will build reusable pipelines, tooling, build plugins and frameworks that our many apps will harness to boost the developer experience, while contributing code every step of the way. We want to make M&S one of the best places for software development, target the latest tooling and enable our engineers to focus on building the best user experience so that they don't get held up by common engineering and continuous integration problems. What You'll Do Being accountable for engineering excellence within your teams, from behaviours to operations, from technical direction to solution in production and from skills and growth to reputation. Cultivating self-management and accountability throughout the team via leadership, clear sense of purpose and thoughtful talent management. Lead alignment with the overarching technical strategy and working with the wider Technology organisation to craft it. Act as a platform owner, apply accurate product thinking to what is being built with a view to enable and empower as much as possible the digital product(s) that it supports through data and customer centricity, driving the related partner management. Collaborate with the entire engineering leadership, to make us think strategically, to ensure maximum alignment and to maintain a healthy ability to "think big" within their teams. Line management of supporting Staff and Senior Engineers, as well as driving recruitment and retention within the team. Act as custodian of OKR's within your hub. Supports our engineering communities to drive bar raising and strategic alignment, by creating the space and time within teams for the agenda of these communities to be progressed efficiently. When vendors are involved, the SEM is positioned as the key technical partner. They own it as a vital part of their platform with the same level of reliability and satisfaction as the in house capabilities they build and run. Who You Are Previous polyglot hands on senior software engineer Excellent knowledge in SRE concepts and trade offs Extensive background in software engineering with several years' experience in a variety of systems and technologies Experience building and leading teams of highly skilled, senior software engineers that deliver high quality software. Excellent understanding of system design, software architecture, cloud, and software engineering standard methodologies, Promoter of DevOps: you build it, you run it. Strong understanding of testing strategies and reliability engineering Excellent people management, interpersonal, analytical, and problem solving skills Ability to lead and line manage senior engineers and technical partners to a desired outcome, without prescribing it. Excellent communication skills, both written and spoken and able to adjust for different, including non technical audiences. A servant leadership mentality that is willing to take ownership of problems. Able to influence people at senior levels and from the highly technical to non technical Our Tech Stack SLOs, SLIs, error budgets, and reliability strategy Observability with metrics, logs, traces, and alerting Incident response, on call operations, postmortems, and runbooks Cloud infrastructure, Kubernetes, containers, and distributed systems CI/CD, progressive delivery, rollback, and release safety Automation, toil reduction, and self healing systems Capacity planning, performance engineering, and cost awareness Security, access control, compliance, and operational governance What's In It For You Working at M&S means being part of something bigger - helping to deliver quality, value and service to millions of customers every day. We're inclusive, fast moving and always evolving, with a strong sense of purpose and a focus on doing the right thing. Here are just a few of the benefits that make working here even more rewarding: 20% colleague discount on all M&S products and many third party brands for you and someone in your household, available once you've completed your probation Competitive holiday allowance with the option to buy more Discretionary bonus schemes linked to your performance and ours Strong pension and life assurance to help plan for the future Tailored induction and training to support your development from day one Exclusive perks and savings through our M&S Choices portal Market leading family policies, including parental, adoption and neonatal leave 24/7 wellbeing support, including virtual GP access and mental health services One paid volunteer day a year to support a cause that matters to you Everyone's Welcome We are ambitious about the future of retail. We're disrupting, innovating and leading the industry into a more conscientious, inspiring digital era. We're transforming how we work together and offering our most exciting opportunities yet. Marks & Spencer strives to be an inclusive organisation, trusted and admired by our colleagues, customers and suppliers. Join us and make change happen. We are committed to building diverse and representative teams, where everyone can bring their whole selves to work and be at their best. We support each other and work together to win together. If you feel you'd benefit from any support or reasonable adjustments during any stage of the recruitment process, please don't hesitate to let us know when completing your application. This information will be picked up by our team, so we can try and put steps in place to help you be at your best through this process.
Reddit is a community of communities. It's built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the largest and most influential platforms on the internet. As Reddit continues to scale globally, reliability and performance are more critical than ever. The Site Experience SRE team sits at the intersection of infrastructure, product engineering, and user experience - ensuring that every interaction across web, mobile, APIs, feeds, media delivery, and real time systems is fast, reliable, and resilient. We are looking for a Staff Site Reliability Engineer to lead reliability engineering initiatives for critical user facing systems at internet scale. In this role, you will partner closely with product and infrastructure teams to improve availability, latency, scalability, and operational excellence across Reddit's most business critical experiences. This is a highly technical leadership role for someone who thrives in large-scale distributed systems, enjoys solving complex reliability challenges, and can influence engineering culture across the organization. What you'll do: Lead Reliability Engineering for User Experience Drive reliability, scalability, and operational excellence for critical user facing systems and services. Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real time experiences. Architect for Scale Partner with product and infrastructure engineering teams to design systems that remain highly available and performant under massive global load. Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning. Reduce Operational Risk Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure. Build proactive mitigation strategies and drive engineering improvements that reduce incidents and improve service health. Drive Automation Eliminate repetitive operational work through automation and tooling. Build systems that improve deployment safety, incident response, remediation workflows, and reliability guardrails. Incident Management Lead complex incident response efforts across engineering teams. Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented. Influence Engineering Standards Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity across the company. Mentor and Multiply Impact Provide technical leadership and mentorship to engineers across SRE and software engineering teams. Help shape reliability culture and raise the operational excellence bar across the organization. What We're Looking For 8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems. Strong collaboration and communication skills with the ability to influence technical direction across teams. Strong experience supporting high traffic, user facing production environments. Deep understanding of one or more: distributed systems, networking, Linux systems, cloud native architectures. Experience designing highly available systems with strong operational and reliability practices. Strong programming skills in languages such as Go, Python, or similar. Strong understanding of observability systems including metrics, logging, tracing, and alerting. Experience improving reliability through SLOs, automation, incident management, and performance optimization. Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services. Nice to Have Experience operating systems at internet scale traffic volumes. Experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms. Familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar distributed infrastructure technologies. Experience with CDN optimization, edge reliability, traffic engineering, or global infrastructure. Contributions to open source software or participation in technical communities. Experience leading large scale incident response and operational transformation initiatives. Why Join Reddit? You'll help shape the reliability and performance of one of the internet's largest platforms, influencing experiences used by millions of people every day. This is an opportunity to solve deeply complex engineering problems at massive scale while helping define the future of reliability engineering for a modern consumer platform. Benefits Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support. Family Planning Support. Gender-Affirming Care. Mental Health & Coaching Benefits. Group Personal Pension Scheme with Employer match. Private Medical and Dental Scheme. Income Replacement Programs. Bike to Work scheme. Flexible Vacation & Paid Volunteer Time Off. Generous Paid Parental Leave.
27/06/2026
Full time
Reddit is a community of communities. It's built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 126 million daily active unique visitors, Reddit is one of the largest and most influential platforms on the internet. As Reddit continues to scale globally, reliability and performance are more critical than ever. The Site Experience SRE team sits at the intersection of infrastructure, product engineering, and user experience - ensuring that every interaction across web, mobile, APIs, feeds, media delivery, and real time systems is fast, reliable, and resilient. We are looking for a Staff Site Reliability Engineer to lead reliability engineering initiatives for critical user facing systems at internet scale. In this role, you will partner closely with product and infrastructure teams to improve availability, latency, scalability, and operational excellence across Reddit's most business critical experiences. This is a highly technical leadership role for someone who thrives in large-scale distributed systems, enjoys solving complex reliability challenges, and can influence engineering culture across the organization. What you'll do: Lead Reliability Engineering for User Experience Drive reliability, scalability, and operational excellence for critical user facing systems and services. Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real time experiences. Architect for Scale Partner with product and infrastructure engineering teams to design systems that remain highly available and performant under massive global load. Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning. Reduce Operational Risk Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure. Build proactive mitigation strategies and drive engineering improvements that reduce incidents and improve service health. Drive Automation Eliminate repetitive operational work through automation and tooling. Build systems that improve deployment safety, incident response, remediation workflows, and reliability guardrails. Incident Management Lead complex incident response efforts across engineering teams. Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented. Influence Engineering Standards Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity across the company. Mentor and Multiply Impact Provide technical leadership and mentorship to engineers across SRE and software engineering teams. Help shape reliability culture and raise the operational excellence bar across the organization. What We're Looking For 8+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems. Strong collaboration and communication skills with the ability to influence technical direction across teams. Strong experience supporting high traffic, user facing production environments. Deep understanding of one or more: distributed systems, networking, Linux systems, cloud native architectures. Experience designing highly available systems with strong operational and reliability practices. Strong programming skills in languages such as Go, Python, or similar. Strong understanding of observability systems including metrics, logging, tracing, and alerting. Experience improving reliability through SLOs, automation, incident management, and performance optimization. Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services. Nice to Have Experience operating systems at internet scale traffic volumes. Experience with Kubernetes, containers, cloud infrastructure, and modern deployment platforms. Familiarity with technologies such as Prometheus, Grafana, OpenTelemetry, Envoy, Kafka, ClickHouse, Cassandra, Redis, or similar distributed infrastructure technologies. Experience with CDN optimization, edge reliability, traffic engineering, or global infrastructure. Contributions to open source software or participation in technical communities. Experience leading large scale incident response and operational transformation initiatives. Why Join Reddit? You'll help shape the reliability and performance of one of the internet's largest platforms, influencing experiences used by millions of people every day. This is an opportunity to solve deeply complex engineering problems at massive scale while helping define the future of reliability engineering for a modern consumer platform. Benefits Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support. Family Planning Support. Gender-Affirming Care. Mental Health & Coaching Benefits. Group Personal Pension Scheme with Employer match. Private Medical and Dental Scheme. Income Replacement Programs. Bike to Work scheme. Flexible Vacation & Paid Volunteer Time Off. Generous Paid Parental Leave.
We are looking for a highly skilled Senior Platform Engineer to join our dynamic Platform Engineering & Technology team. In this role, you will be a key player in designing, building, and maintaining the foundational infrastructure that powers our internal platforms and enables software development teams to deliver applications more efficiently, securely, and reliably, all within an Agile and Kanban-driven environment. You will champion our 'platform as a product' philosophy by creating robust, scalable, and automated solutions. As a senior member of the team, you will provide technical leadership, mentor other engineers, and drive the adoption of best practices in infrastructure management and automation. Wickes is a home improvement retailer which offers a wide range of products for DIY and home improvement, with a strong focus on serving both the local trade and general public, including kitchen and bathroom installations. We are currently undergoing a technical transformation replacing many of the key systems used to run and operate our business. What you'll be doing Infrastructure Automation Design, implement, and manage highly available, scalable, and secure cloud infrastructure on Amazon Web Services (AWS) using Infrastructure as Code (IaC) principles Deep expertise in core AWS services such as EC2, ECS/EKS, Lambda, VPC, S3, RDS, DynamoDB, SQS, SNS, CloudWatch, CloudTrail, IAM, KMS, etc. Optimize AWS resource utilization and costs while maintaining performance and reliability. Implement and enforce security best practices across all AWS environments. Troubleshoot and resolve complex infrastructure and application issues in production and non-production environments. Contribute to architectural decisions and long-term cloud strategy. CI/CD Pipeline Development & Management Architect, develop, and maintain robust, automated, and efficient CI/CD pipelines for various applications and services (e.g., Jenkins, GitLab CI/CD). Implement strategies for continuous integration, continuous delivery, and continuous deployment, including blue/green deployments, canary releases, and rollbacks. Integrate security scanning tools (SAST, DAST, SCA) and quality gates into CI/CD pipelines. Drive the adoption of best practices for version control (Git), branching strategies, and pull request workflows. Automate testing, build, deployment, and release processes to accelerate software delivery and improve reliability. DevOps & Automation Champion a DevOps culture within the organization, promoting collaboration, shared responsibility, and continuous improvement. Develop and maintain automation scripts and tools (e.g., Python, Bash, Go) to streamline operational tasks and reduce manual effort. Implement robust monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK Stack, Datadog, CloudWatch) to ensure platform health and performance. Collaboration & Mentorship Collaborate closely with development teams, product managers, and other stakeholders to understand requirements and deliver effective solutions. Provide technical guidance and mentorship to junior engineers, fostering their growth and development. Participate in code reviews, design discussions, and architectural decision-making processes. Stay up to date with emerging AWS services, cloud technologies, and industry best practices. Security & Compliance Ensure that security and compliance are built into the product roadmap from the outset. Work with security teams to ensure any changes or new deployments meet data protection and compliance requirements. Add all security recommendations from our vendor and hosting providers to the backlog. Governance & Continuous Improvement Define KPIs and success metrics to measure platform effectiveness. Create and maintain comprehensive documentation for our infrastructure, processes, and tools in Confluence. Continuously assess and refine existing products to support evolving business needs. Does this sound like you You'll have a Bachelor's degree in business, computer science, computer engineering, system analysis or a related field of study, or equivalent. You'll have significant experience (10+ years ideally) in a DevOps, Site Reliability Engineering (SRE), or Platform Engineering role. We also expect you have around a minimum of 7+ years of hands on experience designing, deploying, and managing complex infrastructure on AWS. You'll be able to demonstrate strong experience building and maintaining robust CI/CD pipelines from scratch and optimizing existing ones. Expert level proficiency with AWS services and a deep understanding of cloud architecture principles. Extensive experience with Infrastructure as Code (IaC) tools, primarily Terraform. Strong expertise with Jenkins and GitLab CI/CD. Proficiency in at least one scripting language (e.g., Python, Bash, Go). Solid understanding of containerization technologies (Docker) and orchestration (Kubernetes/EKS/ECS). Experience with monitoring and logging tools (DataDog preferred). Strong understanding of networking concepts (TCP/IP, DNS, VPN, Load Balancing) and security best practices in the cloud. Familiarity with database technologies (relational and NoSQL). Experience with Git and collaborative development workflows. Familiarity with other Agile frameworks like Scrum, in addition to Kanban. Poise and ability to act calmly and competently in high pressure, demanding situations. The ability to manage ambiguity to clarify requirements and objectives to ensure successful outcomes. High degree of initiative, dependability, and ability to work with minimal supervision while being resilient to change. Motivation and drive to achieve long term business outcomes. Ability to work effectively in a team environment and contribute to the overall engineering and company wide technical direction. Excellent written, verbal, communication and presentation skills with the ability to articulate new ideas and concepts to technical and nontechnical audiences. What's in it for you We'll also equip you with a benefits package that includes: Competitive package including an annual bonus 25 days holiday plus bank holidays Enhanced contributory pension and life assurance Flexible hybrid working (2 3 days in Watford) Save as you earn scheme Colleague discount Discount platform including savings and cash back at numerous retailers, savings on gym membership, cycle to work scheme Well being strategy with Employee Assistance Programme, financial education & loans, and access to parental, menopause, and fertility support. We're a Disability Confident Employer and committed to building a diverse workforce that reflects the communities we serve. We welcome applications from disabled people and are committed to providing an accessible recruitment process and workplace for everyone. If you require any support or reasonable adjustments, please let us know.
27/06/2026
Full time
We are looking for a highly skilled Senior Platform Engineer to join our dynamic Platform Engineering & Technology team. In this role, you will be a key player in designing, building, and maintaining the foundational infrastructure that powers our internal platforms and enables software development teams to deliver applications more efficiently, securely, and reliably, all within an Agile and Kanban-driven environment. You will champion our 'platform as a product' philosophy by creating robust, scalable, and automated solutions. As a senior member of the team, you will provide technical leadership, mentor other engineers, and drive the adoption of best practices in infrastructure management and automation. Wickes is a home improvement retailer which offers a wide range of products for DIY and home improvement, with a strong focus on serving both the local trade and general public, including kitchen and bathroom installations. We are currently undergoing a technical transformation replacing many of the key systems used to run and operate our business. What you'll be doing Infrastructure Automation Design, implement, and manage highly available, scalable, and secure cloud infrastructure on Amazon Web Services (AWS) using Infrastructure as Code (IaC) principles Deep expertise in core AWS services such as EC2, ECS/EKS, Lambda, VPC, S3, RDS, DynamoDB, SQS, SNS, CloudWatch, CloudTrail, IAM, KMS, etc. Optimize AWS resource utilization and costs while maintaining performance and reliability. Implement and enforce security best practices across all AWS environments. Troubleshoot and resolve complex infrastructure and application issues in production and non-production environments. Contribute to architectural decisions and long-term cloud strategy. CI/CD Pipeline Development & Management Architect, develop, and maintain robust, automated, and efficient CI/CD pipelines for various applications and services (e.g., Jenkins, GitLab CI/CD). Implement strategies for continuous integration, continuous delivery, and continuous deployment, including blue/green deployments, canary releases, and rollbacks. Integrate security scanning tools (SAST, DAST, SCA) and quality gates into CI/CD pipelines. Drive the adoption of best practices for version control (Git), branching strategies, and pull request workflows. Automate testing, build, deployment, and release processes to accelerate software delivery and improve reliability. DevOps & Automation Champion a DevOps culture within the organization, promoting collaboration, shared responsibility, and continuous improvement. Develop and maintain automation scripts and tools (e.g., Python, Bash, Go) to streamline operational tasks and reduce manual effort. Implement robust monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK Stack, Datadog, CloudWatch) to ensure platform health and performance. Collaboration & Mentorship Collaborate closely with development teams, product managers, and other stakeholders to understand requirements and deliver effective solutions. Provide technical guidance and mentorship to junior engineers, fostering their growth and development. Participate in code reviews, design discussions, and architectural decision-making processes. Stay up to date with emerging AWS services, cloud technologies, and industry best practices. Security & Compliance Ensure that security and compliance are built into the product roadmap from the outset. Work with security teams to ensure any changes or new deployments meet data protection and compliance requirements. Add all security recommendations from our vendor and hosting providers to the backlog. Governance & Continuous Improvement Define KPIs and success metrics to measure platform effectiveness. Create and maintain comprehensive documentation for our infrastructure, processes, and tools in Confluence. Continuously assess and refine existing products to support evolving business needs. Does this sound like you You'll have a Bachelor's degree in business, computer science, computer engineering, system analysis or a related field of study, or equivalent. You'll have significant experience (10+ years ideally) in a DevOps, Site Reliability Engineering (SRE), or Platform Engineering role. We also expect you have around a minimum of 7+ years of hands on experience designing, deploying, and managing complex infrastructure on AWS. You'll be able to demonstrate strong experience building and maintaining robust CI/CD pipelines from scratch and optimizing existing ones. Expert level proficiency with AWS services and a deep understanding of cloud architecture principles. Extensive experience with Infrastructure as Code (IaC) tools, primarily Terraform. Strong expertise with Jenkins and GitLab CI/CD. Proficiency in at least one scripting language (e.g., Python, Bash, Go). Solid understanding of containerization technologies (Docker) and orchestration (Kubernetes/EKS/ECS). Experience with monitoring and logging tools (DataDog preferred). Strong understanding of networking concepts (TCP/IP, DNS, VPN, Load Balancing) and security best practices in the cloud. Familiarity with database technologies (relational and NoSQL). Experience with Git and collaborative development workflows. Familiarity with other Agile frameworks like Scrum, in addition to Kanban. Poise and ability to act calmly and competently in high pressure, demanding situations. The ability to manage ambiguity to clarify requirements and objectives to ensure successful outcomes. High degree of initiative, dependability, and ability to work with minimal supervision while being resilient to change. Motivation and drive to achieve long term business outcomes. Ability to work effectively in a team environment and contribute to the overall engineering and company wide technical direction. Excellent written, verbal, communication and presentation skills with the ability to articulate new ideas and concepts to technical and nontechnical audiences. What's in it for you We'll also equip you with a benefits package that includes: Competitive package including an annual bonus 25 days holiday plus bank holidays Enhanced contributory pension and life assurance Flexible hybrid working (2 3 days in Watford) Save as you earn scheme Colleague discount Discount platform including savings and cash back at numerous retailers, savings on gym membership, cycle to work scheme Well being strategy with Employee Assistance Programme, financial education & loans, and access to parental, menopause, and fertility support. We're a Disability Confident Employer and committed to building a diverse workforce that reflects the communities we serve. We welcome applications from disabled people and are committed to providing an accessible recruitment process and workplace for everyone. If you require any support or reasonable adjustments, please let us know.
Machine Learning Platform Engineering We're on a mission to make money work for everyone, and our Machine Learning Platform team builds the systems that help teams across Monzo train, evaluate, deploy, and serve ML models and AI features safely and reliably. We work on backend services, Python libraries, model lifecycle tooling, evaluation workflows, and low latency serving systems. Our users are internal ML engineers, scientists, and product teams building with ML and LLMs. The work matters because machine learning powers many important decisions and experiences at Monzo, from fraud checks and credit decisions to customer operations. Location & Compensation London, UK (remote within the UK available). Salary £85,000 - £110,000 plus incentive awards tied to performance. Benefits include relocation support, visa sponsorship, flexible working hours, learning budget, and a full list of benefits. Responsibilities Develop backend services, platform APIs, and production systems using Go. Write Python libraries, workflows, and tooling used by our ML engineers and scientists. Implement feature platforms and data workflows with Chronon, Feast, and DataHub. Build model training pipelines and experiment tracking using Vertex AI and Comet. Maintain AI observability, evaluation, and tracing using Langfuse. Deploy and maintain real time serving on AWS and batch compute on GCP, including BigQuery data warehousing. Qualifications Strong backend engineering background with experience in Go and Python. Experience with ML or AI platforms, including pipelines, feature stores, model serving, experiment tracking, or LLM tooling. Designed and operated distributed systems that handle scale, concurrency, and failure. Focus on developer experience and removing friction for internal teams. Comfortable with ambiguity and ability to shape a platform as it grows. Experience with strongly typed languages and writing backend software. Curiosity about system behavior in production, including reliability, latency, quality, safety, and operational risk. This Might NOT Be the Right Fit If Your background is predominantly DevOps, SRE, or infrastructure operations. You are focused on data science or modelling rather than platform engineering. You have shipped AI product features but have not worked on the platform side (serving, evaluation, model lifecycle). Benefits Competitive salary £85,000 - £110,000 plus incentive awards. Relocation assistance to the UK and visa sponsorship. Flexible working hours and trust to work the hours that suit you. Annual learning budget of £1,000 for books, training courses, and conferences. Additional benefits available - see our full benefits list. Equal Opportunity Employer Diversity and inclusion are a priority for us. We are an equal opportunity employer and will consider all applicants without regard to age, ethnicity, religion, sex, sexual orientation, gender identity, family or parental status, national origin, veteran status, neurodiversity, or disability status.
27/06/2026
Full time
Machine Learning Platform Engineering We're on a mission to make money work for everyone, and our Machine Learning Platform team builds the systems that help teams across Monzo train, evaluate, deploy, and serve ML models and AI features safely and reliably. We work on backend services, Python libraries, model lifecycle tooling, evaluation workflows, and low latency serving systems. Our users are internal ML engineers, scientists, and product teams building with ML and LLMs. The work matters because machine learning powers many important decisions and experiences at Monzo, from fraud checks and credit decisions to customer operations. Location & Compensation London, UK (remote within the UK available). Salary £85,000 - £110,000 plus incentive awards tied to performance. Benefits include relocation support, visa sponsorship, flexible working hours, learning budget, and a full list of benefits. Responsibilities Develop backend services, platform APIs, and production systems using Go. Write Python libraries, workflows, and tooling used by our ML engineers and scientists. Implement feature platforms and data workflows with Chronon, Feast, and DataHub. Build model training pipelines and experiment tracking using Vertex AI and Comet. Maintain AI observability, evaluation, and tracing using Langfuse. Deploy and maintain real time serving on AWS and batch compute on GCP, including BigQuery data warehousing. Qualifications Strong backend engineering background with experience in Go and Python. Experience with ML or AI platforms, including pipelines, feature stores, model serving, experiment tracking, or LLM tooling. Designed and operated distributed systems that handle scale, concurrency, and failure. Focus on developer experience and removing friction for internal teams. Comfortable with ambiguity and ability to shape a platform as it grows. Experience with strongly typed languages and writing backend software. Curiosity about system behavior in production, including reliability, latency, quality, safety, and operational risk. This Might NOT Be the Right Fit If Your background is predominantly DevOps, SRE, or infrastructure operations. You are focused on data science or modelling rather than platform engineering. You have shipped AI product features but have not worked on the platform side (serving, evaluation, model lifecycle). Benefits Competitive salary £85,000 - £110,000 plus incentive awards. Relocation assistance to the UK and visa sponsorship. Flexible working hours and trust to work the hours that suit you. Annual learning budget of £1,000 for books, training courses, and conferences. Additional benefits available - see our full benefits list. Equal Opportunity Employer Diversity and inclusion are a priority for us. We are an equal opportunity employer and will consider all applicants without regard to age, ethnicity, religion, sex, sexual orientation, gender identity, family or parental status, national origin, veteran status, neurodiversity, or disability status.
Marks & Spencer Plc is seeking a Software Engineering Manager to lead their engineering team. In this role, you will streamline development for over 1000 engineers and manage the SRE function. Your mission will include enhancing the developer experience with reusable tooling and pipelines. The ideal candidate will have extensive software engineering experience and a proven track record in leading teams to deliver high-quality software. Join M&S as they innovate at the forefront of retail technology.
27/06/2026
Full time
Marks & Spencer Plc is seeking a Software Engineering Manager to lead their engineering team. In this role, you will streamline development for over 1000 engineers and manage the SRE function. Your mission will include enhancing the developer experience with reusable tooling and pipelines. The ideal candidate will have extensive software engineering experience and a proven track record in leading teams to deliver high-quality software. Join M&S as they innovate at the forefront of retail technology.
Carbon3ai Limited. in the United Kingdom is seeking Automation Engineers to build a modern AI Platform Operations function from scratch. You will develop workflows, build tooling for performance enhancement, and maintain observability platforms within a hybrid work environment. Ideal candidates will have strong Python skills, experience with automation and observability tools like Prometheus and Grafana, and a passion for operational excellence. Diversity and inclusion are emphasized in our work culture.
27/06/2026
Full time
Carbon3ai Limited. in the United Kingdom is seeking Automation Engineers to build a modern AI Platform Operations function from scratch. You will develop workflows, build tooling for performance enhancement, and maintain observability platforms within a hybrid work environment. Ideal candidates will have strong Python skills, experience with automation and observability tools like Prometheus and Grafana, and a passion for operational excellence. Diversity and inclusion are emphasized in our work culture.
Engine by Starling is on a mission to partner with leading banks worldwide to build rapid growth businesses on our technology. Engine is Starling's SaaS business, the technology that powers Starling. Two years ago we split out as a separate business. Starling has seen exceptional growth and success, thanks to our modern, in-house built technology. The SaaS platform is now available to banks and financial institutions globally, enabling them to benefit from innovative digital features and efficient back office processes. Hybrid Working We have a hybrid approach to working here at Engine. Our preference is that you're located within a commutable distance of one of our offices so we can interact and collaborate in person. About Engineering at Engine by Starling The Cross Cutting Engineering team is the backbone of our innovation. We build and maintain the reliable, scalable, and maintainable infrastructure and tooling that powers our entire software delivery pipeline-from the first line of code to seamless production deployment and ongoing operations. We own the lifecycle of our features, tackling complex challenges with a first principles approach and fostering a multi disciplinary environment where you're encouraged to explore and contribute across the platform. Platform Engineer As a Platform Engineer at Engine, you'll be at the forefront of building and scaling our cutting edge cloud native banking platform across multiple global cloud providers and regions. We are looking for engineers with a strong SRE mindset, who embrace ownership of the entire software delivery pipeline, and are passionate about building internal tooling that empowers our technology teams. What you'll get to do Building and Scaling Cloud Infrastructure: design, build, and maintain our cloud infrastructure across multiple providers (including GCP) and regions, ensuring scalability, reliability, and security. Building on Google Cloud: contribute to the build out and optimisation of our core "Engine" on Google Cloud Platform using Java and Kubernetes. Scaling our SaaS Release Tooling: enhance and improve our multi tenant, multi region SaaS release and continuous deployment systems using Java, Golang, and Terraform. Empowering Developers: develop and maintain internal tooling using Java and Golang to improve developer experience and on call efficiency. Automating Compliance and Security: build automation solutions in Golang to enforce compliance and security controls across our platform. Driving Efficiency: optimise the performance and reliability of our cloud environment with a strong focus on cost effectiveness. Embracing Automation: identify and implement automation opportunities to minimise manual processes across the platform lifecycle. Ensuring Security: implement and maintain robust security practices to protect our platform and customer data. Championing Best Practices: stay abreast of new technologies and industry changes, particularly in SRE practices and deployment automation, and share your knowledge with the team. Maintaining Compliance: contribute to ensuring our platform adheres to relevant industry standards such as ISO27001, SOC2, and PCI DSS. Collaborating and Learning: work closely with cross functional teams, share your expertise, and contribute to our vibrant learning culture. Aiming for Greatness: strive for excellence in everything you do, maintaining a curious and inquisitive mindset. Documenting Solutions: design and document scalable internal tooling clearly and comprehensively. Taking Ownership: own features and improvements throughout their entire lifecycle. Participate in on call: the option to join our on call rota (not mandatory!) to deal with interesting technical issues and gain deep insights into our platform's behaviour. Your place within the team will depend on your individual strengths and interests.
27/06/2026
Full time
Engine by Starling is on a mission to partner with leading banks worldwide to build rapid growth businesses on our technology. Engine is Starling's SaaS business, the technology that powers Starling. Two years ago we split out as a separate business. Starling has seen exceptional growth and success, thanks to our modern, in-house built technology. The SaaS platform is now available to banks and financial institutions globally, enabling them to benefit from innovative digital features and efficient back office processes. Hybrid Working We have a hybrid approach to working here at Engine. Our preference is that you're located within a commutable distance of one of our offices so we can interact and collaborate in person. About Engineering at Engine by Starling The Cross Cutting Engineering team is the backbone of our innovation. We build and maintain the reliable, scalable, and maintainable infrastructure and tooling that powers our entire software delivery pipeline-from the first line of code to seamless production deployment and ongoing operations. We own the lifecycle of our features, tackling complex challenges with a first principles approach and fostering a multi disciplinary environment where you're encouraged to explore and contribute across the platform. Platform Engineer As a Platform Engineer at Engine, you'll be at the forefront of building and scaling our cutting edge cloud native banking platform across multiple global cloud providers and regions. We are looking for engineers with a strong SRE mindset, who embrace ownership of the entire software delivery pipeline, and are passionate about building internal tooling that empowers our technology teams. What you'll get to do Building and Scaling Cloud Infrastructure: design, build, and maintain our cloud infrastructure across multiple providers (including GCP) and regions, ensuring scalability, reliability, and security. Building on Google Cloud: contribute to the build out and optimisation of our core "Engine" on Google Cloud Platform using Java and Kubernetes. Scaling our SaaS Release Tooling: enhance and improve our multi tenant, multi region SaaS release and continuous deployment systems using Java, Golang, and Terraform. Empowering Developers: develop and maintain internal tooling using Java and Golang to improve developer experience and on call efficiency. Automating Compliance and Security: build automation solutions in Golang to enforce compliance and security controls across our platform. Driving Efficiency: optimise the performance and reliability of our cloud environment with a strong focus on cost effectiveness. Embracing Automation: identify and implement automation opportunities to minimise manual processes across the platform lifecycle. Ensuring Security: implement and maintain robust security practices to protect our platform and customer data. Championing Best Practices: stay abreast of new technologies and industry changes, particularly in SRE practices and deployment automation, and share your knowledge with the team. Maintaining Compliance: contribute to ensuring our platform adheres to relevant industry standards such as ISO27001, SOC2, and PCI DSS. Collaborating and Learning: work closely with cross functional teams, share your expertise, and contribute to our vibrant learning culture. Aiming for Greatness: strive for excellence in everything you do, maintaining a curious and inquisitive mindset. Documenting Solutions: design and document scalable internal tooling clearly and comprehensively. Taking Ownership: own features and improvements throughout their entire lifecycle. Participate in on call: the option to join our on call rota (not mandatory!) to deal with interesting technical issues and gain deep insights into our platform's behaviour. Your place within the team will depend on your individual strengths and interests.
Onyx-Conseil is looking for a Platform Engineer to enhance our cloud banking platform. In this role, you will design and scale cloud infrastructure using Java and Golang, contributing to our continuous deployment systems and internal tooling. You will embrace a strong SRE mindset and participate in a hybrid working approach, collaborating closely with cross-functional teams. If you are passionate about cloud technologies and automation, this position offers significant growth opportunities.
27/06/2026
Full time
Onyx-Conseil is looking for a Platform Engineer to enhance our cloud banking platform. In this role, you will design and scale cloud infrastructure using Java and Golang, contributing to our continuous deployment systems and internal tooling. You will embrace a strong SRE mindset and participate in a hybrid working approach, collaborating closely with cross-functional teams. If you are passionate about cloud technologies and automation, this position offers significant growth opportunities.
Linux Technical Support Engineering Jobs Technical support engineers (TSEs) provide expert assistance to customers of software and infrastructure products. At Linux-centric companies (from kernel-adjacent distributions to observability platforms), TSEs need genuine Linux expertise to diagnose complex production issues and guide customers to resolution. The role is customer-facing with deep technical content, and many TSEs progress into SRE, solutions architecture, or product engineering.
27/06/2026
Full time
Linux Technical Support Engineering Jobs Technical support engineers (TSEs) provide expert assistance to customers of software and infrastructure products. At Linux-centric companies (from kernel-adjacent distributions to observability platforms), TSEs need genuine Linux expertise to diagnose complex production issues and guide customers to resolution. The role is customer-facing with deep technical content, and many TSEs progress into SRE, solutions architecture, or product engineering.
Overview Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public sector organisations. Role Summary We are seeking Automation Engineers who sit at the intersection of Site Reliability Engineering and modern AI driven operations. Embedded within Era4's engineering led Operations Centre, this role exists to build a modern AI Platform Operations function from scratch, designing tooling, and agentic workflows. No legacy to deal with. Key Responsibilities Runbook Automation & Agent Development: Build agentic, executable workflows capable of triaging, diagnosing, and where appropriate autonomously remediating known failure patterns. Build and maintain LLM backed agents targeting the observability stack, ITSM platform, and infrastructure APIs (e.g. DCIM, IPAM, hypervisor layers). Develop auditable client focused automations, for client interactions and workflows, with appropriate controls. Develop safe, auditable automation with appropriate controls for higher risk platform actions. Operational Tooling & Self Service Enablement: Build internal tooling that empowers engineers and service desk analysts: CLI utilities, ChatOps integrations (Slack/Teams bots), status dashboards, and self service automation hooks. Reduce dependency on DevSecOps and engineering teams for routine operational tasks through automation. Maintain and contribute a library of automation assets, agent prompts, and runbook as code artefacts, version controlled and peer reviewed. Develop the automation layer around monitoring and event management: alert suppression logic, enrichment pipelines, correlation rules, and alert to ticket integrations. Continuously tune signal to noise ratios across monitoring tooling (Prometheus, Mimir, Grafana, or equivalent) to improve situational awareness. Design and implement event correlation and deduplication logic to reduce alert storms and improve incident context. Identify common operational patterns and tasks as candidates for automation; maintain and prioritise a toil reduction backlog. Participate in post incident reviews and translate findings into updated automation, runbooks, or agent logic. Contribute to the evolution of Era4's operational standards, tooling architecture, and agent framework. Operational: Prior experience in an SRE, Senior Operations, or Platform Engineering environment, with exposure to on call operations and incident management processes. Experience in converting narrative runbooks into executable automation or codified decision trees. Understanding of ITIL aligned incident and change management principles and ITSM tooling. Technical - Core Element Strong Python development skills, including scripting for automation, API integration, and data processing. Hands on experience with observability and monitoring platforms: Prometheus, Grafana, Mimir, or equivalent. Experience integrating with ITSM platforms (ServiceNow, Halo, Jira Service Management, or similar) via API. Solid understanding of event driven architectures, message queues, and webhook based automation patterns. Strong understanding of managing GPU infrastructure in production, key signals and metrics and the automation of workflows. Familiarity with Infrastructure as Code principles and cloud native environments (Kubernetes, Terraform, or similar). Comfort operating in an API first environment, integrating agents with infrastructure APIs, DCIM, IPAM, and hypervisor control planes. One or more would be an advantage Exposure to data centre or colocation operations, particularly high density compute or GPU infrastructure environments. Experience with ChatOps tooling: building Slack or Microsoft Teams bots for operational workflows. Familiarity with DCIM platforms and telemetry pipelines (power, thermal, network). Knowledge of OpenTelemetry, distributed tracing, or log aggregation platforms (Loki, ELK, Splunk). Contributions to open source observability or automation tooling. Experience in a start up or scale up environment where tooling is built from scratch. Why Join Era4 You'll be joining a mission driven start up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next generation company operates at scale. Diversity & Inclusion Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. United Kingdom - Hybrid (Visit to office / site required)
27/06/2026
Full time
Overview Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public sector organisations. Role Summary We are seeking Automation Engineers who sit at the intersection of Site Reliability Engineering and modern AI driven operations. Embedded within Era4's engineering led Operations Centre, this role exists to build a modern AI Platform Operations function from scratch, designing tooling, and agentic workflows. No legacy to deal with. Key Responsibilities Runbook Automation & Agent Development: Build agentic, executable workflows capable of triaging, diagnosing, and where appropriate autonomously remediating known failure patterns. Build and maintain LLM backed agents targeting the observability stack, ITSM platform, and infrastructure APIs (e.g. DCIM, IPAM, hypervisor layers). Develop auditable client focused automations, for client interactions and workflows, with appropriate controls. Develop safe, auditable automation with appropriate controls for higher risk platform actions. Operational Tooling & Self Service Enablement: Build internal tooling that empowers engineers and service desk analysts: CLI utilities, ChatOps integrations (Slack/Teams bots), status dashboards, and self service automation hooks. Reduce dependency on DevSecOps and engineering teams for routine operational tasks through automation. Maintain and contribute a library of automation assets, agent prompts, and runbook as code artefacts, version controlled and peer reviewed. Develop the automation layer around monitoring and event management: alert suppression logic, enrichment pipelines, correlation rules, and alert to ticket integrations. Continuously tune signal to noise ratios across monitoring tooling (Prometheus, Mimir, Grafana, or equivalent) to improve situational awareness. Design and implement event correlation and deduplication logic to reduce alert storms and improve incident context. Identify common operational patterns and tasks as candidates for automation; maintain and prioritise a toil reduction backlog. Participate in post incident reviews and translate findings into updated automation, runbooks, or agent logic. Contribute to the evolution of Era4's operational standards, tooling architecture, and agent framework. Operational: Prior experience in an SRE, Senior Operations, or Platform Engineering environment, with exposure to on call operations and incident management processes. Experience in converting narrative runbooks into executable automation or codified decision trees. Understanding of ITIL aligned incident and change management principles and ITSM tooling. Technical - Core Element Strong Python development skills, including scripting for automation, API integration, and data processing. Hands on experience with observability and monitoring platforms: Prometheus, Grafana, Mimir, or equivalent. Experience integrating with ITSM platforms (ServiceNow, Halo, Jira Service Management, or similar) via API. Solid understanding of event driven architectures, message queues, and webhook based automation patterns. Strong understanding of managing GPU infrastructure in production, key signals and metrics and the automation of workflows. Familiarity with Infrastructure as Code principles and cloud native environments (Kubernetes, Terraform, or similar). Comfort operating in an API first environment, integrating agents with infrastructure APIs, DCIM, IPAM, and hypervisor control planes. One or more would be an advantage Exposure to data centre or colocation operations, particularly high density compute or GPU infrastructure environments. Experience with ChatOps tooling: building Slack or Microsoft Teams bots for operational workflows. Familiarity with DCIM platforms and telemetry pipelines (power, thermal, network). Knowledge of OpenTelemetry, distributed tracing, or log aggregation platforms (Loki, ELK, Splunk). Contributions to open source observability or automation tooling. Experience in a start up or scale up environment where tooling is built from scratch. Why Join Era4 You'll be joining a mission driven start up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next generation company operates at scale. Diversity & Inclusion Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. United Kingdom - Hybrid (Visit to office / site required)