Lead Site Reliability Engineer

  • Mastercard
  • Dunstable, Bedfordshire
  • 27/06/2026
Full time Information Technology Telecommunications

Job Description

Lead Site Reliability Engineer

The Business Operations team is seeking a highly motivated and experienced Lead Site Reliability Engineer (SRE) to join our team. You will play a critical role in ensuring the reliability, scalability, and performance of our applications, supporting essential services that power Mastercard's global operations. As a thought leader in your field, you will bring technical expertise, a passion for automation, and the ability to mentor.

Responsibilities
  • Be a developing subject matter expert in the Site Reliability Engineering area, influencing stakeholders and applying advanced knowledge to drive achievement of area goals and initiatives by contributing to solution development and improvements for existing products, services, and/or processes.
  • Implement and maintain high-availability system solutions, ensuring stability, performance, and operational continuity.
  • Evaluate operational requirements to develop effective technical solutions within existing frameworks.
  • Lead automation and scripting efforts to streamline operational processes and incident response workflows.
  • Troubleshoot and resolve complex system issues, escalating as necessary to maintain system health and proactively address risks.
  • Contribute to documentation, knowledge sharing, and best practices to improve team operational procedures.
  • Conduct reviews and quality assurance activities to uphold organizational standards for system stability.
  • Keep current with industry trends and emerging technologies relevant to system reliability and operational automation.
  • Guide and mentor junior team members through on-the-job experiences, reviewing work and fostering a culture of continuous improvement to grow expertise around their discipline.
Qualifications
  • Observability - Ability to use scripting and tooling to implement observability solutions, enabling the collection, analysis, and visualization of metrics, logs, and traces to support incident detection, diagnosis, and continuous service improvement.
  • Programming and Scripting - Ability to write and maintain code and scripts to automate tasks, build operational tools, and support monitoring, deployment, and incident response using languages such as Python, Go, Bash, or similar.
  • Systems and Network Administration - Ability to configure, operate, and troubleshoot Linux/Unix systems and network components, applying knowledge of networking concepts, protocols, security, and system reliability.
  • Cloud Computing and Infrastructure - Ability to design, deploy, and manage applications and infrastructure on cloud platforms (e.g., AWS, Azure, GCP), ensuring scalability, security, availability, and operational efficiency.
  • Reliability and Scalability - Ability to design and operate systems for high availability, fault tolerance, and disaster recovery, while ensuring systems can scale to meet current and future demand.
  • DevOps Practices - Ability to apply DevOps principles and practices, including CI/CD pipelines, containerization, and orchestration, to enable faster, more reliable software delivery and operations.
  • Troubleshooting - Capability to systematically identify, diagnose, and resolve technical issues across systems, applications, and networks, using analytical methods and tools to restore functionality, minimize disruption, and ensure stable operations.
  • Capacity Planning and Performance Optimization - Ability to monitor resource utilization, forecast future capacity needs, and optimize system performance to support growth, scalability, and efficient infrastructure usage.
  • IT Service Management - Ability to apply IT service management principles to incident, problem, and change management, ensuring reliable service delivery, effective incident response, and continuous service improvement aligned to business needs.
  • Proactive Monitoring and Improvement (SRE Applications) - The ability to use application reliability signals to anticipate issues, identify risks, and drive preventative improvements that enhance application performance and availability.
Corporate Security Responsibility

All activities involving access to Mastercard assets, information, and networks come with an inherent risk to the organization. Each person working for, or on behalf of, Mastercard is responsible for information security and must: abide by Mastercard's security policies and practices; ensure the confidentiality and integrity of the information being accessed; report any suspected information security violation or breach; and complete all periodic mandatory security trainings in accordance with Mastercard's guidelines.