Senior Lead Site Reliability Engineer

  • JPMorgan Chase & Co.
  • 21/05/2026
Full time Information Technology Telecommunications

Job Description

Be an integral part of an agile team that's constantly pushing the envelope to enhance, build, and deliver top-notch reliability and observability for our most critical platforms.

As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Commercial & Investment Bank, you are an integral part of an agile team that works to enhance, build, and deliver trusted market-leading technology products in a secure, stable, and scalable way. Drive significant business impact through your capabilities and contributions, and apply deep technical expertise and problem-solving methodologies to tackle a diverse array of reliability, observability, and performance challenges that span multiple technologies and applications.

Job responsibilities
  • Regularly provides technical guidance and direction on site reliability practices to support the business and its technical teams, contractors, and vendors
  • Develops secure and high-quality production code for reliability tooling and telemetry pipelines, and reviews and debugs code written by others
  • Drives decisions that influence reliability design, observability architecture, application functionality, and technical operations and processes
  • Serves as a function-wide subject matter expert in one or more areas of site reliability, observability, or telemetry engineering
  • Leads resiliency design reviews and breaks up complex reliability problems into digestible work for other engineers, acting as a technical lead for large-sized products
  • Acts as the main point of contact during major incidents, demonstrating the skills to identify and solve issues quickly to avoid financial losses, and champions blameless postmortem culture
  • Collaborates with team members and stakeholders to define comprehensive service level indicators, service level objectives, and error budgets
  • Designs, implements, and maintains operational reliability for large-scale OpenTelemetry pipelines on hybrid on-prem/cloud environments, supporting telemetry ingestion, processing, and export to backends such as InfluxDB, Prometheus, Elasticsearch, and OpenSearch
  • Drives the assessment, refactoring, and incremental migration of custom legacy telemetry collection code to standardized OpenTelemetry instrumentation, reducing technical debt while maintaining system stability
  • Actively contributes to the engineering community as an advocate of firmwide frameworks, tools, and practices, and influences peers and project decision-makers to consider the use and application of leading-edge observability and reliability technologies
  • Adds to the team culture of diversity, opportunity, inclusion, and respect
Required qualifications, capabilities, and skills
  • Formal training or certification on software engineering concepts and advanced applied experience delivering system design, application development, testing, and operational stability
  • Advanced knowledge of reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices, with considerable in-depth knowledge in one or more technical disciplines (e.g., cloud, observability, distributed systems, etc.)
  • Advanced proficiency in one or more programming languages (e.g., Java, Python, Go, etc.)
  • Advanced proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, Elasticsearch, etc.
  • Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
  • Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
  • Hands on experience with the design, deployment, and operation of OpenTelemetry collectors in production environments, focusing on technical aspects such as configuring, optimizing, and troubleshooting OTLP endpoints and receivers
  • Ability to tackle reliability design and functionality problems independently with little to no oversight
  • Practical cloud native experience
  • Ability to expand and collaborate across different levels and stakeholder groups
Preferred qualifications, capabilities, and skills
  • Knowledge of distributed tracing, metrics, and logging best practices
  • Certification in AWS, Kubernetes, or relevant technologies
  • Proven track record in system health monitoring, capacity management, and blameless postmortems for high-availability services
  • Deep understanding of distributed system design principles, networking (TCP/IP, DNS, load balancing), and Linux internals
  • Contributions to open-source observability or telemetry projects
  • Experience working with agent control planes and management protocols; hands on knowledge of OpAMP is highly desirable