Job Description
Lead Site Reliability Engineer (SRE) opportunity within our Google Cloud Site Reliability Engineering team at JPMorgan Chase, part of the Infrastructure Platform - Cloud Foundational Services SRE organization, operating within a global follow the sun support model.
Job Responsibilities
- Lead and implement SRE frameworks to support global Google Cloud environments and ensure the highest level of SLOs through operational excellence.
- Master application, data, infrastructure, and Agentic AI disciplines.
- Understand financial control and budget management, and partner with colleagues to lead collaborative teams to achieve common goals.
- Use enterprise authorized AI capabilities within the work environment to accelerate major incident triage, troubleshooting, and post incident analysis, validating outputs and handling operational data according to sensitivity and security requirements.
- Provide support to develop and improve the quality of technical engineering documentation.
- Provide technical supervision, oversight, and problem resolution for engineering activities.
- Champion a DevOps model so that services are automated and elastic across all platforms.
Required Qualifications, Capabilities, and Skills
- Google and Azure cloud expertise in a mission critical production environment.
- Strong understanding of container technologies such as Docker, Kubernetes, GKE, and HELM.
- Programming experience in Python, shell scripting, or Go, with good understanding of REST APIs.
- Hands on experience with cloud based technologies and tools, especially for deployment, monitoring, and operations, such as Google Observability, Azure Monitor, DataDog, Prometheus, Splunk, Elasticsearch, and Grafana.
- Experience using enterprise authorized AI capabilities within the work environment to improve SRE workflows, with strong validation habits and awareness of data sensitivity.
- Ability to evaluate AI assisted operational recommendations for correctness and risk, define appropriate guardrails for team usage, and ensure outcomes align with resiliency and security expectations.
- Strong understanding of Google Cloud governance, compliance, and cost management.
- Proficient with modern development technologies and tools such as Agile, CI/CD, Git, Infrastructure as Code, Terraform, and Jenkins.
- Google Cloud certification or equivalent technical experience in the public cloud.
- Good understanding of Agentic AI SDKs and GitHub Copilot skills.
Preferred Qualifications, Capabilities, and Skills
- Good understanding of operating systems such as Windows and Linux (RedHat/Ubuntu).
- Good understanding of LLM and other AI/ML frameworks that can be used in AIOPS.
We recognize that our people are our strength and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and place a high value on diversity and inclusion at our company. We do not discriminate on the basis of any protected attribute, including race, religion, color, national origin, gender, sexual orientation, gender identity, gender expression, age, marital or veteran status, pregnancy or disability, or any other basis protected under applicable law. We also make reasonable accommodations for applicants' and employees' religious practices and beliefs, as well as mental health or physical disability needs.