HPC SRE - Quant Research, HFT

  • La Fosse Associates Limited
  • 16/03/2026
Full time Information Technology Telecommunications Python Software Engineer Testing

Job Description

Network Site Reliability Engineer - Python/GO, Observability, Monitoring, HPC

Within the Network Engineering Team, this role is critical in ensuring our clients High-Performance Computing (HPC) environments are supported by a resilient, data-driven, and software-defined network foundation.

We are seeking a Networks focused Site Reliability Engineer (SRE) with a focus on Observability, Telemetry, and Monitoring. In this role, you will apply a software engineering mindset to network operations, bridging the gap between traditional networking and modern Site Reliability Engineering (SRE).
You will be responsible for ensuring our high-performance network infrastructure is not just functional, but deeply visible. You will build the tooling and automation that allow the team to move from reactive troubleshooting to proactive, automated remediation and "self-healing" infrastructure.

Key Responsibilities:

  • Reliability Engineering: Apply SRE principles to the network; define and maintain SLIs, SLOs, and Error Budgets for network latency, packet loss, and availability.
  • HPC Connectivity & Performance: Support low-latency, high-throughput network architectures (eg, RDMA, RoCE) designed for intensive HPC and financial data workloads.
  • Advanced Telemetry: Design and manage high-cardinality telemetry pipelines to collect and analyze flow logs, metrics, and traces at scale.
  • Network Automation (Python/Go): Build and maintain internal software tools, APIs, and "self-healing" scripts to automate routine operations and complex failure recoveries.
  • Infrastructure-as-Code (IaC): Use Terraform to manage complex network configurations and observability stacks (Prometheus, Grafana, OpenSearch) as code.
  • Observability & Monitoring: Implement automated alerting and dashboarding that provide Real Time insights into network health and traffic patterns.
  • Incident Management & Post-Mortems: Lead technical troubleshooting for complex outages and conduct "blameless post-mortems" to drive systemic improvements.

Your Present Skillset
  • 3+ years of experience in a Network Reliability (NRE), SRE, or Network Operations role within a high-performance environment.
  • Software Engineering Mindset: Strong proficiency in Python and Go for building automation, custom exporters, or network management tools.
  • Observability Stack Expertise: Hands-on experience with Prometheus, Grafana, OpenSearch/Elasticsearch, and distributed tracing.
  • Networking Fundamentals: Deep knowledge of TCP/IP, BGP, EVPN, and routing/switching concepts in a high-bandwidth environment.
  • Infrastructure as Code: Proven experience using Terraform to ensure scalable, repeatable, and version-controlled network deployments.
  • HPC Awareness: Familiarity with the networking requirements of high-performance computing, such as non-blocking fabrics and low-latency interconnects.

Desirable Experience
  • Streaming Telemetry: Experience with gNMI, gRPC, or Kafka for Real Time network data streaming.
  • CI/CD for Networking: Familiarity with "NetDevOps" workflows, including automated testing (Pytest/Go test) and pipeline validation for network changes.
  • Container Networking: Knowledge of Kubernetes networking, CNI plugins, and Service Mesh (eg, Istio or Cilium).
  • Traffic Engineering: Experience with segment routing or advanced load-balancing strategies for high-performance workloads.