Infrastructure Engineer (Disaster Recovery and Capacity)

  • Deepstreamtech
  • Manchester, Lancashire
  • 19/05/2026
Full time Information Technology Telecommunications

Job Description

Requirements
  • Strong experience in IT operations/Production Engineering
  • Experience of Linux administration
  • Experience with Kubernetes administration
  • Being comfortable in a scripting language suitable for automation tasks
  • Understanding of current recovery solutions and high availability architectures for cloud and on prem
  • Understanding of current Capacity Management & Planning scenarios and tooling
  • Experience with Agile principles and practices
  • Expertise in problem diagnosis across complex, distributed systems
  • (Desirable) Experience supporting SaaS products
  • (Desirable) Experience with Incident Management, Post Mortems and related practices
  • (Desirable) Knowledge of observability and monitoring best practices
  • (Desirable) Experience operating within one or more public clouds (AWS, GCP, Azure)
  • (Desirable) Experience with configuration management, and infrastructure as code
What the job involves
  • The Disaster Recovery & Capacity Engineer builds out solutions to support Platform disaster response/crisis management activities in compliance with the Engineering and Customer requirements and helps provide and coordinate disaster preparedness with respect to the organisation's Platform, helping ensure business continuity
  • They also ensure we have enough resources to meet current and future Platform demand efficiently, involving forecasting needs,capacity planning,monitoring performance(KPIs),managing risks(shortages/overloads),and developing strategies for optimisation
  • Work with Engineering & Service Management to ensure that the disaster recovery and Capacity plans drive disaster recovery (DR) strategy and procedures both in Cloud and DC venues
  • Build out tooling that supports the DR plans and tracks progress and maturity against set KPI's and Metrics
  • Work with Engineering & Service Management to ensure that disaster recovery solutions are adequate, in place, maintained, and tested as part of the regular operational life cycle
  • Provide ongoing feedback for risk management, mitigation, and prevention
  • Develop and implement capacity planning tooling, frameworks, policies, and strategies
  • Provide capacity requirements and impact assessments for new services or changes
  • Collaborate with other Platform managers to deliver objectives on our platform evolution roadmap