Back

Infrastructure Engineer (Disaster Recovery and Capacity)

Full time Information Technology Telecommunications

Requirements

Strong experience in IT operations/Production Engineering
Experience of Linux administration
Experience with Kubernetes administration
Being comfortable in a scripting language suitable for automation tasks
Understanding of current recovery solutions and high availability architectures for cloud and on prem
Understanding of current Capacity Management & Planning scenarios and tooling
Experience with Agile principles and practices
Expertise in problem diagnosis across complex, distributed systems
(Desirable) Experience supporting SaaS products
(Desirable) Experience with Incident Management, Post Mortems and related practices
(Desirable) Knowledge of observability and monitoring best practices
(Desirable) Experience operating within one or more public clouds (AWS, GCP, Azure)
(Desirable) Experience with configuration management, and infrastructure as code

What the job involves

The Disaster Recovery & Capacity Engineer builds out solutions to support Platform disaster response/crisis management activities in compliance with the Engineering and Customer requirements and helps provide and coordinate disaster preparedness with respect to the organisation's Platform, helping ensure business continuity
They also ensure we have enough resources to meet current and future Platform demand efficiently, involving forecasting needs,capacity planning,monitoring performance(KPIs),managing risks(shortages/overloads),and developing strategies for optimisation
Work with Engineering & Service Management to ensure that the disaster recovery and Capacity plans drive disaster recovery (DR) strategy and procedures both in Cloud and DC venues
Build out tooling that supports the DR plans and tracks progress and maturity against set KPI's and Metrics
Work with Engineering & Service Management to ensure that disaster recovery solutions are adequate, in place, maintained, and tested as part of the regular operational life cycle
Provide ongoing feedback for risk management, mitigation, and prevention
Develop and implement capacity planning tooling, frameworks, policies, and strategies
Provide capacity requirements and impact assessments for new services or changes
Collaborate with other Platform managers to deliver objectives on our platform evolution roadmap