We are looking for a senior SRE / DevOps practitioner to design, standardise, and operatecloud platforms that support multiple AI-driven products and services. This role focuses on buildingopinionated, reusable infrastructure patterns that enable teams to rapidly deliver AI workloads while maintaining high standards for reliability, security, and cost control.
You will developplatform architecture across multiple concurrent projects, ensuring consistency in how services are deployed, integrated, and operated. This includes shaping how workloads are built, deployed, and monitored, as well as defining clear patterns for service communication, API exposure, and infrastructure provisioning.
This is a hands-on role for someone who is comfortable makingstrong architectural decisions, reducing variability across teams, and balancing flexibility with standardisation in a fast-moving environment.
Platform Architecture & Standardisation
- Define and implementopinionated architecture patternsfor cloud-native and AI-enabled services on AWS
- Establish reusable blueprints for these same services
- Drive consistency across multiple projects through shared modules, templates, and platform tooling
Infrastructure as Code & Automation
- Build and maintainTerraform-based infrastructure, using modular and reusable design
- Define CI/CD patterns for:
- infrastructure deployment
- application and model delivery
- Enforce best practices through pipelines and automation rather than documentation
Reliability, Observability & Operations
- Embed SRE principles across all services:
- monitoring, logging, tracing
- SLIs/SLOs and alerting
- Continuously improve reliability, performance, and cost efficiency
- Operate API gateway/data plane technologies (e.g. Kong)
Required Skills & Experience
- Strong experience operatingAWS-based platforms in production
- Proven experience withTerraform, including module design and CI/CD integration
- Hands-on experience withcontainer platforms (ECS preferred; EKS acceptable if adaptable)
- Experience operatingAPI gateways(Kong or equivalent)
- Solid understanding ofcloud networking and service discovery patterns
- Experience supportingmultiple teams or projects on a shared platform
- Strong troubleshooting and production operations experience
AI / Data Platform Experience (Required)
- Practical experience running or supportingAI/ML workloads in production, such as:
- model inference services
- batch processing pipelines
- integration with LLM APIs or hosted models
- Understanding of:
- scaling characteristics of AI workloads
- cost considerations (compute-heavy workloads, GPU usage, etc.)
- Familiarity with tooling such as:
- model serving frameworks
- data processing pipelines
- or managed AI services on AWS
Competitive day rate