Managed Services Engineer
Operate and optimise production AI platforms for enterprise customers — monitoring, incident response, governance updates, and continuous improvement across governed, observable infrastructure.
Express interest →About Deliverance AI
Deliverance AI is the production AI platform company. We exist because 94% of enterprises fail to scale AI — not for lack of ambition or budget, but for lack of platform, governance, and delivery capability. We close the gap between AI investment and AI production for enterprises across regulated industries including pharma and biotech, financial services, retail, telecommunications, and logistics.
Our proprietary nine-layer platform — built around three core capabilities: Clarity (see everything), Govern (control everything), and Accelerate (ship everything) — is deployed inside customer environments with live workloads running on it from day one. We are not a consultancy that writes strategy decks. We are not a staffing firm that lends contractors. We are an engineering-led company with proprietary platform IP, a growing agent marketplace, and 15+ pre-built AI blueprints that cut months off delivery timelines.
Our engagement model is simple: Assess (4 weeks), Deploy (12–16 weeks), Operate (ongoing). Dedicated engineering pods own delivery end-to-end. Every deployment compounds the platform. Every use case ships faster than the last. Governed, observable, and delivering value from day one.
About the role
The Managed Services Engineer is the operational backbone of our Operate phase — the recurring, compounding revenue stream where production AI platforms need continuous monitoring, governance updates, inference management, and performance optimisation. Once engineering pods have deployed the platform and onboarded 2–3 production workloads, you ensure it stays running at peak performance, meeting SLAs, and delivering value month after month.
This is not a traditional NOC role. You will work with cutting-edge AI infrastructure across our nine-layer platform — AI Gateway routing, Agent Governance policies, Inference Platform autoscaling, GPU infrastructure scheduling, Data & RAG pipelines, and the Observability & FinOps layer that provides complete visibility into cost, performance, and compliance. You will use the Clarity capability to see everything running across a customer's AI estate, the Govern capability to ensure compliance-as-code stays current, and the Accelerate capability to optimise inference performance and cost.
As we scale our customer base, the Managed Services team will grow into a critical function — and early hires will shape the tools, runbooks, automation, and operational practices that define how we support enterprise AI in production at scale.
What you will do
- Monitor and maintain production AI platforms across enterprise customer environments, ensuring uptime, performance, and compliance SLAs are met through our Observability & FinOps layer.
- Manage agent deployments, model updates, rollbacks, and A/B testing in production — working closely with ML Engineers and customer teams via our Five Registries and Agent Governance layers.
- Build and maintain observability infrastructure — dashboards, alerting, log analysis, and telemetry — that leverages the Clarity capability to provide visibility into inference latency, throughput, GPU utilisation, and cost attribution per team and use case.
- Respond to and resolve production incidents, performing root cause analysis and implementing preventive measures across the platform stack.
- Optimise inference costs and GPU utilisation for customers — identifying underperforming workloads, right-sizing deployments, tuning AI Gateway caching and routing, and recommending configuration changes.
- Ensure ongoing compliance by maintaining governance policies, ARMOR security framework configurations, and audit trails as regulatory requirements evolve (EU AI Act, NIST AI RMF, ISO 42001, GxP).
- Produce regular performance reports and ROI dashboards for customers, demonstrating the compounding value of their production AI platform investment.
- Develop and maintain operational runbooks, automation scripts, and self-healing infrastructure patterns that improve reliability and reduce manual intervention.
What we are looking for
- 3+ years of experience in infrastructure engineering, site reliability engineering (SRE), DevOps, or managed services with a focus on production systems.
- Strong knowledge of Kubernetes, containerisation, and cloud-native infrastructure (AWS, GCP, or Azure).
- Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, ELK, or similar) and incident management practices.
- Familiarity with GPU infrastructure, inference workloads, and AI model serving frameworks is a strong advantage — or a genuine eagerness to learn quickly in this space.
- Scripting and automation skills in Python, Bash, or Go, with experience building infrastructure-as-code (Terraform, Ansible, Helm).
- Understanding of security best practices, access management, and compliance requirements for regulated environments.
- Excellent problem-solving skills and the ability to remain calm and methodical under pressure during production incidents.
- Strong communication skills — you will produce customer-facing reports and participate in operational reviews with enterprise stakeholders. You represent the ongoing quality of our platform.