About the Role We are working with a leading international consultancy that is building scalable, production-grade AI SaaS products within their dedicated AI Lab. This is a greenfield opportunity — you will combine deep technical expertise with strategic vision to design and build AI-powered platforms that transform enterprise clients' business models. The AI Lab is developing cutting-edge, large-scale AI products delivering sustained commercial impact. The team operates with a startup mindset: agile, flat hierarchies, and a genuine bias for experimentation and ownership. The Opportunity This is a rare full-stack platform engineering role that spans infrastructure architecture through to LLM operationalisation. You will own the platform layer end-to-end — from Kubernetes cluster operations and IaC through to model serving, RAG pipelines, and LLMOps. Key themes of the role: Design and evolve a multi-tenant SaaS architecture with tenant isolation, per-tenant controls, and enterprise security Build automated tenant provisioning, safe rollouts (canary/feature flags), and noisy-neighbor protection Operationalise LLMs end-to-end — fine-tuning, evaluation, high-performance serving, monitoring, and embeddings workflows Drive MLOps foundations: automated training pipelines, experiment tracking, and scalable model deployment Manage Kubernetes clusters, GPU-heavy workloads, and autoscaling on AWS Build unified CI/CD pipelines shipping ML and application code seamlessly Implement comprehensive observability: logs, metrics, traces, model/data drift detection Embed enterprise security and compliance — IAM, RBAC, VPC design, secrets management, encryption — at every layer Design well-architected ETL/ELT pipelines, streaming systems, feature store integration, and workflow orchestration Technical Requirements Platform & Multi-Tenancy Proven patterns for tenant isolation (DB-per-tenant, schema-per-tenant, row-level security), tenant-aware caching, noisy-neighbor protection OIDC/OAuth2, tenant-aware RBAC/ABAC, SCIM provisioning, and audit logging for B2B SaaS Kubernetes & Infrastructure Deep Kubernetes: cluster ops, HPA/VPA, node pools, GPU scheduling, Karpenter, PDBs, network policies, multi-AZ design Service mesh (Istio/Linkerd), ingress patterns (ALB/Nginx), secure egress, mTLS Infrastructure as Code beyond basics: Terraform modules, Terragrunt, policy-as-code (OPA/Conftest), secrets automation GitOps (ArgoCD/Flux), progressive delivery (Argo Rollouts/Flagger), feature flags, canary and blue/green deployments MLOps & Model Lifecycle Model lifecycle tooling: MLflow/W&B, model registry, experiment tracking, reproducible training, dataset versioning (DVC/lakeFS) Pipeline orchestration: Airflow, Prefect, or Dagster artifact stores Model serving: KServe, Seldon, BentoML, or Ray Serve — online, async/batch inference, autoscaling, rollback LLMOps Prompt and version management, offline online evaluation harnesses, RAG evaluation (retrieval metrics, groundedness), guardrails, red-teaming basics Streaming inference (SSE/WebSockets), caching, routing, fallback models Vector DB experience: pgvector, Pinecone, Weaviate, or Milvus — embedding lifecycle, backfills, re-embedding, indexing strategies Observability & Security OpenTelemetry, tracing, SLOs — Prometheus/Grafana, Loki/ELK, Datadog/New Relic Incident management: postmortems, runbooks, error budgets GDPR, encryption at rest/in transit, secrets management (AWS Secrets Manager/Vault), KMS, key rotation SOC 2 / ISO 27001 familiarity, vulnerability scanning (Trivy/Grype), SBOMs, SAST/DAST About You You have shipped and operated customer-facing SaaS products at scale with real users You have owned end-to-end ML/AI infrastructure — from data ingestion through to production monitoring You enable engineers and data scientists to move faster through self-service platforms and automated workflows You have a track record of designing systems that scale globally across regions and traffic patterns You are comfortable with incident response, on-call rotations, and stabilising critical production systems You think with a product mindset — customer value, reliability, and speed-to-market over technology for its own sake You have a strong bias for automation and eliminating manual operational toil Excellent communication skills — async collaboration, documentation, and explaining technical decisions to non-technical audiences What's on Offer Genuine greenfield platform engineering ownership — build it from scratch Startup atmosphere with flat hierarchies within a globally established firm Hybrid working, international mobility across a wide office network Extensive learning and development programmes Competitive package including bonus