Role Overview
We are looking for a highly skilled Senior AI/ML Ops Engineer (d/f/m) to join our Cloud Excellence team. In this role you will be responsible for designing and building up AI infrastructure and tooling for the company – from agent frameworks in AWS to vertex AI playgrounds in GCP and from quota management to cost attribution integration. You will collaborate closely with Data Scientists and Software Engineers and the local AI champion team in general, who contribute to company goals related to AI adoption. You will report to the Head of Cloud & Infrastructure .
The Team
The mobile.de Cloud and Infrastructure team is providing tools, services and support to enable our software engineers within the public cloud. We manage infrastructure as code for containerised workloads on Kubernetes, integrate third party services, run observability and manage generic cloud resources for products of mobile.de. We watch over our platform 24/7 to detect and tackle incidents in time. Following the company's anchor day approach, our team meets weekly on Mondays in the Berlin office for collaboration, meetings and enjoying lunch together.
Responsibilities
* You will contribute to the success and future of Germany’s biggest automotive marketplace with our knowledge and experience
* You will co-own operations and constantly improve our solutions by driving automation, simplicity, reliability and observability
* You will share our knowledge and mentor team members as well as software engineers in SRE principles
* You will heavily (re)use and design automated workflows in our github ecosystem
* You will simplify and choose Cloud SaaS for the right value and stay agile in replacing outdated solutions
* This role is all about making AI/ML successful for the company and we want you for this!
Your Profile
* You are able to support and help planning AI/ML projects from the operational and infrastructure perspective with best-in-class IaC standards
* You are proficient with cloud AI models, related frameworks (AgentCore), APIs (Bedrock, Vertex AI) and IAM integration
* You can optimize AI/ML workloads with your extensive knowledge of cloud scaling options and mechanisms
* You are experienced in supporting the deployment and scaling of agentic frameworks (LangChain, LangGraph) and retrieval-augmented generation (RAG) pipelines, including vector database management (Elastic/OpenSearch, S3 Vector) and tool orchestration
* You can manage tooling like inference profiles and quotas for cost allocation
* You can debug build pipelines and spot software issues
* You like to investigate into cloud costs and consult with software engineers on optimization options regarding their AI/ML products
* You can automate toil via scripting or programming languages (Python, Go)
* You have experience with different IaC stacks like AWS CDK, pulumi, terragrunt
* You are proficient in public cloud (AWS, GCP) and containerisation concepts
* You champion observability and compliance by implementing robust monitoring, logging, and alerting for agentic and ML services
* Consulting and helping others with your expertise makes your day
* You are able to fluently communicate in English
* AI certifications (AWS, GCP) are preferable
Nice to Have skills
* Experience managing and deploying own LLM models (for instance on K8s nodes with GPUs)
* Experience operating vector databases (Milvus, QDrant, S3 Vector, Elastic/OpenSearch, etc.)
* Corporate scale environment experience (using ADRs, Tech Radar, KPIs, etc.)
* You drive AI-first operational excellence by identifying and automating repetitive operational tasks using AI-powered tools and scripting, proactively improving efficiency, reliability, and cost-effectiveness
* Experience enforcing runtime guardrails and policy controls