Jobs
Meine Anzeigen
Jobs per E-Mail
Anmelden
Stellenangebote Job Tipps Unternehmen
Suchen

Gpu cluster engineer (human)

Metzingen
Neura Robotics
Ingenieur
Inserat online seit: Veröffentlicht vor 5 Std.
Beschreibung

Your mission & challenges You are the go-to expert for NEURA's GPU cluster infrastructure - a large-scale AWS HyperPod environment running cutting-edge GPU instances for foundation model training and customer fine-tuning workloads. You design the operational framework, build self-service tooling for ML teams, and work directly with AWS to influence the platform at the hyperscaler level. Your focus is on cluster engineering and operations — not on ML research itself, but on making sure the people doing that research have rock-solid, efficient, and accessible infrastructure under them. Setting up, configuring, and continuously evolving NEURA's HyperPod clusters, including HyperPod/Slurm and HyperPod/EKS orchestration models. Designing and implementing strategies for cluster stability: node failure detection, automated job recovery, checkpoint coordination, and fault-tolerant multi-node training workflows. Providing a workload priority management framework that allows multiple teams and use cases like foundation model pretraining, fine-tuning, customer workloads, to share cluster capacity efficiently and fairly. Optimizing end-to-end GPU utilization: identifying and resolving bottlenecks across compute, GPU memory, EFA networking, and storage throughput. Working directly and closely with the AWS HyperPod product and solutions engineering teams, escalating operational issues, sharing learnings from one of the platform's largest deployments, and placing concrete requirements on the roadmap. Providing self-service tooling that allows ML researchers and engineers to launch, monitor, and manage training jobs independently, without requiring infrastructure intervention for routine operations. Developing onboarding documentation, training materials, and internal workshops that enable users to operate efficiently, follow best practices, and understand cost implications of their workloads. Infrastructure as Code is a given for you. Every cluster configuration, every operational change, every new environment is code first. Owning the cost and capacity strategy: Spot instance management, Reserved Instance planning, Savings Plans, and ongoing commitment negotiations with AWS. What we can look forward to 5 years of experience in infrastructure or systems engineering, with a strong focus on GPU cluster or HPC operations. Deep hands-on experience with AWS HyperPod and AWS instances; direct prior experience with HyperPod is a strong differentiator. Solid understanding of both Slurm and Kubernetes as cluster orchestration layers, and the ability to evaluate their trade-offs for large-scale GPU workloads. Practical knowledge of distributed training - you understand what affects throughput and how to debug it. Experience building self-service tooling and operational documentation for technical end users. You make complex infrastructure accessible, not just functional. Strong understanding of cloud cost management at scale: Spot interruption handling, capacity reservations, cost attribution across teams and workloads. Comfort working across organizational boundaries — your primary partners are ML researchers, but you'll also work closely with product, finance, and cloud vendor teams. Strong English communication skills. German is a plus.

Bewerben
E-Mail Alert anlegen
Alert aktiviert
Speichern
Speichern
Ähnliches Angebot
Premaster programm ingenieur für fault detection and classification management in der halbleiterfertigung schwerpunkt elektromobilität (w/m/div.)
Reutlingen
Bosch Gruppe
Ingenieur
Ähnliches Angebot
Fachreferent / projektingenieur leittechnik dc-anlagen (w/m/d)
Wendlingen am Neckar
TransnetBW
Project Engineer
Ähnliches Angebot
Bauingenieur (m/w/d) unterstützung des technischen geschäftsführers
Metzingen
KNECHT Kellerbau GmbH
Bauingenieur
Mehr Stellenangebote
Ähnliche Angebote
Ingenieur Jobs in Metzingen
Jobs Metzingen
Jobs Reutlingen (Kreis)
Jobs Baden-Württemberg
Home > Stellenangebote > Ingenieur Jobs > Ingenieur Jobs > Ingenieur Jobs in Metzingen > GPU Cluster Engineer (human)

Jobijoba

  • Job-Ratgeber
  • Bewertungen Unternehmen

Stellenangebote finden

  • Stellenangebote nach Jobtitel
  • Stellenangebote nach Berufsfeld
  • Stellenangebote nach Firma
  • Stellenangebote nach Ort
  • Stellenangebote nach Stichworten

Kontakt / Partner

  • Kontakt
  • Veröffentlichen Sie Ihre Angebote auf Jobijoba

Impressum - Allgemeine Geschäftsbedingungen - Datenschutzerklärung - Meine Cookies verwalten - Barrierefreiheit: Nicht konform

© 2026 Jobijoba - Alle Rechte vorbehalten

Bewerben
E-Mail Alert anlegen
Alert aktiviert
Speichern
Speichern