Your Role:
Design, build, and operate the software and platforms that power our high‑performance and AI workloads. You will own core services, developer tooling, and workflow orchestration that enable researchers and scientific application engineers to deliver faster, more reliable results across on‑prem and cloud high-performance compute infrastructure. This role complements our existing Scientific Applications Engineers by focusing on software engineering, DevOps, and platform capabilities to enhance how users interact with the HPC cluster.
Scientific Computing sits within the Data & AI Products organization of our Data & AI Organization. We lead oneHPC, the modernization of our company's compute stack across Life Science, Healthcare, and Electronics. The team operates an integrated on‑prem + cloud platform used for simulation, data‑driven research, and advanced machine learning.
Key Responsibilities:
1. Own the self-service platform (FastAPI/Step Functions backend, React portal, and CLI workflows) to let researchers self-onboard, manage projects, and leverage LDAP/Slurm/VAST integrations.
2. Implement Infrastructure as Code and configuration management for hybrid HPC + AWS environments.
3. Engineer container strategies for CPU/GPU workloads, including base images, CUDA/NCCL stacks, and reproducible builds.
4. Extend internal services, APIs, and SDKs (Python/TypeScript) that provide standardized access to HPC schedulers, data stores, and AI/GPU resources.
5. Design and implement CI/CD pipelines, artifact/version management, and automated testing for scientific software and internal tools.
6. Model and operate data backends: SQL and NoSQL, including schema design, and migrations.
7. Ensure transparency, monitoring, and performance insights for platform services and batch workloads (Prometheus/Grafana, structured logging, alerting, SLOs).
8. Partner with Scientific Applications Engineers and domain scientists to productionize ML/AI and simulation workflows; provide code reviews and documentation.
9. Help set up training sessions, workshops, and onboarding material to assist users to effectively utilize the HPC and cloud resources.
Minimum qualifications
10. Degree in STEM, Software Engineering, or related field; equivalent practical experience accepted.
11. 3+ years building and operating production software/platforms; strong software engineering fundamentals and code quality practices.
12. Expert in Python; proficient in TypeScript
13. Proven DevOps capability: Git‑based workflows, automated testing, CI/CD (GitHub Actions), container registries, package publishing.
14. AWS proficiency: EC2/EKS/Batch, S3/EFS/FSx for Lustre, VPC/IAM, CloudWatch; infrastructure as code with Terraform or CloudFormation.
15. Proficient in working on Linux-based clusters.
16. Experience integrating with HPC schedulers (Slurm) and/or Kubernetes for batch/ML workloads.
Preferred qualifications
17. Experience with AI/ML infrastructure: GPU cluster operations, model training at scale (PyTorch/TensorFlow), experiment tracking (MLflow), model serving and artifact storage.
18. Data movement at scale for science: object storage strategies, parallel filesystems, data transfer tooling (e.g., Globus), checksum/lineage practices.
19. Exposure to scientific domains or packages (e.g., ORCA, VASP, GROMACS, LAMMPS, AlphaFold, RFDiffusion, EDEM, STAR‑CCM+).
Department: EF-DS-DPL Data Platform Portfolio
Level: Expert 3