You will play a key role in developing, debugging and maintaining software to operate a large scale compute platform. Your duties will include: - Close collaboration with teams within and across organizations to support their workflows or integrate their technology into our platform Writing software to automate operations processes by developing services and tools - Designing, implementing, and maintaining robust, scalable, and highly available services that support Infrastructure as Code (Terraform, Pulumi) - Developing configuration management, and fleet orchestration solutions powered via Ansible, Puppet, or others - Supervising on-server system performance, identifying bottlenecks, and implementing solutions to improve efficiency - Conducting root cause analysis for on-server system failures and implementing preventive measures - Writing and reviewing code, as well as generating and reviewing design documentation - Participating in qualifications and rollouts of software to production clusters - Participating in a business-hours rotation where engineers respond to platform issues for same-day resolution Familiarity with the mechanics behind infrastructure management Familiarity with node management systems like Ansible, Puppet or similar solutions User-centred thinking and strong problem solving skills with attention to detail Strong systems programming skills and knowledge of operating systems (macOS and Linux) administration and troubleshooting. Experience with large scale server provisioning and maintenance Proficiency in Swift, Python, or similar languages in a systems context Strong proficiency in Linux/Unix internals, administration, and troubleshooting Operational knowledge of Kubernetes clusters Solid understanding of networking protocols Proven experience in systems software development