Site Reliability Engineer
About the Role
We're looking for a Site Reliability Engineer to join a team in the energy sector. This role focuses on ensuring system performance, reliability, and scalability across critical infrastructure. You’ll help shape middleware architecture, drive automation, and improve service resilience. The work will have a direct impact on operational efficiency and user experience.
Key Responsibilities
* Improve system reliability, availability, and performance across infrastructure and services
* Design and maintain middleware solutions including Kafka, message queues, and API gateways
* Define, monitor, and uphold SLAs and SLOs in line with business priorities
* Lead incident response, root cause analysis, and post-incident review processes
* Manage infrastructure as code using tools like Terraform, Ansible, Kubernetes, and Helm
* Automate deployments, workflows, and recovery procedures
* Set up and manage observability tooling (Prometheus, Grafana, ELK stack, etc.)
* Write internal tools and scripts using Python, Bash, or Go
What We're Looking For
* Experience in Site Reliability Engineering, DevOps, or a related infrastructure role
* Strong background in cloud platforms (AWS, GCP, or Azure)
* Hands-on experience with middleware systems such as Kafka and message brokers
* Proficiency with infrastructure automation tools (Terraform, Ansible, Kubernetes, Helm)
* Scripting skills in Python, Bash, or Go
* Familiarity with observability tools and practices
* Understanding of event-driven systems and distributed architectures
* Strong troubleshooting skills and ability to work effectively under pressure
Additional Information
This is a hybrid role and will require regular on-site collaboration. Candidates must have valid work authorization in the relevant country; visa sponsorship is not available.