Search...

Site Reliability Engineer

Skills

About the Role

You will ensure services stay online and performant around the clock. You will optimize Kubernetes clusters including service mesh, metrics, and logging. You will benchmark services and identify infrastructure bottlenecks. You will improve observability and alerting to catch issues before they impact users, scale services to minimize downtime under load, and develop CI/CD pipelines for new and existing services.

Requirements

  • Hands-on experience with Kubernetes in production environments
  • Proficiency with Golang for systems and infrastructure tooling
  • Familiarity with confidential virtual machines (CVMs)
  • Experience with Prometheus, Loki, and Grafana for monitoring and observability

Responsibilities

  • Ensure services stay online and performant, including during off hours
  • Optimize Kubernetes clusters, including service mesh, metrics, and logging
  • Benchmark services and identify infrastructure bottlenecks
  • Improve observability and alerting systems to catch issues before they impact users
  • Scale services to minimize downtime under load
  • Develop CI/CD pipelines for new and existing services