DevOps Site Reliability Engineer
Skills
About the Role
You will own the reliability, stability, and operational support of production systems. You will lead incident response end-to-end, troubleshoot and mitigate outages, and perform deep-dive root cause analysis to drive corrective actions. You will operate production Kubernetes (EKS), manage scaling and capacity using KEDA, Karpenter, and HPA, and evolve infrastructure as code with Terraform and Helm. You will support GitLab CI/CD pipelines, design observability systems with Prometheus, Grafana and EFK, and resolve networking issues involving TLS, load balancing, VPCs, NAT and VPN. You will respond to security-related incidents, support compliance initiatives, leverage AI-powered tools for automation, and participate in on-call rotations to provide operational coverage.
Requirements
- 3+ years of hands-on DevOps / SRE experience
- Strong production experience with Docker and Kubernetes
- Solid knowledge of AWS (EKS, EC2, IAM, RDS, S3, CloudWatch, Lambda)
- Experience with monitoring, logging and alerting systems
- Proficiency with Terraform, Helm and GitLab CI
- Strong troubleshooting skills across infrastructure, CI/CD and networking
- Scripting experience with Bash and Python
- Fluent English and willingness to participate in on-call rotations
- Familiarity with pub/sub systems such as SQS, RabbitMQ or Kafka
- Nice to have: Experience with Redis, Airflow, Databricks, Spark/EMR
- Nice to have: GitOps workflows and advanced Git usage
- Nice to have: Experience supporting Postgres, Snowflake or ClickHouse
- Nice to have: Proficiency in Mandarin
Responsibilities
- Own reliability, availability and performance of production environments
- Lead incident response end-to-end including troubleshooting, mitigation and resolution
- Perform deep-dive root cause analysis and drive long-term corrective actions
- Operate production Kubernetes (EKS) including cluster upgrades and Helm deployments
- Manage scaling and capacity using KEDA, Karpenter and HPA
- Evolve infrastructure as code using Terraform and Helm following security best practices
- Support GitLab CI/CD pipelines and resolve deployment issues
- Design and maintain observability using Prometheus, Grafana and EFK
- Troubleshoot networking issues involving TLS, load balancing, VPCs, NAT and VPN
- Support compliance initiatives and respond to security-related incidents
- Leverage AI-powered tools to automate tasks and improve productivity
- Participate in on-call rotations to provide consistent operational coverage
