DevOps Site Reliability Engineer

1 day agoSenior Singapore Hybrid Devops Jobs by Solidus Labs

Skills

About the Role

You will own the reliability, stability, and operational support of production systems. You will lead incident response end-to-end, troubleshoot and mitigate outages, and perform deep-dive root cause analysis to drive corrective actions. You will operate production Kubernetes (EKS), manage scaling and capacity using KEDA, Karpenter, and HPA, and evolve infrastructure as code with Terraform and Helm. You will support GitLab CI/CD pipelines, design observability systems with Prometheus, Grafana and EFK, and resolve networking issues involving TLS, load balancing, VPCs, NAT and VPN. You will respond to security-related incidents, support compliance initiatives, leverage AI-powered tools for automation, and participate in on-call rotations to provide operational coverage.

Requirements

3+ years of hands-on DevOps / SRE experience
Strong production experience with Docker and Kubernetes
Solid knowledge of AWS (EKS, EC2, IAM, RDS, S3, CloudWatch, Lambda)
Experience with monitoring, logging and alerting systems
Proficiency with Terraform, Helm and GitLab CI
Strong troubleshooting skills across infrastructure, CI/CD and networking
Scripting experience with Bash and Python
Fluent English and willingness to participate in on-call rotations
Familiarity with pub/sub systems such as SQS, RabbitMQ or Kafka
Nice to have: Experience with Redis, Airflow, Databricks, Spark/EMR
Nice to have: GitOps workflows and advanced Git usage
Nice to have: Experience supporting Postgres, Snowflake or ClickHouse
Nice to have: Proficiency in Mandarin

Responsibilities

Own reliability, availability and performance of production environments
Lead incident response end-to-end including troubleshooting, mitigation and resolution
Perform deep-dive root cause analysis and drive long-term corrective actions
Operate production Kubernetes (EKS) including cluster upgrades and Helm deployments
Manage scaling and capacity using KEDA, Karpenter and HPA
Evolve infrastructure as code using Terraform and Helm following security best practices
Support GitLab CI/CD pipelines and resolve deployment issues
Design and maintain observability using Prometheus, Grafana and EFK
Troubleshoot networking issues involving TLS, load balancing, VPCs, NAT and VPN
Support compliance initiatives and respond to security-related incidents
Leverage AI-powered tools to automate tasks and improve productivity
Participate in on-call rotations to provide consistent operational coverage

Skills

About the Role

Requirements

Responsibilities

Similar Jobs