Search...

DevOps SRE

Skills

About the Role

You are responsible for maintaining the reliability, availability, and performance of production systems. You will operate a modern cloud-native stack, work with Kubernetes and AWS, implement infrastructure as code with Terraform and Helm, support CI/CD pipelines, design observability with Prometheus Grafana and EFK, troubleshoot networking issues, respond to security-related incidents, and participate in on-call rotations to provide continuous operational coverage. You will also leverage AI-powered tools to automate and improve productivity.

Requirements

  • 3+ years of hands-on DevOps / SRE experience
  • Strong production experience with Docker and Kubernetes
  • Solid knowledge of AWS (EKS, EC2, Organizations, RDS, S3, CloudWatch, Lambda, DynamoDB)
  • Experience with monitoring, logging, and alerting systems
  • Proficiency with Terraform, Helm, and GitLab CI (or similar)
  • Strong troubleshooting skills across infrastructure, CI/CD, and networking
  • Scripting experience with Bash and Python
  • Willingness to participate in on-call rotations
  • Familiarity with pub/sub systems (SQS, Kafka, or similar)
  • Experience with Redis, Airflow, Databricks, Spark/EMR
  • GitOps workflows and advanced Git usage
  • Experience supporting databases such as Postgres, Snowflake, or ClickHouse

Responsibilities

  • Own the reliability, availability, and performance of our production environments.
  • Operate production Kubernetes (EKS), including cluster upgrades and Helm deployments.
  • Manage scaling and capacity using KEDA, Karpenter, and HPA for resource optimization.
  • Manage AWS Cloud environments including EC2, Lambda, AWS Batch, Elasticache, RDS, and more.
  • Evolve infrastructure as code using Terraform and Helm with security best practices.
  • Support GitLab CI/CD pipelines, resolving deployment issues and improving stability.
  • Design observability systems using Prometheus, Grafana, and EFK to reduce alert fatigue.
  • Solve networking issues involving TLS, Load Balancing, VPCs, NAT, and VPN.
  • Support compliance initiatives and respond to security-related incidents.
  • Leverage AI-powered tools as a standard part of your workflow for automation and productivity.
  • Lead incident response end-to-end, including troubleshooting, mitigation, and resolution.
  • Perform deep-dive RCA to drive long-term corrective and preventive actions.
  • Participate in on-call rotations to provide consistent operational coverage.