DevOps SRE
Skills
About the Role
You are responsible for maintaining the reliability, availability, and performance of production systems. You will operate a modern cloud-native stack, work with Kubernetes and AWS, implement infrastructure as code with Terraform and Helm, support CI/CD pipelines, design observability with Prometheus Grafana and EFK, troubleshoot networking issues, respond to security-related incidents, and participate in on-call rotations to provide continuous operational coverage. You will also leverage AI-powered tools to automate and improve productivity.
Requirements
- 3+ years of hands-on DevOps / SRE experience
- Strong production experience with Docker and Kubernetes
- Solid knowledge of AWS (EKS, EC2, Organizations, RDS, S3, CloudWatch, Lambda, DynamoDB)
- Experience with monitoring, logging, and alerting systems
- Proficiency with Terraform, Helm, and GitLab CI (or similar)
- Strong troubleshooting skills across infrastructure, CI/CD, and networking
- Scripting experience with Bash and Python
- Willingness to participate in on-call rotations
- Familiarity with pub/sub systems (SQS, Kafka, or similar)
- Experience with Redis, Airflow, Databricks, Spark/EMR
- GitOps workflows and advanced Git usage
- Experience supporting databases such as Postgres, Snowflake, or ClickHouse
Responsibilities
- Own the reliability, availability, and performance of our production environments.
- Operate production Kubernetes (EKS), including cluster upgrades and Helm deployments.
- Manage scaling and capacity using KEDA, Karpenter, and HPA for resource optimization.
- Manage AWS Cloud environments including EC2, Lambda, AWS Batch, Elasticache, RDS, and more.
- Evolve infrastructure as code using Terraform and Helm with security best practices.
- Support GitLab CI/CD pipelines, resolving deployment issues and improving stability.
- Design observability systems using Prometheus, Grafana, and EFK to reduce alert fatigue.
- Solve networking issues involving TLS, Load Balancing, VPCs, NAT, and VPN.
- Support compliance initiatives and respond to security-related incidents.
- Leverage AI-powered tools as a standard part of your workflow for automation and productivity.
- Lead incident response end-to-end, including troubleshooting, mitigation, and resolution.
- Perform deep-dive RCA to drive long-term corrective and preventive actions.
- Participate in on-call rotations to provide consistent operational coverage.
