DevOps SRE

1 day agoNew York, USA Devops Jobs by Solidus Labs

Skills

About the Role

You are responsible for maintaining the reliability, availability, and performance of production systems. You will operate a modern cloud-native stack, work with Kubernetes and AWS, implement infrastructure as code with Terraform and Helm, support CI/CD pipelines, design observability with Prometheus Grafana and EFK, troubleshoot networking issues, respond to security-related incidents, and participate in on-call rotations to provide continuous operational coverage. You will also leverage AI-powered tools to automate and improve productivity.

Requirements

3+ years of hands-on DevOps / SRE experience
Strong production experience with Docker and Kubernetes
Solid knowledge of AWS (EKS, EC2, Organizations, RDS, S3, CloudWatch, Lambda, DynamoDB)
Experience with monitoring, logging, and alerting systems
Proficiency with Terraform, Helm, and GitLab CI (or similar)
Strong troubleshooting skills across infrastructure, CI/CD, and networking
Scripting experience with Bash and Python
Willingness to participate in on-call rotations
Familiarity with pub/sub systems (SQS, Kafka, or similar)
Experience with Redis, Airflow, Databricks, Spark/EMR
GitOps workflows and advanced Git usage
Experience supporting databases such as Postgres, Snowflake, or ClickHouse

Responsibilities

Own the reliability, availability, and performance of our production environments.
Operate production Kubernetes (EKS), including cluster upgrades and Helm deployments.
Manage scaling and capacity using KEDA, Karpenter, and HPA for resource optimization.
Manage AWS Cloud environments including EC2, Lambda, AWS Batch, Elasticache, RDS, and more.
Evolve infrastructure as code using Terraform and Helm with security best practices.
Support GitLab CI/CD pipelines, resolving deployment issues and improving stability.
Design observability systems using Prometheus, Grafana, and EFK to reduce alert fatigue.
Solve networking issues involving TLS, Load Balancing, VPCs, NAT, and VPN.
Support compliance initiatives and respond to security-related incidents.
Leverage AI-powered tools as a standard part of your workflow for automation and productivity.
Lead incident response end-to-end, including troubleshooting, mitigation, and resolution.
Perform deep-dive RCA to drive long-term corrective and preventive actions.
Participate in on-call rotations to provide consistent operational coverage.

Skills

About the Role

Requirements

Responsibilities

Similar Jobs