DevOps Lead
Skills
About the Role
You will own the infrastructure platform end-to-end: drive full Infrastructure as Code coverage, design and operate ephemeral development environments, and build deployment and GitOps pipelines. You will architect Terraform across multi-region AWS, scale ECS clusters, embed security and observability into infrastructure, manage incidents, and mentor the DevOps team.
Requirements
- Proven experience leading a DevOps or platform engineering team
- Extensive Terraform and Infrastructure as Code experience, including state management and module design
- Experience owning IaC migrations or greenfield IaC buildouts at scale
- Experience designing and operating ephemeral or on-demand environment platforms
- Strong familiarity with AWS Control Tower, VPC, Transit Gateway, PrivateLink, IAM, RDS, and cost management
- Familiarity with AI-assisted development workflows and tooling
- Proficiency with Python or Go for automation, Bash scripting, and GitHub Actions
- Experience with observability tooling such as New Relic, Prometheus, or CloudWatch
- 13+ years of experience in DevOps, SRE, or infrastructure engineering, with at least 2–3 years leading a team
Responsibilities
- Manage, mentor, and develop the DevOps team
- Define the team's roadmap and communicate infrastructure strategy
- Own architectural and tooling decisions end to end
- Deliver full Infrastructure as Code coverage and eliminate manual provisioning
- Architect and maintain Terraform codebases for multi-region, multi-account AWS
- Establish IaC governance, module standards, and automated drift detection
- Design and implement ephemeral development environments
- Ensure ephemeral environments are fast to provision, cost-efficient, and identical to production
- Integrate ephemeral environments into development workflows and AI-assisted coding toolchains
- Scale and manage ECS clusters, deployments, and autoscaling
- Own GitOps delivery via ArgoCD including secure pipelines and rollback strategies
- Build fast, reliable CI/CD pipelines with AI tooling integration
- Oversee cloud operations for uptime, failover, and incident response
- Implement infrastructure security including secrets management, least-privilege IAM, and network segmentation
- Establish observability across logging, metrics, and distributed tracing
