SRE Manager
Skills
About the Role
In this role you will lead and grow the SRE team to maintain high availability scalability and reliability of production systems You will own AWS cloud infrastructure operations monitoring security resource management and cost optimization in a 24/7 environment You will lead incident management troubleshooting RCA and post incident improvements You will ensure infrastructure cloud environments and operational processes comply with security audit and regulatory requirements such as MAS TRM and ISO 27001 You will drive SRE best practices including observability alerting SLA SLO SLI capacity planning disaster recovery and high availability You will improve system performance reliability and operational efficiency through automation and architecture optimization You will build and maintain CI/CD IaC and GitOps workflows to improve deployment efficiency and system consistency You will manage Kubernetes and EKS platforms and containerized infrastructure You will collaborate closely with Backend Data Security and Product teams on architecture design and operational improvements You will build and improve monitoring and observability platforms such as Grafana ELK CloudWatch Zabbix and Nagios You will mentor team members support technical growth and drive cross functional collaboration You will maintain operational documentation SOPs and incident reports You will participate in and improve on-call and incident response processes
Requirements
- 8+ years of Linux system administration and large-scale infrastructure experience
- 2+ years of team management or Tech Lead experience
- Hands-on experience operating high-traffic, high-availability cloud platforms in a 24/7 environment
- Strong experience with AWS services, including EC2, API Gateway, AppSync
- VPC, IAM, Networking
- Lambda, Aurora, ElastiCache (Redis)
- CloudFront, CloudWatch, EKS
- Security Services, SNS, Parameter Store, Secrets Manager
- Strong Kubernetes and container infrastructure experience, including EKS administration and troubleshooting
- Experience with Infrastructure as Code and configuration management tools such as Terraform, Helm, and Kustomize
- Experience with CI/CD and GitOps tools such as Jenkins, GitHub Actions, Argo Workflow, and ArgoCD
- Familiar with observability and monitoring tools including Grafana, ELK, Zabbix, and Nagios
- Experience managing distributed systems and related technologies such as MongoDB, Kafka, Load Balancers, and HA architecture
- Strong understanding of SRE / DevOps practices, including Incident Management, Capacity Planning, Disaster Recovery, and SLA/SLO/SLI
- Proficient in scripting or programming languages such as Bash, Python, or Golang
- Knowledge of cloud security, infrastructure security, and technical risk management
- Strong communication, collaboration, and problem-solving skills in fast-paced environments
- Experience in FinTech, Crypto, or high-availability platforms is a plus
- Familiar with compliance and security frameworks such as MAS TRM and ISO 27001 is a plus
Responsibilities
- Lead and manage the SRE team to ensure high availability, scalability, and reliability of production systems
- Own AWS cloud infrastructure operations, monitoring, security, resource management, and cost optimization in a 24/7 environment
- Lead incident management, troubleshooting, RCA, and post-incident improvements
- Ensure infrastructure, cloud environments, and operational processes comply with security, audit, and regulatory requirements (e.g. MAS TRM, ISO 27001)
- Drive SRE best practices including observability, alerting, SLA/SLO/SLI, capacity planning, disaster recovery, and high availability
- Improve system performance, reliability, and operational efficiency through automation and architecture optimization
- Build and maintain CI/CD, IaC, and GitOps workflows to improve deployment efficiency and system consistency
- Manage Kubernetes / EKS platforms and containerized infrastructure
- Collaborate closely with Backend, Data, Security, and Product teams on architecture design and operational improvements
- Build and improve monitoring and observability platforms such as Grafana, ELK, CloudWatch, Zabbix, and Nagios
- Mentor team members, support technical growth, and drive cross-functional collaboration
- Maintain operational documentation, SOPs, and incident reports
- Participate in and improve on-call and incident response processes
