SRE Manager

1 month agoTaipei, Taiwan Full Time Engineering Management Jobs by XREX

XREX

XREX helps users access financial services by providing crypto-fiat exchange, escrow services, and blockchain compliance tools globally.

Taiwan, Province of China

Funding

Unknown ($19M)

Investors

Tether

Projects

XREXCEX

XREX ExchangeCentralised Exchange

BitCheckPeer to Peer & Remittance

XRAYOnchain Compliance and Investigations

XREX EarnSavings & Yield

XREX ClubsSocial Trading

XREX MarketplaceEducation and Training

About XREX

XREX helps individuals and businesses access secure financial services by providing a regulated crypto-fiat exchange with USD support, escrow payment guarantee services, and blockchain compliance tools. Users can trade digital assets, transfer funds securely through BitCheck escrow, and access staking rewards through community clubs.

View jobs by XREX

Skills

About the Role

In this role you will lead and grow the SRE team to maintain high availability scalability and reliability of production systems You will own AWS cloud infrastructure operations monitoring security resource management and cost optimization in a 24/7 environment You will lead incident management troubleshooting RCA and post incident improvements You will ensure infrastructure cloud environments and operational processes comply with security audit and regulatory requirements such as MAS TRM and ISO 27001 You will drive SRE best practices including observability alerting SLA SLO SLI capacity planning disaster recovery and high availability You will improve system performance reliability and operational efficiency through automation and architecture optimization You will build and maintain CI/CD IaC and GitOps workflows to improve deployment efficiency and system consistency You will manage Kubernetes and EKS platforms and containerized infrastructure You will collaborate closely with Backend Data Security and Product teams on architecture design and operational improvements You will build and improve monitoring and observability platforms such as Grafana ELK CloudWatch Zabbix and Nagios You will mentor team members support technical growth and drive cross functional collaboration You will maintain operational documentation SOPs and incident reports You will participate in and improve on-call and incident response processes

Requirements

8+ years of Linux system administration and large-scale infrastructure experience
2+ years of team management or Tech Lead experience
Hands-on experience operating high-traffic, high-availability cloud platforms in a 24/7 environment
Strong experience with AWS services, including EC2, API Gateway, AppSync
VPC, IAM, Networking
Lambda, Aurora, ElastiCache (Redis)
CloudFront, CloudWatch, EKS
Security Services, SNS, Parameter Store, Secrets Manager
Strong Kubernetes and container infrastructure experience, including EKS administration and troubleshooting
Experience with Infrastructure as Code and configuration management tools such as Terraform, Helm, and Kustomize
Experience with CI/CD and GitOps tools such as Jenkins, GitHub Actions, Argo Workflow, and ArgoCD
Familiar with observability and monitoring tools including Grafana, ELK, Zabbix, and Nagios
Experience managing distributed systems and related technologies such as MongoDB, Kafka, Load Balancers, and HA architecture
Strong understanding of SRE / DevOps practices, including Incident Management, Capacity Planning, Disaster Recovery, and SLA/SLO/SLI
Proficient in scripting or programming languages such as Bash, Python, or Golang
Knowledge of cloud security, infrastructure security, and technical risk management
Strong communication, collaboration, and problem-solving skills in fast-paced environments
Experience in FinTech, Crypto, or high-availability platforms is a plus
Familiar with compliance and security frameworks such as MAS TRM and ISO 27001 is a plus

Responsibilities

Lead and manage the SRE team to ensure high availability, scalability, and reliability of production systems
Own AWS cloud infrastructure operations, monitoring, security, resource management, and cost optimization in a 24/7 environment
Lead incident management, troubleshooting, RCA, and post-incident improvements
Ensure infrastructure, cloud environments, and operational processes comply with security, audit, and regulatory requirements (e.g. MAS TRM, ISO 27001)
Drive SRE best practices including observability, alerting, SLA/SLO/SLI, capacity planning, disaster recovery, and high availability
Improve system performance, reliability, and operational efficiency through automation and architecture optimization
Build and maintain CI/CD, IaC, and GitOps workflows to improve deployment efficiency and system consistency
Manage Kubernetes / EKS platforms and containerized infrastructure
Collaborate closely with Backend, Data, Security, and Product teams on architecture design and operational improvements
Build and improve monitoring and observability platforms such as Grafana, ELK, CloudWatch, Zabbix, and Nagios
Mentor team members, support technical growth, and drive cross-functional collaboration
Maintain operational documentation, SOPs, and incident reports
Participate in and improve on-call and incident response processes

Skills

About the Role

Requirements

Responsibilities

Similar Jobs