Senior DevOps / SRE Engineer
Skills
About the Role
You will build and maintain the infrastructure that runs concurrent AI trading agents, including cron schedules, state files, and trailing stop processes. You will deploy and manage agent environments and workspace persistence, design and operate CI/CD pipelines, and execute zero-downtime deployment strategies. You will build monitoring, alerting, and observability across metrics, logs, and traces; operate and scale Kubernetes/EKS clusters and containerized workloads; manage Redis, Postgres/RDS, ClickHouse, Kafka, and blockchain node infrastructure; and own logging, security, incident response, backups, and disaster readiness. You will lead on-call practices, run incident response and postmortems, and implement long-term reliability improvements for production trading workloads.
Requirements
- Professional DevOps, SRE, or infrastructure engineering experience
- Strong Kubernetes experience, ideally on AWS EKS
- Hands-on experience with Docker and Helm
- Proficiency with infrastructure as code such as Terraform or Ansible
- Experience with CI/CD and deployment automation (GitHub Actions, ArgoCD, or similar)
- Strong AWS infrastructure experience; multi-cloud is a plus
- Experience operating Redis, Postgres/RDS, ClickHouse, and Kafka in production
- Observability experience with Prometheus, Grafana, Datadog, Loki, ELK/OpenSearch/Kibana, or OpenTelemetry
- Ability to build dashboards, alerts, and operational visibility
- Ability to debug across Python, Node.js, and Go
- Experience with access management, secrets handling, production hardening, and operational controls
- Experience with incident management, on-call operations, and backup/recovery planning
- Understanding of real-time systems and low-latency reliability requirements for trading
- Familiarity with blockchain node infrastructure, exchange APIs, wallet operations, and on-chain monitoring
- Experience with or willingness to learn MCP server deployment and auth management
- Hyperliquid experience is a plus
- OpenClaw and multi-agent orchestration experience is strongly preferred
Responsibilities
- Build and maintain infrastructure for concurrent AI trading agents
- Deploy and manage OpenClaw agent environments with workspace persistence and cron orchestration
- Design and operate CI/CD pipelines for production agent updates
- Define and execute zero-downtime deployment and safe rollback strategies
- Ensure active positions remain protected through infrastructure changes
- Build monitoring, alerting, and observability across metrics, logs, traces, and dashboards
- Manage cloud infrastructure using infrastructure as code
- Operate and scale Kubernetes/EKS clusters and containerized workloads
- Operate and maintain Redis, Postgres/RDS, ClickHouse, and Kafka
- Operate blockchain node infrastructure and ensure reliable exchange API and wallet connectivity
- Own logging, security, and incident response across the full stack
- Lead incident response, on-call practices, debugging, mitigation, and postmortems
- Own backup, recovery, and disaster-readiness for critical infrastructure
