Senior Site Reliability Engineer CCIP
Skills
About the Role
You will ensure the reliability, scalability, and operational excellence of the CCIP platform. You will strengthen production resilience by improving deployment safety, establishing distributed tracing for observability, eliminating operational toil through automation, driving adoption of meaningful SLOs and SLIs and error budgets, and increasing platform scalability and readiness as CCIP grows. You will be responsible for maintaining highly available production systems and reducing operational overhead.
Requirements
- Demonstrated experience in Site Reliability Engineering, Production Engineering, or a similar role operating large-scale distributed systems.
- Deep expertise defining, implementing, and driving adoption of SLOs, SLIs, and error budgets across engineering organizations.
- Built and operated production Kubernetes environments supporting critical services.
- Applied OpenTelemetry to improve observability across distributed systems.
- Experience improving the reliability, scalability, and operability of production infrastructure.
- Demonstrated technical leadership influencing reliability practices across engineering teams.
- Experience performing capacity planning and performance tuning for high-throughput distributed services.
- Previous experience working on Web3 infrastructure or within a crypto-native engineering organization.
- Applied chaos engineering or fault-injection techniques to improve production resilience.
- Partnered with software engineering teams to conduct production-readiness reviews before service launches.
- Experience leading on-call operations, including defining rotations, escalation policies, and improving alert quality.
Responsibilities
- Improve deployment safety and increase delivery velocity by advancing production engineering practices.
- Establish distributed tracing across the platform to improve observability and accelerate incident investigation.
- Eliminate operational toil through automation that increases engineering efficiency and platform reliability.
- Drive adoption of meaningful SLOs, SLIs, and error budgets that guide engineering decisions and improve service health.
- Increase platform scalability and operational readiness as CCIP continues to grow.
- Strengthen Chainlink's reputation through highly available production systems while reducing operational overhead.
Benefits
- Long term incentives
- Comprehensive benefits
