Network Operations Engineer
Skills
About the Role
You will serve as the front line of reliability for production infrastructure. You will detect and respond to incidents, triage alerts, coordinate incident response, and document decisions and outcomes in real time. You will also improve observability, refine alerting, build dashboards, and create runbooks to scale operational coverage. This is a shift-based role where you will validate system and user-facing functionality and support ecosystem participants during incidents.
Requirements
- Foundational experience with Linux systems, including filesystem navigation, log reading, and process awareness
- Understanding of core networking concepts such as DNS, HTTP, and TCP/IP and ability to troubleshoot connectivity issues
- Basic scripting ability (Python, Bash) to automate tasks and analyze system data
- Exposure to monitoring and observability tools such as Datadog, Grafana, or Prometheus
- Strong written communication skills for clear incident documentation and procedures under pressure
- Willingness to work shift-based, follow-the-sun schedules with a structured troubleshooting approach
- Familiarity with blockchain infrastructure, including node operation or EVM-based systems (preferred)
- Experience with Datadog or similar observability platforms in production (preferred)
- Exposure to infrastructure-as-code tools such as Terraform or configuration management tools like Ansible (preferred)
- Previous experience in a network operations center, incident response team, or on-call rotation (preferred)
- Experience working in a remote, globally distributed team (preferred)
Responsibilities
- Monitor the health and performance of blockchain networks, bridges, RPC services, staking systems, and user-facing products
- Track third-party dependencies and identify degradation that may impact the ecosystem
- Validate and triage alerts by distinguishing signal from noise, assessing severity, and determining impact
- Escalate confirmed issues to the appropriate SRE or engineering teams with clear structured context
- Coordinate incident response by engaging stakeholders, maintaining timelines, and ensuring consistent communication
- Document incidents in real time, including decisions, actions, and outcomes
- Build and improve dashboards, alerting systems, and monitoring coverage to enhance visibility
- Create and maintain runbooks for common failure modes and triage workflows
- Support validators and infrastructure providers when issues intersect with systems
- Validate user-facing product functionality during incidents
Benefits
- Remote first global workforce
- Medical insurance
- Dental insurance
- Vision insurance
- Company matching 401k with 3% match (United States employees only)
- $1,500 Home Office Set Up Allowance (lifetime max)
- $200 Annual AI Allowance
- $75 Monthly internet or phone reimbursement
- Flexible Time Off
- Company issued laptop
- Egg freezing benefits
- Mental health and employee wellness benefits
