VP of Engineering

2 days agoHead San Francisco, USA Hybrid Full Time Engineering Management Jobs by Hyperbolic

Skills

Scheduler Slurm Opensource Distributed Training Gpu Scheduling Gpu Ml Architecture Security Cloud Infrastructure Observability Ci/Cd Ray Kubernetes Distributed System Ai

About the Role

You will lead the cloud infrastructure architecture and define the architecture for GPU orchestration compute scheduling networking storage and distributed systems. You will build and scale large GPU clusters supporting customer workloads. You will drive reliability and performance for AI training and inference workloads. You will establish best practices for Kubernetes observability CI CD security and operational excellence. You will build SRE and Platform Engineering functions from the ground up and recruit and develop infrastructure, platform, and SRE capabilities. You will partner with stakeholders on strategy and investments for scalable AI infrastructure.

Requirements

12+ years building and operating large-scale infrastructure systems
Experience leading infrastructure organizations while remaining hands-on technically
Experience building GPU infrastructure or AI ML compute platforms
Proven track record scaling infrastructure in high-growth startup environments
Expert-level Kubernetes knowledge
Experience designing and operating multi-region cloud infrastructure
Strong understanding of Linux, networking, distributed systems, and storage architecture
Experience with Infrastructure-as-Code and automation frameworks
Deep expertise in observability, monitoring, and reliability engineering
Experience building highly available production systems
Experience with GPU scheduling, Slurm, Kubernetes GPU operators, Ray, or distributed training systems
Experience managing thousands of GPUs in production environments
Background supporting AI training and inference platforms

Responsibilities

Lead the design and evolution of a scalable AI cloud platform
Define the architecture for GPU orchestration compute scheduling networking storage and distributed systems
Make critical decisions regarding cloud infrastructure bare metal deployments and platform scalability
Personally participate in architecture reviews and key technical initiatives
Build and scale large GPU clusters supporting customer workloads
Design systems for GPU provisioning scheduling utilization optimization and capacity management
Drive platform reliability and performance for AI training and inference workloads
Partner with stakeholders on infrastructure requirements for next generation AI systems
Establish best practices for Kubernetes observability CI CD security and operational excellence
Build SRE and Platform Engineering functions from the ground up
Define reliability standards including SLOs SLIs incident response processes and capacity planning
Drive automation across infrastructure operations
Recruit and develop infrastructure platform and SRE capabilities

Skills

About the Role

Requirements

Responsibilities

Similar Jobs