Senior AI Infrastructure Engineer

2 months agoSenior San Francisco, USA Onsite Full Time Devops Jobs by Hyperbolic

Skills

About the Role

You will design, build, and operate the infrastructure that transforms raw GPUs into a programmable, orchestrated pool for AI workloads. You will implement bare-metal provisioning and lifecycle management, develop GPU scheduling and placement strategies, automate provisioning with infrastructure as code, integrate storage solutions for training data, design APIs and cloud-init workflows for automated configuration, optimize GPU compute (CUDA), and work directly with hardware vendors to troubleshoot and improve integrations.

Requirements

Bare-metal provisioning and lifecycle management (IPMI, Redfish, BMC, PXE, automated OS deployment)
GPU scheduling and orchestration with GPU type awareness, memory and topology considerations
Experience with Terraform or Pulumi and CI/CD for infrastructure
Secrets management and configuration management experience
Observability stack implementation experience
Storage and data infrastructure for AI/ML including object storage, high-IOPS block storage, and distributed file systems
API design and cloud-init for automated provisioning
Solid understanding of GPU architecture, CUDA, and GPU compute optimization
Experience building and scaling cloud infrastructure or distributed systems in production
Proven ability to work with hardware vendors and vendor engineering teams
Strong communication skills

Responsibilities

Build and scale a multi-tenant GPU cloud marketplace
Design and implement multi-tenancy provisioning and virtualization solutions
Transform raw GPUs into a programmable, orchestrated resource pool
Implement bare-metal provisioning and lifecycle management
Develop GPU scheduling, placement strategies, and fragmentation minimization
Automate infrastructure using Terraform or Pulumi and CI/CD pipelines
Implement secrets management, configuration management, and observability
Design APIs and cloud-init workflows for automated provisioning
Integrate and operate storage solutions for AI/ML workloads
Collaborate with hardware vendors to troubleshoot and optimize integrations

Skills

About the Role

Requirements

Responsibilities

Similar Jobs