Search...

Senior AI Infrastructure Engineer

Skills

About the Role

You will design, build, and operate the infrastructure that transforms raw GPUs into a programmable, orchestrated pool for AI workloads. You will implement bare-metal provisioning and lifecycle management, develop GPU scheduling and placement strategies, automate provisioning with infrastructure as code, integrate storage solutions for training data, design APIs and cloud-init workflows for automated configuration, optimize GPU compute (CUDA), and work directly with hardware vendors to troubleshoot and improve integrations.

Requirements

  • Bare-metal provisioning and lifecycle management (IPMI, Redfish, BMC, PXE, automated OS deployment)
  • GPU scheduling and orchestration with GPU type awareness, memory and topology considerations
  • Experience with Terraform or Pulumi and CI/CD for infrastructure
  • Secrets management and configuration management experience
  • Observability stack implementation experience
  • Storage and data infrastructure for AI/ML including object storage, high-IOPS block storage, and distributed file systems
  • API design and cloud-init for automated provisioning
  • Solid understanding of GPU architecture, CUDA, and GPU compute optimization
  • Experience building and scaling cloud infrastructure or distributed systems in production
  • Proven ability to work with hardware vendors and vendor engineering teams
  • Strong communication skills

Responsibilities

  • Build and scale a multi-tenant GPU cloud marketplace
  • Design and implement multi-tenancy provisioning and virtualization solutions
  • Transform raw GPUs into a programmable, orchestrated resource pool
  • Implement bare-metal provisioning and lifecycle management
  • Develop GPU scheduling, placement strategies, and fragmentation minimization
  • Automate infrastructure using Terraform or Pulumi and CI/CD pipelines
  • Implement secrets management, configuration management, and observability
  • Design APIs and cloud-init workflows for automated provisioning
  • Integrate and operate storage solutions for AI/ML workloads
  • Collaborate with hardware vendors to troubleshoot and optimize integrations