Senior AI Infrastructure Engineer
Skills
PlacementConfiguration ManagementTopologyInfinibandRoceRedfishObject StorageNvidiaGpu InfrastructureCephMulti-TenancyBmcAnsibleGpuBare-MetalDistributed SystemsCudaTerraformOrchestrationMemory ManagementObservabilityCi/CdIpmiPxeOs DeploymentGpu SchedulingDistributed File SystemCloud-InitHardware Vendor ManagementApiSecrets ManagementPulumiBlock Storage
About the Role
You will design, build, and operate the infrastructure that transforms raw GPUs into a programmable, orchestrated pool for AI workloads. You will implement bare-metal provisioning and lifecycle management, develop GPU scheduling and placement strategies, automate provisioning with infrastructure as code, integrate storage solutions for training data, design APIs and cloud-init workflows for automated configuration, optimize GPU compute (CUDA), and work directly with hardware vendors to troubleshoot and improve integrations.
Requirements
- Bare-metal provisioning and lifecycle management (IPMI, Redfish, BMC, PXE, automated OS deployment)
- GPU scheduling and orchestration with GPU type awareness, memory and topology considerations
- Experience with Terraform or Pulumi and CI/CD for infrastructure
- Secrets management and configuration management experience
- Observability stack implementation experience
- Storage and data infrastructure for AI/ML including object storage, high-IOPS block storage, and distributed file systems
- API design and cloud-init for automated provisioning
- Solid understanding of GPU architecture, CUDA, and GPU compute optimization
- Experience building and scaling cloud infrastructure or distributed systems in production
- Proven ability to work with hardware vendors and vendor engineering teams
- Strong communication skills
Responsibilities
- Build and scale a multi-tenant GPU cloud marketplace
- Design and implement multi-tenancy provisioning and virtualization solutions
- Transform raw GPUs into a programmable, orchestrated resource pool
- Implement bare-metal provisioning and lifecycle management
- Develop GPU scheduling, placement strategies, and fragmentation minimization
- Automate infrastructure using Terraform or Pulumi and CI/CD pipelines
- Implement secrets management, configuration management, and observability
- Design APIs and cloud-init workflows for automated provisioning
- Integrate and operate storage solutions for AI/ML workloads
- Collaborate with hardware vendors to troubleshoot and optimize integrations
