Machine Learning Infrastructure Engineer

1 month agoSenior San Francisco, USA Hybrid Full Time Ai Jobs by TRM Labs

Skills

About the Role

You will design, build, and operate GPU-backed infrastructure to run production ML and LLM workloads. You will optimize inference systems for throughput and cost, implement model optimization and compilation workflows, and support distributed inference patterns such as model and tensor parallelism. You will schedule heterogeneous workloads across accelerators, instrument systems for GPU load, memory, batching, and token throughput, and work with engineering and ML teams to transition models from experimentation to reliable production services.

Requirements

Bachelor's degree or equivalent in Computer Science or related field
5+ years of experience building and operating distributed systems or infrastructure in production
Experience deploying and operating ML/LLM inference workloads on GPU clusters in cloud environments (AWS and/or GCP)
Deep understanding of high-throughput inference systems including batching strategies and token throughput optimization
Experience with ML serving frameworks such as Triton Inference Server, vLLM, Ray Serve, ONNX Runtime, or HuggingFace Optimum
Experience optimizing GPU load, memory efficiency, and production performance bottlenecks
Familiarity with distributed inference strategies including model parallelism and tensor parallelism
Experience working with Kubernetes or equivalent orchestration systems
Familiarity with heterogeneous accelerators (e.g., Inferentia) is a plus
CUDA familiarity and experience debugging GPU-related issues is a plus
Adaptable and autonomous with excellent communication and collaboration skills

Responsibilities

Design and operate GPU cluster infrastructure
Optimize high-throughput inference
Enable distributed inference strategies
Implement model optimization and compilation workflows
Schedule heterogeneous workloads
Build observability into ML infrastructure
Partner across engineering teams to transition models to production

Skills

About the Role

Requirements

Responsibilities

Similar Jobs