Search...

AI Inference Engineer

Skills

About the Role

You will optimize the latency and throughput of model inference, design and build reliable production serving systems, and accelerate research on scaling test-time compute. You will implement batching, caching, load balancing, and model parallelism; develop low-level GPU kernels and code generation; apply algorithmic optimizations such as quantization, distillation, and speculative decoding; and test, benchmark, and improve inference reliability for large-scale, high-concurrency deployments.

Requirements

  • Experience with system optimizations for model serving, including batching, caching, load balancing, and model parallelism
  • Experience with low-level inference optimizations such as GPU kernels and code generation
  • Experience with algorithmic inference optimizations such as quantization, distillation, and speculative decoding
  • Experience with large-scale, high-concurrency production serving
  • Experience with testing, benchmarking, and reliability of inference services

Responsibilities

  • Optimize model inference latency and throughput
  • Build reliable production serving systems
  • Accelerate research on scaling test-time compute
  • Implement batching, caching, and load balancing for model serving
  • Develop model parallelism and low-level GPU kernel optimizations
  • Implement code generation for inference
  • Apply algorithmic optimizations such as quantization, distillation, and speculative decoding
  • Test, benchmark, and improve inference service reliability for high-concurrency deployments