AI Inference Engineer
Skills
About the Role
You will optimize the latency and throughput of model inference, design and build reliable production serving systems, and accelerate research on scaling test-time compute. You will implement batching, caching, load balancing, and model parallelism; develop low-level GPU kernels and code generation; apply algorithmic optimizations such as quantization, distillation, and speculative decoding; and test, benchmark, and improve inference reliability for large-scale, high-concurrency deployments.
Requirements
- Experience with system optimizations for model serving, including batching, caching, load balancing, and model parallelism
- Experience with low-level inference optimizations such as GPU kernels and code generation
- Experience with algorithmic inference optimizations such as quantization, distillation, and speculative decoding
- Experience with large-scale, high-concurrency production serving
- Experience with testing, benchmarking, and reliability of inference services
Responsibilities
- Optimize model inference latency and throughput
- Build reliable production serving systems
- Accelerate research on scaling test-time compute
- Implement batching, caching, and load balancing for model serving
- Develop model parallelism and low-level GPU kernel optimizations
- Implement code generation for inference
- Apply algorithmic optimizations such as quantization, distillation, and speculative decoding
- Test, benchmark, and improve inference service reliability for high-concurrency deployments
