Senior MLOps Engineer, LLMOps
Skills
About the Role
You will build and maintain the infrastructure and pipelines that enable production AI systems. You will design CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance and observability tooling. You will integrate and evaluate state-of-the-art LLM and agent tools, deploy scalable model serving, monitor cost, latency and performance, and run offline and online evaluations including human-in-the-loop processes. You will provide reproducible sandboxes and dashboards so researchers and engineers can iterate quickly and reliably.
Requirements
- Write high-quality maintainable software primarily in Python
- Experience with containerization and orchestration such as Docker and Kubernetes
- Experience with infrastructure-as-code and deployment tooling such as Terraform and CI/CD pipelines
- Experience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
- Implement MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
- Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
- Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance and capturing traces
- Strong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery
Responsibilities
- Build reusable CI/CD workflows for model training evaluation and deployment
- Automate model versioning approval workflows and compliance checks
- Build modular and scalable AI infrastructure including vector database feature store model registry and observability tooling
- Embed AI models and agents into real-time applications and workflows
- Continuously evaluate and integrate state-of-the-art AI tools
- Drive AI reliability governance and uptime
- Ensure data accuracy consistency and reliability for training and inference
- Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human-in-the-loop workflows
- Provide sandboxes dashboards and reproducible environments for researchers
Benefits
- Equity plan eligibility
