Search...

Senior MLOps Engineer, LLMOps

Skills

About the Role

You will build and maintain the infrastructure and pipelines that enable production AI systems. You will design CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance and observability tooling. You will integrate and evaluate state-of-the-art LLM and agent tools, deploy scalable model serving, monitor cost, latency and performance, and run offline and online evaluations including human-in-the-loop processes. You will provide reproducible sandboxes and dashboards so researchers and engineers can iterate quickly and reliably.

Requirements

  • Write high-quality maintainable software primarily in Python
  • Experience with containerization and orchestration such as Docker and Kubernetes
  • Experience with infrastructure-as-code and deployment tooling such as Terraform and CI/CD pipelines
  • Experience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
  • Implement MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
  • Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
  • Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance and capturing traces
  • Strong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery

Responsibilities

  • Build reusable CI/CD workflows for model training evaluation and deployment
  • Automate model versioning approval workflows and compliance checks
  • Build modular and scalable AI infrastructure including vector database feature store model registry and observability tooling
  • Embed AI models and agents into real-time applications and workflows
  • Continuously evaluate and integrate state-of-the-art AI tools
  • Drive AI reliability governance and uptime
  • Ensure data accuracy consistency and reliability for training and inference
  • Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human-in-the-loop workflows
  • Provide sandboxes dashboards and reproducible environments for researchers

Benefits

  • Equity plan eligibility