Search...

Senior or Staff ML Systems Engineer, LLMs

Skills

About the Role

You will build and scale the technical infrastructure that powers large language models and agentic systems. You will create reusable CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance checks. You will design and operate modular AI infrastructure—vector databases, feature stores, model registries, and observability tooling—and embed models and agents into real-time applications. You will continuously evaluate and integrate state-of-the-art tools, monitor cost, latency, and performance, and run offline and online evaluation pipelines including regression tests and human-in-the-loop workflows. You will enable researchers by providing sandboxes, dashboards, and reproducible environments, and ensure data accuracy and reliability for model training and inference.

Requirements

  • Write high quality maintainable software primarily in Python
  • Strong background in scalable infrastructure including containerization and orchestration (Docker Kubernetes)
  • Experience with infrastructure as code and deployment (Terraform CI/CD pipelines)
  • Familiarity with monitoring and logging frameworks (Datadog Prometheus OpenTelemetry)
  • Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
  • Experience with scalable model and agent serving infrastructure (vLLM Triton BentoML)
  • Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance
  • Ability to capture traces for analysis and optimize prompt response flows with real time data access
  • Strong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery

Responsibilities

  • Build reusable CI/CD workflows for model training evaluation and deployment
  • Automate model versioning approval workflows and compliance checks
  • Design and maintain modular scalable AI infrastructure including vector databases feature stores model registries and observability tooling
  • Embed AI models and agents into real time applications and workflows
  • Evaluate and integrate state of the art AI tools and libraries
  • Drive AI reliability governance and ensure compliance security and uptime
  • Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human in the loop workflows
  • Provide sandboxes dashboards and reproducible environments to accelerate research
  • Ensure data accuracy consistency and reliability for model training and inferencing

Benefits

  • Equity plan
  • Remote work