Staff MLOps Engineer, LLMOps

20 hours agoLeadSalary: 200K - 275KNorth America Remote Full Time Ai Jobs by TRM Labs

Skills

About the Role

You will build and operate production-grade infrastructure for large language models and agentic workflows. You will create CI/CD pipelines for training, evaluation, and deployment, automate model versioning and approval workflows, and deploy model serving and monitoring systems. You will integrate vector databases, feature stores, and model registries, instrument observability and cost monitoring, and enable reproducible sandboxes and human-in-the-loop evaluation for researchers. You will continuously evaluate and integrate state-of-the-art LLM tooling and ensure reliability, compliance, and performance of AI systems.

Requirements

Proficient software engineering skills, primarily in Python
Experience with containerization and orchestration (Docker, Kubernetes)
Experience with infrastructure-as-code and CI/CD (Terraform, GitHub Actions or similar)
Experience with monitoring and logging frameworks (Datadog, Prometheus, OpenTelemetry)
Knowledge of MLOps best practices including model versioning, rollback, and automated evaluation
Experience with scalable model serving (vLLM, Triton, BentoML or similar)
Experience integrating vector databases, feature stores, and model registries
Experience with experiment tracking and evaluation frameworks (MLflow or similar)
Ability to optimize prompt and response flows and monitor cost, latency, and performance

Responsibilities

Build reusable CI/CD workflows for model training, evaluation, and deployment
Automate model versioning, approval workflows, and compliance checks
Design and implement a modular, scalable AI infrastructure stack including vector databases and feature stores
Integrate and maintain model registries and experiment tracking
Partner with engineering and data science to embed models and agents into applications
Evaluate and integrate state-of-the-art LLM tools and frameworks
Drive AI reliability and governance, including monitoring, testing, and drift detection
Deploy infrastructure for offline and online evaluation including regression testing and human-in-the-loop workflows
Provide sandboxes, dashboards, and reproducible environments to enable rapid iteration

Benefits

Equity plan eligibility

Skills

About the Role

Requirements

Responsibilities

Benefits

Similar Jobs