Senior MLOps Engineer, LLMOps

22 hours agoSeniorSalary: 200K - 275KNorth America Remote Full Time Ai Jobs by TRM Labs

Skills

About the Role

You will build and maintain the infrastructure and pipelines that enable production AI systems. You will design CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance and observability tooling. You will integrate and evaluate state-of-the-art LLM and agent tools, deploy scalable model serving, monitor cost, latency and performance, and run offline and online evaluations including human-in-the-loop processes. You will provide reproducible sandboxes and dashboards so researchers and engineers can iterate quickly and reliably.

Requirements

Write high-quality maintainable software primarily in Python
Experience with containerization and orchestration such as Docker and Kubernetes
Experience with infrastructure-as-code and deployment tooling such as Terraform and CI/CD pipelines
Experience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
Implement MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance and capturing traces
Strong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery

Responsibilities

Build reusable CI/CD workflows for model training evaluation and deployment
Automate model versioning approval workflows and compliance checks
Build modular and scalable AI infrastructure including vector database feature store model registry and observability tooling
Embed AI models and agents into real-time applications and workflows
Continuously evaluate and integrate state-of-the-art AI tools
Drive AI reliability governance and uptime
Ensure data accuracy consistency and reliability for training and inference
Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human-in-the-loop workflows
Provide sandboxes dashboards and reproducible environments for researchers

Benefits

Equity plan eligibility

Skills

About the Role

Requirements

Responsibilities

Benefits

Similar Jobs