Senior or Staff ML Systems Engineer, LLMs

22 hours agoLeadSalary: 200K - 275KNorth America Remote Full Time Ai Jobs by TRM Labs

Skills

About the Role

You will build and scale the technical infrastructure that powers large language models and agentic systems. You will create reusable CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance checks. You will design and operate modular AI infrastructure—vector databases, feature stores, model registries, and observability tooling—and embed models and agents into real-time applications. You will continuously evaluate and integrate state-of-the-art tools, monitor cost, latency, and performance, and run offline and online evaluation pipelines including regression tests and human-in-the-loop workflows. You will enable researchers by providing sandboxes, dashboards, and reproducible environments, and ensure data accuracy and reliability for model training and inference.

Requirements

Write high quality maintainable software primarily in Python
Strong background in scalable infrastructure including containerization and orchestration (Docker Kubernetes)
Experience with infrastructure as code and deployment (Terraform CI/CD pipelines)
Familiarity with monitoring and logging frameworks (Datadog Prometheus OpenTelemetry)
Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
Experience with scalable model and agent serving infrastructure (vLLM Triton BentoML)
Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance
Ability to capture traces for analysis and optimize prompt response flows with real time data access
Strong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery

Responsibilities

Build reusable CI/CD workflows for model training evaluation and deployment
Automate model versioning approval workflows and compliance checks
Design and maintain modular scalable AI infrastructure including vector databases feature stores model registries and observability tooling
Embed AI models and agents into real time applications and workflows
Evaluate and integrate state of the art AI tools and libraries
Drive AI reliability governance and ensure compliance security and uptime
Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human in the loop workflows
Provide sandboxes dashboards and reproducible environments to accelerate research
Ensure data accuracy consistency and reliability for model training and inferencing

Benefits

Equity plan
Remote work

Skills

About the Role

Requirements

Responsibilities

Benefits

Similar Jobs