Senior or Staff AI Infrastructure Engineer

20 hours agoLeadSalary: 200K - 275KNorth America Remote Full Time Ai Jobs by TRM Labs

Skills

About the Role

You will design, build, and operate the infrastructure that supports large-scale AI and agent systems. You will create reusable CI/CD workflows for training, evaluation, and deployment; automate model versioning, approvals, and compliance checks; and assemble modular stacks including vector databases, feature stores, and model registries. You will integrate and evaluate cutting-edge LLM tools, instrument observability and monitoring, and deploy online and offline evaluation pipelines with regression testing, cost monitoring, and human-in-the-loop workflows. You will collaborate with engineers and data scientists to embed models and agents into real-time applications, provide sandboxes and reproducible environments for researchers, and continuously improve model performance, reliability, and governance.

Requirements

Write high quality maintainable software primarily in Python
Strong background in scalable infrastructure including Docker and Kubernetes
Experience with infrastructure as code and deployment tools such as Terraform and CI/CD pipelines
Familiarity with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
Experience deploying and maintaining LLM and agentic workflows in production including cost latency and performance monitoring
Experience capturing traces for analysis debugging and optimizing prompt response flows with real time data access
Strong ownership pragmatism and ability to balance infrastructure design with iterative delivery

Responsibilities

Build reusable CI/CD workflows for model training evaluation and deployment
Automate model versioning approval workflows and compliance checks
Build modular scalable AI infrastructure including vector databases feature stores model registries and observability tooling
Partner with engineering and data science to embed AI models and agents into real-time applications and workflows
Continuously evaluate and integrate state of the art AI tools and frameworks
Drive AI reliability and governance to ensure compliance security and uptime
Ensure data accuracy consistency and reliability for model training and inference
Deploy infrastructure to support offline and online evaluation including regression testing cost monitoring and human in the loop workflows
Enable researchers with sandboxes dashboards and reproducible environments
Improve AI and ML model performance

Benefits

Remote work
Eligibility to participate in TRM's equity plan

Skills

About the Role

Requirements

Responsibilities

Benefits

Similar Jobs