Search...

Senior or Staff AI Infrastructure Engineer

Skills

About the Role

You will design, build, and operate the infrastructure that supports large-scale AI and agent systems. You will create reusable CI/CD workflows for training, evaluation, and deployment; automate model versioning, approvals, and compliance checks; and assemble modular stacks including vector databases, feature stores, and model registries. You will integrate and evaluate cutting-edge LLM tools, instrument observability and monitoring, and deploy online and offline evaluation pipelines with regression testing, cost monitoring, and human-in-the-loop workflows. You will collaborate with engineers and data scientists to embed models and agents into real-time applications, provide sandboxes and reproducible environments for researchers, and continuously improve model performance, reliability, and governance.

Requirements

  • Write high quality maintainable software primarily in Python
  • Strong background in scalable infrastructure including Docker and Kubernetes
  • Experience with infrastructure as code and deployment tools such as Terraform and CI/CD pipelines
  • Familiarity with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
  • Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
  • Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
  • Experience deploying and maintaining LLM and agentic workflows in production including cost latency and performance monitoring
  • Experience capturing traces for analysis debugging and optimizing prompt response flows with real time data access
  • Strong ownership pragmatism and ability to balance infrastructure design with iterative delivery

Responsibilities

  • Build reusable CI/CD workflows for model training evaluation and deployment
  • Automate model versioning approval workflows and compliance checks
  • Build modular scalable AI infrastructure including vector databases feature stores model registries and observability tooling
  • Partner with engineering and data science to embed AI models and agents into real-time applications and workflows
  • Continuously evaluate and integrate state of the art AI tools and frameworks
  • Drive AI reliability and governance to ensure compliance security and uptime
  • Ensure data accuracy consistency and reliability for model training and inference
  • Deploy infrastructure to support offline and online evaluation including regression testing cost monitoring and human in the loop workflows
  • Enable researchers with sandboxes dashboards and reproducible environments
  • Improve AI and ML model performance

Benefits

  • Remote work
  • Eligibility to participate in TRM's equity plan