Senior or Staff AI Infrastructure Engineer
Skills
About the Role
You will design, build, and operate the infrastructure that supports large-scale AI and agent systems. You will create reusable CI/CD workflows for training, evaluation, and deployment; automate model versioning, approvals, and compliance checks; and assemble modular stacks including vector databases, feature stores, and model registries. You will integrate and evaluate cutting-edge LLM tools, instrument observability and monitoring, and deploy online and offline evaluation pipelines with regression testing, cost monitoring, and human-in-the-loop workflows. You will collaborate with engineers and data scientists to embed models and agents into real-time applications, provide sandboxes and reproducible environments for researchers, and continuously improve model performance, reliability, and governance.
Requirements
- Write high quality maintainable software primarily in Python
- Strong background in scalable infrastructure including Docker and Kubernetes
- Experience with infrastructure as code and deployment tools such as Terraform and CI/CD pipelines
- Familiarity with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry
- Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
- Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML
- Experience deploying and maintaining LLM and agentic workflows in production including cost latency and performance monitoring
- Experience capturing traces for analysis debugging and optimizing prompt response flows with real time data access
- Strong ownership pragmatism and ability to balance infrastructure design with iterative delivery
Responsibilities
- Build reusable CI/CD workflows for model training evaluation and deployment
- Automate model versioning approval workflows and compliance checks
- Build modular scalable AI infrastructure including vector databases feature stores model registries and observability tooling
- Partner with engineering and data science to embed AI models and agents into real-time applications and workflows
- Continuously evaluate and integrate state of the art AI tools and frameworks
- Drive AI reliability and governance to ensure compliance security and uptime
- Ensure data accuracy consistency and reliability for model training and inference
- Deploy infrastructure to support offline and online evaluation including regression testing cost monitoring and human in the loop workflows
- Enable researchers with sandboxes dashboards and reproducible environments
- Improve AI and ML model performance
Benefits
- Remote work
- Eligibility to participate in TRM's equity plan
