Search...

Founding Infrastructure Engineer

Metagov logo
Metagov

Metagov is a laboratory for digital governance and a community of research and practice gathered around the mission to cultivate tools, practices, and communities to enable self-governance in the digital age. It operates as a nonprofit research organization.

Distributed
About Metagov

Metagov's mission is to cultivate tools, practices, and communities that enable self-governance in the digital age. Their vision is to work toward a governance layer for the internet that is empowering, creative, interconnected, and accountable. Metagov was founded in 2019 and spun out as an independent nonprofit in January 2020. Their original goal was to describe, support, and expand the right to self-governance in online communities, which is enabled and circumscribed by the platform's architecture. Metagovernance describes two related roles: (1) enabling and constraining users’ ability to create their own institutions, and (2) governing the interaction between separate institutions, from small chat groups to billion-dollar blockchain protocols. The community primarily convenes online in the Metagov Slack to coordinate projects, discuss research in online governance, plan seminars, convene governance experiments, collaborate with other organizations, and advance research outputs.

View jobs by Metagov

Skills

About the Role

You will be the technical owner of the platform's operational backbone. You will harden the platform for major launches, perform load testing, and build fallback routing and per-agent monitoring. You will implement end-to-end observability and integrated trace analysis across heterogeneous infrastructure, ship downtime warnings and fallback behavior, and implement routing transparency and endpoint provenance so users can verify which backend served their inference. You will improve performance of public endpoints, integrate programmatic infrastructure interfaces such as an MCP server, and make the utility more transparent and contributable. You will set priorities autonomously, operate production inference and ML serving infrastructure, and coordinate with cloud providers, HPC centers, and other infrastructure partners. Occasional travel for team workshops may be required.

Requirements

  • Significant experience operating production inference or ML serving infrastructure (vLLM, model routing, multi-region deployments, GPU-backed services)
  • Strong distributed systems and SRE instincts including observability, incident response, fallback design, and capacity planning
  • Comfort working across heterogeneous infrastructure partners including cloud providers and HPC centers
  • Experience orchestrating many stacks and integrating open-source projects
  • Maintainer and integrator experience with pride in operational excellence
  • Ability to work autonomously in a small team and travel occasionally for workshops

Responsibilities

  • Harden platform for launches
  • Perform load testing
  • Build fallback routing
  • Set up per-agent monitoring
  • Build end-to-end observability across stacks
  • Ship downtime warnings and fallback behavior
  • Implement routing transparency and endpoint provenance
  • Improve production service performance
  • Integrate MCP server or programmatic infrastructure interfaces
  • Make infrastructure transparent and contributable
  • Operate and maintain production inference and ML serving infrastructure
  • Coordinate with heterogeneous infrastructure partners
  • Orchestrate and integrate multiple open-source stacks
Founding Infrastructure Engineer at Metagov | JobStash