Member of Technical Staff, Infrastructure

Moonvalley AI

Completely RemoteFull TimeInformation Technology
Posted Today

Job description

Responsibilities

  • Design and improve scheduling and resource allocation for inference and training coexistence on shared GPU clusters
  • Build, operate, and scale GPU infrastructure across clusters of thousands of GPUs
  • Own GPU utilization and cost as first-class metrics
  • Build automated tooling and observability that reduces friction for the AI team
  • Participate in on-call rotation and drive reliability improvements
  • Serve as the primary point of contact for GPU providers, managing relationships and coordinating infrastructure needs
  • Set scheduling policy and drive architecture decisions for compute and storage systems over time

Requirements

  • Linux-native systems expertise with ability to debug at the kernel level
  • Deep understanding of networking and storage stacks
  • Experience operating and scaling GPU infrastructure (hundreds to thousands of GPUs)
  • Kubernetes, Slurm, and distributed storage systems operation
  • Track record of running critical infrastructure reliably with monitoring, incident response, and automation
  • Sufficient understanding of training and inference workloads to collaborate with researchers

Preferred Qualifications

  • Experience in resource-constrained environments where allocation, scheduling, and prioritization of scarce resources was the core problem (e.g., HPC, trading, large-scale ML platforms)

Benefits

  • Competitive salary and equity
  • Private health coverage
  • Pension contribution (UK, Canada, US)
  • Fully-distributed, async-first culture
  • Hardware setup of your choice
  • Stipends for phone, internet, and meals

About the Company

Moonvalley AI builds world models for media and entertainment, training and deploying AI systems at scale across thousands of GPUs.

Skills & tools

KubernetesGPUSlurmLinuxDistributed SystemsPythonAWSMachine Learning

What the team is looking for

Use this list as a quick fit check before you apply.

  1. 01Linux-native systems expertise
  2. 02GPU cluster engineering at scale
  3. 03Kubernetes and Slurm operation
  4. 04Distributed systems design
  5. 05Production infrastructure reliability
  6. 06ML workload familiarity