Member of Technical Staff, Infrastructure

Moonvalley AI

Completely RemoteFull TimeInformation Technology

Posted 1 months ago

Job description

Design and improve scheduling and resource allocation for inference and training coexistence on shared GPU clusters
Build, operate, and scale GPU infrastructure across clusters of thousands of GPUs
Own GPU utilization and cost as first-class metrics
Build automated tooling and observability that reduces friction for the AI team
Participate in on-call rotation and drive reliability improvements
Serve as the primary point of contact for GPU providers, managing relationships and coordinating infrastructure needs
Set scheduling policy and drive architecture decisions for compute and storage systems over time

Linux-native systems expertise with ability to debug at the kernel level
Deep understanding of networking and storage stacks
Experience operating and scaling GPU infrastructure (hundreds to thousands of GPUs)
Kubernetes, Slurm, and distributed storage systems operation
Track record of running critical infrastructure reliably with monitoring, incident response, and automation
Sufficient understanding of training and inference workloads to collaborate with researchers

Experience in resource-constrained environments where allocation, scheduling, and prioritization of scarce resources was the core problem (e.g., HPC, trading, large-scale ML platforms)

Moonvalley AI builds world models for media and entertainment, training and deploying AI systems at scale across thousands of GPUs.

KubernetesGPUSlurmLinuxDistributed SystemsPythonAWSMachine Learning

Use this list as a quick fit check before you apply.

Moonvalley AI

Job details