
Member of Technical Staff, Infrastructure
Moonvalley AI
Completely RemoteFull TimeInformation Technology
Posted Today
Job description
Responsibilities
- Design and improve scheduling and resource allocation for inference and training coexistence on shared GPU clusters
- Build, operate, and scale GPU infrastructure across clusters of thousands of GPUs
- Own GPU utilization and cost as first-class metrics
- Build automated tooling and observability that reduces friction for the AI team
- Participate in on-call rotation and drive reliability improvements
- Serve as the primary point of contact for GPU providers, managing relationships and coordinating infrastructure needs
- Set scheduling policy and drive architecture decisions for compute and storage systems over time
Requirements
- Linux-native systems expertise with ability to debug at the kernel level
- Deep understanding of networking and storage stacks
- Experience operating and scaling GPU infrastructure (hundreds to thousands of GPUs)
- Kubernetes, Slurm, and distributed storage systems operation
- Track record of running critical infrastructure reliably with monitoring, incident response, and automation
- Sufficient understanding of training and inference workloads to collaborate with researchers
Preferred Qualifications
- Experience in resource-constrained environments where allocation, scheduling, and prioritization of scarce resources was the core problem (e.g., HPC, trading, large-scale ML platforms)
Benefits
- Competitive salary and equity
- Private health coverage
- Pension contribution (UK, Canada, US)
- Fully-distributed, async-first culture
- Hardware setup of your choice
- Stipends for phone, internet, and meals
About the Company
Moonvalley AI builds world models for media and entertainment, training and deploying AI systems at scale across thousands of GPUs.
Skills & tools
KubernetesGPUSlurmLinuxDistributed SystemsPythonAWSMachine Learning
What the team is looking for
Use this list as a quick fit check before you apply.
- 01Linux-native systems expertise
- 02GPU cluster engineering at scale
- 03Kubernetes and Slurm operation
- 04Distributed systems design
- 05Production infrastructure reliability
- 06ML workload familiarity

Moonvalley AI
Job details
- Work model
- Completely Remote
- Commitment
- Full Time
- Category
- Information Technology
- Posted
- Today