DevOps/Site Reliability Engineer

Bespoke Labs

Completely RemoteContractEngineering & Architecture
Posted 1 weeks ago

Job description

Responsibilities

  • Own and scale cloud infrastructure on AWS, including EC2, EKS, RDS, S3, IAM, and VPC
  • Manage Kubernetes clusters and container orchestration end-to-end
  • Build and maintain CI/CD pipelines using GitHub Actions or similar tools
  • Implement monitoring, alerting, and observability stacks such as Prometheus, Grafana, or DataDog
  • Automate infrastructure using Terraform or other Infrastructure as Code (IaC) tools
  • Improve the reliability, performance, and security of production systems
  • Debug and resolve issues across complex, distributed systems
  • Participate in design reviews to help raise the infrastructure bar

Requirements

  • 3–5 years of experience in DevOps, SRE, or infrastructure engineering
  • Strong AWS experience with EKS, EC2, RDS, S3, and IAM
  • Proficiency with Kubernetes deployment, scaling, and troubleshooting in production
  • Experience with CI/CD pipelines such as GitHub Actions or ArgoCD
  • Experience with Infrastructure as Code tools like Terraform, Pulumi, or CDK
  • Proficiency in Python or Go scripting
  • Experience working in production environments with real users
  • Ability to operate autonomously and handle ambiguity

Preferred Qualifications

  • Experience supporting ML training workloads or GPU clusters
  • Familiarity with distributed computing or large-scale data pipelines
  • Prior work experience at an AI, ML, or data-focused company
  • Open-source contributions or published technical writing

Benefits

  • Competitive compensation and meaningful equity
  • Flexible, remote-friendly environment with low bureaucracy
  • Health, wellness, and learning & development benefits
  • Opportunity to work with a high-caliber team on frontier AI infrastructure

About the Company

Bespoke Labs is an AI research and data company building the datasets, benchmarks, and evaluation infrastructure that power frontier AI models. Backed by leading investors and trusted by top AI labs, our small, fast-moving team has an outsized impact on how the next generation of AI is built.

Skills & tools

AWSKubernetesTerraformPythonGoGitHub Actions

What the team is looking for

Use this list as a quick fit check before you apply.

  1. 013–5 years in DevOps, SRE, or infrastructure engineering
  2. 02Strong AWS experience (EKS, EC2, RDS, S3, IAM)
  3. 03Kubernetes deployment and troubleshooting
  4. 04CI/CD pipelines (GitHub Actions, ArgoCD)
  5. 05Infrastructure as Code (Terraform, Pulumi, or CDK)
  6. 06Python or Go scripting