Site Reliability Engineer

Chess.com

Completely RemoteFull TimeEngineering & Architecture
Posted Today

Job description

Responsibilities

  • Design and implement multi-regional resilient infrastructure to handle millions of concurrent sessions.
  • Lead hybrid cloud migration strategies, integrating bare-metal resources with cloud services.
  • Own on-call rotations and incident response procedures to maintain high availability SLAs.
  • Architect monitoring and alerting systems to proactively identify performance bottlenecks.
  • Collaborate with development teams to implement infrastructure-as-code and CI/CD pipelines.
  • Optimize system performance through capacity planning, load testing, and resource allocation.
  • Drive automation initiatives to reduce manual operational overhead.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or a related technical field.
  • 5+ years of experience in SRE, DevOps, or infrastructure engineering.
  • Strong proficiency with UNIX/Linux operating systems and command-line administration.
  • Experience with cloud platforms (GCP, AWS, or Azure) and Infrastructure-as-Code (Terraform, CloudFormation).
  • Hands-on experience with configuration management (Ansible, Chef, or Puppet).
  • Solid understanding of networking fundamentals (TCP/IP, HTTP, DNS).
  • Experience with containerization and orchestration (Docker, Kubernetes).
  • Proficiency with monitoring and observability tools (Datadog, Prometheus, Grafana).

Preferred Qualifications

  • Experience managing bare-metal server infrastructure and datacenter operations.
  • Proficiency with scripting languages such as Python, Go, or Bash.
  • Background in high-availability architectures and disaster recovery planning.
  • Experience with game server infrastructure or real-time application hosting.
  • Previous experience working in a fully remote, distributed environment.

About the Company

Chess.com is one of the largest gaming sites in the world and the #1 platform for playing, learning, and enjoying chess. We are a team of over 600 fully remote people in 60+ countries working to support 250M+ chess players worldwide. We prize our mission-driven, flat, and no-corporate culture.

Skills & tools

LinuxTerraformKubernetesDockerAWSGCPPythonGoAnsible

What the team is looking for

Use this list as a quick fit check before you apply.

  1. 01Bachelor's degree in Computer Science or related field
  2. 025+ years in SRE, DevOps, or infrastructure engineering
  3. 03Proficiency with UNIX/Linux
  4. 04Experience with cloud platforms (GCP, AWS, or Azure)
  5. 05Infrastructure-as-code expertise (Terraform, CloudFormation)
  6. 06Configuration management (Ansible, Chef, Puppet)
  7. 07Networking fundamentals (TCP/IP, HTTP, DNS)
  8. 08Containerization (Docker, Kubernetes)
  9. 09Monitoring tools (Datadog, Prometheus, Grafana)