Staff Site Reliability Engineer

Blink Health

Completely RemoteFull TimeInformation Technology
Posted 1 months ago

Job description

Responsibilities

  • Establish and evolve SRE best practices across the organization, including reliability principles, error budgets, incident response, postmortems, and operational readiness standards.
  • Define and drive observability strategy for system health, performance, and reliability, including SLIs/SLOs, alerting quality, dashboards, and service health indicators.
  • Design and implement software-driven solutions within the infrastructure domain, automating manual processes and eliminating operational complexity and toil.
  • Act as a technical leader and force multiplier, helping set priorities and influencing decision-making across core cloud infrastructure, reliability tooling, and platform architecture.
  • Take ownership of large, ambiguous initiatives, driving them from concept to delivery while aligning stakeholders across engineering, security, and product.
  • Combine deep knowledge of software development, infrastructure, and security to improve platform resilience, scalability, performance, and compliance.
  • Proactively identify systemic risks and reliability gaps, recommending and leading platform upgrades and architectural improvements before they become incidents.
  • Partner with engineering teams to improve developer workflows, tooling, and operational maturity, increasing productivity while reducing cognitive load.
  • Provide technical mentorship, architecture guidance, and high-quality design and code reviews for engineers across infrastructure and product teams.
  • Lead by example in documentation and knowledge sharing, ensuring systems and processes are well-understood and not dependent on individual ownership.
  • Participate in and help mature incident response, escalation practices, and post-incident learning across the organization.

Requirements

  • Bachelor’s or Master’s degree in Computer Science or equivalent practical experience.
  • 7+ years of experience in site reliability engineering, infrastructure engineering, or platform engineering roles, with demonstrated impact at scale.

Preferred Qualifications

  • Expert-level, methodical troubleshooting across the entire stack, from application to kernel to network.
  • Strong command-line proficiency and deep expertise in Linux systems and operating system fundamentals.
  • Advanced understanding of networking concepts including load balancing, proxies, DNS, TCP/IP, NAT, and service-to-service communication.
  • Experience working across multiple languages (e.g., Python, Go, Bash) and troubleshooting application stacks such as React or similar.
  • Strong track record of automating repetitive and complex operational work to reduce toil and increase reliability.
  • Ability to design and build internal tools (Python or Go) that standardize and scale engineering practices.
  • Comfortable operating in an agile environment, with disciplined testing and quality practices.
  • Deep experience with cloud platforms (AWS preferred, GCP/Azure acceptable), particularly managed services and production-grade architectures.
  • Strong expertise in Kubernetes and container orchestration (EKS, Helm), including lifecycle management and operational best practices.
  • Proven experience designing and implementing observability systems, including metrics, logging, tracing, dashboards, and alerting.
  • Deep understanding of container technologies, security scanning, secrets management, dynamic configuration, and microservices architectures.
  • Familiarity with service meshes and advanced traffic management concepts.
  • Experience designing and maintaining company-wide IaC codebases using tools such as Terraform, Pulumi, CloudFormation, or Ansible.
  • Ability to think holistically about infrastructure design, cost, reliability, security, and long-term maintainability.

Benefits

About the Company

Blink Health is the fastest growing healthcare technology company that builds products to make prescriptions accessible and affordable to everybody. Our two primary products – BlinkRx and Quick Save – remove traditional roadblocks within the current prescription supply chain, resulting in better access to critical medications and improved health outcomes for patients.

BlinkRx is the world’s first pharma-to-patient cloud that offers a digital concierge service for patients who are prescribed branded medications. Patients benefit from transparent low prices, free home delivery, and world-class support on this first-of-its-kind centralized platform. With BlinkRx, never again will a patient show up at the pharmacy only to discover that they can’t afford their medication, their doctor needs to fill out a form for them, or the pharmacy doesn’t have the medication in stock.

We are a highly collaborative team of builders and operators who invent new ways of working in an industry that historically has resisted innovation. Join us!

Skills & tools

SRECloud InfrastructureKubernetesLinuxNetworkingPythonGoTerraformObservabilityAutomation

What the team is looking for

Use this list as a quick fit check before you apply.

  1. 01Bachelor's Degree
  2. 027+ Years Experience