Remote Systems Reliability Engineer

Description

Remote Systems Reliability Engineer

Introduction: Elevate Infrastructure Reliability from Anywhere

Are you a systems thinker with a passion for uptime, observability, and infrastructure at scale? We are looking for a Remote Systems Reliability Engineer who thrives in distributed environments and is excited to optimize the availability and performance of mission-critical services. This remote-first opportunity empowers you to work on globally scaled systems that demand reliability-first engineering. With an annual salary of $178,470, this opportunity rewards technical excellence and proactive problem-solving.

Key Responsibilities

Systems Design & Automation

  • Design, implement, and maintain scalable system architectures that ensure continuous availability and fault tolerance across multiple environments.
  • Leverage infrastructure-as-code principles using tools like Terraform, Pulumi, or AWS CDK to build and manage resilient systems.

Observability & Monitoring

  • Enhance observability with proactive logging, metrics, and tracing strategies utilizing platforms such as Datadog, Prometheus, or OpenTelemetry.
  • Develop automated monitoring and alerting systems that minimize false positives and reduce mean time to detect (MTTD) and mean time to resolve (MTTR).

Incident Management & Reliability

  • Conduct thorough root cause analysis for incidents and drive long-term remediation through blameless postmortems.
  • Enhance system robustness by stress testing, chaos engineering, and load simulations.
  • Be part of an on-call schedule to address critical infrastructure issues and improve incident response workflows.

Collaboration & SLOs

  • Collaborate with software engineers, product teams, and infrastructure stakeholders to create service-level objectives (SLOs) and service-level indicators (SLIs).

Work Environment & Culture

Join a team of performance-obsessed engineers committed to transparency, collaboration, and engineering excellence. We promote a culture of ownership and continuous improvement, empowering you to take initiative and experiment with new approaches. As part of our globally distributed infrastructure group, you'll work with peers across time zones who value diverse thinking and respect deep focus. Our remote-first philosophy is grounded in trust, flexibility, and purpose-driven execution.

Tools, Frameworks, and Technologies

Cloud Platforms

  • AWS
  • Google Cloud Platform (GCP)
  • Azure

Infrastructure as Code

  • Terraform
  • CloudFormation
  • Ansible

CI/CD & Deployment

  • Jenkins
  • GitHub Actions
  • Argo CD

Monitoring & Observability

  • Grafana
  • Prometheus
  • New Relic
  • Honeycomb

Alerting & Incident Tools

  • PagerDuty
  • Opsgenie

Containers & Orchestration

  • Docker
  • Kubernetes
  • Helm

Scripting Languages

  • Python
  • Bash
  • Go

Performance Metrics & Data Impact

In the last 12 months, our engineering team has:

  • Reduced production downtime by 43% through targeted reliability enhancements.
  • Lowered incident resolution time by 61% using improved automation workflows.
  • Increased deployment frequency by 28% without compromising stability.

You’ll contribute to these measurable improvements and help push the bar even higher, applying your infrastructure performance expertise to large-scale systems.

Qualifications and Expertise

  • Proven experience (5+ years) in SRE, DevOps, or systems engineering roles with a track record of maintaining high-availability platforms
  • Proficiency in at least one central cloud platform (AWS preferred)
  • Deep understanding of distributed systems, load balancing, and network protocols
  • Hands-on experience with CI/CD best practices and release automation
  • Demonstrated ability to diagnose production failures, improve observability, and prevent future issues
  • Strong scripting capabilities for building internal tooling and automation
  • Familiarity with compliance frameworks (SOC 2, ISO 27001) is a plus
  • An academic foundation in computing, engineering disciplines, or proven equivalent professional experience, hands-on expertise

Opportunities for Growth

This position offers direct influence over systems architecture decisions, capacity planning strategies, and cross-functional incident workflows. You’ll have a voice in setting operational standards and evaluating emerging technologies like eBPF, serverless platforms, and distributed tracing solutions. Team members regularly present their work in internal guilds and industry conferences, opening doors to thought leadership opportunities and advanced technical training.

Who You Are

You’re not just passionate about systems reliability—you live for continuous improvement, data-informed decision-making, and proactive risk mitigation. You understand the nuances of working remotely and are adept at asynchronous collaboration and structured communication. You appreciate clean code, scalable architecture, and metrics that tell a story. Whether you're deploying an automated canary release or refactoring alerting thresholds, you're driven by impact and clarity.

Visual Overview of Your Impact

Area of Focus Key Contribution Tool/Method
Monitoring Build high-fidelity dashboards Grafana, Prometheus
System Resilience Implement self-healing workflows Kubernetes, Runbooks
Incident Response Drive reduced MTTR PagerDuty, Postmortems
Automation Scale config management Ansible, Terraform
Cross-Team Collaboration Shape SLIs and operational policies Slack, Notion, GitHub

Call to Action: Build Resilience That Scales

Are you ready to bring your infrastructure skills to the next level and help build a future-proof platform for millions of users worldwide? Join us as a Remote Systems Reliability Engineer and be at the heart of operational excellence. Here, your ideas matter, your voice is heard, and your impact is global. Apply now to take the lead in engineering systems that endure.

Earn $178,470 annually while redefining reliability in the cloud era—from wherever you thrive most.