Remote Systems Reliability Engineer

Introduction: Elevate Infrastructure Reliability from Anywhere

Are you a systems thinker with a passion for uptime, observability, and infrastructure at scale? We are looking for a Remote Systems Reliability Engineer who thrives in distributed environments and is excited to optimize the availability and performance of mission-critical services. This remote-first opportunity empowers you to work on globally scaled systems that demand reliability-first engineering. With an annual salary of $178,470, this opportunity rewards technical excellence and proactive problem-solving.

Key Responsibilities

Systems Design & Automation

Design, implement, and maintain scalable system architectures that ensure continuous availability and fault tolerance across multiple environments.
Leverage infrastructure-as-code principles using tools like Terraform, Pulumi, or AWS CDK to build and manage resilient systems.

Observability & Monitoring

Enhance observability with proactive logging, metrics, and tracing strategies utilizing platforms such as Datadog, Prometheus, or OpenTelemetry.
Develop automated monitoring and alerting systems that minimize false positives and reduce mean time to detect (MTTD) and mean time to resolve (MTTR).

Incident Management & Reliability

Conduct thorough root cause analysis for incidents and drive long-term remediation through blameless postmortems.
Enhance system robustness by stress testing, chaos engineering, and load simulations.
Be part of an on-call schedule to address critical infrastructure issues and improve incident response workflows.

Collaboration & SLOs

Collaborate with software engineers, product teams, and infrastructure stakeholders to create service-level objectives (SLOs) and service-level indicators (SLIs).

Work Environment & Culture

Join a team of performance-obsessed engineers committed to transparency, collaboration, and engineering excellence. We promote a culture of ownership and continuous improvement, empowering you to take initiative and experiment with new approaches. As part of our globally distributed infrastructure group, you'll work with peers across time zones who value diverse thinking and respect deep focus. Our remote-first philosophy is grounded in trust, flexibility, and purpose-driven execution.

Tools, Frameworks, and Technologies

Cloud Platforms

AWS
Google Cloud Platform (GCP)
Azure

Infrastructure as Code

Terraform
CloudFormation
Ansible

CI/CD & Deployment

Jenkins
GitHub Actions
Argo CD

Monitoring & Observability

Grafana
Prometheus
New Relic
Honeycomb

Alerting & Incident Tools

PagerDuty
Opsgenie

Containers & Orchestration

Docker
Kubernetes
Helm

Scripting Languages

Python
Bash
Go

Performance Metrics & Data Impact

In the last 12 months, our engineering team has:

Reduced production downtime by 43% through targeted reliability enhancements.
Lowered incident resolution time by 61% using improved automation workflows.
Increased deployment frequency by 28% without compromising stability.

You’ll contribute to these measurable improvements and help push the bar even higher, applying your infrastructure performance expertise to large-scale systems.

Qualifications and Expertise

Proven experience (5+ years) in SRE, DevOps, or systems engineering roles with a track record of maintaining high-availability platforms
Proficiency in at least one central cloud platform (AWS preferred)
Deep understanding of distributed systems, load balancing, and network protocols
Hands-on experience with CI/CD best practices and release automation
Demonstrated ability to diagnose production failures, improve observability, and prevent future issues
Strong scripting capabilities for building internal tooling and automation
Familiarity with compliance frameworks (SOC 2, ISO 27001) is a plus
An academic foundation in computing, engineering disciplines, or proven equivalent professional experience, hands-on expertise

Opportunities for Growth

This position offers direct influence over systems architecture decisions, capacity planning strategies, and cross-functional incident workflows. You’ll have a voice in setting operational standards and evaluating emerging technologies like eBPF, serverless platforms, and distributed tracing solutions. Team members regularly present their work in internal guilds and industry conferences, opening doors to thought leadership opportunities and advanced technical training.

Who You Are

You’re not just passionate about systems reliability—you live for continuous improvement, data-informed decision-making, and proactive risk mitigation. You understand the nuances of working remotely and are adept at asynchronous collaboration and structured communication. You appreciate clean code, scalable architecture, and metrics that tell a story. Whether you're deploying an automated canary release or refactoring alerting thresholds, you're driven by impact and clarity.

Visual Overview of Your Impact

Area of Focus	Key Contribution	Tool/Method
Monitoring	Build high-fidelity dashboards	Grafana, Prometheus
System Resilience	Implement self-healing workflows	Kubernetes, Runbooks
Incident Response	Drive reduced MTTR	PagerDuty, Postmortems
Automation	Scale config management	Ansible, Terraform
Cross-Team Collaboration	Shape SLIs and operational policies	Slack, Notion, GitHub

Call to Action: Build Resilience That Scales

Are you ready to bring your infrastructure skills to the next level and help build a future-proof platform for millions of users worldwide? Join us as a Remote Systems Reliability Engineer and be at the heart of operational excellence. Here, your ideas matter, your voice is heard, and your impact is global. Apply now to take the lead in engineering systems that endure.

Earn $178,470 annually while redefining reliability in the cloud era—from wherever you thrive most.

Remote Systems Reliability Engineer

Description

Remote Systems Reliability Engineer

Introduction: Elevate Infrastructure Reliability from Anywhere

Key Responsibilities

Systems Design & Automation

Observability & Monitoring

Incident Management & Reliability

Collaboration & SLOs

Work Environment & Culture

Tools, Frameworks, and Technologies

Cloud Platforms

Infrastructure as Code

CI/CD & Deployment

Monitoring & Observability

Alerting & Incident Tools

Containers & Orchestration

Scripting Languages

Performance Metrics & Data Impact

Qualifications and Expertise

Opportunities for Growth

Who You Are

Visual Overview of Your Impact

Call to Action: Build Resilience That Scales

Similar Jobs

Remote Network and Systems Administrator

Remote Virtual Systems Administrator

Remote Infrastructure Systems Administrator

About Us

Partner Sites

Important Links

Social Media