Remote Systems Reliability Engineer
Description
Remote Systems Reliability Engineer
Introduction: Elevate Infrastructure Reliability from Anywhere
Are you a systems thinker with a passion for uptime, observability, and infrastructure at scale? We are looking for a Remote Systems Reliability Engineer who thrives in distributed environments and is excited to optimize the availability and performance of mission-critical services. This remote-first opportunity empowers you to work on globally scaled systems that demand reliability-first engineering. With an annual salary of $178,470, this opportunity rewards technical excellence and proactive problem-solving.
Key Responsibilities
Systems Design & Automation
- Design, implement, and maintain scalable system architectures that ensure continuous availability and fault tolerance across multiple environments.
- Leverage infrastructure-as-code principles using tools like Terraform, Pulumi, or AWS CDK to build and manage resilient systems.
Observability & Monitoring
- Enhance observability with proactive logging, metrics, and tracing strategies utilizing platforms such as Datadog, Prometheus, or OpenTelemetry.
- Develop automated monitoring and alerting systems that minimize false positives and reduce mean time to detect (MTTD) and mean time to resolve (MTTR).
Incident Management & Reliability
- Conduct thorough root cause analysis for incidents and drive long-term remediation through blameless postmortems.
- Enhance system robustness by stress testing, chaos engineering, and load simulations.
- Be part of an on-call schedule to address critical infrastructure issues and improve incident response workflows.
Collaboration & SLOs
- Collaborate with software engineers, product teams, and infrastructure stakeholders to create service-level objectives (SLOs) and service-level indicators (SLIs).
Work Environment & Culture
Join a team of performance-obsessed engineers committed to transparency, collaboration, and engineering excellence. We promote a culture of ownership and continuous improvement, empowering you to take initiative and experiment with new approaches. As part of our globally distributed infrastructure group, you'll work with peers across time zones who value diverse thinking and respect deep focus. Our remote-first philosophy is grounded in trust, flexibility, and purpose-driven execution.
Tools, Frameworks, and Technologies
Cloud Platforms
- AWS
- Google Cloud Platform (GCP)
- Azure
Infrastructure as Code
- Terraform
- CloudFormation
- Ansible
CI/CD & Deployment
- Jenkins
- GitHub Actions
- Argo CD
Monitoring & Observability
- Grafana
- Prometheus
- New Relic
- Honeycomb
Alerting & Incident Tools
- PagerDuty
- Opsgenie
Containers & Orchestration
- Docker
- Kubernetes
- Helm
Scripting Languages
- Python
- Bash
- Go
Performance Metrics & Data Impact
In the last 12 months, our engineering team has:
- Reduced production downtime by 43% through targeted reliability enhancements.
- Lowered incident resolution time by 61% using improved automation workflows.
- Increased deployment frequency by 28% without compromising stability.
You’ll contribute to these measurable improvements and help push the bar even higher, applying your infrastructure performance expertise to large-scale systems.
Qualifications and Expertise
- Proven experience (5+ years) in SRE, DevOps, or systems engineering roles with a track record of maintaining high-availability platforms
- Proficiency in at least one central cloud platform (AWS preferred)
- Deep understanding of distributed systems, load balancing, and network protocols
- Hands-on experience with CI/CD best practices and release automation
- Demonstrated ability to diagnose production failures, improve observability, and prevent future issues
- Strong scripting capabilities for building internal tooling and automation
- Familiarity with compliance frameworks (SOC 2, ISO 27001) is a plus
- An academic foundation in computing, engineering disciplines, or proven equivalent professional experience, hands-on expertise
Opportunities for Growth
This position offers direct influence over systems architecture decisions, capacity planning strategies, and cross-functional incident workflows. You’ll have a voice in setting operational standards and evaluating emerging technologies like eBPF, serverless platforms, and distributed tracing solutions. Team members regularly present their work in internal guilds and industry conferences, opening doors to thought leadership opportunities and advanced technical training.
Who You Are
You’re not just passionate about systems reliability—you live for continuous improvement, data-informed decision-making, and proactive risk mitigation. You understand the nuances of working remotely and are adept at asynchronous collaboration and structured communication. You appreciate clean code, scalable architecture, and metrics that tell a story. Whether you're deploying an automated canary release or refactoring alerting thresholds, you're driven by impact and clarity.
Visual Overview of Your Impact
Area of Focus | Key Contribution | Tool/Method |
---|---|---|
Monitoring | Build high-fidelity dashboards | Grafana, Prometheus |
System Resilience | Implement self-healing workflows | Kubernetes, Runbooks |
Incident Response | Drive reduced MTTR | PagerDuty, Postmortems |
Automation | Scale config management | Ansible, Terraform |
Cross-Team Collaboration | Shape SLIs and operational policies | Slack, Notion, GitHub |
Call to Action: Build Resilience That Scales
Are you ready to bring your infrastructure skills to the next level and help build a future-proof platform for millions of users worldwide? Join us as a Remote Systems Reliability Engineer and be at the heart of operational excellence. Here, your ideas matter, your voice is heard, and your impact is global. Apply now to take the lead in engineering systems that endure.
Earn $178,470 annually while redefining reliability in the cloud era—from wherever you thrive most.