Site Reliability Engineer - SRE

Washington, D.C., USA Full-time

️ Strategic Site Reliability Engineer: Global Network Orchestration Platform

The Opportunity: Design the core reliability platform for the final frontier of space Mesh networking. This is a strategic, high-impact mandate within a high-growth, fast-paced startup, building the next generation of software-defined networks for satellite megaconstellations and aerospace fleets. We seek technical leaders ready to architect mission-critical systems and drive platform maturity.

Technical Skills & Proficiencies Required

Observability Platform Mastery: Deep, hands-on expertise in the architecture, scaling, and management of production observability stacks: Prometheus, OpenTelemetry, Grafana, Loki, and distributed tracing systems.
Cloud & Orchestration: Expert-level production experience with Kubernetes and GCP. Expertise in multi-cloud (AWS) environments is highly preferred.
Reliability Engineering: Proven ability to define, implement, and manage robust SLOs, SLIs, and Error Budgets for high-availability distributed systems, crucial for mission readiness.
Automation & IaC: Mastery of Infrastructure as Code (Terraform) and GitOps (ArgoCD) for automated deployment and scaling across complex cloud environments.
Programming Proficiency: Strong command of systems programming; fluency in Go and/or Python is required for developing and optimizing platform tooling.
Preferred Domain Expertise: Experience with Service Mesh (Istio/Linkerd), instrumenting applications in Golang/C++, and working with HPC environments (CPU/GPU workloads).

Mandatory Security Requirements

US Citizenship is required.
An active Secret security clearance or higher is strongly preferred.

Apply Back to all positions