Site Reliability Engineer
Washington, D.C., USA
Full-time
️ Strategic Site Reliability Engineer: Global Network Orchestration Platform
The Opportunity: Design the core reliability platform for the final frontier of space Mesh networking. This is a strategic, high-impact mandate within a high-growth, fast-paced startup, building the next generation of software-defined networks for satellite megaconstellations and aerospace fleets. We seek technical leaders ready to architect mission-critical systems and drive platform maturity.
Technical Skills & Proficiencies Required
- Observability Platform Mastery: Deep, hands-on expertise in the architecture, scaling, and management of production observability stacks: Prometheus, OpenTelemetry, Grafana, Loki, and distributed tracing systems.
- Cloud & Orchestration: Expert-level production experience with Kubernetes and GCP. Expertise in multi-cloud (AWS) environments is highly preferred.
- Reliability Engineering: Proven ability to define, implement, and manage robust SLOs, SLIs, and Error Budgets for high-availability distributed systems, crucial for mission readiness.
- Automation & IaC: Mastery of Infrastructure as Code (Terraform) and GitOps (ArgoCD) for automated deployment and scaling across complex cloud environments.
- Programming Proficiency: Strong command of systems programming; fluency in Go and/or Python is required for developing and optimizing platform tooling.
- Preferred Domain Expertise: Experience with Service Mesh (Istio/Linkerd), instrumenting applications in Golang/C++, and working with HPC environments (CPU/GPU workloads).
Mandatory Security Requirements
- US Citizenship is required.
- An active Secret security clearance or higher is strongly preferred.