Senior Site Reliability Engineer
Skills
About the role
No H1 or C2C. Must be Permanent Resident or US Citizen
Senior Site Reliability Engineer
Description and Requirements
About Our Team
We are building Quantum, a next‑generation hybrid AI platform that spans Windows, Android, and cloud. As part of this vision, we are expanding the reliability engineering organization that powers cross‑device Personal AI.
We are looking for Senior Site Reliability Engineers (SREs) to help us build and evolve the foundational reliability, observability, and operations capabilities that ensure fast, safe, and dependable for millions of users.
This role may support one of several teams within the SRE organization (e.g., Observability, Operations, or Service Reliability), depending on your strengths and interests.
Operating with the speed, ownership, and creative latitude of a startup—yet supported by the scale, resources, and technical depth. We are building new systems, new tooling, and new operational models from the ground up, and we are doing so with clarity, intention, and high engineering standards.
Location: Open to remote work in the US. The preferred work location is Chicago, IL.
What You Might Work On
As a Senior SRE, you may be responsible for a subset of the following, depending on team placement and skill alignment:
Reliability & Performance Engineering
Improving the availability, scalability, and performance of distributed systems across device, edge, and cloud.
Defining or refining SLIs, SLOs, and error budgets for critical services.
Leading initiatives to remove single points of failure, improve resilience, and reduce operational risk.
Operational Excellence
Participating in on‑call rotations and contributing to incident response, triage, and post-incident reviews.
Developing automation, runbooks, and self‑healing systems to reduce alert noise and MTTR.
Enhancing operational readiness and supporting incident prevention programs.
Observability & Insight
Designing or improving observability systems using OpenTelemetry, Grafana, and modern signal pipelines.
Building dashboards, analytics, and alerting that illuminate system health and AI service behavior.
Ensuring telemetry is reliable, actionable, and tied to real‑world outcomes.
Deployments & Change Safety
Improving reliability of CI/CD workflows, including phased rollouts, canaries, shadow testing, and safe rollback mechanisms.
Contributing to the evolution of deployment tooling for device+edge+cloud hybrid systems.
Systems Design & Collaboration
Influencing architectural decisions by injecting reliability, observability, and operational considerations early in design.
Collaborating with AI/ML engineers, platform engineers, firmware teams, and product partners to deliver robust, dependable user experiences.
Basic Qualifications
10+ years of experience in Site Reliability Engineering, Production Engineering, DevOps, or large‑scale distributed systems operations
Bachelor’s Degree in Computer Science, Engineering, or a related technical discipline
Strong experience running production distributed systems at scale
Proficiency in at least one modern programming language (e.g., Python, Go, Java, C++)
Strong understanding of Linux systems, networking fundamentals, and system performance tuning
Experience with monitoring/observability (metrics, logs, tracing)
Hands‑on experience with cloud environments (Azure, AWS, or GCP)
Experience in incident management, on‑call rotations, and postmortem processes
Preferred Qualifications
Deep experience with Azure cloud services
Experience with OpenTelemetry for end‑to‑end instrumentation
Strong familiarity with Grafana, Prometheus, Loki, Tempo, or similar tools
Experience supporting AI/ML systems, model serving, or data‑intensive workloads
Background with hybrid architectures (device + edge + cloud)
Experience improving deployment reliability and progressive delivery systems
Passion for automation, reliability engineering, and reducing operational friction
What Success Looks Like
Systems become more observable, reliable, and predictable.
Incidents are resolved quickly, and follow‑up improvements prevent recurrence.
Alerting becomes more accurate, actionable, and trusted.
Deployments become safer and more consistent.
Teams move faster because reliability foundations are strong and intuitive.
Questions about this role
How do I apply to this Senior Site Reliability Engineer role at SDI International?
Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.
What's the typical salary for DevOps / SRE in United States?
Compensation for DevOps / SRE roles in United States varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our DevOps / SRE hub for United States medians across recent openings.
How fast does AI Applyd auto-apply?
Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.
What ATS does SDI International use?
AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.
Want AI Applyd to auto-apply to roles like this?
We tailor your resume per posting, fill the forms, and track replies for you.