SRE
At a glance
Highlights
- Competitive compensation with equity
- 100% medical, dental, vision coverage for employees and dependents
- Flexible PTO and company-wide winter break
Why this role might suit you
The role offers a chance to shape reliability for a fast‑growing AI platform, working with cutting‑edge Kubernetes and observability tooling while enjoying strong compensation and comprehensive benefits.
Skills
About the role
ABOUT BASETEN
Baseten powers mission-critical inference for the world's most dynamic AI companies, like Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma and Writer. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting-edge models into production. We're growing quickly and recently raised our $300M Series E https://www.baseten.co/blog/announcing-baseten-s-300m-series-e/, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction. Join us and help build the platform engineers turn to to ship AI products.
THE ROLE
As a Site Reliability Engineer at Baseten, you'll define and codify the gold standards of day 2 operations for our ML infrastructure platform. You'll envision and build robust systems, processes, automations, and observability tooling that keep our platform reliable at scale — and that empower the broader organization to operate confidently.
You'll work closely with engineering, forward-deployed and product teams: learning from recurring failure patterns, turning tribal knowledge into automated mitigations, and raising the operational floor for the entire company.
EXAMPLE INITIATIVES
You'll work on projects like these as part of the SRE team:
- Improve Baseten SRE Practices, by instrumenting SLOs and SLIs, improving alerting and observability for all services.
- Building AI-assisted tooling for incident triage and response.
RESPONSIBILITIES
- Own the reliability of Baseten's multi-cloud Kubernetes infrastructure, including incident response, post-mortems, and remediation tracking.
- Build and maintain observability infrastructure — metrics, logging, dashboards, and alerting — as code.
- Author, validate, and improve runbooks for recurring failure patterns, ensuring they're structured for low-context, safe execution.
- Identify high-frequency failure patterns and convert them into automated mitigations or self-healing automations.
- Diagnose and resolve runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.
- Define and instrument SLOs and SLIs across customer workloads and internal services.
- Navigate ambiguity, make principled tradeoffs, and avoid unnecessary complexity in the systems you build and the processes you define.
REQUIREMENTS
- Extensive hands-on experience with Kubernetes (multi-cloud experience across EKS, GKE, or similar is a strong plus).
- Experience in building and maintaining scalable infrastructure.
- Strong foundation in observability tooling: metrics (VictoriaMetrics, Prometheus), logging (Loki, ELK), dashboards (Grafana), and alerting pipelines. Observability-as-code experience is a plus.
- Experience with infrastructure-as-code (Terraform, Helm) and GitOps workflows (Flux CD, ArgoCD).
- Experience writing and improving runbooks, leading incident response, and doing post-mortem analysis.
- Comfort working at the intersection of engineering and operations — you write code, but you also think deeply about process, escalation paths, and operational leverage.
- Familiarity with incident management platforms (incident.io http://incident.io or similar) is a plus.
- No prior ML experience required, but curiosity about how ML models are deployed and served at scale will serve you well.
BENEFITS
- Competitive compensation, including meaningful equity.
- 100% coverage of medical, dental, and vision insurance for employee and dependents
- Flexible PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
- Paid parental leave
- Fertility and family-building stipend through Carrot
- Company-facilitated 401(k)
- Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.
Compensation
This DevOps / SRE role pays $135k-$285k/yr. Within typical range for devops / sre roles in United States.
Questions about this role
How do I apply to this SRE role at Baseten?
Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.
What's the typical salary for DevOps / SRE in United States?
Compensation for DevOps / SRE roles in United States varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our DevOps / SRE hub for United States medians across recent openings.
How fast does AI Applyd auto-apply?
Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.
What ATS does Baseten use?
AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.
Want AI Applyd to auto-apply to roles like this?
We tailor your resume per posting, fill the forms, and track replies for you.