Infrastructure Engineer and SRE
Skills
About the role
Software Engineer, Infrastructure / Site Reliability Engineer
Location: San Francisco, CA, 4 days per week onsite
Salary Range: $180K to $250K base plus equity
Company Overview
A client of Crosscheck Staffing is building the infrastructure layer for AI agents in production. The platform enables agents to write code, use browsers, interact with computers, and execute complex workflows safely and reliably for enterprise customers. The company was founded by former Google and Stripe engineering leaders, including one of the co-founders of Google Wallet, and the full founding team has deep infrastructure experience.
The company is small, technical, and operating at a high-ownership stage. They are already seeing strong enterprise demand, including regulated and defense-adjacent use cases, and are now hiring foundational infrastructure engineers who can help scale the platform.
This is a strong fit for engineers who want to work close to the metal on Kubernetes, containers, networking, cloud infrastructure, secure execution environments, observability, and distributed systems.
Role 1: Software Engineer, Infrastructure
The Infrastructure role is focused on building the core systems that power secure AI agent execution. This person will work on the platform layer that allows agents to run workloads safely, quickly, and reliably across cloud environments.
This role is a fit for someone who enjoys building foundational infrastructure, not just maintaining it. The ideal candidate has strong hands-on experience with Kubernetes, Docker, Linux, networking, AWS or GCP, Terraform or Pulumi, and distributed systems.
What you will work on
Build and scale secure infrastructure for AI agent workloads
Design and operate sandboxed execution environments, containerized systems, and distributed job orchestration
Improve performance across the platform, with a constant focus on speed, reliability, and efficiency
Build secure VPC deployments for enterprise and regulated customers
Work on infrastructure involving Kubernetes, Docker, Docker-in-Docker, microVMs, Terraform, Pulumi, AWS, GCP, Grafana, and Prometheus
Debug complex production issues across containers, networking, Linux systems, cloud primitives, and distributed services
Own systems from design through production deployment
Strong fit signals
Strong production experience with Kubernetes, Docker, cloud infrastructure, and distributed systems
Deep knowledge in at least one infrastructure layer such as containers, networking, Linux, storage, or cloud primitives
Experience building infrastructure systems from scratch
Strong debugging ability below the surface of managed cloud tooling
Background from a strong infrastructure-heavy company or top engineering environment
Comfortable working directly with founders in a small, fast-moving startup
Role 2: Site Reliability Engineer
The SRE role is focused on keeping our client’s production infrastructure reliable, observable, secure, and scalable as customer demand grows. This person will own reliability practices, monitoring, alerting, incident response, deployment safety, and automation.
This role is a fit for someone who has operated production systems at scale and can improve reliability without adding unnecessary process. The ideal candidate has hands-on experience with Kubernetes, Terraform or Pulumi, observability, incident response, SLOs, cloud infrastructure, and automation.
What you will work on
Own production reliability across our client’s infrastructure platform
Build and improve monitoring, alerting, dashboards, and observability workflows
Lead incident response, root cause analysis, and postmortems
Automate deployments, scaling, provisioning, and recovery tasks
Improve developer experience through safer releases and better operational tooling
Work with Grafana, Prometheus, Terraform, Pulumi, Docker, Kubernetes, Python or Go, AWS, GCP, Azure, and PagerDuty-style workflows
Help keep infrastructure highly available, secure, and ready for enterprise customers
Strong fit signals
3+ years of explicit SRE, production infrastructure, or platform reliability experience
Strong hands-on experience with Kubernetes, Docker, Terraform or Pulumi, Grafana, and Prometheus
Experience with incident response, on-call, SLOs, SLIs, alerting, and production debugging
Ability to automate reliability work with Python, Go, Bash, or infrastructure tooling
Experience scaling infrastructure, not just maintaining it
Background from a strong engineering company or infrastructure-heavy environment
Ideal Candidate Background
Our client is prioritizing candidates with strong recent full-time experience at respected infrastructure or engineering companies. Target backgrounds include companies such as:
Google, Meta, AWS, Apple, Stripe, Cloudflare, Datadog, Snowflake, Databricks, Nvidia, CoreWeave, Crusoe, Netflix, Uber, Airbnb, LinkedIn, Atlassian, HashiCorp, Docker, Grafana Labs, Fastly, Akamai, Confluent, Vercel, Render, Tailscale, Temporal, Box, Palo Alto Networks, Pure Storage, Affirm, Splunk, or similar engineering environments.
The strongest candidates will have both company pedigree and real hands-on ownership. A strong logo alone is not enough. They need to have built, operated, or debugged real production infrastructure.
Must Have
Hands-on IC mindset
Strong production infrastructure experience
Kubernetes and containerization experience
Terraform or Pulumi experience
Cloud infrastructure experience with AWS, GCP, or Azure
Strong debugging ability across Linux, networking, containers, and cloud systems
End-to-end ownership of production systems
Ability to work onsite in San Francisco 4 days per week
No current or future sponsorship requirement
Nice to Have
Experience with AI infrastructure, ML workloads, GPU clusters, or agent execution environments
Experience with secure execution environments, sandboxes, microVMs, or runtime isolation
Experience with regulated customers, defense, FedRAMP, SOC2, or high-security enterprise environments
GitOps experience with ArgoCD or Flux
Go or Python experience for infrastructure automation or platform services
Experience with service mesh, CNI, container runtime, or low-level Kubernetes debugging
Compensation
This DevOps / SRE role pays $180k-$250k/yr. Within typical range for devops / sre roles in United States.
Questions about this role
How do I apply to this Infrastructure Engineer and SRE role at Crosscheck Staffing?
Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.
What's the typical salary for DevOps / SRE in United States?
Compensation for DevOps / SRE roles in United States varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our DevOps / SRE hub for United States medians across recent openings.
How fast does AI Applyd auto-apply?
Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.
What ATS does Crosscheck Staffing use?
AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.
Want AI Applyd to auto-apply to roles like this?
We tailor your resume per posting, fill the forms, and track replies for you.