Skip to content

Infrastructure Engineer and SRE

Crosscheck Staffing

San Francisco, USonsite$180k-$250k/yrPosted May 18, 2026

Skills

cloudflarekubernetesdatabricksprometheussnowflaketerraformpagerdutygrafanadatadogverceldockerpulumipythonazuregooglecloudawsgoml

About the role

Software Engineer, Infrastructure / Site Reliability Engineer

Location: San Francisco, CA, 4 days per week onsite

Salary Range: $180K to $250K base plus equity

Company Overview

A client of Crosscheck Staffing is building the infrastructure layer for AI agents in production. The platform enables agents to write code, use browsers, interact with computers, and execute complex workflows safely and reliably for enterprise customers. The company was founded by former Google and Stripe engineering leaders, including one of the co-founders of Google Wallet, and the full founding team has deep infrastructure experience.

The company is small, technical, and operating at a high-ownership stage. They are already seeing strong enterprise demand, including regulated and defense-adjacent use cases, and are now hiring foundational infrastructure engineers who can help scale the platform.

This is a strong fit for engineers who want to work close to the metal on Kubernetes, containers, networking, cloud infrastructure, secure execution environments, observability, and distributed systems.

Role 1: Software Engineer, Infrastructure

The Infrastructure role is focused on building the core systems that power secure AI agent execution. This person will work on the platform layer that allows agents to run workloads safely, quickly, and reliably across cloud environments.

This role is a fit for someone who enjoys building foundational infrastructure, not just maintaining it. The ideal candidate has strong hands-on experience with Kubernetes, Docker, Linux, networking, AWS or GCP, Terraform or Pulumi, and distributed systems.

What you will work on

Build and scale secure infrastructure for AI agent workloads

Design and operate sandboxed execution environments, containerized systems, and distributed job orchestration

Improve performance across the platform, with a constant focus on speed, reliability, and efficiency

Build secure VPC deployments for enterprise and regulated customers

Work on infrastructure involving Kubernetes, Docker, Docker-in-Docker, microVMs, Terraform, Pulumi, AWS, GCP, Grafana, and Prometheus

Debug complex production issues across containers, networking, Linux systems, cloud primitives, and distributed services

Own systems from design through production deployment

Strong fit signals

Strong production experience with Kubernetes, Docker, cloud infrastructure, and distributed systems

Deep knowledge in at least one infrastructure layer such as containers, networking, Linux, storage, or cloud primitives

Experience building infrastructure systems from scratch

Strong debugging ability below the surface of managed cloud tooling

Background from a strong infrastructure-heavy company or top engineering environment

Comfortable working directly with founders in a small, fast-moving startup

Role 2: Site Reliability Engineer

The SRE role is focused on keeping our client’s production infrastructure reliable, observable, secure, and scalable as customer demand grows. This person will own reliability practices, monitoring, alerting, incident response, deployment safety, and automation.

This role is a fit for someone who has operated production systems at scale and can improve reliability without adding unnecessary process. The ideal candidate has hands-on experience with Kubernetes, Terraform or Pulumi, observability, incident response, SLOs, cloud infrastructure, and automation.

What you will work on

Own production reliability across our client’s infrastructure platform

Build and improve monitoring, alerting, dashboards, and observability workflows

Lead incident response, root cause analysis, and postmortems

Automate deployments, scaling, provisioning, and recovery tasks

Improve developer experience through safer releases and better operational tooling

Work with Grafana, Prometheus, Terraform, Pulumi, Docker, Kubernetes, Python or Go, AWS, GCP, Azure, and PagerDuty-style workflows

Help keep infrastructure highly available, secure, and ready for enterprise customers

Strong fit signals

3+ years of explicit SRE, production infrastructure, or platform reliability experience

Strong hands-on experience with Kubernetes, Docker, Terraform or Pulumi, Grafana, and Prometheus

Experience with incident response, on-call, SLOs, SLIs, alerting, and production debugging

Ability to automate reliability work with Python, Go, Bash, or infrastructure tooling

Experience scaling infrastructure, not just maintaining it

Background from a strong engineering company or infrastructure-heavy environment

Ideal Candidate Background

Our client is prioritizing candidates with strong recent full-time experience at respected infrastructure or engineering companies. Target backgrounds include companies such as:

Google, Meta, AWS, Apple, Stripe, Cloudflare, Datadog, Snowflake, Databricks, Nvidia, CoreWeave, Crusoe, Netflix, Uber, Airbnb, LinkedIn, Atlassian, HashiCorp, Docker, Grafana Labs, Fastly, Akamai, Confluent, Vercel, Render, Tailscale, Temporal, Box, Palo Alto Networks, Pure Storage, Affirm, Splunk, or similar engineering environments.

The strongest candidates will have both company pedigree and real hands-on ownership. A strong logo alone is not enough. They need to have built, operated, or debugged real production infrastructure.

Must Have

Hands-on IC mindset

Strong production infrastructure experience

Kubernetes and containerization experience

Terraform or Pulumi experience

Cloud infrastructure experience with AWS, GCP, or Azure

Strong debugging ability across Linux, networking, containers, and cloud systems

End-to-end ownership of production systems

Ability to work onsite in San Francisco 4 days per week

No current or future sponsorship requirement

Nice to Have

Experience with AI infrastructure, ML workloads, GPU clusters, or agent execution environments

Experience with secure execution environments, sandboxes, microVMs, or runtime isolation

Experience with regulated customers, defense, FedRAMP, SOC2, or high-security enterprise environments

GitOps experience with ArgoCD or Flux

Go or Python experience for infrastructure automation or platform services

Experience with service mesh, CNI, container runtime, or low-level Kubernetes debugging

Compensation

This DevOps / SRE role pays $180k-$250k/yr. Within typical range for devops / sre roles in United States.

Questions about this role

  • How do I apply to this Infrastructure Engineer and SRE role at Crosscheck Staffing?

    Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.

  • What's the typical salary for DevOps / SRE in United States?

    Compensation for DevOps / SRE roles in United States varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our DevOps / SRE hub for United States medians across recent openings.

  • How fast does AI Applyd auto-apply?

    Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.

  • What ATS does Crosscheck Staffing use?

    AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.

Want AI Applyd to auto-apply to roles like this?

We tailor your resume per posting, fill the forms, and track replies for you.