Skip to content

Member of Technical Staff (AI Inference Engineer)

Perplexity

San Francisco, USonsite$220k-$405k/yrPosted Apr 13, 2026

At a glance

Highlights

  • Production-scale LLM inference
  • Rust-native serving runtime
  • GPU kernel migration to CuTe DSL
  • Performance optimization at low latency
  • End-to-end problem ownership

Why this role might suit you

The role offers the chance to shape a high‑performance inference engine for a leading AI platform, working with cutting‑edge GPU technologies and Rust‑based systems while influencing production reliability at scale.

Skills

rustpythoncudacute-dslncclrdmakubernetesnsight-computensight-systemspytorchtorch-compileint8fp8fp4

About the role

We build and run the inference engine behind every Perplexity query and deploy dozens of model architectures at scale with tight latency and cost budgets. Our stack is Rust, Python, CUDA, and CuTe DSL - and we need another engineer to join us.

What you will work onExamples of real work the team does:

- New models support. Support transformer-based retrieval, text-generation, and multimodal models in our inference infrastructure, from weight loading, request scheduling and KV-cache management to support in API Gateway.

- GPU kernels migration to CuTe DSL. Port our in-house CUDA kernels to NVIDIA's CuTe DSL so they run on GB200 today and are portable to Vera Rubin racks tomorrow.

- Rust-native serving runtime. Develop our internal Rust-based inference server to solve all Python pains and keep up with rapidly growing traffic.

- Performance optimisation. Profile and fix bottlenecks from network ingress through continuous batching and GPU kernel interleaving.

- Reliability and observability. Build dashboards, alerts, and automated remediation so we catch regressions before users do. Respond to and learn from production incidents.

Who we're looking for- Deep experience with GPU programming and performance work (CUDA, Triton, CUTLASS, or similar). Any other deep systems programming experience is a plus.

- You understand modern LLM architectures and are able to bring them up reliably in a production environment.

- You've built and operated production distributed systems under real load - ideally performance-critical ones.

- Comfortable working across languages and layers: Rust for the serving runtime, Python for model code, CUDA/CuteDSL for kernels.

- You own problems end-to-end. You can read a research paper on Monday, write a kernel on Wednesday, and debug a production incident on Friday.

- Self-directed. You do well in fast-moving environments where the path forward isn't laid out for you.

Good if you touched any of- ML compilers and framework internals: PyTorch internals, torch.compile, custom operators.

- Distributed GPU communication: NCCL, NVLink, InfiniBand, RDMA libraries, model/tensor parallelism.

- Low-precision inference: INT8/FP8/FP4 quantization, mixed-precision serving.

- Profiling and debugging tools: Nsight Compute/Systems, CUDA-GDB, PTX/SASS analysis.

- Container orchestration: Kubernetes, GPU scheduling, autoscaling inference workloads.

Qualifications- 3+ years of professional software engineering experience with meaningful work on ML inference or high-performance systems.

- Familiarity with at least one deep learning framework (PyTorch, JAX, TensorFlow).

- Understanding of GPU architectures (memory hierarchy, warp scheduling, tensor cores).

- Understanding of common LLM architectures and inference optimization techniques (e.g. quantization, speculative decoding, prefill-decode disaggregation).

Compensation

This Software Engineer role pays $220k-$405k/yr. Within typical range for software engineer roles in United States.

Questions about this role

  • How do I apply to this Member of Technical Staff (AI Inference Engineer) role at Perplexity?

    Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.

  • What's the typical salary for Software Engineer in United States?

    Compensation for Software Engineer roles in United States varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our Software Engineer hub for United States medians across recent openings.

  • How fast does AI Applyd auto-apply?

    Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.

  • What ATS does Perplexity use?

    AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.

Want AI Applyd to auto-apply to roles like this?

We tailor your resume per posting, fill the forms, and track replies for you.