Skip to content

Senior Product Manager, AI Factory Infra

NVIDIA

New York City, USonsite$208k-$380k/yrPosted May 28, 2026

At a glance

Highlights

  • AI factory focus
  • Impact on resilient automation
  • Collaboration with hyperscalers
  • Competitive compensation
  • Opportunity to shape self-healing AI factories

Why this role might suit you

The role offers leadership over a critical AI factory automation platform, exposure to cutting‑edge GPU infrastructure, and collaboration with major cloud partners, making it an attractive opportunity for senior product leaders seeking high‑impact technical product management.

Skills

product-managementinfrastructureplatformmlopsdistributed-systemsworkflow-orchestrationautomationoperator-uxrepair-queuesaudit-trailsslotime-to-draintime-to-healthyfleet-availabilityrma-logisticshardware-repairreliability-engineeringchaos-testinggpu-infrastructuredatacenter-operationsai-factoryagentic-ai-workflow

About the role

NVIDIA is driving a vision for AI factories that convert tokens to intelligence at scale to power AI demands of tomorrow. Maintaining AI infrastructure at scale takes more than human involvement; it demands smart automation. The orchestration engine for AI factory break-fix runs live in production at DGX Cloud. As the Product Manager leading all aspects of resilient automation at AI Factory, you will manage break-fix automation. You will develop the product strategy, improve operator experience, and guide the roadmap for professionals. You will build a scalable, reliable product from a strong engineering foundation that NVIDIA Cloud Partners depend on to uphold their SLAs. This is your chance to compose how AI factories self-heal!

What You’ll Be Doing:

Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs.

Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety.

Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently.

Drive the integration between failure attribution and automated repair actions, following through from detection to resolution.

Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability.

Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale.

What We Need to See:

12+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background.

BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience.

Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.

Track record owning products with real-world operational consequences — you understand blast radius and build accordingly.

Strong operator UX instincts — proven ability to translate complex system state into workflows that on-call engineers can act on under pressure.

Ability to build alignment across engineering, SRE, and external vendor partner teams.

Ways to Stand Out from the Crowd:

Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments.

Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale.

Background in reliability engineering, SLO build, or chaos/fault-injection testing.

Prior experience at a cloud service provider or Hyperscalers infrastructure team.

Experience building Agentic AI workflow software

Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family www.nvidiabenefits.com/

#LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 208,000 USD - 327,750 USD for Level 5, and 240,000 USD - 379,500 USD for Level 6.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until May 31, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

Compensation

This Product Manager role pays $208k-$380k/yr. Within typical range for product manager roles in United States.

Questions about this role

  • How do I apply to this Senior Product Manager, AI Factory Infra role at NVIDIA?

    Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.

  • What's the typical salary for Product Manager in United States?

    Compensation for Product Manager roles in United States varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our Product Manager hub for United States medians across recent openings.

  • How fast does AI Applyd auto-apply?

    Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.

  • What ATS does NVIDIA use?

    AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.

Want AI Applyd to auto-apply to roles like this?

We tailor your resume per posting, fill the forms, and track replies for you.