Skip to content

Deep Learning Performance Architect, CUTLASS DSL

NVIDIA

Shanghai, CNonsitePosted Jun 1, 2026

About the role

Are you passionate about programming languages, compiler technology, and GPU performance? Do you want to help shape the future of high-performance kernel development for AI? We are looking for outstanding engineers to build CUTLASS DSL, a Python-native language for GPU kernel development, along with the MLIR dialects and lowering passes behind it. In this role, you will also help accelerate kernel compilation while delivering performance comparable to CUTLASS C++, enabling efficient hardware-software co-design for NVIDIA's next generation of AI platforms.

What you'll be doing:

Design, develop, and optimize CUTLASS DSL, a Python-native language for high-performance GPU kernel development

Build and advance the MLIR dialects, lowering passes, and code generation flows that power the CUTLASS DSL stack

Drive innovations that improve kernel compilation speed while maintaining performance on par with CUTLASS C++

Collaborate closely with architecture, research, software product teams, and the open-source community to bring cutting-edge optimizations into real products

What we need to see:

MS, PhD, or equivalent experience in Computer Science, Software Engineering, or a related field

2+ years of relevant work experience

Excellent programming skills in Python and strong proficiency in C++

Hands-on experience with DSLs, compilers, or code generation systems

Strong command of the MLIR/LLVM stack, including IR design and pass optimization

Strong communication skills and the ability to thrive in a highly collaborative environment

Ways to stand out from the crowd:

Deep understanding of the CUDA GPU programming model, GPU microarchitecture, and performance analysis and optimization techniques

Familiarity with key high-performance computing abstractions such as Layout, Tile, MMA, and TMA in the CuTe ecosystem

Questions about this role

  • How do I apply to this Deep Learning Performance Architect, CUTLASS DSL role at NVIDIA?

    Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.

  • What's the typical salary for Other in China?

    Compensation for Other roles in China varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our Other hub for China medians across recent openings.

  • How fast does AI Applyd auto-apply?

    Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.

  • What ATS does NVIDIA use?

    AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.

Want AI Applyd to auto-apply to roles like this?

We tailor your resume per posting, fill the forms, and track replies for you.