How do I apply to this DevOps Engineer, HPC and LSF role at NVIDIA?

Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.

What's the typical salary for DevOps / SRE in India?

Compensation for DevOps / SRE roles in India varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our DevOps / SRE hub for India medians across recent openings.

How fast does AI Applyd auto-apply?

Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.

What ATS does NVIDIA use?

AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.

DevOps Engineer, HPC and LSF at NVIDIA in Bengaluru, IN

About the role

NVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing.

As a member of the Hardware Infrastructure Farm team, you will provide leadership in the design and implementation of ground breaking compute clusters that powers all silicon development across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve engineer's productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success.

What you’ll be doing:

Manage and support workload and resource schedulers in a large-scale HPC environment.

Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring.

Develop solutions for complex computing resource management requirements.

Extract and leverage grid performance metrics for troubleshooting and performance optimization.

Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.

Develop, define and document standard methodologies to share with internal teams.

Collaborate with domain experts to improve how our chip development process utilizes our infrastructure.

Directly contribute to the overall quality and improve time to market for our next generation chips.

What we need to see:

Extensive knowledge with job scheduler administration (e.g. IBM Spectrum LSF or SLURM).

Proficient in administering Centos/RHEL Linux distributions.

In depth understating of container technologies like Docker.

Proficiency in UNIX scripting languages and Python.

Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.

Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.

3+ years experience in a large, distributed Linux environment.

BS in Computer Science, similar degree or equivalent experience.

Ways to stand out from the crowd:

Experience analyzing and tuning performance for a variety of HPC or EDA workloads.

Solid understanding of cluster configuration managements tools such as Ansible.

Proficiency in Perl for maintaining legacy automation scripts.

Deep understanding of distributed system principles.

#LI-Hybrid

Questions about this role

How do I apply to this DevOps Engineer, HPC and LSF role at NVIDIA?
Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.
What's the typical salary for DevOps / SRE in India?
Compensation for DevOps / SRE roles in India varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our DevOps / SRE hub for India medians across recent openings.
How fast does AI Applyd auto-apply?
Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.
What ATS does NVIDIA use?
AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.

Want AI Applyd to auto-apply to roles like this?

We tailor your resume per posting, fill the forms, and track replies for you.

Start free Report this listing