Workload Orchestration Engineer at Roche in Madrid, ES

Skills

kubernetes

About the role

Bei Roche kannst du ganz du selbst sein und wirst für deine einzigartigen Qualitäten geschätzt. Unsere Kultur fördert persönlichen Ausdruck, offenen Dialog und echte Verbindungen. Hier wirst du für das, was du bist, wertgeschätzt, akzeptiert und respektiert. Dies schafft ein Umfeld, in dem du sowohl persönlich als auch beruflich wachsen kannst. Gemeinsam wollen wir Krankheiten vorbeugen, stoppen und heilen und sicherstellen, dass jeder Zugang zur Gesundheitsversorgung hat – heute und in Zukunft. Werde Teil von Roche, wo jede Stimme zählt.

Die Position

Job description

As a Workload Orchestration Engineer within the Accelerated Compute Engineering (ACE) team, you will be responsible for overseeing and advancing our workload orchestration tech stack across both our High-Performance Computing (HPC) and industry-leading AI Factory platforms. With the rapid expansion of our compute infrastructure, efficiently scheduling, managing, and maximizing the utilization of our CPU and GPU environments is paramount.

You will own the deployment, configuration, and fine-tuning of orchestration platforms that schedule massive, parallel computational workloads. By implementing robust scheduling policies for traditional scientific workflows and modern containerized AI workloads, you will bridge the gap between heavy compute capacity and efficient execution. Your work will directly ensure that Roche’s researchers, data scientists, and engineers can seamlessly run large-scale AI model training and computational science simulations at scale.

Description of the area

Hosting and Infrastructure (HI) provides mission-critical on-premise infrastructure, cloud hosting, connectivity, and technology products that enable all functions at every Roche site to develop, innovate, connect, and deliver compliant digital products across the Roche Enterprise.

The Value Streams - Accelerated Compute Engineering (ACE) Team is focused on driving both customer success and platform success by acting as a center of excellence and delivery for the High Performance Compute and AI Infrastructure supporting AI and HPC use cases across Roche. This team facilitates seamless onboarding and adoption for business vertical customers needing accelerated compute—helping those infrastructure consumers with needs optimized for high availability, seamless data transfer, flexibility, speed, and the rapidly changing needs of AI—helping achieve rapid time-to-value.

Job Responsibilities

Orchestration Stack Deployment& Governance

Design, implement, and maintain the SLURM Workload Manager ecosystem across our HPC cluster architectures, ensuring high availability and optimal resource distribution.

Deploy and manage Run: ai as the core orchestration and virtualization layer for the AI Factory, enabling fractional GPU allocation and dynamic resource allocation.

Evaluate, architect, and implement SLURM Slinky integrations where required to seamlessly bridge Kubernetes-based AI orchestration with traditional HPC cluster resources.

Containerization& Workload Optimization

Define best practices and frameworks for containerized scientific execution, utilizing Singularity/Apptainer and/or Enroot to provide secure, reproducible performance environments for HPC.

Translate user and workload requirements into optimized scheduling parameters (e.g., topology-aware scheduling, multi-node scaling).

Actively profile and tune scheduling queues, quality-of-service (QoS) parameters, and fair-share policies to maximize multi-tenant efficiency.

Platform Reliability& Telemetry

Partner with Observability Engineers to implement continuous monitoring, telemetry, and reporting dashboards to track scheduler efficiency, queue wait times, and hardware utilization rates.

Troubleshoot complex workload failures, including distributed training synchronization issues, MPI communication bottlenecks, and driver incompatibilities.

Maintain configuration-as-code models for the scheduling tier, leveraging automation to deploy cluster policies uniformly.

Qualifications

Education / Experience

Bachelor’s or an advanced degree in Computer Science, Applied Mathematics, Computational Engineering, or a similar technical discipline.

5+ years of systems engineering experience, with a heavy emphasis on workload scheduling, resource management, and cluster optimization for multi-tenant environments.

Deep technical familiarity with Enterprise Linux operating systems and distributed systems architecture.

HPC Scheduling& Tooling: Expert-level proficiency in administering SLURM, including complex partition designs, accounting, and plug-in management. Highly proficient with Singularity for container runtime execution.

AI Orchestration: Hands-on experience or deep architectural understanding of Run: ai, Kubernetes, and containerized GPU scheduling paradigms.

Infrastructure Literacy: Solid understanding of high-speed interconnects (InfiniBand, RoCE) and multi-node communication architectures (MPI, NCCL) as they relate to job placement.

Automation: Proficiency in automating scheduler configurations and telemetry gathering, or infrastructure automation tooling.

Leadership& Mindset:

Lean& Agile Mindset: Highly focused on driving efficiency, reducing idle compute time, and creating frictionless pathways for user workload submissions.

Collaboration& Advocacy: Outstanding capability to translate scientific and AI model workflow challenges into scalable scheduler configurations.

Intellectual Curiosity: A strong passion for remaining ahead of industry trends regarding GPU slicing, fractionalization, and the convergence of AI workloads with traditional HPC schedulers.

Wer wir sind

Eine gesündere Zukunft treibt uns zur Innovation an. Mehr als 100.000 Mitarbeiter weltweit arbeiten gemeinsam daran, wissenschaftliche Fortschritte zu erzielen und sicherzustellen, dass jeder Zugang zur Gesundheitsversorgung hat – heute und für zukünftige Generationen. Durch unser Engagement werden über 26 Millionen Menschen mit unseren Medikamenten behandelt und mehr als 30 Milliarden Tests mit unseren Diagnostik-Produkten durchgeführt. Wir ermutigen uns gegenseitig, neue Möglichkeiten zu erkunden, Kreativität zu fördern und hohe Ziele zu setzen, um lebensverändernde Gesundheitslösungen zu liefern.

Gemeinsam können wir eine gesündere Zukunft gestalten.

Roche ist ein Arbeitgeber, der die Chancengleichheit fördert.

Questions about this role

Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.

Compensation for Software Engineer roles in Spain varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our Software Engineer hub for Spain medians across recent openings.

Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.

AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.

Want AI Applyd to auto-apply to roles like this?

We tailor your resume per posting, fill the forms, and track replies for you.

Start free Report this listing

Skills

About the role

Questions about this role

How do I apply to this Workload Orchestration Engineer role at Roche?

What's the typical salary for Software Engineer in Spain?

How fast does AI Applyd auto-apply?

What ATS does Roche use?