Skip to content

Machine Learning Solutions Engineer (ML + Infrastructure Focus)

lightningai

New York City, USonsite$150k-$195k/yrPosted Jan 24, 2024

Skills

kubernetespytorchdockerpythoncsshelmllmml

About the role

Who We Are

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.

Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.

We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.

What We're Looking For

Lightning is looking for a Machine Learning Solutions Engineer with a focus on ML and Infrastructure to join ou Sales team in New York. As a Machine Learning Solutions Engineer, you will operate at the intersection of machine learning, distributed systems, and cloud infrastructure. You will partner with customers to design and deploy end-to-end AI systems, spanning:

Model development and training

GPU infrastructure and cluster design

Distributed inference and production deployment

This role goes beyond traditional ML solutions engineering—you will act as a technical architect, helping customers make critical decisions across compute, orchestration, and system design.

The role is hybrid out of our New York City office hub, with an in-office requirement of at least 3 days per week and occasional team and company offsites. We are not able to provide visa sponsorship for this role at this time.

What You’ll Do

Customer Architecture & Technical Leadership

Partner with customers to understand ML workloads, infrastructure constraints, and scaling requirements

Architect end-to-end solutions across:

Data pipelines (CPU → GPU workflows)

Distributed training (multi-node, multi-GPU)

High-throughput inference systems

Translate business goals (latency, cost, throughput) into technical system design decisions

GPU & Infrastructure Design

Design and optimize workloads across GPU clusters (H100, H200, B200, etc.)

Advise on:

Training vs inference cluster design

Interconnect choices (Ethernet vs Infiniband / RDMA vs Roce)

Storage strategies (local NVMe vs networked / object storage)

Model and optimize for:

Tokens/sec, tokens/$

Throughput vs latency tradeoffs

GPU utilization and scheduling efficiency

Kubernetes & Platform Systems

Design and support deployments on Kubernetes (EKS, GKE, on-prem clusters)

Work with:

GPU scheduling (time-slicing, MIG, bin-packing)

Autoscaling and workload orchestration

Helm-based deployments and multi-tenant environments

Help customers balance:

Raw Kubernetes flexibility vs platform abstraction (Lightning)

Demos, POCs, and Execution

Build and deliver technical demos and POCs that showcase:

Distributed training workflows

Scalable inference endpoints

End-to-end ML pipelines on Lightning AI

Scope and lead POCs aligned to customer success metrics (latency, cost, reliability)

Cross-Functional Impact

Act as the bridge between customers, product, and engineering

Provide feedback on:

Platform gaps in infrastructure, orchestration, and performance

Emerging patterns in GPU usage and distributed systems

Influence roadmap across ML workflows and infrastructure capabilities

Enablement & Thought Leadership

Create technical content

Architecture guides (e.g., high-throughput LLM inference systems)

Best practices for GPU utilization and scaling

Educate customers on modern AI infrastructure patterns

What You’ll Need

ML + Systems Expertise

3–6+ years experience in:

Machine Learning / AI Engineering

Solutions Engineering / Sales Engineering / ML Consulting

Strong understanding of:

Training vs inference workloads

Model optimization (quantization, batching, caching, etc.)

GPU & Distributed Systems

Experience working with:

GPU clusters (NVIDIA stack preferred)

Distributed training or inference systems

Familiarity with:

NCCL, CUDA, or GPU performance profiling

Networking concepts (RDMA, Roce, Infiniband, high-throughput systems)

Kubernetes & Cloud Platforms

Hands-on experience with:

Kubernetes (EKS, GKE, or on-prem)

Slurm

Containerization (Docker)

Exposure to:

GPU scheduling in Kubernetes environments

Multi-tenant or production ML deployments

Programming & Tooling

Strong Python skills (PyTorch preferred)

Experience building:

ML pipelines

APIs or inference services

Familiarity with Lightning AI, PyTorch Lightning, or similar frameworks is a plus

Customer-Facing Excellence

Ability to:

Explain complex infrastructure and ML tradeoffs clearly

Run technical discovery and uncover quantifiable success metrics

Experience working cross-functionally with:

Sales, product, and engineering teams

Compensation

The annual base pay range for this role is $150,000 - $195,000, in addition to a variable pay component and meaningful equity.

Benefits and Perks

We offer a comprehensive and competitive benefits package designed to support our employees’ health, well-being, and long-term success. Benefits may vary by location, team, and role.

Benefits include:

Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)

Retirement and financial wellness support (U.S.); Pension contribution (U.K.)

Generous paid time off, plus holidays

Paid parental leave

Professional development support

Wellness and work-from-home stipends

Flexible work environment

At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.

Compensation

This Machine Learning Engineer role pays $150k-$195k/yr. Within typical range for machine learning engineer roles in United States.

Questions about this role

  • How do I apply to this Machine Learning Solutions Engineer (ML + Infrastructure Focus) role at lightningai?

    Click "Apply with AI Applyd" above. We auto-fill the application from your resume and answer screening questions in seconds. No copy and paste, no juggling tabs.

  • What's the typical salary for Machine Learning Engineer in United States?

    Compensation for Machine Learning Engineer roles in United States varies widely by seniority, employer size, and remote vs onsite arrangement. Check the salary range on this listing when published, or browse our Machine Learning Engineer hub for United States medians across recent openings.

  • How fast does AI Applyd auto-apply?

    Most applications complete in under 90 seconds. You can track the status in your dashboard and watch the screenshot proof land the moment the application submits.

  • What ATS does lightningai use?

    AI Applyd supports Greenhouse, Lever, Ashby, Workday, iCIMS, SmartRecruiters, LinkedIn Easy Apply, and most other ATS platforms. If we can submit through the platform, we do.

Want AI Applyd to auto-apply to roles like this?

We tailor your resume per posting, fill the forms, and track replies for you.