Senior ML Engineer (Inference Serving)
Back to Job Search

Senior ML Engineer (Inference Serving)

Reference: SJE13
Location
San Francisco, CA, USA
Salary
$200,000 - $275,000
Contract Type
Permanent
Work Arrangement
In-Office (Full-Time)
Skill Requirements
  • Embedded, Electronics & Semiconductors

About the Opportunity

A rare chance to join a stealth, well-funded AI hardware start-up building a custom AI SoC and full inference serving stack from scratch. You will work directly alongside world-class hardware and software engineers, with genuine end-to-end ownership over how large-scale foundation models run on next-generation silicon.

What You'll Do

  • You'll get the chance to serve as a core contributor on a small, senior team building state-of-the-art inference serving and cluster scheduling capabilities for a custom AI SoC
  • You'll have the opportunity to architect high-performance multi-node inference stacks, designing and tuning throughput and latency from the ground up
  • You'll get to implement advanced optimisation strategies across TP/PP/EP hybrids, continuous batching, and KV cache management at the intersection of compute, networking, and storage
  • You'll have the chance to drive performance improvements directly inside leading inference frameworks including vLLM, SGLang, and PyTorch
  • You'll get the opportunity to develop advanced cluster scheduling algorithms that push the frontier of efficiency for large-scale open-source models
  • You'll be able to engage directly with the open-source community, upstreaming optimisations and influencing the roadmap of widely adopted AI infrastructure projects
  • You'll get to apply best practices in performance benchmarking, testing, and debugging to maintain a production-grade stack that runs on novel silicon

What We're Looking For

  • Strong Python, C++ and PyTorch engineering fundamentals with a track record of shipping high-quality software in a fast-moving environment
  • 1+ years as an active developer on LLM inference serving frameworks such as vLLM or SGLang
  • Deep understanding of LLM inference internals including KV cache, batching strategies, and attention mechanisms
  • Experience running and optimising large-scale workloads across heterogeneous clusters
  • Proficiency in performance analysis; GPU kernel development in CUDA, Triton, or ROCm is a plus
  • Familiarity with networking, storage management, or distributed scheduling technologies such as Orca or LMCache is a significant plus

Education

Master's or PhD in Computer Science, Computer Engineering, Electrical Engineering, or equivalent industry experience preferred.

Apply Now

Please fill in the form below to apply for this job.

Apply Now
Get in touch
Sebastian Eyre image
Sebastian Eyre
Similar Jobs
1st Jun 2026

Embedded Software Engineer (all-levels)

In-Office (Full-Time)Embedded EngineerEmbedded, Electronics & Semiconductors
29th May 2026

Senior Product Manager - AI Cloud

In-Office (Full-Time)Embedded, Electronics & SemiconductorsProduct & Design
8th Apr 2026

Windows Kernel Engineer

Remote WorkingEmbedded, Electronics & Semiconductors

Get in touch.

oho connects the future to your hands. Let us know what we can do for you.