Role:
We’re looking for a Senior CUDA Developer to build and optimize high-performance GPU kernels for next-generation AI systems. You’ll focus on delivering robust, maximum performance out of GPUs, including efficient GPU-to-GPU communication, and work in a high-impact role, close to the metal in a fast-moving startup environment.
Responsibilities:
● Design, implement, and optimize CUDA kernels for performance and scalability
● Build and tune GPU-to-GPU communication paths (e.g., NIXL, NCCL-style collectives, P2P)
● Profile, debug, and optimize memory, latency, and throughput bottlenecks
● Collaborate with compiler, systems, and hardware teams
Experience:
● 3+ years of CUDA development and performance optimization experience
● Deep understanding of GPU architecture, memory hierarchies, and execution models
● Experience with multi-GPU communication and synchronization
● Triton experience is a plus
● Familiarity with AMD GPUs & ROCm is a strong plus
