Nicholas Wilt, Principal Architect of CUDA’s driver API and one of the foundational figures behind GPU computing at NVIDIA, reflects on the pivotal moments and tough decisions that shaped CUDA’s success. From early scepticism within NVIDIA to redefining GPUs as single-chip supercomputers, he shares insights on the challenges, vision and evolution of CUDA. Wilt also explores its unexpected trajectory, from dense linear algebra to powering modern AI, and offers advice for the next generation of developers looking to push performance boundaries in a rapidly changing landscape.
You were one of the key figures in CUDA’s development – can you share some of the biggest challenges you faced in its early days?
Even within NVIDIA, there was a lot of skepticism about whether CUDA would deliver a return on the hardware and software investments. On the hardware side, besides the extra design and verification costs, the extra die area would increase the per-unit manufacturing costs. And the software costs, of course, would be wasted if the product weren’t successful.
I got into many a lunchtime argument with graphics architects who believed the extra die area would be fatal to NVIDIA’s prospects in the market. You have to remember that the well-documented debacle that was NV30 was still fresh on everyone’s mind. NV30 hadn’t been remotely competitive with AMD’s R300 series, partly because NVIDIA had overspent their transistor budget. So from that perspective, the skepticism was well-founded.
What was the original vision for CUDA?
The idea was that with an incremental investment in hardware, GPUs could be transformed from graphics chips into single-chip supercomputers. Coupled with an initiative that Jensen called “CUDA Everywhere,” we also made sure that would-be adopters of the technology would be able to intercept the technology on the platform of their choice. On the software side, we ensured that the software was portable across operating systems and CPU architectures; on the hardware side, it meant that even small SOCs targeting mobile computing applications would be fully CUDA-capable.
When you were developing CUDA, did you and the team have a sense that you were creating something that would have such a lasting and transformative impact on computing, or did that realization come later?
I can only speak for myself, but I was certain that it would be a huge success. I didn’t expect it to become a multi-trillion dollar property, but that’s partly because there were no trillion-dollar companies when we started work on CUDA.
Part of the reason we knew we were onto something is because adoption of the platform was growing like a hockey stick. Later at Amazon, I experienced a similar pattern with Amazon Web Services, which we knew would be huge just because it was growing so quickly, and the steep adoption curve was a source of many customer requests for innovation of the platform.
How did you and the team at NVIDIA decide on CUDA’s programming model?
CUDA’s programming model has a clear division of labor between the kernels and the APIs that manage the environment in which the kernels run. The kernel execution aspect was the purview of Eric Lindholm, Ian Buck, Steve Glanville, and others who co-designed the hardware and compiler. John Nickolls played the most important role, having and having been recruited specifically to build GPU computing at NVIDIA. One of his key insights was recognizing the importance of a software-controlled cache, also known as a scratchpad or “shared memory” in CUDA parlance.
As for the execution environment – initialization, memory management, and abstractions such as CUDA streams and events that manage concurrent kernel execution and memory copies: I served as the principal architect of CUDA’s driver API, which was designed to be portable and language-independent. The design was influenced by my experience on Microsoft’s DirectX team.
The “Direct” in DirectX means “direct access to the hardware,” which game developers always said they wanted, but is not possible in a robust operating environment. So the compromise we brokered in DirectX (and, later, CUDA) was to meet developers halfway: explain how the hardware works and build APIs that are useable by developers, implementable on the hardware, and future-proof, not only to your own roadmap, but to anticipated platform innovations; and make sure those APIs keep working as the platform evolves. I believed, and still believe, that the key ingredient that fueled Intel and Microsoft’s ascendence to industry dominance from the mid-1980s to the mid-1990s was backward compatibility. CUDA incorporates all of those design sensibilities.
What do you think has been the most groundbreaking application of CUDA since its inception?
I don’t recall talking about machine learning much during CUDA’s early development, and that’s obviously become the most groundbreaking application. We knew the GPUs would excel at dense linear algebra and N-body simulations (any workload that computes forces exerted between particles – from the atomic level to the galactic), and we made significant investments to accelerate ray tracing via OptiX.
One surprise workload prompted an early request to add blocking event waits to CUDA. The bug report read (paraphrasing): “We have better things for the CPU to do than spin waiting for the GPU to finish.” We wondered about this workload that needed the CPU and GPU to run concurrently, and it turned out to be video transcoding!
What are some of the most underappreciated features of CUDA that developers don’t leverage enough?
CUDA arrays have been in the platform since CUDA 1.0, as a vessel to hold dimensional memory such as textures. As the memory subsystem moves multi-dimensional data between levels in the cache hierarchy, generally you want blocks or voxels of data, not rows or columns. The problem is that complicates ABIs – instead of an address that stands alone, an element in the array is defined by a tuple of an array and the 2D or 3D offset into that array.
I’ve always believed that linear address spaces are oversold. The conflation of pointers and arrays tends to be the first attribute of the C programming language discarded by designers of C-derived programming languages, and that should tell us something.
How do you see CUDA’s role evolving in the next 5-10 years?
CUDA already has transitioned from the lingua franca of important workloads like machine learning, to a baseline technology that undergirds those capabilities. NVIDIA always has aspired to make CUDA accessible to more programmers, and I expect that to continue.
What do you think is the next big step for CUDA? Are there any upcoming features or trends that excite you?
The main trend I have been following involves the efforts to make low-level AI/ML code easier to write and more performance-portable. DSLs such as Triton have attempted to bridge that gap from outside NVIDIA; and now NVIDIA appears to have joined that trend with cuTile.
What advice would you give to any Junior developer looking to develop their CUDA skills?
NVIDIA’s developer education materials are an excellent place to start, with a caveat: Developers who aren’t already proficient with pointers should shore up their understanding of how they work in C or C++ before they take on CUDA.
Looking back on your career, what project or achievement are you most proud of, and why?
After 35 years in the industry, building CUDA’s driver API across the first six years of CUDA’s existence is my proudest achievement by far. I lose patience with folks who intimate that luck was involved; we work in a field where you create your luck with good choices and solid execution. CUDA was successful because we built the best combination of hardware and software in a very competitive environment, and we built it well.
If you're looking to hire top-tier CUDA talent, explore CUDA job opportunities, or participate in our Behind the Code series, we'd love to hear from you! Reach out via email alexf@oho.us or message him on LinkedIn @AlexFord .