CUDA - NVIDIA GTC 2024 presentation
Steven Jones' presentation on CUDA at NVIDIA GTC 2024
Steven Jones' presentation on CUDA at NVIDIA GTC 2024
Full transcript
Full transcript of talk
Now let's get started with our session. This morning, I am proud to introduce Steven Jones, an NVIDIA distinguished software architect, presenting the latest and greatest on our CUDA parallel computing platform.
He's been with NVIDIA for over 14 years, starting as a software engineer, senior software engineer on CUDA in 2008. He had a brief stint, and he may he may see share a little bit about that, while he was at SpaceX and meeting Elon, but in between, that was a a good diver diversion. As today, he's gonna talk about, his aeronautical and aerospace engineering experience from the University of Cambridge, as his master's degree.
And with that, I'm gonna introduce Steven to talk about CUDA. Thanks very much all for coming. It's it's amazing. I'm sure everyone says this, but I haven't been to a talk yet. It's just amazing to actually talk to people instead of to a camera above your screen when you're recording a talk.
So it's really nice. Thank you all for coming here. I'm one of the architects of CUDA, so I spend my time thinking about CUDA, the platform, the language, all the systems and stuff that goes with it. And also, you know, I I work closely with the hardware teams. Right?
I spend probably half my time, even though I'm a software guy, working with the hardware teams, working on what our next generation and the one after that GPU is going to be because, you know, one of the magical things we get to do is we get to combine the hardware and the software so that we can build the programming model we want and that we can program the hardware that we build.
Now, I'm gonna start today with something which I've learned working with the hardware teams a lot, and and I I think it really drives a lot of the way of the things the way of thinking about the things which I'm talking about today, and also just a lot of the way I think about how hardware is driven by the laws of physics and by by the way the constraints that we all are living under in terms of the hardware we design as well as the software that we that programs it. Right? So this is possibly a contentious thing to say at NVIDIA, that accelerated computing is not all about performance. But if you watched the keynote yesterday with Jensen, he was really talking about how it's all about energy.
Right? It's not just the performance, it's about the performance per watt. Because, ultimately, you've got to power these things, you've got to provide the energy into the machine, and so the efficiency is really the key metric that you have to care about. Yes, you want to be scaling performance, but you've got to scale it with efficiency as well. And the obvious place I went to look for this I I just doing a bit of research for the introduction of the talk, you know, I went around looking at data centers.
Right? They're building data centers at enormous rate in the world. They're building they're standing up something like 6 or 7 data centers a day. But I went looking for the how many data centers they build and nobody lists data centers by number. Right?
They list data centers by megawatt. Because power is the thing that is important in a data center. Right? There's 5 gigawatts worth of data centers in North America right now, there's going to be another 3 gigawatts standing up in the next year. And when you go and you buy time on a data center, you rent them and they're charged by kilowatts per month.
Nobody cares how many servers you're buying. Nobody cares how much how many data centers you're renting. You are renting power per month, because power is the metric that really matters for computing. Right? And if you look at what a data center typically is, a medium sized data center runs you maybe 20 megawatts, and that's because I've got this big building with a giant power connector coming in, and that power provides 20 megawatts of power.
So if I build a brand new chip that goes twice as fast, but it runs at twice the power, I I can't magically put more put more power into this room. Right? I've I end up with half that number of chips. I've got 20 megawatts. The question is, what can I do with my 20 megawatts?
And again, if you were watching what what, what Jensen was saying about Blackwell at the keynote, he talked a lot about the energy, the power efficiency, and that is a really big focus. And on the hardware side, that's something that they're all thinking about. Every transistor really counts. But it's not just in the data centers, it's also in my own home. Right?
My my desktop machine gets power from the wall. I can't put 10 GPUs in it. I can't I can't run that out of my 1 and a half kilowatts if I'm in America, 3 kilowatts if I'm in the in the UK. My maximum system power on my laptop, even smaller. Right?
So that everybody is constrained by power more than anything else. So it is really all about energy. And the challenge we have is that the energy equation, the energy balance is getting worse. Right? On on the left hand side here, the very well known chart, if you like, of Moore's law, that's a bunch of numbers which looks at the transistor density from when Moore stated his law back in about 1970.
And it's pretty much been going exponentially. It's a log plot on the left hand side. And on the right hand side, I went to TSMC's website and I just pulled up all the information about their different chips. And their transistor density, that red line, that orange line, has been, of course, increasing exponentially as well. But something else you see when you look through that that data is that if you look at the power efficiency scaling, as I shrink my transistors, I need fewer electrons to turn them on.
And so they take less power, but the power is not scaling as fast as the transistor count. And that is a problem in a world which is energy constrained. Because when I keep adding transistors, I've got to do something about the power. And so, while obviously we look very closely at the hardware and the efficiencies of the hardware, it's critically important obviously I say this, I'm a software guy, but it is critically important to look at this in terms of the efficiency of the software as well. And so I'm going to be talking about a couple of really key pieces where power comes from.
1 is data movement and the other is the computation, the 2 obvious users of electricity and power in one of these machines. Right? And so, starting with computation, let's talk for a moment about floating point arithmetic, because that's really the core of computation that is being done. Right? In fact, most of these data centers running GPUs, probably overwhelmingly a lot of the flops and a lot of the well, a lot of the watts are being spent on things like matrix multiplication.
And so I went and I dug up a little table here. You know, on the left hand side we've got all these different precisions that NVIDIA offers. And there's lots of reasons for that, but I'm gonna get into into a particular one focused on power right now. And on the right hand side, I can break this down into the energy that doing different precisions and multiplications and the FMA is a huge multiply add. It's the sort of the fundamental unit of arithmetic operation in a computer.
And if you look at the top, you see that the standard flop, the 32 bit single precision flop, for which I've normalized everything to about 1 x, double precision is about 2 and a half times the power, And half precision is about half the power. And so it's not The key here is that the higher precisions don't just scale linearly. Floating point multiplication scales as the square of the mantis length. That's that blue Excuse me. That's the blue section on the left hand side.
Right? So the longer my number is, the more power it takes to compute that number because I've got more decimal places. I've got more more bits to move around. And then you look at the tensor cores. Right?
The tensor cores, completely different. That takes these single operations and they group them together with an economy of scale. You see dramatic changes and improvements in energy per flop. Right? So this is one of the reasons you see these investments in these tensor cores and these these dense processing units, is because I've got 20 megawatts in my data center.
I wanna cram all the flops in that I can get and this is where this is the kind of place it's coming from. And so you can look at these interesting balances. Right? The tensor core at fp64, the double precision tensor core. And this is looking at hopperh100 numbers, is more efficient than individual multiplies and that.
And that's the economy of scale I was talking about. But if you look at the difference between say, 16 bit. Right? Instead of being 1 and a half times more efficient, it's 4 times more efficient because again, it's that square of them anticipates which is getting get winning me the power. But I'm gonna look at something which I think is a more interesting thing to look at which is the difference between these 64 bit operations which is 2 and a half times the power of single precision and these tensor cores at reduced precision which is a tiny fraction, a factor of about 20 in power efficiency which is sitting there, but of course there's a very big difference between a 64 bit precision and a 16 bit precision number.
So it turns out there's not as much of a difference as you would think. And this is this is not new work, but there's a lot that's been going on about it recently and I really wanted to tell you about this because I think this this to me is a really exciting way that we are attacking the power wall, the power problem that I see from those from those those curves, from the silicon processes. Right? A couple of years ago I was telling you about some work that that my colleague Azam Hajdar, and there's a reference to the paper down below, was using tensor cores to do an LU factorization matrix solve, where you would do the heavy lifting in 16 bit and then you'd have this this iterative process, this this gmres process, to take your 16 bits and progressively improve the accuracy back up to 64 bit values. Right?
And so, in the paper they look at this, excuse me, and this graph shows on the bottom here the number of iterations. I take a 60 I take these 16 bit tensor cores, they output actually a 32 bit result. And then I iterate progressively 3, four times and there's that there's that line in the middle of what the fp 64's accuracy is and I can get closer and closer to the final result and finally I exceed the perfect accuracy of a true native 64 bit number. So this is not in any way compromised on the results. This is exactly accurate to the bit of what you would get, except I've done the heavy lifting using those much more power efficient tensor cores.
And that's a really big deal. Right? Because here's a chart of actually, as I'm currently ran this for me this weekend, of running, in this case, an LU decomposition solve, the thing that was in that flowchart before. And the green line is the gray top of the GH 200 or they're both the GH 200, so the green line is the 16 bit plus 64 bit, the middle column on this chart. The blue line is the pure native double precision values.
And so you can see what you're getting. You're not only getting the power benefit on the bottom, which is huge, right? Almost a factor 6 improvement in flops per watt, which is unbelievable. I can now do 6 times more flops in that power limited data center using this technique than I could if I had natively got the exact same result using double precision numbers. And at the same time I'm going almost 4 times as fast.
Right? So I'm faster and I'm more efficient. This is amazing. This is huge. This algorithm is actually implemented in the CU solver library.
But I just I see this reaching out into everything. Right? If I can do work faster and more efficiently, this is one way we can attack this power wall that I see coming. It's not just Azam's work as well. There's other people my my my friend, my long time friend Ryo Yokota, at the University of Tokyo, he and some colleagues wrote a paper looking at a completely different approach, but again using low precision, in his case integer tensor cores, to produce matrix multiplication.
And some of our genius guys in NVIDIA implemented it. But what they did here, instead of just taking one of the GH 200, like mega powerful chips, they took the L40 or the L40 s. And that's sort of the lower power data center part which does not have a native double precision tensor core in the first place. And using the 16 bit tensor cores that are in the L40s, they were able to run matrix multiplications running at 6, 7 times the performance. Purely without even having a proper, like Well, there is a double precision unit, but it's not the high power of high performance double precision unit that you would normally find in an h100.
They compared it in fact It was too busy. I didn't put it on the chart. They compared it to an a100 and it's half the performance of an a100 using no double precision tensor cores at all, which is absolutely incredible. Right? This is opening the door to parts with much lower power.
Being able to achieve, you know, 50% of the performance is incredible. Right? And not only that, but the power savings. Right? In the same way that At the same time as you're getting a factor 6 or 7 in performance, you're getting a factor 7 or 8 in power.
In in in power efficiency, sorry. Performance per watt. So this is huge. And so I This fascinates me, like, I'm very lucky with this talk. I get to just find these cool things that are going on around the company and tell you about them because because I just I find them interesting.
And this is one of the things I think is really fascinating because it's it there's so much so many things we can do with this type of technique. Now, tensor cores themselves A lot of people come and ask me: How do I program tensor cores? Tensor cores are a complex system, right? They have all these different precisions, they have these different ways of using them. But the three main flavors of ways you get access to the tensor cores is first through the cuBLAS math library.
That is your basic workhorse that has existed since the very beginning of CUDA. It's linear algebra APIs and you call a matrix multiply and that naturally automatically pipes through on to the tensor cores. Kublas actually calls that one in the middle called Kublas lt. Kublas lt, which you can also access yourself, it's a it's a public library. It gives these advanced API where you can really control a lot more aspects of what the tensor cores are doing.
The tensor cores have a lot of different configurations, a lot of different modes. You can really get access to them. And on the right hand side we have something called Cutlass, which if you've seen me give this talk before, I I I talk about it probably every year because it really is the programmer's way of getting at the tensor cores. It lets you write tensor core code inside your own kernel and get access to all of the different knobs and and and and configurations that the tensor cores have. So I drew this out for myself in a different way because really there's a couple of different dimensions.
There's the productivity dimension of Kublas on the left where I call one API and I get I get the peak acceleration. And then there's a control on the right hand side if I really want to start tweaking it and merging it and meshing it with my with my data. And so, one of the things that the math libraries have done is they've They've been been working on on device extension libraries, it's called. So the Kublas device extension library. This brings the Kublas on the left hand side, productivity path, into your device kernel.
So while Cutlass is a sequence a hierarchy of c plus plus classes which which give you incredibly fine grain control, there's a completely different approach on the cuBLAS DX side where the idea is that you can get your tensor cores activated in your kernel just with a single gem call, just like you would do with with Kublas from from the CPU. And so, why do you wanna do this? Well, you wanna do this because sometimes you don't just want a matrix multiplication, you want to then do something with the result. That's what we call a fusion. You take, you know, some take some data, you manipulate it in some way, you do a few count big matrix operations and then you use the results in some way.
And by fusing all of these together and this is a chart of having taken a pre processing step, fusing 2 matrix multipliers together and a post processing step all in one kernel, the difference between doing that in one kernel and sequencing with a series of calls, using thrust in this case with Kubla, is a factor of 3 in performance. Right? So being able to take the simple the same simplified API, put it inside your kernel, customize it in the way that you want, also comes out with performance. And again, I'm not showing perf per watt in these cases, but all of these cases are reaching peak performance, but typically at lower energies. Same is true for FFT.
I actually showed this last year because they've been working on device extension libraries for FFT for some time. FFT also, again, fusions of FFTs with the rest of your with the rest of your operations. In this case, I'm fusing 3 kernels into 1. And again, you see these speed ups. And so a lot of this comes from this fusion thing where I've really customized my kernels in ways that I that give me the ability to string lots of work together.
I load data once. If you remember, I said there were two reasons for power cost. 1 is data movement and the other is compute. This is solving the data movement so that my compute densely applies to it without the data movement going on between. And so how does that work?
Well, the basic kernel fusion and probably many of you are aware of this in the room right now. There's If I Typically I'll have a sequence of operations. Right? Maybe I'll do some precision conversion. I'll multiply things in a matrix multiplication and then I'll run an activation function, you know, a ReLU or something like that on it.
Some very standard sequence of operations. And this is what those charts were showing just a moment ago. By fusing them together, you load your data once, you operate on it many times and then you store your data out the other end and you end up with these single fused kernels. And this is a great idea and everybody should do it if they can. The challenge is that I don't just have one thing to do.
Right? I might have a 100 different types of things to do. And so, you know, I've drawn 4 on this slide because that's all I could fit. But even with 4, I've got 64 possible combinations and I can't build every single one of them all ahead of time. You know, if I had a 100 in every row, I'd have a 1000000 different combinations.
That's just not feasible. So what I'm seeing is as people build these codes which fuse things, people are also moving very often towards just in time compilation, run time compilation, where you say, My program needs this, that and the other units. Configure them precisely for what I need, and then compile it on the spot and run it. And so I see JIT compilation being more and more important in people's workflows inside CUDA. And so our compiler team has spent this chart covers, what, 18 months, I think, from CUDA 11.8 and their work consistently reducing the cost of this chip performance.
Because very often, as I'm showing down there on the bottom left, you've got this iterative loop. Right? You build some you build a fused kernel, you run it, you get some data, you look at what you're doing next, you build the next one and you've got this iterative thing. The compile time becomes part of your main program loop. And so they've worked really hard and they've got this is just this is showing the the time compiled time for hello world.
So it's basically just overhead. Hello world is the simplest program you can possibly write. And so the overheads of compilation have come down by a factor of 6 over the last 18 months. And so really there's this big focus on how how fast can I iterate, how fast can I compile because JIT compilation is showing up everywhere? Right?
Now JIT compilation These compilation tool I talk often if you see me talk about this, about when I think of CUDA, I'm always thinking of the entire platform. Right? I'm My job as one of the architects of CUDA is to think about how all of these things fit together, but nothing exists in isolation. And there's kind of an inverted pyramid here. Very very few people are writing in programming compilers.
There's a few of you and we love you and we absolutely support you and we have LLVM and all these other things that you can target. But fundamentally speaking you can probably count on both hands the number of people who really get to sit down and write start writing compilers. Above that there's kernels, libraries, host called libraries and then this massive universe of frameworks and SDKs at the top. Right? Now, one of the things which I'm thinking a lot about these days and that I pay a lot of attention to certainly over the last several years is Python.
Right? Because when I look at the world of Python developers, I think my pyramid is suddenly much much wider. Instead of having, you know, a 1000000 users at the top, I've got 10,000,000 users at the top. And so it's much the gap between something that you can build at the bottom and the impact that it has at the top is even more broad. Right?
So making a change to compilers, like JIT compilation. JIT compilation is incredibly important in Python, because Python is this very runtime interpreted language and you're constantly generating data dynamically. And so a compiler in the loop is completely normal. In fact, the Python interpreter basically is one of those. And so these changes we make at the very bottom affect enormous ranges of people.
And so looking at the Python stack, you have to invest everywhere, all the way across it. And so I've I've listed a few things here in terms of places that we are really looking at. But really our goal, and it's I put that as the subtitle of this slide, but really kind of It's the vision that we have in terms of what Where I think Python needs to be. Where we, all of us in CUDA do, which is, as I say, towards a complete NVIDIA experience for the Python developer. The whole CUDA ecosystem available and accessible to Python programming.
One of the aspects of that is that you're seeing our libraries and our tools start supporting Python more and more. And so, the math Excuse me. The math library teams have put a ton of work into producing a Pythonic interface which natively and naturally connects Python applications to these accelerated libraries which I think fundamentally the libraries are the most common way that people access GPU acceleration. And at the bottom here, by the way, through many of these slides I've got links to other people's talks. And this is this is linked to my friend Ati and Harun's talk where they're talking very much about all to do or everything to do with the libraries and this is a big piece of it.
And so if you ever wanna know more There's an index list at the end of this as well. You can just go and follow-up and see what all the different talks are, which I've drawn from to for the material in this presentation. But the Python libraries, it's a full stack which goes all the way from your application through JIT compilation, through the different APIs both CPU side and GPU side, all the way down onto underlying libraries. The GPU accelerated ones, the NVPL and video performance libraries which target the ARM Processor MKL, anything else. Right?
So sort of a universal front end for these accelerated libraries. The other aspect of tensor cores that I was talking about before was Cutlass, which gives you detailed configuration control over the tensor cores. And Cutlass as well has a Python interface. You can go install it, you can go find documentation for it and so on. On the right hand side, they've integrated this with Python.
Right? Sorry, with PyTorch, apology. A PyTorch extension. And so you can you can emit PyTorch from Cutlass and you can automatically bring Cutlass extension Tensor Core custom kernels in Python into PyTorch. There's a Cutlass talk that was on the previous slide.
I actually have the link for it. Go and have a look at the Cutlass talk which is gonna tell you a lot more about how this type of thing works. And as I said, we're not just investing in libraries, we're also investing in tools. And so the the developer tools team for the CUDA platform, the Nsight guys, have been putting a lot of effort into act into being able to combine their output for both CPU code sorry, c plus plus code and for Python code all at the same time in the same timeline. Right?
And so here on the right I've got an example of doing exactly that. Likewise, the code annotations, what we call NVTX, which allows you to identify code regions by annotating it and so you can have a green region and a blue region. So it's much easier for you to find regions that you want in complicated profiler traces. This is all configurable for JSON files and all works nicely with Python programs. There's just all of these different pieces.
That pyramid that I was showing, you've got to start putting the building blocks in all these different places, so that ultimately you end up with an ecosystem that works up and down across the board. As I said, I look around and find these amazing things that people are doing and one of the things that's really caught my eye inside Nvidia is Warp. My friend Miles Macklin, who is who is normally in New Zealand, but he's up here to give a talk about about Warp this week, He runs a team that's built this thing called warp, which is It's a very special thing. It's a Python it lets you write GPU kernels in Python, but they are differentiable kernels. And it naturally and automatically takes the kernels that you have written and with JIT compilation again, remember as I said, JIT compilation showing up everywhere.
He can automatically produce the reverse mode differential differential version of your of your flow. So you can have a forward pass. It records it and you can replay it as a backward. And so you can construct simulation code, physics code, computational code in the kernel, GPU accelerated. This compiles straight down onto the GPU and runs at full compiled GPU performance but with this with this back differentiable pipeline available as well.
And the things you can do with it are incredible. Right? So there's a whole compiler chain inside of here which takes in the Python and and and turns it into PTX and runs it on the GPU. But it lets you do these things, these amazing simulation things. His talk is down.
Go and check it out because first, it's incredible technology. 2nd, he's doing it in the realm of computer graphics, so he's got beautiful videos and visuals as well. This is an example where modeling something incredibly complicated like this plastic system of tearing bread apart. And, you know, the big one is the simulation and the ground truth looks almost exactly the same. And being able to do this and teach a neural model to follow how something like this, some plastic deformation functions and works correctly.
Through auto differentiation, you can just run the simulation. The backwards differential path is used to train the model and then the model can very very quickly start producing just amazing computer graphics like this and range in simulation results like this. Go and check out his talk. So, last year, and I very rarely reuse slides, but this slide nicely summarizes. I told you about something called Leagate.
And I wanna tell you a bit more again because again, it fits into a lot of the stuff that I've been talking about again. Leagate is a framework which takes your basic single threaded code and distributes it very widely across a large number of machines. Right? These machines are getting bigger and bigger, You're processing more and more data. To program these things gets increasingly hard and this is what something Light Leagate is for.
It's a layer, it's a stack where you have libraries on top, a runtime in the middle, and it runs across the accelerated libraries across your whole machine. And lastly I showed you the basic stencil benchmark using NumPy. NumPy can talk to this thing we have called well, qnumeric which is a NumPy implementation based on Ligate. It automatically scales your NumPy program, in this case across a 1000 GPUs. It's a pretty straightforward central computation, but it's a very powerful tool.
And so what they've done with this is they've taken Ligate and they've applied it to the JAX framework, another framework for differentiable computing. Many of you have probably heard of it. And the JAX framework is it's heavily used, of course, in machine learning and AI but it actually is a framework that can run more or less arbitrary simulations. Another differentiable computing thing, similar to the warp in Python that I was showing you a moment ago. And JAX is based on the XLA compiler, which takes in all the different layers of JAX and compiles it down to a particular target.
So what the JAX what the Leagate guys have done is they integrated Leagate into JAX at that compiler level, at the XLA level. So your JAX program does not change. The structure of your JAX program is the same. You mark up a few things and you indicate with decoration and configurations about what the pipeline stages are of your program, which I think they'll be able to put fully in the compiler in the future. And then this plugin to XLA, the compiler for JAX, then takes your code, maps it across all the Leggate runtime, and allows it to scale.
And so what they've done with that, they've And my friend, Wanchan, has a has a has a talk on this where he goes into way more detail because I only get to give you 2 or 3 slides on every single topic. And just running it comparing it against PaxML and Alper which are common distribution frameworks inside of JAX, The scaling and the ease of use is very impressive. So go and check his talk out if you're a JAX programmer because scaling is really such a powerful thing to be able to do. At the same time, the scaling across these big systems. And again, oddly reusing another slide from last year just because it's a good description.
Nsight systems have spent an enormous amount of effort working on their distributed system analysis. Right? Putting a breakpoint on a GPU is hard enough with a quarter of a 1000000 threads. And figuring out how to make a tool break a quarter of a 1000000 threads and tell me useful information is incredibly difficult. And now I scale this up to thousands of machines.
There's just no possible way. So you need new tools and they've really invested on these new tools. And I showed you some of those before and I'll show you I've got a quick picture again, but a key piece that they've done with this is they've taken these these large distributed tools, these multinode tools that they've got, and they can now embed it not just in the Nsight systems main viewer, but in your Jupyter notebook as well. And so your tools are available at the place that you're writing this code. And again, it's all about those building blocks across up and down the stack.
Right? And it's amazing, like, they take vast amounts of data and they can boil it down to picture. In this case, I've got a heat map showing how the deep utilization and the communication are or are not overlapping, so I can find compute only zones where I have opportunities for asynchronous communication. And again, all about energy. Right?
If I have my communication and my computer working together, Everything moves faster than if I'm doing them one after another and then I've got high power running running for twice as long. So, at the other end of the scale from Ligate, but still a very large system scale, is something called nvcmem. And this is something we've had for quite some time. And b schmem's been around for several years. It it it evolves and has has had a lot of different a lot of thing new things come into it all the time.
There's a whole talk by my my friend Jiri who talks about all things multi GPU programming. And he is one of the best speakers I know and his talk is absolutely worth going and seeing. But what mv Schmem does is it gives you low latency, fine grain control over the overlap of a computer communication. It's one of those things that sits underneath a lot of the stuff that you use without you really knowing you're using it. But what I'm going to be telling you about is actually the thing that sits underneath that.
Because it's really interesting, you know, these things, these envy schmem things on my pyramid, they fit down at the bottom level. Right? This is something that maybe a 100 people use, but which affects a 1000000 people through these different layers. And one of the technologies which is deep deep down inside of this is something called GPU Direct. And I've told you about GPU Direct before and I'll just go through a quick sequence explaining what it is.
Because when I've got data being produced by a GPU and I've got to get to the network and the network has historically been a peripheral attached to the CPU, In the past, when I before I had GPU Direct, my GPU would generate data and I'd have to go through 4 different steps to get that data out onto my network. I'd have to synchronize, copy a couple of different times, trigger some things. So there was 4 hops to go through in order to be able to get my data out of my GPU and onto my network. And so gpu direct came along and said, this is ridiculous, especially for the amount of data that I'm moving. Let's just move my data directly to the network device.
And so I eliminated my 4th hop and now with a direct single single path to pass copy, GPU direct allowed me to generate my data and then send it directly from GPU to network card. And that's very powerful, but it still keeps the CPU in the loop. So they came up with a thing called GPU direct Async. And this is this these evolutions that happen over the years as they work and improve in these technologies. And so now I've kind of got a 2 and a half step process.
What the GPU Direct Async does is the CPU can do the setup, but it lets the GPU trigger it. And so the data moves automatically and directly and there's some CPU proxy that handles the triggering. But it's now fully controlled by the GPU, so the GPU program doesn't have to stop so the data can be sent. It can keep on going and just signal, now send it. Now send it.
Now send the next one. And now finally, they've got this thing called gpu direct kernel initiated. And this is where you take the CPU out of the picture entirely. This is a truly 2 hops process. You can never get fewer than 2 hops.
You've gotta first prepare it and tell the tell the network that it's coming and the second thing is to stream all the data off and onto the network. The 2 is the lowest number that you can get here. So we've gone from 4 to 3 to sort of 2 and a half to 2 and this embeds everything entirely in the kernel. And the result is incredible. Right?
This is a this is a run of training on a graph neural network. I'll explain more about that later, actually. Where that middle line is that 2 and a half step process. So the 2 and a half step process is still 20% faster than the vanilla normal non GPU direct process. And in terms of the transfer, the feature transfer, the movement of the data that you're caring about, you're looking at an order of magnitude speed up more.
Right? So it's the power of being able to make the communication more streamlined, more autonomous. I don't mean the power in watts in this case, but the potential of of of that is so enormous. And these things sit there and they quietly and silently plug into something like NVSHMEM, schmem, they plug into something like nickel. And you end up nickel.
Rests on top of this. And for nickel, on the left hand side, nickel is the thing that moves all your data between GPUs when you're doing any kind of communicating multi GPU job, small messages are the hardest. Right? One byte messages are extremely difficult because they you're sending a lot of overhead for a small amount of data. And here, what this does is cuts your latency considerably.
And on the right hand side, you've got much more bandwidth and potential because again, you're cutting out the overheads and you can really communicate much more efficiently. And again, tools integration everywhere. It's so important to be able to see what's going on. We've got NVSHMM and nickel traces built into the tools. So I wanna talk about the thing that I think I've had the most questions about over the last year, which is Grace Hopper and the programming model and the way you program those machines.
Right? And the philosophy of CUDA has always been that we have a single program constructed of effectively a GPU function annotated by global and a CPU function. And it's all in one program. It's a heterogeneous program. Right?
It's not 2 separate things. It's 1 program with functions running in 2 different places. Right? And this relates to something that Jensen was saying to me the other a few weeks ago. It's not that you're replacing serial work with parallel work, it's that it extends it.
You need both and you want to do both. Right? And so the idea is that CPU code runs on the CPU and GPU code runs on the GPU. Right? And between them, historically, we've had this PCI bus and so even though you've got these very high speed memories going on, the PCI bus has historically been a bottleneck.
And so the obvious thing to do, which we did with the Grace Hopper chip and we talked about this last year, is that you can combine them together with this this thing called the NVLink c to c connection, which is many, many, many times faster than PCI. And so my data transfer goes through much better. Right? And this is called Grace Hopper, this is what the machine is. But it's not just a device with a very fast interconnect.
In fact, it can be that, but I think that's really just missing the point of what this is all about. The reason that I love this thing. It's that you've got really one processor with 2 characteristics, like, natively two different things. Right? I've got 2 memory systems optimized for their own processor, but one is optimized for a latency system, right?
My CPU is a latency processor. It has deep caches, it cares about linear operations very fast. My GPU is a throughput machine. It has these very high bandwidth memories and it has very high bandwidth caches. And the way that you treat these things is different because the way these things run code is different.
And so, on one of these Grace Hopper machines, it's it's a single unified memory system on 2 different ways of executing. And if I've got something like a linked list, run it on the CPU. It's much better. If I've got something like a parallel reduction, run it on the GPU. That's what it's for.
And I can pick and choose, just like my program was a hybrid of 2 things, I can literally run whatever I want at the right place for it, because these 2 systems are unified with 1 address space. So it's more than just the FastLink. It's the fact the GPU can see and modify and touch the CPU memory. In doing so, we can detect it and we can move that over to the GPU so the GPU can bet get the benefit of its very high bandwidth caches. That could be as much as a factor of 10 improvement in performance if you're touching that data all the time.
And so the ability to both combine the single address space but also intelligently move things around while we're working on it is unbelievably powerful. That lets me put the compute and the data where it needs to be. And at the same time, of course, the migration doesn't affect the CPU, right? It can still access and touch and see that data. It's a little bit of extra latency, of course it's going over the bus.
But really, it's really one machine, and that's kind of the point I'm trying to get to. And this is some results from and very generously I'm able to show some results from Thomas Schultes' talk. He's the director of CSCS, which runs the ICON code on the newly brought up ALPS machine, a Grace Hopper machine in Switzerland. And this is just a fantastic example of exactly what I was talking about. There's a simulation here where you've got an ocean simulation running purely on CPU code and you've got an atmosphere simulation on the GPU in green.
And the coupling is extremely tight and so you're moving data around a lot and so historically it's been you've more or less been limited to the performance of the CPU code. But when you move to something like GPU, you're really able to run both of these things at the same time. The CPU code on the CPU, the GPU code on the GPU, the very close coupling and exchange of data, automatic. And the result is, you know, this is a factor 3 speed up. This is unbelievable.
And this is at the scale of 64 GPUs. And, you know, this is the kind of thing that is going to affect the number of days, like, I can forecast and my weather forecast and really important things like that which impact everybody. At the same time, other great examples. My colleague Matthias has to talk about this, just looking at fine tuning of language models. A language model is a series of transformer layers and when you go through your transformer layers as you're processing these and you're forward pass training it, you generate these intermediate tensors.
And that can be a large number of layers and therefore a large amount of data. So typically what we do is we throw away the data and then on the way back we recalculate it all. So we double our computation in exchange for saving some memory. But with the Grace Hopper device, I can actually cache some of that instead of throwing it away. I'll keep some of it around on the GPU.
The small things are not worth throwing away. The blue things, instead I will cache them and save them on the grace memory. Because remember, memory is just 1 giant memory system. And then on my way back, I can recall it back in from the GPU and so I don't have to do that recomputation. And the result is a 20% speed up in this particular example.
This is taking a 10000000 mixture of expert 10000000 parameter mixture expert model. And you can see on the left the light green is offload and the dark green is recompute. The recompute time is, of course, the same for both. But if I'm doing on Grace Hopper, if I if I do the offload of data instead of the recompute, I'm gaining in time because I've got this very tightly coupled memory system that lets me do it. Another example which I see a lot of these days is graph Graph neural networks are the kind of things which financial institutions go and analyze if your credit card has been used fraudulently.
Right? Things like that. Massive, massive, massive interconnections of information. And the GraphSAGE model is a primary model for going and using neural networks to solve graphs. And so this is a trivial example, a simple walk through of how it works.
And my friend Joe Eaton has a whole talk on this. So again, he's the expert, I'm just the messenger. But basically you go you sample your neighborhood. You've got these little batch, these convolutional networks that run at all these different types of nodes. The challenge of the graph network is that it's not just one single collection of data that I'm operating on.
My entire universe could be touched on any edge between any two nodes in the graph. So I have a massive pile of data which is completely randomly accessed. I might access only 10% of it at any one time. I don't know which 10%, and it's going to be different on every iteration. So what I need is this it's just a big pool of very fast memory, so I can randomly access and touch it, as I go through the flow of the GraphSage model.
And putting this on Grace Hopper has just been an incredible performance improvement. Where previously I spent a lot of my time fetching data and moving things in and out and on and off the GPU, now with this unified pool of memory, you're looking at a factor again, a factor of 2 speed up. These are huge. Like, a factor of 2 speed up is like a generational speed up in most codes. You will spend ages, a whole PhD, getting a 20% speed up in something.
This is a factor of 2 because now you have a new architecture that can do new things. So finally, from one form of graphs to another, and this is, I must admit, a little bit of just it's it's as an engineer, you know, you plan something, you design it, and CUDA graphs is something I started designing several ago. And you have all these ideas and it takes a lot longer than you think to get where you're going to go. And so the idea of CUDA graphs, which I've talked about a few times and hopefully you know, the idea is you define your workflow upfront and then I can issue a single launch operation to launch an arbitrary amount of work. So it can be a very fast way of putting work onto the GPU, and I can see really good improvements in speedups to launch.
But it's a lot more than just a fast way to launch work. So I actually went back and I found my slide deck from 2018. And this was for GTC, for conversations with developers, just saying, could this be useful to you? And I just thought I'd grab some of my slides from then, because it's so interesting to see what I was thinking at the time and where it's finally going. And so, you know, a quick description of task graphs where you had these nodes and they could be different things and this is largely what we built.
And I had this sequence where you say, you know, the task graph properties, they're reusable. I can launch them over and over again. I define it once and I run it many times. But then, cyclic. I wanted a graph not to just be a straightforward flow of dependencies.
Why not be able to jump back to the beginning? Why not be able to have a dynamic graph? Something where node b could decide it wanted to go to c or d based on some data that it was that it came up with. Right? Data dependent dynamic control flow.
And then finally hierarchy, which is a key part of any graph system. But these these are literally my my very first slide deck of graphs saying, here is what I want and finally we've built it. 6 years or 7 years later, how long however long it's been. And so let me tell you about this thing that we built because it's it really it is everything that I'd had in my mind about how these things would be used. And it's it opens the door to a lot of potential, I think.
So what I've got on the left here is an incredibly trivialized version of something called conjugate gradient. It's like a gradient descent type of thing. It's a very, very standard way of solving a system of linear equations, and it's just pseudo code on the left hand side. But the key part about it is there's an iterative loop. There's a main loop where I do something, and I run that loop over and over and over again until I have my solution.
And the loop body, typically, traditionally with CUDA graphs, the idea with my loop body is that I'm going to take that body and I'm going to turn it into a graph. And then I'm going to run that graph many times so my program starts looking very simple. Instead of having all of these different things that I have to do, I have one launch call. And this is great. This is how people use graphs today and it speeds things up very efficiently.
But the challenge is this data dependent execution, very common. Iterating till converge. Right? It's almost universal pattern. The iteration requires reading the result back and deciding if I'm going to do my while loop again.
So I keep having to stop my program, copy the data back in order to evaluate my while residual is greater than epsilon, and then I can go back and do another launch. And so now we're moving data dependent execution to the GPU. Right? So I take the main loop and now I create a graph with these new nodes. We've created 2 new node types and I'll tell you about them in just a moment, an if node and a while node.
And now I can put the while on the GPU. So the convergence check, the while check, is done without having to come back to the CPU. And my program no longer has a main loop at all. The main loop is now completely moved dynamically to the GPU. And I can just launch a conditional graph, if you want to call it that.
And my program is much simpler. So now my CPU is out of the picture. I can run a 100 of these independently, all at the same time because I no longer need CPU threads to manage them. And the way it works is we've taken one of these conditional nodes. It's just another type of graph node.
But it's a graph node that's either an if or a while. And inside the if node of the graph, I can It evaluates the condition. It either runs a sub graph or it doesn't. Remember, graphs are hierarchical. That was one of the things on my very very early slides.
And so I've got these conditional nodes which encapsulate what to do if the condition is true. Now, because graphs are hierarchical, you can nest these. I can have a conditional node inside a conditional node, so any depth that I want. And so I can have a while node. And so I can have an if node that if something happens, go and run this.
And this contains a while, which iterates continuously And all of this can just run 100% be described inside my task graph. A lot of people ask me, why did you make graphs control dependencies instead of data dependencies? And this is the reason. This is why we built it with control flow dependencies. Because you want to be able to say things like while and if, which data flow does not allow you to do.
And then there's other constructs you can do. This thing on the right is like a switch. It's like multiple ifs. If x, if y, if zed. That's like a switch with cases.
All of these types of things, if and while, are the key fundamental building blocks. And maybe we'll optimize switch ourselves later to get to get to make it more efficient. But, you know, fundamentally, you can now describe a fully dynamic control like workflow on the GPU without having to return to the CPU to be the control the hand holding control. And that's very much a theme for the kind of way we're moving things, to reduce the amount of communication, to keep the GPU busy, to keep to keep the power as efficient and the computation as efficient as we possibly can get it. And this, finally after 6 years, was out a few weeks ago with CUDA 12.4.
So it's really it's so nice to be able to stand up and show you this thing that has been in my head forever and we've never we've just not been able to get to it until now. Turns out you have to build a lot of things before this can work. And so that's it. That's what I've got. Here's the list of all the references of everything that I've told you about because I am just the messenger.
All I do is tell you about amazing work that everybody else around the company is doing, and I just I get to stand up here and tell you about it. And so here is a list of all the fascinating stuff that I've dug up to find. Have have this is shared this is shared in the PDF of the slides, and so if you wanna go back and stream some talks or even attend them in person, these people are really worth listening to. Thank you very much. That's that's pretty awesome, Steven.
We have a few minutes for, to take some q and a. We have microphones up at the front here. If you have any questions, feel free to ask. We all we have a couple online questions for you. So I'm gonna start with the first one.
Are there CUDA APIs available for us to measure power consumption? As our programs run on GPUs and break down, how much is due to compute, memory access, or networking? That's a hard question. The power consumption is a system level thing, so you need different system APIs. And so what we have we have a monitoring system called DCGM, which allows you to monitor your data center and all these nodes in real time and see what the utilization is, the power is of these different things.
But you have to have to use that to collect the data across your system and extrapolate a single CUDA function. There's no way to identify just the power purely from that because power is it's an external factor that depends on not just compute but memories and buses and all sorts of things like that. Great. We have questions here. Go ahead.
I have a question about, effectiveness. In in in case of company, for example, if we, combine all workstations in 1, server, how much energy we can save? So if you combine what, sir, could you what if if you combine For example, if we have multiple, workstations, CPU, GPU, right, and we are everything put in one server, CPU, GPU, how much we energy can save? So the energy saving is going to be very algorithm dependent, of course. But typically the most expensive thing in any system is communication.
It's moving electrons around. And so the more you combine into a single localized space this is why you see density increasing in data center racks. Because it takes much less energy to move electrons a few inches instead of meters. Right? So sorry.
I just combined the different different units there. But so in general, you will be saving energy, but it really depends on your algorithm exactly how much. I think it's hard to predict that. You would need a model of your system.
The top ten ideas from the talk
Energy efficiency is crucial in accelerated computing. Performance per watt is a critical metric, as data centres are constrained by power limitations. Optimizing for energy efficiency allows for more computing power within a given power budget.
Tensor cores offer significant energy savings compared to traditional CUDA cores. By using mixed-precision arithmetic and leveraging the economies of scale of tensor cores, developers can achieve substantial performance gains and power savings.
Iterative refinement techniques, such as using lower precision tensor cores for the bulk of the computation and then iteratively refining the result to higher precision, can provide significant speedups and energy savings without compromising accuracy.
JIT (just-in-time) compilation is becoming increasingly important in CUDA workflows, particularly for dynamic languages like Python. Recent improvements in CUDA's JIT compilation performance have made it more feasible to include compilation as part of the main program loop.
Python is a rapidly growing language in the CUDA ecosystem. NVIDIA is investing heavily in providing a complete CUDA experience for Python developers, with a focus on productivity and performance.
Unified Memory and memory migration optimisations in Grace Hopper architectures allow for efficient utilisation of CPU and GPU memory spaces. This enables developers to run workloads on the most suitable processor without explicit data movement.
CUDA graphs, particularly with the introduction of conditional nodes (if and while nodes), enable fully dynamic control flow on the GPU without returning control to the CPU. This reduces communication overhead and keeps the GPU busy, resulting in more efficient computation.
NVIDIA's investment in enhancing CUDA libraries (e.g., cuBLAS, cuFFT) and tools (e.g., Nsight Systems) with Python interfaces and distributed computing capabilities democratises access to high-performance computing and enables developers to tackle large-scale problems more easily.
Innovations in interconnect technologies, such as NVIDIA's GPU Direct and NVSHMEM, reduce data movement overhead and latency in multi-GPU and multi-node systems. These optimisations are crucial for scaling performance in distributed computing environments.
CUDA's programming model, which combines CPU and GPU code in a single program, allows developers to leverage the strengths of both processors. This heterogeneous approach, coupled with the unified memory system in Grace Hopper, enables efficient execution of diverse workloads.
Last updated