The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.

===

hugo: [00:00:00] I'm so excited to be here today with Charles Fry. What is up Charles? 

charles: Hey, Hugo. Thanks for having me. 

hugo: Such a pleasure. you've published this incredible GPU glossary recently, my backgrounds in I'm a scientist, right?

I'm doing a lot of ML, LM data stuff these days. And it took me long enough to figure out all the programming stuff, let alone like having to worry or be concerned with hardware stuff. And they're just enough on enough hours in the day. So really appreciate all the work you've done to demystify, this pretty much.

Pretty heavy hitting stuff, to be honest. so really excited to jump into this stuff. I would love everyone to introduce themselves in the chat. Let us know where you're watching from and why you're interested in such things. Charles, you're in New York city. I'm in Sydney, Australia. So it's Friday morning and it's looking beautiful here.

it's getting chillier there this time of year, isn't it, Charles? 

charles: Yeah, not as chilly as I remember as a child, that's a different problem. 

hugo: so yeah, introduce yourself in the chat. It'd be great to [00:01:00] know what you're interested, you're a scientist or a developer, or someone who's interested, in learning a bit more about hardware.

I'll introduce Charles and probably get a few things wrong. and then, let me just make sure, yep, we're good. and then Charles can rectify any mistakes I've made. But Charles is a developer advocate at Modal, where he and they work on tools and resources to simplify GPU programming for developers.

Charles has been at Weights Biases, Developing Tools, Technical Docs, Educational Content. Charles also holds a PhD, if that's not enough, Charles also holds a PhD in computational neuroscience. and his technical interests span neural networks. Now we get to some pretty heavy hitting stuff.

Computational Bayesian methods and category theory for programming. And it's funny that you and I have chatted because like we chat about LLMs and that type of stuff all the time. We haven't really delved into our shared Bayesian interests and our category theory interests. I'd love to do a bait and switch and just do a podcast live stream right now on like spectral sequences or something like that, but that would be wild.

But [00:02:00] Hey, is there anything I left out about what you're up to now or what your interests are or what's happening at 

charles: modal? No, that's that's all pretty good. Yeah, I'd say modal, the one addendum I would make is modal's mission is to make cloud computing and building compute and data intensive distributed applications easier.

And the thing that has taken off already that we've found people really like is our like core like serverless GPUs plus. but that's a. That's the hot thing right now. That's, the place where people are spending a lot of money and developing new technologies. So that's where we're falling as a startup, but there's a big, bigger vision behind it of, of a cloud that is not so painful to use.

hugo: Amazing. and I'm pasting the link to the GPU glossary, which we're going to get into in the chat and people can check out modal there as well. we have a spooler tuning in from Chile. So we've got Chile, Sydney, and New York city represented, which is pretty, pretty cool. why don't we jump in and feel free to ask questions in the chat at everyone and [00:03:00] we'll try to get to them.

but Charles, we, yeah. If we're doing some form of modern AI, whatever that means, and we know what it means. It's working with like large neural networks of some sort, right? whether that's, pinging and APIs that are self hosted or trying to train deep learning models ourself, trying to train deep training, deep learning models ourself.

but GPUs are essential to, what we work with in, in modern AI. I'm just wondering what first got you super interested in GPUs in the very technical aspects you are now, and how, just what your understanding of their role in AI, how that's evolved over time. 

charles: Yeah. I think my first interest in it was, probably in grad school, I was, training.

Neural networks, very small things, especially by today's standards, and, I was working on some sort of applied math stuff on like proving that neural networks would converge or, like exploring the geometry of neural network loss surfaces, and, that's, things were [00:04:00] slow, I was actually a running stuff almost entirely on CPUs.

There was one or two steps that I would move on to a GPU to try and speed them up. but I was stuck on CPUs because of like limited software support for some of the more exotic. things that I was doing in that project. now they're well supported. The libraries are way better. But at the time, you couldn't get Jacobian vector products out of PyTorch, you couldn't get, so you couldn't approximate things about, the second order, second derivatives.

the kind of curvature stuff I was looking into. it came up as oh, this is the way to speed things up, and then it was like, all,couldn't find a way to really speed stuff up with it, like the programming difficulty ramps up pretty quickly, and at the time, wasn't deeply immersed in GPU stuff, but then after grad school, asneural networks started to become more useful.

There were more open source models for people to build off of. and I was working weights and biases, helping people like, teaching people about deep learning, teaching people about, how to run training and how to monitor it, [00:05:00] I ended up. now I wasn't a poor grad student, I had access to good compute resources, and I was working with people who were, like, operating at a scale where it was absolutely necessary.

and so that's when I first started spending more time with them, worked on a feature for showing, profiles. Of PyTorch runs that was probably the moment where I started diving deep you run like a you run like a trace of something and it shows you every you write a program and you write 50 lines of code and then you're on a trace and you look at the you look at the flame graph or the icicle graph and all of a sudden you're like wait what are all these functions that are running and also a PyTorch like PyTorch program you Has a CPU component and a GPU component and they're like running concurrently with each other and this is all revealed by looking at the trace in a way that I was I like Intellectually knew this existed.

I knew a couple of like tips and tricks for how to use it more effectively But I'd never seen it I'd never, poked at it, never,done more principled optimization, [00:06:00] and working on that feature for Weights and Biases, for being able to, store those profiles on the Weights and Biases platform and then show them in a UI, that was when I first was like, oh, this is pretty interesting, what's a CUDA kernel launch, and what is, what are these,cut lists?

Gem and T what does that mean? and that's when I started learning and reading about it more. 

hugo: Amazing. and we're going to do a demo late later in the live stream and maybe we'll see these types of things as well. But the basic idea there, it is very enlightening to look at a flame graph, for example, where you see as a function of time, all the resources and what's consuming them and using them, that type of stuff.

It is peeking under the hood, and getting insight into what's actually happening, 

charles: yeah. So that was a big, that was a big moment. And then I would continue to do that,use, profiling as part of, performance debugging and troubleshooting. I was teaching, for full stack deep learning in 2022 and 2023.

and that,also, that was a lot more, Like a lot [00:07:00] more deployment a lot more like engineering and application even than when I was at weights and biases itself a big Jump from grad school and for that like also it was okay Let me figure out what all these different types of GPUs are so I can give good Recommendations on what to use for training what to deploy That's when I wrote the like cloud GPU pricing guide with, with Sergei Karyev at Fullstack Deep Learning.

I think that's still live. price is maybe a little bit out of date, but, the guide's up. and then, that was actually what got me eventually interested in modal, is I was looking around at, cloud deployment options on GPUs. cause,Lambda wasn't cutting it anymore, AWS Lambda, and so needed to, find a better option, and Modal was one of the platforms I found, it was the one I liked the most, and got excited enough to join it, join the team, But yeah, throughout that process, just continually,uncovering problems, or running into issues, or trying to,add features, and discovering, oh, there's quite a bit going on at the hardware level that I need to understand in [00:08:00] order to be able to do what I care about at the like, application level.

hugo: I love it. And that really dovetails nicely into a lot of the things we're here to talk about. Just want to say, so we've got H Buller from Chile. We've got someone from Bulgaria as well. And Biswa Roop is watching from Ottawa, huge fan of Modal. They've said in the chat and they share the same interest in category theory as well.

wow. Yeah, exactly.so what you've done at modal is recently launched this GPU glossary, which I've linked to in the chat and I'll link to in the show notes, of course. what inspired it and what are the most important takeaways for developers? 

charles: Yeah. the GPU glossary actually began its life as a document called, I am fucking done not understanding CUDA, which was just a little motion doc that I had.

yes. so that was. Like, basically, I was at, joined Modal, had been there for a couple months, I think, and now found myself,a lot of people at Modal, they're like, Platform systems engineers who maybe have supported data and [00:09:00] ML teams, even, but they aren't ML researcher, it's not really like an ML researcher background, like the co founders have done some ML stuff, but it's not it's not bread and butter for the company, like the company is, is, And so you need platform engineers, so like Linux kernel hackers and people who are on their fifth or sixth file system, but not people who had done this GPU stuff.

So I found myself doing a ton of like user support and also like engineering and advising on OK, what, should we update the, what should our upgrade policy be for the NVIDIA drivers? like, how do we help users,run CUDA programs and GPU accelerated things on Modal?

So I find myself like, It's one thing to, hack your way at it yourself and just,cut your way through the swamp to try and understand things. It's another to try and be able to, quickly answer questions as they're, like, coming up in meetings or as they are, like, asked by users in the support channel.

And so, I was like, okay, I need some place [00:10:00] where I can go to, find information all together. and like a good, I need to like, collect this stuff up, make it like searchable and discoverable, share it with people, so that, that ended up, then we had an intern at Modal, Matt Napo, who was working on some stuff with, multi node running like, multiple, machines, not just one CPU, one physical machine, but multiple physical machines, each with multiple GPUs.

So he had, he was like, wading into another part of the,complexity of working with GPUs, like, where it starts to mix with networking. and he wrote a document that was called the Interconnect Glossary. and that, those two things came together, and we were like, Oh, wow, this is actually becoming, super useful for us.

Let's, let's make this something that other people can use. Something that,um,is a, is like a public resource and that people can contribute to 

hugo: amazing. So I also love, some of the best [00:11:00] documentation and best tools, of course, the ones that are developed for you personally, but me and you figure it out or as much as possible and then share it with the world.

So I really appreciate that origin story, so to speak. What do you think the most important takeaways for people working with LLMs today are from it? 

charles: Yeah, I think. There's a couple, I think one of them, and this is something that I said in my like, thread on, about it on Twitter is, Kind of the core of the CUDA stack is this, is actually this instruction set architecture called PTX, not anything called CUDA, and but, that's actually just, before diving all the way into that, detail and explaining what that means or why it's important, maybe zooming out a bit, is that, the word Cuda, people use Cuda in this very, loose way, in the way that, I don't know, parents in the 90s would use the word Nintendo.

It's that's a Nintendo, and this is a Nintendo, and,it's no, that's actually a Sega Genesis, and this is an N64. these are different things. CUDA is like an approach to building GPUs, the, architecture is [00:12:00] the A in CUDA, and that's it's actually like a design principle for computing.

It's like a computing architecture, and, there's that core principle. And then there's a programming model that's an abstract programming model, not a programming language, not not anything you can point to and be like there's the binary for it or there's the definition of it, it's like, it's a, I guess there's a definition, a, but it's abstract, it says you, it's a way to write programs so that they transparently scale with the hardware.

That will run them. and that's like not obvious. And then finally, there isCUDA C or CUDA Fortran, which are implementations of that programming model, like as actual software platforms, or sorry, as programming languages, and then,a software platform for that, which includes drivers for hardware, it includes, libraries, APIs, like, all that stuff, that's, like, when people say I'm installing CUDA, they're [00:13:00] talking about that, but when people say, NVIDIA's Mote is CUDA, they're, like, talking about the, this, architecture for GPUs and all the, things around that, and also this, programming model and how it supports the, continued success and health of their business.

hugo: Thanks for all of that. And I do, the thing is I do, I love, and we'll get into this a bit later. I do love that. You mentioned PTX parallel thread execution, because these are the types of things that I think a lot of developers find scary and only want to like encounter if they. Feel that the need to, and even then, so at least me, my type of persona.

so I am interested in just thinking through the types of things developers need to do, and then think thinking through what they need to know about GPUs and from the glossary in order to do these. So I think, there are several and we can slice it in a number of ways. Developers want to do inference.

Perhaps they want to do fine tuning. Some may want to start training from scratch. We've got rag. Then we have. even more compute intensive stuff like diffusion models and text, the image, text, the video, these types of things. so maybe we can [00:14:00] start with inference. We, one of the amazing things about the world currently is that we don't even need to worry about hosting our own models if we want to ping vendor APIs, but if we, do want to start thinking about hosting our own models and there'll be increasing number of reasons to do this as time goes on, we need to start thinking about.

How we use GPUs and what we need to know there. So maybe we can start with an example from inference. Recently, Meta released LLAMA 3. 370b and This looks baller, man. like the reports of how good it is compared to, previous, like several hundred billion parameter models.

one of the first things I saw about it is it outperformed GPT for 15 percent high performance in Python coding, slightly better results for grade school math. This is all incredible stuff that I want to test out myself. and I don't have a lot of experience in even like loading, A 70 B model and thinking through wait, like what?

How much VAM? What is that? And how much of that do I [00:15:00] need to like do this? And do I do, do I do floating points something or, so my real question is for developers who want to first run models for inference, like whether it's two B, seven B, 30 B, 70 B, 405 B, what should they know about hardware requirements?

charles: Yeah, so I think you like the focus on parameter counts is, helpful because it points you to the first thing that you need to worry about, which is if you care at all about the time that it takes to finish your inference, then you very badly want to make sure that all the weights of the model fit.

In the memory associated with the GPU, so the basically on device but off chip memory. The, also called the VRAM or the GPU RAM. that if you can't do that, you have to like, put weights onto the CPU RAM and then pull them in and out. or even onto disk. And in the same way that like, [00:16:00] In theory, you could run programs on your CPU that, use virtual memory to push, bytes onto disk instead of keeping them in RAM, but when that happens, your computer basically slows to a crawl.

There's a similar problem with GPUs, if you can't keep Or with language model inference, if you can't keep all of the bytes in that, in the RAM that's got that low latency high bandwidth connection to the actual compute units of the GPU and like in order to actually like run computations, you can't, the von Neumann model is like you can't run.

Computation on stuff that's in storage, right? You take it out of storage, put it in some kind of, register, and then you, compute on it. And GPUs follow that same model, and what that means is because that, the little registers where you do your, where you do the actual calculations, because that can't store an entire, 2 billion parameter model to say nothing of a 70 billion parameter model,you have to move the weights.

In and out of the compute units on [00:17:00] every inference. So in some ways, if you've worked with databases before, it's needing to do a sequential scan every single time to respond to a query. and so it's very much about,you want to keep it in memory, and you want to be able to, rip through it as quickly as possible.

and that's, So that's that's right there at the core, what it means is when you're picking out a GPU in the cloud, or especially if you're trying to pick out a GPU to purchase long term, the first thing to look at with contemporary models is like, is there enough RAM on this GPU for me to be able to run the models that I want to run, like without having to spill to CPU RAM or disk?

yeah. 

hugo: Great. And so with these sizes, what type of RAM requirements do I. Is it like, I remember that you were saying, double it and then something? 

charles: Yeah, double it and 30 percent is not bad,with, with language models, once you start running large batches and large sequence lengths, those start to have a meaningful contribution to the amount of memory.

That, [00:18:00] you, that things take up, I want to say transformer inference,with a KV cache, with, storing the, the, the activations of the network, you end up with O N space, and O N squared time, with, where N is, the length of the sequence, the number of tokens in the context, so what that means is, at a certain point, At the, many thousands of tokens, you're storing, a couple of gigabytes or more of data, of internal activations from the network.

And that also is gonna need to come in and out of the compute units of the GPU,on each, future token generation. that, that makes it hard to do the, the, percentage. They're like, Oh, what do I add as my like percentage on top of what the model weights require? currently like the,but the rule of thumb is take the model weights and then multiply them by the, the precision that you're using.

So if somebody says they have a 70 B model, that's 70 [00:19:00] billion parameters. If it takes 2 bytes to store each parameter, then that's 140 gigabytes. And that's already too big to fit in the RAM of any single GPU that,that you can find. and if you had, if you can shrink them down to 1 byte, so you represent them as 8 bit integers or 8 bit floating point numbers, then now it's 70 gigabytes and you can fit at least the model weights.

In the memory of,H100 GPUs, or A100 80GB GPUs. 

hugo: Great. And then on top of that, you suggested adding 30 percent just to be safe. Yeah, that's this depends on how many tokens you're feeding it, essentially. But 

charles: yeah, you could, you can start to do a little bit more math that depends on like the size, latent space, and like the number of requests you want to handle, at the same time.

yeah, so that That one is maybe a separate estimation. If you expect to handle very long contexts, like many thousands of tokens or like many requests at the same time, [00:20:00] then you should consider, then you need to do a separate napkin math session to think about that, or profiling session. 

hugo: Yeah, I appreciate that.

And this is once again, let's just make clear that this is an iterative process. You don't really get this right the first time. And there's almost a zen of doing it, right? Like you do it a few times and you get a sense of things. You feel the pain to be honest as well. on that point though, actually, I've linked to the GPU glossary, on Modal's website, you can find Modal Slack as well.

Charles and the Modal team are great at answering these types of questions on the Slack community as well. They got nearly 5, 000 people there who were super excited about having these conversations as well. definitely check it out. I, also I don't, I didn't really want to have this conversation, but, I do want to say your point about thinking through what happens when we like have longer contexts and that type of stuff is super important.

I just do want to say once again to everyone, public service announcement. long contexts are incredible, but also think about giving your LLMs smaller [00:21:00] contexts that's well managed data as well, as opposed to just stuffing stuff in. We need to remember that Johannes Kepler built his laws of planetary motion on 2, 000 data points that had been, Measured by Tico Brahe, the astronomer, and if you collect your data well and express it well and know what's happening in there, you do not need gazillions of data points.

There are forces at work trying to convince us that big data is necessary, and it's important in a lot of cases, but not always. 

charles: Yeah. Definitely. And I think the more time you spend with the hardware, the more sympathy you have, it's like, Oh, do I really, I, once you know how much is going on in between each token generation, it's man, do I really want to do I really want to exercise that whole stack and, schedule thousands or hundreds of thousands of threads, just because I'm too lazy to figure out how to clean up my context.

Exactly. 

hugo: and if you do look at the, the prompts you send things, it can be the [00:22:00] wildest stuff. so I love that you mentioned certain, trade offs, and I do, I'm just wondering how developers can balance VRAM and latency and cost, when running models at different scales. 

charles: Yeah, the hardest part, yeah, memory is something you have to, buy, and it's fixed, so it's it's like this fixed resource, it's like land or something, but you can, the thing about land is you can buy more of it,and, That's kind of how it feels with RAM, you can also collect up, more GPUs, it's relatively painless now to run on up to 8 GPUs, not just because we make it easy with modal to run on 8 H100s, but also the, good.

Thicker software stack around that people have resolved a bunch of obnoxious like multi processing bugs and written yeah production grade libraries and foolproof Libraries and tools like blm parts of the hugging face stack torch so you can just run on mobile gpus up so you can get 640 [00:23:00] gigabytes of ram pretty straightforwardly, And, relatedly, you can increase the, throughput of your system very transparently.

So you can, if what you care about is, chunking through as much data as possible, and maybe as much data per dollar as possible, and you want to, buy throughput, that's also pretty transparent. It's like, buying more pipes, or like, buying more internet bandwidth. feels the same way. like you can just get more of it.

and yeah, on modal, that looks like scaling up to many copies of 8xh100 machines, each one handling, thousands of requests a piece. the problem is actually latency. Latency is the hardest problem to solve is like famous. Problem in, many domains of computer architecture and, systems design.

It's yeah, I can increase the throughput of, reads from memory, but can I actually decrease the latency? [00:24:00] No, it's almost you can always, just buy more lanes to re to solve latency problems, but you can't bribe the speed of light. and you can't bribe,heat. E flux that certain components can handle that like limits how quickly you can change the voltage of a clock.

you can't, it's like hard to co locate things as opposed to get lots of them spread out. So like the key Like maybe takeaway from that for LLM developers is that actually these like interactive chat experiences that everybody has jumped on in the wake of ChatGPT and like other similar products being successful, that's actually hard mode.

For these things, because they're very latency sensitive at the scale of a few hundred milliseconds, because it's your, it's human attention that you have to worry about. and that's that's going to be the hardest thing to run economically at small scale, your or yourself, whereas like, you know, running a nightly job to, to add, [00:25:00] enrich data.

Yeah. That you've pulled from your, into your data warehouse, where it's oh yeah, as long as that finishes in under an hour, it'll be ready for when the report goes out. that's way easier to do economically, it's yeah, you can just scale that, very transparently. and I think, a key takeaway is that you should search for throughput oriented problems with language models and like anything on a GPU because really GPUs are these like highly throughput oriented machines like everything on the like down at the like silicon level is oriented to maximizing throughput by, hiding latency by doing as many things at once as you can.

and so it's in some ways, not really a latency oriented hardware architecture at all. as opposed to something like Grok's LPU or the, Cerberus wafer scale engine. These, much more latency oriented, computing architectures. 

hugo: Awesome. We have a great question from, um, bop, [00:26:00] our friend from Ottawa who's interested in category theory as well.

Um, BOP asks, does long context in model affect inference time, and they clarify if they have the same LAMA three with eight context, eight K context length in one with 16 k context length, and not using the whole context. Does it matter what their max contact context length is? 

charles: Yeah, I think for, in principle, if you aren't actually using that whole context length, it should not impact inference latency.

in practice, there are effects from, just setting a larger max model length. Just whether you're using something like tensorRT that is, compiling ahead of time, like compiling kernels for running your inference, or whether you're using VLLM, which also which is like a little bit more dynamic, but still also will ask you to set like a max model length, I think you will end up with, like larger block memory allocations or other things that can slow you down.

I don't know actually the full details of why those [00:27:00] things are so much slower when you set the large When you set a larger context length. I would guess maybe there's like aggressive padding going on or something So while it's like not in principle a problem. I think in practice It ends up being Something that will slow you down.

So you want to tune like at a very large scale You'll have like distinct inference like distinct services That handle different context lengths, just because, if you can saturate all of them, then you, will, save a lot on latency and on cost by, sending to different,engines tuned for different,context lengths or batch sizes.

hugo: so we've talked a bunch about VRAM latency. I'm just wondering how people should think about cost, because, things can seem pretty cheap for a small amount of time, small amount of inference, but costs can blow up pretty quickly. So how can we think about these things? 

charles: Yeah, I definitely, yeah, it's.

Important to think about them as [00:28:00] something that I've like learned by spending a lot more time in the sort of like platform and hardware level it's like whenever you talk about constraints on a system there's always like a secret constraint of cost which is like oh yes we can't get any faster because we aren't willing to lay a fiber optic cable directly to this user's home from our well you know like that's that's like a cost issue primarily and extreme example but that that generally is a problem.

I think, I would say, if you, my, highest level advice here is that costs are on the downward slope. GPUs, models are getting smaller. The, new GPUs are gonna come out, and the previous GPUs are still gonna be as expensive. They're not as performant as they were before, but now they'll be cheaper because there's like new supply to meet, to meet demand.

And so like we've seen this long, like fat, like faster than Moore's law decrease in the like cost per token. and so to a certain [00:29:00] extent, unless you're trying to make money today with a language model service, you don't have to worry as much about cost. You want to forecast it, look out and be like, okay, do I expect costs to maybe go down, continue to go down by an order of magnitude every couple of years,or like every two years maybe, so when will it, when will it be economical to run this workload if it's not today?

and then, as part of getting closer to cost of cognition going to zero, we get to the point where like it's like CPU cycles, which are also like, Almost too cheap to meter. And we'll get to the point where the, some like one billion parameter model with very good, like fine tuning is actually able to solve a lot of chat body tasks and you can run that model on.

NVIDIA's Black,Blackwell series inference chip that's now super cheap because the next,and you can run it in [00:30:00] 10 milliseconds, or whatever. that's I have an optimistic view of the future. Of costs here, so I don't sweat them as much in the here and now 

hugo: one final question on inference, and I'm glad you mentioned fine tuning, because that's what I'd like to get onto next.

We talked in passing around how quantization can affect hardware demands improve performance. Just wondering if you can speak explicitly to. Some different types of quantization. How can they can help us here? 

charles: Yeah. So quantization has turned out to be super critical. my, yeah, my mental model here is that like neural networks are fundamentally actually analog computers.

they wish to be analog to be implemented as like a sequence of memristors with voltages passing through them, or maybe, dare I say, synapses, with,released vesicles. And so they don't actually require. The,the precision that people usually associate with digital computers. You don't like, you, you like, flip a couple of bits [00:31:00] in a kernel, and you cause the entire system to come crashing down.

You flip a couple of bits in a transformer, and you have to, and you like, you won't even notice. In most of the time, With a few exceptions. If you NAND, things are, but within the like, actual numerical parts, like you can flip bits and things are fine. So you can actually very aggressively You can just, quantize,neural networkswithout reducing their quality in a meaningful way.

people used to do single,precision, 32 bit floating point was the usual thing people would train and release models in. That's now moved down to half precision floating point for, definitely for release, and I think also for training. And you can just, transparently quantize. definitely don't You don't want to run at any like higher than 16 bit precision that suggests some issue or like mistake below that and actually just a quick point like one of the re the primary reason to do quantization is to reduce the impact on ram on gpu [00:32:00] storage capacity secondarily it's also like it's stored at a lower precision so there's takes up less space and then you do need to move it into the compute units And so there's less pressure on the memory bandwidth, on the like, bandwidth of the channel that communicates between the memory and the actual compute.

so there's a benefit there. Where it is slightly less important is quantization for the actual execution of the compute. you know what, if you reduce the VRAM, and now you can fit it on a smaller GPU, you just cut your costs. Even if you didn't speed up inference at all, and if you like reduce if you're if memory moving things in and out of memory is actually what's bottlenecking your compute, then You can continue, you can work in a higher precision in the compute and it will,you can go down to 4 bit weights with 16 bit compute and sometimes end up faster than doing 8 bit weights, 8 bit weights, 8 bit compute. 8 bit, activations, matrix multiplications. and that is less important than its impacts on memory.

This is [00:33:00] maybe, Another kind of big takeaway for me, a surprising thing, like people focus so much on algorithms and people focus so much on compute, they don't focus as much on like memory and storage, where it turns out that's where the important stuff is happening. or frequently that is where the bottleneck is.

Yeah, so to close out on the like different formats, the 8 bit formats are like pretty robust these days. There's not lots of tooling around them, for creating 8 bit weights out of existing 16 bit weights. You, there's there's some magic going on under the hood there, neural magic, literally.

They're the ones who make LLM compressor and do a lot of the like, quantization stuff for, for VLLM. yeah, those work reasonably well. to see good stuff with 4 bit quantization, from Neural Magic and others. and that will also become increasingly important with the blackwell chips, which have native 4 bit floating point support.

and, so we, Like that's a [00:34:00] little bit more on the edge, relative to eight bit quantization and, 16 bit quantization, but it's there as an option, especially for the very largest models, like the four and a 5 billion parameter llama models. 

hugo: Very cool. this type of stuff is one of the reasons I wanted to come and chat with you about, I think there's so much meat in there for people who want to self host models for inference.

So anyone listening or watching, please let us know how you. How you go with this. Before moving on to other things such as fine tuning and training from scratch, I think it's worth pausing for a second and we've mentioned like A100s and H100s. I'm just wondering if you can tell us a bit about what all these, what all the different types of GPUs are and how people, like if they wanted to do inference such as along the lines we've been talking about.

That's overwhelming, man, to be honest. Like, how do we think about the space of just choosing what to use and when? 

charles: Yeah, so I'd say there's maybe three primary divisions of, NVIDIA's GPU offerings from my perspective. [00:35:00] there's the one that, matters the most to me or that I, think the most about and have mostly been the only ones I've mentioned so far are the, data center GPUs.

Previously known as the Tesla GPUs, these, are the ones that are, like, designed for being put in a data center, they are,yeah, they frequently have better networking, cause they're, like, some of them are designed for training, and those, that class of GPUs is where you'll always, H 100s L 40Ss, which we're releasing on Modal either today or tomorrow, the, yeah, T 4s, L 4s, Anything you'd see on Colab, like V100.

Yeah, all of those are these data center grade GPUs.or data center GPUs. So if you look at those, they always have a little letter at the front. and the, that letter is, like a shorthand for a generation of GPUs. this usually, Actually, another, maybe high level takeaway from spending time on platforms and spending [00:36:00] time, thinking about hardware for the GPU glossary is find the slowest moving parts of any system.

And, that is where you're gonna find the, the interesting things and the truth. So the slowest moving part of, NVIDIA's whole enterprise is the construction of new chips. A chip is the, like actual, the business end of the GPU. It's not the card. When people say, oh, an H 100 or oh a an a 100, an a 10.

That's a card, which is like a, there's like a silicon board, that connects a bunch of things where it has wires between components, but then there's within that is a chip. That is, that does the actual computation, that's separated from an external storage, that's separated from the rest,everything else, the graphics output, all that.

Those chips, that's the slowest moving part, that's your, That's the part that, everything else flows from. So there's generally a, family of chips that have a letter associated with them. It's like GA 100, [00:37:00] GH 100, or,G80, I think was the first CUDA architected chip. Those, are maybe used across multiple, cards, and they, they define.

The,like a family of accelerators that will all have ampere as the name. So the A usually means ampere, and that's like a micro architecture of the components of the chip associated with a particular set of actual physical chips you could buy,that are like GA 100, GA one oh x ga, whatever.

yeah. So those, that's your like core there. And the current state of the current widely available state of the art is Hopper and Ada Lovelace. and those are, Hopper is for basically just H100s, that's not entirely true. And then Lovelace For the, like at your at home G 40 nineties GPUs or, the L 40 s [00:38:00] GPUs, the, those like generations.

It takes a couple of years to tick over a generation. So we've, there's been like 15 years or so, maybe not 15 years of Cuda GPUs. So we've ticked over now through many of, many iterations, like seven of these. but they only change every couple of years. So if you. slow down and learn the ones that have already been around.

You don't, it's not like there's a new one every two months. It's not a front end JavaScript framework or anything. so yeah. So that's, yep. So Blackwell being the next generation associated, and that will be in these like B200 chips, or sorry, B200 cards.yeah. So that's all data center grade GPUs.

then, the thing you would buy that you would use to, for example, play video games in your home, if you use GPUs for such an exotic purpose as that, that would be these, consumer cards that Nvidia puts out, frequently using the same, core,definitely the same micro architecture, the same design of the internal [00:39:00] cores of the chip, but, slightly different combination of them, and different components around it, different memory.

so that, same generations are used there, but they have different names, it's different component of the company that puts it together into a card, they have their own release cycles or whatever, so the 4000 series chips, Like 4060, 40, is there 40? 4080, 4090, those are, those use these loveless chips in a, in these GeForce 4000 cards.

And the future GeForce 5, 000 cards will use Blackwell chips. and yeah, those are actually quite good. The 90, the like 4090 has a like pretty large amount of VRAM. I think 24 gigabytes. so enough to be like pretty useful. The 5090 is rumored, I think, to have 32 gigabytes. not quite the 48 that maybe people were hoping for.

and that So they make, reasonably good local inference, local inference machines, but NVIDIA has not really pushed trying [00:40:00] to, capture the local llama market, the way you might think. Expect, there's a third group, which is like your Jetsons, these system on a chips, where like GPU share a memory, and those are aimed at embedded systems, and they want to put them inside every robot as their brain, and they want to like, have the, Nvidia's vision with those, I think, is for them to be, like, the brains of all these smart devices that have neural networks running in them in the future, but for now, It's a little bit, that hasn't really taken off yet.

and mostly more people tinkering, I think with those, so far. 

hugo: Super cool. Thank you for breaking all of that down and slicing the space in a way that I hope will very useful for me and I hope will be useful for a lot of listeners as well. so we've talked a lot about inference. I am interested.

I was on the fence about whether to discuss fine tuning explicitly as well, but I think, we've been talking about fine tuning for some time and, Even this year that earlier this year, it wasn't clear how important it would be. seems like it's getting more [00:41:00] important and the tooling is getting a lot better as well.

So it's becoming more accessible. And for those, I think everyone probably has a sense of what it is, but fine tuning a model, essentially allows you. To update the weights in a way that you can personalize it a bit more with respect to your data. If you know of transfer learning in machine learning, that's one analog, I think, but essentially it allows you to.

So one example is a lot of the base models, a completion models in the sense that they will, the question and answer we're accustomed to in chat interfaces don't happen in these completion models. It was just complete sentences. So if you say, what is the capital of France? They'll likely say then what is the capital of Germany because they're trained on lists of questions like that from textbooks, what you can then do is give a question and answer pairs called instruction tuning on top of that thousands or tens of thousands of these types of pairs so that it becomes a chat model and you can fine tune it that way.

You can also fine tune diffusion models on images to personalize it with respect to images, your [00:42:00] own data and those types of things. Before getting into the hardware considerations. I do want to say there are like, it's interesting. I was going to say modern approaches like LORA seems ancient now in, in a lot of ways.

There's also Q LORA, which is quantized LORA, but a lot to break it down. A lot of this stuff is essentially matrix multiplication. LORA is an acronym for low rank adoption. And you may know that matrices can be decomposed in a variety of ways, so we reduce the rank. So that's the number of dimensions essentially, which reduces the computation needs.

And you can do pretty well if you quantize and do LORA, there are other techniques like, parameter efficient, fine tuning. That's all to say that fine tuning models is getting easier and lowering hardware requirements with these types of techniques as well. So with that rant in mind,what are the key hardware considerations for fine tuning as opposed to inference, which we just talked about?

charles: Yeah, I think the reason why there is so much interest in parameter efficient fine tuning methods is because of how greedy for [00:43:00] RAM, training is and including fine tuning. when you do fine tuning, you like, you push things through the model, you look at the outputs, and then you need to push the corrections back through the model.

That is backpropagation. And there's a number of things that lead to that being much more RAM intensive. the first one Is that like you always you need to cache or hold on to the like intermediate activations of the model, the same way you do with a KV cache, but for every activation in the model, not just the KV, um, components, also like queries and also,feed forward network activations.

So that adds to your, RAM requirements. you can get around that with active, you can like trade off a little bit of space and time with, like activation checkpointing, but adds a bit, adds more complexity. then also you need to take, you compute a gradient Like, which is a vector, in the end, it's a [00:44:00] vector the same size as the parameters, take all the parameters, list them out, big vector, you need to compute a gradient that's a vector of the same size.

And you actually do need to frequently, realize that vector in memory, because you, also need for, you don't just use the gradient,You for contemporary find like the optimization algorithms that people use aren't the second order, like extremely memory hungry optimization algorithms that I like studied in grad school second order methods that involve calculating Hessians.

But they do have additional linear memory requirements. So you need an extra like a thing just as big as the gradient to store your Average gradients recently, like a exponentially weighted average of gradients and an exponentially weighted average of the squares of gradients. So like the gradient variability over time or the variance or standard deviation of the gradient.

so you've got now like multiple copies of the parameters that you are [00:45:00] training that you need to keep around. so these parameter efficient fine tuning methods help a lot with that, by meaning that You aren't training all the parameters in the model. You're just training these like little sidecar adapters that have, through a little bit of matrix math, as Hugo alluded to, allow you to like only keep a few thousand numbers in memory to represent a matrix that can still operate on 1000 dimensional vector in 1000 dimensional vector out.

And so then you only need to keep those like extra copies for gradients, average gradients and squared gradients in for a very small, like tiny matrix, and so that helps a ton. there's one other piece actually that I forgot to mention, which is that I talked about how neural networks are analog computers, you can quantize them a lot, that is very much true during,During inference, during training, it's much trickier, and at the very least for a lot of like accumulation stuff, [00:46:00] I don't know what the state of the art is here, because I fell a little bit behind on this, but at one point it was definitely the case that you needed to keep 32 bit so like single precision four bytes per parameter for everything for the parameters for the gradients average gradient squared gradients so now like you have this multiplier not just of like number of copies of the parameters but also of the size of those copies so it's pretty quick to go from oh yeah i can totally run inference on this thing in my current like hardware setup to oh wow i can't even come close To running fine tuning, like full fine tuning on this.

so parameter efficient fine tuning helps a ton with that. I think there's a reason for the model providers to try and make it. As easy as the Meta and Alibaba putting out Llama and Quen to try and make it possible for people to still do some amount of fine tuning on, a single node, just because that's so much operationally simpler and the software is better.

that actually drives them to try and release [00:47:00] models that still will fit, in that size. but yeah, as, as with many other things, the number one consideration is going to be how much RAM does it take, to store all the information required for this algorithm.

hugo: Totally, and I just want to say, we've got a lovely comment in the chat, Ramona has said she's already learned so much, all she knew about A100, is that the one that costs, is that the one that costs money on Colab? she's definitely learned a lot. A huge amount. Also, Charles, nobody's spoken Hessian matrices to me in a long time, so I appreciate, that, that grid of second order partial derivatives coming straight at me.

Hugo, 

charles: you may be the perfect audience for a figure in my thesis, which is a commutative diagram about, null spaces of Hessians. That,no, only the three people of the three people who read my thesis. I think only one of them liked it, but now you're just flirting with me, Charles. You could double that Hugo by reading it.

hugo: Oh, whoo. Heating up over here. [00:48:00] That's great. so I love that. I am wondering, so we do have some similar considerations when thinking about inference, but maybe Can you talk me through an example of if someone wanted to fine tune, a 30b model, how they'd think about that, or? 

charles: Yeah, I think, oh yeah, maybe the other thing is, you have more control over training than you do over inference, right?

Training is something where you know what the data is going to be ahead of time. It may be something really big so it's like hard for you to control it like if it's trillions of tokens like it gets to the scale where you can't even shuffle it or search it. But you have control over it and so you can control like how big the batches are and you can like you know you yeah it's a bit more of a high performance computing.

Workload, and a lot less of a like distributed systems cloud workload in a lot of ways, and so it feels, it looks, it feels pretty different, but the, maybe the biggest one to think about again [00:49:00] on our, theme of the most important constraints is you're going to want to boost the batch size as much as possible.

So even once you've like fit running a single input through a model, in a GPU, it's oh, you know what? In training, I don't just have one input coming in from some random user at any given moment. I actually have this giant queue of a trillion tokens that I want to rip through, or a hundred thousand, for fine tuning is probably much smaller than a trillion.

yeah. you're immediately going to want to like, turn up the batch size, and that's also, it's not just to fit on a single GPU, it's to fit as many elements in a batch as possible. and one of the reasons to do that, to do a little bit more talking about like the hardware details, like the reason to go for a large batch size, is that actually batch size one is a pretty wasteful use of GPUs.

what you do is you like, load a weight in from memory. And then it interacts with an activation, and then you like, [00:50:00] multiply it, you add it to another number, and then that weight never gets used again. So you moved maybe several bytes, in order to do three or four floating point operations, maybe even just two floating point operations, multiply and add.

And then, that was it. So One to one ratio of byte movement to floating point operations. And GPUs are actually designed to like, opt, basically optimally, an H100 wants, I want to say about 2, 000 floating point operations per byte movement. I forget,you look at the ratio of how many bytes you can move in and out from RAM to chip, to the, to like, how many, floating point operations per second you can do.

That, that's, that number is, It's called the ops to byte ratio. There's like another sort of sexier name for it that I'm forgetting right now, but that's like a key per system parameter that tells you what workloads,will this system like really rip through? and. The ops to byte ratio for batch size one inference is [00:51:00] terrible.

and so you want to blow up the, size of your input so that you hit a higher ops to byte ratio, and the, efficiency, the,dollars per gradient learning, will go way up, or the, dollars per,process token will go way up. this has a penalty to latency. You can always buy more throughput, can't buy more latency, but it does, it substantially improves efficiency,the time it takes to finish a training run, for example, and, that's, you start to worry a lot more about how, not just can I fit running this on a g, on,on this hardware, but, like, how much, how many inferences can I fit at once on, One unit 

hugo: that makes sense.

So there's so much to Pun intended, keep in memory for me. do developers really need to think about all of these things? Cause it is a huge amount and it is an argument, like I want to go and ping an API as opposed to think about this anymore. [00:52:00] 

charles: Yeah. There's something to that, I think.

yeah. I would say that you don't need to, like a lot of this stuff that I'm talking, like I'm covering all kinds of different things here. The point isn't to have all this stuff at the like tip of your tongue ready to go. 

It's 

charles: to be able to understand it well enough to like when it comes up and figure.

A more complex software system that you're using that abstracts some of this away or gives you high level parameters. So do you know, oh you know what, the four bit quants would be really great here 'cause I don't have enough VRAM. Yep. and oh, I can actually combine quantization and calculation at the same time.

'cause they use different units on the ccp, on the GPU. Like one uses the cuda cores, the other uses the tensor cores. like that knowledge is helpful for like understanding. The performance parameters of the system or having intuition about what is, what are the best, Parameters to tune. I think again, like databases provide a useful, analog here.

It's you don't [00:53:00] need to be able to, implement a B plus tree to know how to configure indices on a Postgres instance. and just as there are, like, quite, there's quite a large market for managed Postgres, and many people run managed Postgres instances, there still are people who run B plus.

databases themselves, not least of which people who run managed Postgres for other people or build cloud extensions of Postgres, like your Neons or, Nile or whoever, Superbase, so there's, all of those people will need to, Need to understand this stuff, build this stuff, and then also, even manage Postgres, you're gonna have to get in there and configure indices, and you need to know a little bit about, what are the problems that the database is solving for me, and, the trade offs between, memory and disk storage, and the nature of write ahead logs, to be able to, be an effective user.

of that database or the database product. 

hugo: So were I to want to fine tune a 30 B model, for example, can you just talk me [00:54:00] through how I'd think about that and like what GPU you'd recommend looking for? 

charles: Yeah. So 30 billion parameter model. let's assume that you're going to train it in 16 bits, with.

and do low rank adaptation. we're looking at about 60 gigabytes of VRAM to store just the model. you would maybe be able to figure that, fit this on a single H100 with small batch sizes. so 80 gigabytes of memory on an H100 or an A100.that would probably be my choice for doing my like, initial experiments of playing around with it.

Also, by the way, First step is to make sure you get your whole software stack running your and like a bunch of your data management your experiment Management and monitoring perhaps with a tool like weights and biases you want to get all that running with a much smaller model Maybe not even a two billion parameter model, maybe literally like a hundred million parameter model.

Stas Beckman's ML Engineering book has a great section on how you can [00:55:00] reach into the configs of a hugging face model and just shrink the latent dimension down by like a factor of a hundred. and now you've got like a really tiny model. and that tiny model is what you would use to like, Figure out the software stack, right?

Do you like Axolotl? or do you want to use like Hugging Face Accelerate directly? Or maybe TorchTune? Torch 's like library for training models. or maybe Unsloth? I think Unsloth is oriented to training on a single GPU at a time. but anyway, so like you're gonna do a lot of that initial like kind of software level training.

like debugging a setup on the smallest thing that you can like not just because you that saves you money on cloud bills. Obviously, I don't want you to save money on cloud bills. I want you to run stuff on modal, but it saves you a lot of money and time, to be able to like iterate quickly on something small.

and so then I would maybe once I got to the point where I was like, okay, I like my overall stack here. I feel like I understand my data. I can overfit a single batch on this [00:56:00] tiny model. okay, now I'm gonna go to this 30 billion parameter model and squeeze it into one GPU.And then do a little bit of maybe overfit a single batch, try to get loss zero, literally loss zero on a single batch, like maybe, probably gonna have to adjust a bunch of hyperparameters, gonna have to adjust, a bunch of, system parameters for tuning, and then once I'm happy with that, then scale up to eight GPUs at once, eight H100s, or eight A100s, and, at that point, there's not going to be any more, unless you really want to get into it, there's not going to be any more scaling up, because now you're at the point where you need multi node training, and that's very hard, and the software's terrible, and the hard, and there's like problems with the hardware level, and very nasty,if you can, at that point, I then start trying to adjust my problem to fit into, Into that scale of hardware.

[00:57:00] So okay, maybe, oh, maybe I need to shrink my model a little bit or find, yeah, find a more efficient model or I need,I need data augmentation because, this model is going to need way more data points to be able to learn to the level that I want. And I can't run a bigger one because that would kick me out of this, scale that I can work at.

And then, I would do as little, as little tuning and as little work as possible at that scale just because of how long it will take each iteration cycle, even with modal making it as fast as possible, it's still not fast, it's still minutes. for many of these things to spin up and finally get to the point where you get any useful information.

It's hours for a training run, so you really want to get as much out of the way as possible before you get there. 

hugo: Man, that is so useful and particularly thinking about the iterative approach to start with something small and that helps you figure out the tooling as well and then to grow with your project.

I was going to say you heard it here first people, but you may not have, but if you've heard it before, [00:58:00] you've heard it again here. Something you have heard here first though, which I wanted to mention earlier is Charles said, you can't bribe the speed of light. And that is something that I'm gonna take with me.

Wonderful. So I hope everyone who it feels a bit more enabled to go and start fine tuning their own models, for pleasure as well, even if you're not doing it at work, it actually is. It makes me feel like I have a bit more agency when I'm able to do these types of things. and that really leads nicely.

What I just said leads nicely into the next thing I want to talk about, which is training from scratch. I don't, I think most people in the world will never have to train from scratch. Anyone who works with models though, I would really encourage you to. Have a bit of fun and try to train, GPT 2 from scratch yourself.

It does. It is so empowering to do this, to realize that you can do something like I'll never train something GPT force scale, Or Llama 70B, whatever it is, but the ability to go and train a [00:59:00] GPT 2 and among other things is incredibly empowering. So for developers who wanted to do something like this,what do you think they need to know about compute requirements and scaling hardware?

charles: Yeah,this is the point at which, the multi node training, line, really strikes. I would say, GPD 2, you can train in 5 minutes, on modal, using one of the recent, nano GPT speedrun. code bases. I posted it on Twitter, if you, DM me if you're interested in getting a hold of the code for it.

yeah,at that scale, which the smallest, I think it's the smallest GPT 2 model is using those speedruns, which is 170 million parameters. so very small, by contemporary standards. That one you can train and you can train on pretty reasonable scale data sets on a single node. And so again, yeah, if you can find a way.

That's really nice. Those models actually then tend up to, tend to end up being ones you might be able to run [01:00:00] on like a high quality CPU if you have, if you have a way to run inference for them that takes advantage of like vector lanes in a CPU, so it takes advantage of ADX instructions on Intel, So that, then you can actually like deploy them on CPUs, you can deploy them on an AWS Lambda, or a Google Cloud Function or whatever, and that actually makes your life a lot easier, but for training anything larger than that, you're quickly getting to the like, now I need multiple collections of eight GPUs, which then need to be like tightly interconnected with each other, and I think my understanding is like the easiest to use product there.

is, Lambda Labs one click cluster, so they'll get you InfiniBand interconnected H100s up to, 256? Maybe more? but yeah, they, and they, you can get a hold of it within a day, which is, yeah, pretty fast. For getting a hold of a supercomputer, which is effectively what something like that is.

it's wild.yeah, I [01:01:00] think maybe the biggest thing on the hardware side, like at that point, the GPU hardware is staying the same and you're changing everything around it, how the GPUs talk to each other at best. So the like, interesting stuff with the hardware there is that You need, there's like custom network interface cards and actually custom cabling.

InfiniBand is like the technology used for communication at the like latency and throughput requirements of very, of larger scale training jobs. And that is, InfiniBand is like, Much like CUDA, InfiniBand is actually like a thick stack going all the way down to physical hardware, and all the way up to, basically, almost the application layer.

it's a, it's like a messaging service, so maybe, the same level as TCP, basically. InfiniBand goes from InfiniBand verbs, that are very like, TCP level, all the way down to a specification for the physical hardware, and how Bits should be transmitted as like electrical impulses or [01:02:00] as, like light waves along an InfiniBand cable.

so that's what you're gonna need to train like anything like a, let's say 8 billion parameter language model or larger. There's actually an interesting middle ground that we're looking into at Modal right now. which is, larger than GPT 2. Thank you. doesn't, like training for it doesn't economically or feasibly fit on a single node.

but you don't need, you don't need InfiniBand Interconnect for the absolute highest throughput and you can get away with a regular internet stack. Internet protocol and friends. Ethernet. Um,things that people actually have in their house and have seen and touched. those things.

can actually, you can actually do some training, training for a scale in between, say, GPT 2 and like Llamas 7b. in between that actually works like reasonably, economically on [01:03:00] that, on that hardware. watch this space with Modal, and also,check out maybe some of the cloud providers have reasonable offerings for that.

hugo: Thank you so much and I just want to repeat that you can now train GPT 2 from scratch in, in five minutes if you have access to the right hardware. So that's absolutely incredible. that's on 

charles: modal, by the way. that's, yeah, you can do it on eight H100s with a simple modal run command.

And, There's a 10 minute compile step of Torch Compile, it actually takes longer to run Torch Compile than it does to train the model, which is wild, but that, like with that included, that's still under Modal's 30 a month free, compute limit, so you can train Torch Compile. Train GPT 2, the smallest size, for free, now.

hugo: Amazing. And I just, I wanted to mention that. Modal does have a free tier, so if people want to check it out, do. And as I said, Charles and team are super helpful on Slack. We've got a comment in the chat. Michael Liu [01:04:00] says, GPT small. GPT 2 small only though. Sure, dude. this is, sure, but it, this is still incredible, I think, right?

charles: Yeah, and it would just take, take longer to train some of the larger ones. I don't know what the largest, yeah,I would believe that you could get, Maybe another order of magnitude training in a way that didn't make you want to, die on 8x H100s. so like me, like you're, the speedrunning stuff is oriented at small, but, slightly larger.

hugo: Yeah. And to be clear, small now means hundreds of millions of parameters. Yeah. Yeah. I would look at, 

charles: raise your, raise your hands in the chat. If you've trained a thousand parameter or less neural network, 

hugo: Totally. so I do, there are a few other things I'd like to talk about in terms of types of tasks people have to do.

in terms of information retrieval, a pretty hot way to do it is rag retrieval, augmented generation. Hardware requirements for that, my understanding and in my [01:05:00] experience, it doesn't really differ from inference that much. 

charles: So definitely like for a million different reasons, if you want to change the behavior of a language model.

Changing the inputs is way easier than changing the parameters. and retrieval, retrieval from an external information system looks very similar to all the other software stuff that's happening around language model inference. Maybe one interesting point is there's like a, tool use, and agentic retrieval, where the language model decides what information it wishes to retrieve.

That. now you, at a software level, you want to be really good at Asynchronous inference, like basically one request has reached the point where the language model needs to make a tool call, it's like going out to a database, and just as you would in a web server, you want to put that one on pause and run other work at the same time, and then bring it back in once the results are available.

So that's the sort of thing that starts to make running inference, [01:06:00] like too hard to run yourself. And that's where you would bring in like a, like a VLLM, SGLANG, one of these inference servers. Because they will, it's not just that they like, have a nice Like interface for like setting up models and they give you an open AI API for free and these other things, but it's also that they like, they have that kind of,concurrency and scheduling logic built in.

And, scheduling is like some large fraction of what operating system kernels and databases and web frameworks like, really,hang their hats on. So I think there's a similar thing happening already with LLM, for example. 

hugo: I, I'd also have to find myself and probably be subject to, abuse on LinkedIn if I didn't ask you about agents, what, what, and this is a very general and partially stupid tongue in cheek question, but also real, like when thinking about more agentic approaches, like how do we think about hardware?

charles: This,a year ago or [01:07:00] so, we had, Xinyu Yao and others on the, LangChain webinars, I was hanging out with them a lot and talking about these things, and, he did, React, and,a lot, a tree of thought, and it really seemed like we were gonna make a lot of progress on that very quickly, and I think what, it turned out that, going from 90 percent to 99 percent With a lot of that stuff was harder than we thought, and there's any number of reasons for that.

I, could be that we just need to get better at prompting and like steering models. Could be that, could be we need another order of magnitude more data. it could be that we need these, GPT 5 scale,multiple data center scale training runs before. before we have that level of like intelligence and,or it could be that we need like a completely different approach to training.

like just training on passive byte streams is insufficient and that we need to train models like, as [01:08:00] agents. A. K. A. to pull out the R. L. textbook, like something that takes actions that modify the state of an environment. in order to achieve a goal. Like maybe we just, maybe we have to train them as they have to go out on the internet and try to achieve goals and then they like receive reward signals and that's the only way to actually get like reliable, agentic behavior.

I think there's that whole collection of like interesting, difficult, technical problems and then slathered on top of that is a lot of LinkedIn fluencing and just migrated from crypto to AI. overenthusiasm or grifting on the idea of replacing human labor with machines or like automating all processes or what like whatever else people are like associating with the term agent now.

and so there's like a, there's a bit of an association with some of like sloppy thinking about what to use language models for or how to use them or even like sloppy [01:09:00] engineering in general, Oh yeah, bites in bites out. Actions in environment like it's a pretty generic API. You're gonna slap it on anything without having to think about the problem very hard.

so despite my general enthusiasm and interest in like cognitive architectures and like creating like minds out of these little artificial brains that we created, I feel like a little bit of let down from what's happened. and to your initial question, it feels like the only answer from hardware on that is like scale up.

hugo: Makes sense. And I can't believe I've never heard the term LinkedIn influencer before. So I'm going to take that as a, with everything I do currently, it's, 

charles: the siren call of LinkedIn influencing is there. Oh, it's hot and tempting. And I let Claude do all my LinkedIn influencing for me.

hugo: I do. I do actually, if we can take a moment though, I was trying to draft something with Claude. And it messed it up completely. And I said to it, what I was like, I was interested in [01:10:00] what about it? Mess messed it up. And what I wanted to get into, it's like pre training and then RLHF to see. And I kept asking why, why, why?

And I posted this on LinkedIn yesterday. And it finally said, when you keep asking why you're forcing me to be increasingly honest and simple, I'm just a language model that took your sharp, specific, technical content and regurgitated bland marketing speak instead, no deeper explanation needed, no human like qualities or complex rationalizations, just an AI that made your content worse.

And there's something very beautiful about that to me, some deep honesty coming from a language model that, touched me as a human. 

charles: Yeah, I do think Claude still is You know, I miss a little bit the like really unhinged era, like the GPT 3 API era, the like AI dungeon when you saw a lot more like poetry and insanity out of models and really been tamed.

Claude occasionally still is capable of poetry like that, which I like. [01:11:00] Absolutely. and the like new spine tunes of the llama models. I like they, I think they inject enough, interesting material into their, Continued pre training corpora and their fine tuning corpora that you end up with the occasional little bits of, little bits of poetry.

hugo: Yeah, and some of the uncensored models can be pretty unhinged as well. 

charles: Yeah, yeah, there's a, I was hoping to see more use of the, sparse autoencoder stuff for steering. That's another one where it's oh yeah, this is like a cool research idea, but it hasn't been, people haven't been able to, drive it to the point of general utility yet.

and like, all you can do is, yeah, orthogonalize away the model's, refusal. Tendency. So we'll answer all your requests or make it think it's the Golden Gate Bridge, or like steering in a production context. 

hugo: Totally. I love that. You also mentioned like the web three meets AI grifting vibe.

I don't know if I talked to you about this. I went to Miami for the first time last year, like for a conference, like a data AI conference and went to some data science meetups [01:12:00] and. Like it was impossible to tell who was for real or not. and I couldn't like, I was like, wait, am I for real?

Cause a lot of people, particularly in Miami were pivoting and I don't, I'd never been to Miami before. I'd never been to meetups. I've hosted and been to a lot of meetups in California and on like in the Northeast, but you go to these meetups and even to talk business, like you're doing shots with people and stuff like that.

It's really a totally different vibe. That spun me out. Love Miami. Wish I'd been there in my 20s or 30s, not my early 40s though, to be honest. 

charles: Yeah. Yeah. That's not really, you know,we're definitely at the point where as a field, we definitely need to start delivering on the things we said were possible like a year or two years ago.

like that.we've been promising, robot girlfriends and, book your flights for you and, a totally new user interface to machines. And it feels like a lot of that stuff has not yet been delivered. So maybe put the shot glass [01:13:00] down, and you glossary and start, and start shipping.

hugo: This is going to be a pull quote for whatever we produce out of this, put the shot glass down and pick up the compute glossary. You can't bribe the speed of light. but to be clear though, I'll push back a bit on the idea that we're one homogeneous field making promise. Like I'm not going to deliver on Sam Altman's promises, but like I'm not going to deliver on Jensen's promises.

And I don't have the vested interest that they do either. 

charles: Yeah. 

hugo: Yeah. 

charles: When you, yeah, we use the same words to refer to what we do as the, as they do, we talk about artificial intelligence and language models and scaling, and so we are, inextricably bound up with our fellow humans in this field, whether we want to or not, and we have all, we've benefited from and may suffer from the actions of others under our same banner.

hugo: Hey, I appreciate you've got a hard [01:14:00] stop in 20 minutes, right?maybe 10 even. Okay. in that case, I'd love to quickly know about diffusion. Models and what type of hardware considerations you consider then and then if we could get into a quick demo That'd be super cool. We don't have enough time for both.

we can just do the demo 

charles: Yeah, I'll say just something real quick about diffusion models. I think the nice thing about diffusion models Relative to transformers are autoregressive. So when you put out a bunch of tokens for a user, you have to like, you have to pass through the model many times, and this is necessarily serial.

Rather than parallel, and so it's actually quite a bit harder to make effective use of a parallelism, throughput oriented architecture like a GPU, to effectively serve a transformer. During training, they're incredible, map directly, map onto the hardware very nicely, but during inference, not [01:15:00] so much.

Diffusion models are very different. You output, the whole image at once. Effectively, they don't behave autoregressively, and that shows up as a much higher like ops to byte ratio in a lot of very natural workloads. and so that actually makes making effective use of GPUs easier with image generation or audio generation like diffusion models.

and related perhaps relatedly or perhaps distinctly. Diffusion models are also, useful diffusion models can be quite small. there are many of them in the,billion or so parameter range. And, the best open models for, generation are, like, um,Flux Schnell is 12 billion parameters.

And that one's very good. Very useful. and Like the equivalent language model is like 70 billion parameters or something. and so that also makes them like [01:16:00] slightly easier to use. this is one of the features that kind of like has led to quite a bit more adoption of diffusion models on the modal platform than language models, like better open source models that are smaller and that make more economical use of the hardware.

so a surprising feature of these like diffusion, diffusion models relative to the transformers. 

hugo: Super cool. Appreciate all of that. I do want to do a demo. I do wonder, we did say we'd just quickly look at some specs. Can we do that in two minutes? So this is, I, so I want to buy, okay.

Everyone, most MacBook pros these days have GPUs as well. So you can actually do a bunch of cool stuff locally. Also, I don't know, there are certain apps for this. I've got one called MLC chat on my phone. Which allows me to like load 3 billion parameter models and that type of stuff on my cell phone. So check, MLC chat or otherwise, it's a cool app, but when Llama 3.

2 came out and they released their 3B language model, I could just use it [01:17:00] there. People have heard me wax lyrical about this before. Calling it Llama 3. 2 is such a boss move because it has so much cool stuff in it as well. It sounds like incremental to 3. 1, but I saw recently that, This M4 Max MacBook Pro, I'm not being paid by Apple to say this or do this, by the way, but I'm amazed that they've released this chip, which allows me to buy a consumer grade laptop that has a 40 core GPU and 16 core CPU.

I can get it like with 64 gigabytes of unified memory. and all of this, for.let me just see. Yeah. One terabyte of SSD storage and all of this for 5k USD or something like that when spec'd out like that, which clearly isn't cheap and you can get a lot of this stuff cheaper. but in terms of having a laptop from Apple.

like this, I think that's pretty cool. So I'm just wondering,I'm a bit of a noob with this stuff, which one of the, one of the reasons I wanted to have this [01:18:00] chat, but how, like, how would I think about what I can do with something like that? And, in terms of value for money as well. 

charles: Yeah, definitely.

I would say the value for money is to wait until they do something, right? Release the M4 and then go one to two generations back, and buy something from that generation. It actually looks fairly similar to the like buying guide I would give for Nvidia local hardware. It's Look at, the 5000 series is on the horizon, that's gonna come out soon, look at the 4000 series, and then find the one with the largest memory, as has come up many times in this, find the one with the biggest RAM that you can, prefer buying more of that over anything else, so with the M4s you can get up to 128 gigabytes of unified memory, that means memory that both the GPU and the CPU address at Like it's both, it's like physically contiguous and like the virtual memory is the same, that, that is going to be like, that's actually, you can run stuff economically on an M4 that you might not [01:19:00] be able to run economically on an H100 because of that unified memory.

It's this, it's yeah. and the, sorry. And because of the scale of it, 128 gigabytes. I don't think they have 128 gigabytes for the M3s. but that's the, If you can find an M3 Max with 128 gigabytes of memory, whose price has now gone down because of the M4 Max, that's that's a pretty solid maneuver.

and you can even, the other thing, is that with a lot of the stuff you'd want to do with, Language models and diffusion models. You're going to be memory bound as we've discussed where it's the memory. It's like moving in from that, from Ram into compute unit. That's where the bottleneck is. And that has actually gone up very slowly as we go from M1, M2, M3 to M4.

I think maybe even between M2 and M3, it might not have gone up at all. It went up from M3 to M4. I'm pretty sure we'd have to pull up the like. These memory bandwidth specs, which, are not top tier marketing material. if we pulled [01:20:00] those up, you would see that it's been going up fairly slowly.

and so that's not,and, that is, pretty important. It's the next constraining resource for local inference. Immediately after RAM capacity, so that is another reason why you don't have to as aggressively upgrade with these things if you're able to get one with a big memory,yeah, and then on the core, on the number of cores,GPU cores and neural engine cores, I don't know as much about the internals of Apple's metal chips as I do about NVIDIA chips.

My understanding. My understanding is the neural engine is like a trans, tensor core. that is, it operates on like a bunch of elements at once, a little bit more like the vector lanes of the CPU. Whereas like a regular GPU core in Apple is a little bit more like a CUDA core in a, in an NVIDIA GPU which operates only on one number at a time.

rather than a bunch of them. I don't know exactly what the trait,relative strengths of the neural engine and regular core, [01:21:00] regular GPU cores. but I would say it's likely to be the case that for many of the things you'd want to do locally, RAM and the bandwidth of the RAM are going to be your constraining resources.

And those change more slowly than the compute does. And one final point, you are probably going to do other things with this computer. You might edit a video. You might, browse the internet and open multiple tabs. with a contemporary browser, and download many copies of many JavaScript libraries into your RAM.

you might,run a RAM hungry, data analytics tool in a browser tab. And so that RAM on this system, the same RAM that runs, that supports all those things and makes them run blazing fast and avoid a spill to disk, is also what you use for your local, inference. And so this is, that makes this like pretty incredible hardware for many things you might want to do.

and I, I haven't done a deep [01:22:00] dive on CPU hardware and what's out there. So I won't say that there's no other operation, blah, blah, blah, blah. But, it is a very compelling package. 

hugo: Super cool. And yet you're right. I do. I actually do way more video editing than I'd like to at the moment. on top of that, who knows?

I might become a gamer dude. if I get enough GPUs, probably not, but,Apple's not really the platform for gaming, but,see, that's how much of a gamer I am to not even, but I am interested. Yeah. So just to close this out, unified memory is like max that out is what I'm hearing.

Having said that, so it ships with 48 gigs. You can get, an extra 64, sorry, you can upgrade to 64 gigs of unified memory for 200 bucks. You can then upgrade to 128. But around 800 us and what I'm hearing though, is that's probably an 800. well spent, 

charles: I think it's probably worth it. Yeah.

Especially if you think about the like long term,Every time I've, traded in a machine or, given up on a machine, it's been fundamentally because there [01:23:00] is not enough RAM. Yup. And it seemed like a good idea at the time that I bought it, was outstripped. so I think it's worth it, and, yeah, last time, I think, last time I bought a personal laptop, I think I went with 64 gigabytes.

Of unified memory on an M3, which might not have been the largest, but that was a couple, that was like, I've learned even more about the power of memory. and I might change my mind. I think, or I think I would make a different choice with my own dollars. 

hugo: Today. Let's get into a demo now. Cause I appreciate you have a hard stop and everyone.

Charles has actually got another event. He's going to an in person event, which is very old fashioned, but I also approve. I just want to say, Michael, you has a bunch of great questions in the chat, which we won't get to, but I'm going to put Charles in my, Twitter handles. Feel free to ping us modal Slack and ask about it.

My one question while you're getting this up and running, Charles, I will say. One question Michael has, is around what about buying a [01:24:00] gaming PC with a 4090 using it as a personal SSH. I love that idea. One reason I'm not doing So, there's something for me about the ease of just being able to do something on my MacBook Pro without having to SSH and that type of stuff.

I suppose it is like I don't, I hate using terms like developer ergonomics and that type of stuff, but that's essentially what it is. The barrier to entry is so much easier to me, but definitely in terms of cost efficiency, getting a gaming PC and SSH ing in is the move. 

charles: And definitely on that.

don't skimp too much on the CPU for something like that. Yeah. you care less about the ram 'cause it's not the unified memory that you have in a MacBook. but, you still do care about that and you don't wanna throttle the really expensive thing 'cause you wanted to save 50 bucks on the CPU.

totally. Yeah. yeah. what are we looking at, man? Yeah, so pull up my VS code here. Just 'cause I wanted to give like a quick modal demo. Just what, why is modal cool? so let me [01:25:00] make sure that I've got the right,Virtual environment going. Do I have modal available? Oh yeah, we do.

so I just wanna show modal run getting started slash get started up high. We got it. Yes. Okay, so a little quick live demo. I was talking a lot about, hardware, about the machines that you need to use. We talked a lot about, purchasing hardware and how that's, difficult, because, okay, do I need this amount of gigabytes now or in the future?

we talked about scaling up things, being able to handle more requests or larger models. and these are all problems that, Actually, Modal, the platform that I work on, helps solve. Modal makes it easy to take your Python code and run it in the cloud, to deploy Python applications. it's got a bunch of, nice primitives, remote dictionaries, distributed file systems, like, all this kind of stuff that you need, and GPUs.

let me show you what it looks like to, run something on Modal. I've got a Python file. And I'm just going to use the modal command line [01:26:00] tool,modal run to run this script the same way I might run a Python script locally. Let me run this,script, on modal infrastructure. So this is like spinning up a bunch of containers, I guess just one container, and running something on modal, running in the cloud, getting me back results.

And it happened faster than I could even say what was going on. so then, let's see, the nice thing about modal is that it's not just oh, I want to run one thing one time. It's like I have an embarrassingly parallel task that I want to run a bunch of. and so let's do print the square is xyeah, let's see, let's make this a for loop.

For x in square dot map, range 42. So I'm just, now I'm saying I want to run this function a bunch of times. this function that runs on modal. And then I'm going to want to print the square is x. yeah, okay. Format that. Move it over, alright, print the square as x, blah blah blah. So like this [01:27:00] function here is a regular Python function.

Nothing up my sleeves here. all we did was decorate it. This is decorators. Decorators in Python are like Python's version of higher order functions. They take in a function and return a function. Category theorists in the audience perhaps smell a monad. but they take a function and they give it some new feature.

there's lots of decorators in Python. Modals decorators are like, run this in the cloud. so that's what's going on here. And it's not just run in the cloud, but also make it possible to run in parallel in the cloud. Which should happen now, we're running it 42 times. Oops, I missed, I thought I typed an F there.

so we wrote, we printed the square as x 42 times. It's not, that's even less compelling than the demo normally is, in terms of its outputs. Let's run it 420 times. Now you can see 10 containers or so spun up to handle that workload. maybe let's do 4200 and I'm going to make it a little bit wider, a little bit wider there, so we can maybe see the [01:28:00] outputs a little bit more.

So now you can see 4, 000 inputs. You can see the containers are spinning up. It's stabilizing at about 20, 25. printing the outputs. Printing this, printing logs and printing outputs here. all as easily as it would be to like run this script. and then lastly for the piece de resistance, let's do GPU equals H100.

And this is running on a remote H100. So now let's import. Let's just do a quick NVIDIA SMI, as people like to do. subprocess. run NVIDIA SMI. That should be all I need to do. Let's run it. I don't need to make sure that the code is right, because I can just run it and see whether it works.

which is my preferred way to do debugging. Alright, now we've got about ten H100 containers, each printing something, squaring on the CPU and returning it. Stop it. No, don't stop it. I can't stop it. no. You can stop it if you [01:29:00] want. and it'll spin right down. And you won't need to keep paying for it.

you don't have to worry about, oh, did I spin up an instance? And is it still going? blah, blah, blah. we, it's serverless infrastructure. So you tell us what to do. The code that you want to run, you describe the machine you want to run it on, but you don't have to think about administration of machines, the way that you do with a typical cloud infrastructure, like an AWS or a GCP.

Totally. 

hugo: And of course, this was a toy ex Sorry, just recap what you've done there. This is incredible. Just tell us, give us the TLDR on what just happened. 

charles: So the TLDR is that I took a Python function, I ran it 4, 200 times, and That was auto scaled onto however many H100s it would take to like,to consume those inputs, and, yeah, all with, without having to write a single line of YAML and with only a few extra lines of Python on top of the original,the 

hugo: original code.

Amazing. I was so sorry, Michael Liu, you're so funny, but [01:30:00] Michael Liu has just commented, bro is trying to take AWS's lunch money, which I really. 

charles: We,like Snowflake is like a similar situation there. Like they make analytic SQL really nice on top of clouds. And the thing is they pay AWS a bunch of money to use their cloud infrastructure.

We have a similar situation. We're using cloud infrastructure and like making it useful for people. I were Jeff. I wouldn't be that mad that somebody's found a new way to get people to give AWS money through us. 

hugo: yeah. not at all. I'd consider acquiring at some point. I do, I joked about developer ergonomics before, but the API and the use of decorators, chef's kiss, my man.

I know we've got to jump out in a minute, but I, that was a toy example. Could you just run us through, Some other example. I love the examples on your website. so if you could run us through those, what, just before you do that, while you show your screen, I do want to say, Charles and modal are very generously giving a thousand dollars in credits to every student in a course I'm [01:31:00] teaching in January, which I'll link to in the chat and the show notes on.

The software development life cycle for, the iterative life cycle for how to build software powered by LLMs. it's an 800 course. modal is providing a thousand dollars in credit for everyone. So you're making money. all promotion aside, yeah, I do want to make clear, and I feel slightly odd about this, modal hasn't sponsored this podcast or me in any other way.

And I reached out to Charles. To this because I saw his GPU glossary and I was like, I'd love to chat with you about this. So that's how this evolved. If anyone has any questions around that, would you mind? So you do have a free tier, but you also are very generous with credits for open science and all these types of things.

So if you could show us some of those examples and show us what you do for science as well. And if people can get free credits, that'd be awesome. 

charles: Yeah, so we sponsored a team for the, that got second place on Arc AGI Pub, that just recently, so that was the like, MIT Cornell team. yeah, so we do a [01:32:00] bunch of those kinds of one off academic projects.

So if you got something, it's important to us that you open source it because, we're a hardware, basically, we're a platform company. And so we aren't like, um, you know, our stuff gets more valuable. The more people can do with it. And so the more people like build software, the like better for us.

So there's like a nice alignment of incentives between us and like academics and open source people. so that's where, that's where I see this coming from. and that's what I use look for in people that we give these grants to. we also have a startup program that's. A little bit more directly transactional, where it's we'll give you some compute credits.

'cause we think you'll like the platform so much that when your startup blows up, and you get to the scale of Suno, the music generation company, you'll want to run all your inference on modal. very cool. And, yeah, so that, so Suno 

hugo: uses modal su super cool. Suno. Cool. 

charles: Okay. 

hugo: yeah. So let's have a look at some of your other featured examples.

charles: yeah, so this here, the best way to get started with modal, it's it's a pretty [01:33:00] complex product because it's building a novel computing platform, so there, you can dive into the references and grok it from there, but I think the faster path is usually to find something relatively similar in our examples, and jump from that.

So I maintain the examples, I try and make them. state of the art as much as possible and also they're monitored for correctness and continuously integrated and tested. So they should be like, they should work. And they should teach you the right way to use modal and the right way to use tools. So just give some examples of that.

OpenAI compatible LLM inference with Llama318B and VLLM. Fairly recent updates. It's got the like machete kernel 4 bit quant of, of Llama 3 1 from Neural Magic, just because that's a great small model to get running with. so then, yeah,fine tuning, we have a Flux 1 fine tuning example. This is, my former roommate's [01:34:00] dog, Qwerty.

this is her, in, Elder Scrolls Morrowind, as imagined by a fine tune that's been taught what Qwerty looks like. unlike other golden retrievers, and we got a ton of other stuff on there. We got protein folding. We've got sandboxing. This one's fun. This is an LLM coding agent that writes Hugging Face Transformers code, so runs its own language models, and then that code that it writes Is run inside of a sandbox on modal, so that you know it's not gonna steal your secrets, or,pseudo RM, RF, and nuke your service, or steal your lunch money, so that's another example, that was a fun one, we've got a rag example, you gotta have a rag example, yeah, it's all on there, and even, If you're still, big data is still important, maybe even more important now that we've got language models and other things that can, that can crunch it.

we've also got things like this,DuckDB on Parquet files from S3 scale out example. [01:35:00] not just GPUs, not just neural networks, but any data intensive workloads you want to run in the cloud. I think we can make 'em better, faster, stronger. 

hugo: This is so cool. I love seeing Duck DB there. I love seeing Gradio flash up.

Gradio is one of my favorite tools in the ecosystem for just quickly shipping products to share with friends and colleagues. Sent a Gradio up to, a bunch of non-technical friends recently just to play around with, language models and, mul multimodal models. Charles. I have to 

charles: say, I have to say, when I made this originally, before I joined Modal, when they released A100 GPUs, I like immediately went and made this fine tuning example, and it was Christmas ish,two years ago.

Yeah, two years ago, and I, gave it as Christmas gifts to the people. I would make one of these for their pets, and I would, hand them this URL, and I'd be like, go and make pictures of your dog, thanks for letting me, crash on your couch. and [01:36:00] that was, like, As somebody who was more of a researcher and an educator before, that was my like, first taste of the joy of making something and being able to deploy something onto the internet for anyone to use, and that was what inspired me to eventually become a modal,developer.

Advocate, because I want more people to experience that same joy. Beautiful. 

hugo: What a great story. And look, I think that's a wonderful note to end on. I Charles, as always, I appreciate your time. I appreciate your vibe as well. To be honest, it's always fun to chat. We've talked for nearly two hours and we could probably go for another two.

I also appreciate everything you do for the community, man. And putting this glossary out there, all the work you do at modal, all of those examples, I think are fundamental in so many ways. So appreciate you and Eric and the modal team for everything you do. and thank you for your time and wisdom, man.

charles: Thanks for reaching out.