The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

ravin: [00:00:00] So we're going two 70 m. And so where it fits into the family is it's the smallest size in the Gemma three family. Now it's about a quarter of a size of the one B, and we're really trying to establish this low resource, super fast, highly fine tuneable side of the Gemma three Tech channel. 

hugo: That was Ravi Kuma, a researcher at Google DeepMind, whose work spans projects like Notebook, lm, project Mariner, and the Gemma three family of open weight AI models.

These models bring multimodal and agentic capabilities to developers and researchers with sizes until now, ranging from 1 billion to 27 billion parameters. This week, Ravi and his team are releasing the smallest Gemma yet. Just 270 million parameters built for speed, efficiency, and fine tuning. You might be thinking, why would I ever use a model this small in production?

As Ravi explains, the use cases are broader than you might expect from on-device ai, where latency and privacy are critical to [00:01:00] quick fine tuning for highly targeted applications to running multiple models in parallel on modest hardware. We get into how this model fits into the Gemma three lineup, what it unlocks for developers and why small might just be the next big thing in ai.

I'm Hugo Baum Anderson and welcome to Vanishing Gradients. Hey there, Evan, and welcome to the show. 

ravin: Thank you, Hugo. I'm always excited to be here. 

hugo: Such a pleasure. And you are in 

ravin: California? Currently? I'm always in California, but this time I am in Los Angeles. I came back to visit the family. 

hugo: Awesome. I'm on the other side.

I'm in Brooklyn, New York City. Well, they paint murals a biggie. I, um, but this isn't East coast, west Coast rap battles today, this is, you are launching something this week at Google DeepMind. What, what's happening, man, 

ravin: I, I'm quite excited. We're gonna be launching our Gemma 2,700 and the Model 20. That sounds big, smaller on the scale of things, but quite excited to get this model out there.

hugo: As we'll get to though this model release I'm very excited about and we'll get into all the reasons why us and the the [00:02:00] community are excited, but it does start to build out even more a fascinating family of of models in the Gemma family and I'm excited to get into to that as well. But I'd just love to start big picture.

This is an open weight model once again. And I'm just wondering, with all the powerful proprietary APIs that we're seeing these days, why do open weight models matter and why do you spend so much of your time working on them? 

ravin: I think so there's two questions there. Open weight models and why do I spend my time?

Open weight models clearly still matter for a lots of reasons. I think within Google, we've noticed that people use Gemini for very complex things that require frontier scale models, and in that sense it just makes sense to host them on Google Cloud level infrastructure and are just huge. They're models that require, I have to say, a lot of technical work just to get going and running efficiently and serving at scale.

But on the open side of things. We've also noticed that people really enjoy having models. They can run on their laptop size devices, their single node devices. [00:03:00] Like there's a huge demand for models that people can put where they need it to be in their cloud on their laptops. And importantly, models that are at a size that they can fine tune or modify them and get working for their own, own purposes.

There's been a number of cases, uh, where we've seen people take their own. It could be languages, it could be on data sets. It could be the very specific use case that they need for their very, very specific application, which we'll talk about later for the genocide models, and really craft that model to be what they need for that one specific purpose.

hugo: What type of use cases do you think are best served bio open models today? 

ravin: So I think going back to answering your rean, answering your question a little bit, coming back now is. There's what we're seeing in the world right now is we have these large models that are good at many, many, many, many things.

Like Gemini, for instance, is great at understanding UIs. It's great at creative writing. It's great at long reasoning. It's great at these much more complex agent tasks that take multiple trajectories. And so you have these frontier scale models, and as the name [00:04:00] implies or even specifies, they're at the frontier of what models can do, and that is absolutely fantastic.

We want to push the frontier for all the reasons we want our computers to be better at. All the tasks that I just mentioned, but we also have all these tasks that don't require like PhD level, uh, understanding. Gemini for instance, just won a gold medal standard at the international, I think it was International Olympic Math.

The 

hugo: IMO, right? 

ravin: Yeah. IMO, yes. Yeah. It just won gold level standard. And just like people, sometimes you need that PhD level person solving complex math equations, but oftentimes in the world you don't always need that. So for you and I, for instance, you and I do. Simple things on our own computers, we, we don't need a UI actions model, we just need a model that can take, in your case, maybe the transcripts of your course and just summarize them into a quick set of notes that you can distribute out to your students for your Maven course, for instance.

Gemini will excel at that, but you don't always need a Gemini scale model to do that. I think another thing for you, you have a particular voice and style in a way [00:05:00] that you love saying things, but you could fine tune a model to be like the Hugo model that really has your style and your tone and things like that.

And that's another use case where an open model excels because it's not gonna be the model that's needed that everybody in the world needs. No offense, you just need your Hugo model and maybe your editors and you're good to go. 

hugo: Fantastic. And also o other things that I, I think about are latency costs, these types of things, of course, privacy, and for people who want to play around with using Gemma, a Gemma model and a Gemini model in in parallel, check our AI studio from Google, if you haven't already, 'cause something really cool about ai.

There are lots of really cool things, but you can put in one prompt and look at the response from your Gemma model of choice, your Gemini model of choice. Look at the reasoning, and also you can see how. The latency differences and estimate the cost differences, that type of stuff as well, which I think is, and the point is that you can put in a, a prompt and see that GEM 3 27 B whatever it is, actually will be far quicker than than Gemini as well.

ravin: And so I wanna point on the latency [00:06:00] side that's less of an open versus closed thing and more of just like a model sizing thing, which we'll get to, I think a little bit later in the podcast. But at least to give folks a preview and to talk to you about it a bit is, this is something we think really hard about on the Gemma.

Project we think really hard about like what is it that would be useful to open developers? Would a 2 trillion size model be useful for open developers? And oftentimes we come to notes be, it'd be challenging to use. Very few people have clusters of data center size clusters. So as we're designing an open model for folks and what we really settle on, which we'll talk about more layers in a 100 that's about the right size for a lot of high-end developers or small businesses, right?

They can reasonably get a single A 100. And then for folks like you and I when we work in the open model ecosystem, it's like a MacBook seems, or a small laptop like that seems to be about the right size. So how do we size models that fit really well on that size and excel at their capability at that size?

hugo: Totally. So I want to dive into what you are, what you're launching this week, but F first. You've built out a wonderful [00:07:00] suite of models as part of the Gemma three family so far, so I'm wondering if you'll give us a quick tour. You've already got one B, four B, 12 B, 27 B, so perhaps you could give us a tour of the family and what the rationale behind having all these different sizes.

ravin: Yeah, so what we're striving for, particularly on the Gemma project, like I mentioned before, is just what is it that would be the most useful for open developers? What sizes, what would fit? We now, on the third major iteration of Gemma Models, there was a Gemma one. We established a project at two, I think two B and nine B, I'm already starting to lose track.

Gemma two, which then pushed capabilities with Gemma three was like, you know what? Multimodal would be one thing. If we could really nail it on device for users, that would be great. And so four B, 12 B and 27 B are all multimodal. We found that we get really great multimodal performance at those sizes as long with great tech generation performance and longer contacts as well.

We also then put out a one B model, and I'll point out the one B is is not multimodal. It also has a smaller context window at 30 2K versus 1 [00:08:00] 28 K like the others. But we had this hypothesis that we have these great models that are run on, let's say high-end laptop and higher level sizes. We still think that users would want a text only model that runs quickly on lower end devices, like a lower end laptop or a higher end pixel phone, let's say a Pixel nine type of phone.

And interestingly enough, I'll say we don't know how the community's gonna react to things when we put it out there. And we are, we're building the best funnel we can at Google, but once we put it out, you really see how users are gonna use it. And we saw that, and you could verify this on hugging face, that the 27 B has a couple hundred thousand dollars, still a popular model, but the one we model has 2 million downloads.

Like it's an order magnitude more than the 2070. And it's fascinating, right. We also would think like maybe people would just want the most capable model. All the time, like, why wouldn't you download the 27 B? It still costs $0, same as the one B, but the but one B just turns out to be immensely popular.

So it leads to this question, if one B'S popular, like [00:09:00] why don't we, why don't we make another really great model at a smaller size? Because there clearly is demand at the smaller end of the scale as well. 

hugo: I love it. And something I'll link to in the show notes. I've been very fortunate to be friends with you and work with you on a series of online workshops showing people how to leverage all the cool things in the Gemma family from multimodality to building agents using O Open weight models.

So I'll link to all of those notebooks, please, and videos in in the show notes. So today, this week though, we're going smaller than one B. What's the size we're doing today and where does it fit into the family ribbon? 

ravin: So we're going two 70 m. And so where it fits into the family is it's the smallest, smallest size in the Gemma three family.

Now it's about a quarter of a size of, of the one B. And we're really trying to, uh, again, establish this low resource super fast, highly fine tuneable side of the, of the Gemma Text family, the Gemma three tax family. 

hugo: Awesome. And why did you decide to release such a small model now, and what problems is it designed to solve?

ravin: So there's a couple of reasons. I would say. We always want [00:10:00] to produce the best models that we possibly can for folks out there. And so some of it is the Gemma three family, for instance, was quite ambitious. It launched, there were four model sizes. This is the most we've launched in one shot since we started the Gemma program.

And so one thing to think about is when you're launching an open model, the more architectures you release, just the more the community needs to do to uptake it, right? Like the hugging faces, the alamas and all the partners. Uh, you're giving them a lot of stuff at once. So some of it is like you, we don't wanna release 20 sizes because it'd just be too much and not useful for users.

But we also wanna make sure we're covering the useful range. So with the Gemma three launch, that was like, as I mentioned before, we saw really great multimodal performance, the four B bigger sizes, and that's why we released those. And they really targeted consumer sized laptop or high-end laptop, high-end desktop, and then a 100 size, like a cloud resource level.

And then we also released this Gemma one B, and we noticed that a lot of people really enjoyed using the Gemma one B like more than we had anticipated. [00:11:00] Once we launched those models, both within Google and externally with Google, we noticed that a lot of people were asking like, Hey. The Gemma one B is great, but it's like it's just slightly too big for my size or, and some people were saying it's way too big for my size, so a Gemma one B won't fit on.

Or even we've noticed in some of our courses, a Gemma one B won't fit on everybody's laptops. Potentially it doesn't fit on everyone's phones for sure. It still requires a larger sized phone. So a lot of internal teams at Google and even a lot of folks, you know on the courses that you and I are teaching, were asking like, where can we get a small, more efficient model?

Where can we get one that's more performant on the hardware sizes that I have? I'm looking to fine tune a model. How can I do it with less resources? And so for all those questions, smaller models just tend to tend to make sense. Another thing which we learned internally is we didn't know the capabilities at the smaller size as well, like it was even open to us.

Could we release a model that was good at the two 70 M size march is when we released the Gemma three series. And so we've done additional research internally to see if we really push the capabilities of small models, can we produce a good. A small model, which is [00:12:00] why you're seeing this launch. Now. I'll be upfront and say we, within Google, we have a lot of versions of Gemma, less than one B that we tried and we left a lot of them within the Google ecosystem because they just weren't something that we think would've been the best performance for users.

But after a couple of iterations in the last couple of months, architecture, ablations, and other research. We found the recipe that produces a great two 70 M and that's why we have a two 70 M versus let's say a a 500 or other size that we're releasing publicly. 

hugo: Super cool. So I think we've talked around this a bit, but I'm just wondering, when is a small, or dare I say tiny language model, like 270 million parameters, actually better and better, more performant, whatever that may mean than a one B or two B model.

ravin: So this is, yeah, I think there's a couple of cases and I wanna say. I'm curious to always see what the community does because we really learn where it's better when we see lots of different folks take on the models. But the ones that would come up consistently are when you need super fast [00:13:00] speed. So just because these models are smaller, they computationally run incredibly fast, and this is incredibly fast on single node.

This is incredibly fast on a laptop. This is incredibly fast on a single device, like they're just much, much quicker folks, or use cases that need. Extremely quick responses, sub millisecond type of things. I think that's definitely one. Another use case, we talked briefly about Gemini and Gemma, when you don't need your model to know every fact of history, when you don't need your model to do everything that you know every human could possibly do.

You need your model to do one or two things extremely well or just be really reliable in that one use case. That's when a model like a one B or 270 M Gemma model makes sense. To be frank, I would not expect this model to know the president of the United States for forever. I would not expect it to know the latest facts that come up in the news today.

It's not a type of model where you're going to want to lean on it to be up to date, [00:14:00] but for Gemini sized model, we do need those models to stay up to date. We do have those models continue to know the latest facts in history and things like that when you don't need that sort of use case. A Gemma model makes a ton of sense.

And the last one, which works for even large scale researchers. If you need to iterate quickly on a new idea, it makes sense to try it on small models first before you scale up. Even within the Gemma family, you don't wanna burn all your TPU credits or your GPU credits on a Gemma 27 B, just iterating.

It's probably better. Just make sure you get all the bugs out on a 270 M when we model as a researcher. And then lastly, for you and I, Hugo, it's education. I think we have a lot of students that come from all walks of life and we want to teach them how to use animals. Not necessarily Gemma models. It could be any AI model, but again, it's gonna be hard for them to download even a Gemma 27 B and run it on their laptop.

But you don't need a Gemma 27 B to teach a lot of the things that you're teaching in your courses. For instance, you could start with a two 70 M and say, this is how a transformer works. These are what tokens are. [00:15:00] This is auto aggressive decoding. And all those fundamentals are the same from a two 70 M up to humongous scale models.

hugo: Totally. I was at sci-fi recently. I was teaching with Eric Ma, a mutual friend among and other people. But when we were trying to get people to download models, including J Gemma models like conference wifi, right? So being able to have small models that you can pass around on A USB stick as well. I was actually also recently half joking with someone that don't wanna go down this rabbit hole too much, but there's gonna be a big difference between.

Yeah, people who learn to code before AI and people who learn to code after ai. 'cause of course AI assisted programming is huge for all of us. But we half joked that there may be like hermetically sealed bootcamp soon in caves where you actually learn to code without the internet around you. So you can actually learn to code before doing AI assisted coding.

And the ability to have small local models in in, in that regime would be super. Powerful. I'm wondering if you can give a few concrete examples of tasks where this size of model, so under one B two [00:16:00] 70 million or thereabouts would really shine. And I know that you're excited about the community as well, showing us new cases, but where do you think it it, it will really shine 

ravin: even I get to learn this from the community within Google?

So I have a great colleague, his name's Duke Young and he sits in, in Japan and he's really big in the game development ecosystem in, in Japan as part of Google, but also individually. So something he's been doing with the models, even in this past week, he's been building models that are really good as non playable character dialogues on device.

So you can imagine you have a video game on a low powered device, maybe let's say, yeah, a pixel device or a Chromebook or a small laptop, right? And you're gonna deploy this model. You're gonna play this game to millions of people. So you want this game to be responsive. You want your characters to have personality, and he's been finding that he can get this two 70 m model to be certain characters and act like certain characters quite readily and within the days with the, with the work.

And so let's say he was gonna design an Android video game and he needed colorful characters for you to interact with. You have a fun time playing this game. The two 70 M [00:17:00] model runs really quickly and on device and produces these dialogues for for you. And so you can create this rich game experience on a pixel phone.

hugo: Incredible. S what other types of use cases have, are you interested in? 

ravin: So, the ones that I've been using, this goes very similar to a blog post that I did last year. So she, you know, that I used to work in the restaurant business as a data scientist. And so something that would happen at this restaurant is customers would come in, or customers actually would text us and tell us whether they liked certain things or not with, with this particular restaurant.

And so something I would do is I would do sentiment analysis, but at the time I would've to do it on cloud level resources because that's what it took at the time. I don't have to fire up. Like Google Cloud level ML models. But now I'm finding that this, even these two 70 M models at real time are getting quite good at understanding customer sentiment.

Because what you and I would've to do in the day is we would train these sentiment classifiers from scratch. Like we would start literally with random initialization, right? And we would have to train a specific sentiment model. But now these two 70 model has been already been trained on an incredible [00:18:00] corpus of resources, right?

Like we've. In this case, like me at Google, has done the work for you to already teach this model so many things about human, human language, what's good and what's bad. And just by having the corpus that we have available internally and building this model on that. And so now as an external folks, like if I was to redo this, I wouldn't have to train a sentiment model from scratch.

I would just take a two 70 M model. I would just fine tune it to be a really good classifier of sentiment and already eons ahead. Of where I was when I was starting from a random model back in 20 18, 20 19. 

hugo: Love it. And I actually think we have been, the whole space has been fascinated by the generative capabilities of, of models of large language models and language models more, more generally.

But a huge amount of the power we've seen is the ability to do classic ml and in context learning as well. Yeah. Um, I also do think something you've spoken to in there. Is using different models for different things. And I think there has been a worldview, which is like one model to rule them all. Now that's clearly not the case anymore, [00:19:00] but another worldview has been like, you'll need mostly use one big model and maybe not a variety of different models yourself.

And that isn't the case either. And as we've actually seen with a very interesting aspect of the GPT five launch has been that it has a router which will route to a variety of different models, right? Even, um, OpenAI is aware that using a variety of different models is useful. I don't necessarily think their model of having one proprietary router is where we'll all end end up, but for a lot of people, I think using different models for different tasks, like even let's say I were to weekly cluster my emails in terms of importance and all of these types of things.

I'm not gonna use a super based model to, to do that because I don't need to. So I do think the Unix philosophy of different, different computational units piped together, this type of stuff, as opposed to the apple philosophy of one thing to rule them all is something we're seeing play out really wonderfully here.

ravin: I would say your point on that, like even for me personally. With the Gemini series, we have you go to the AI studio, like you said before, and you'll see Gemini Pro and then you'll see flash and you'll see [00:20:00] different sizes of models. But on my side too, like it's not, I try two 7:00 AM and sometimes I notice it just doesn't do what I needed to do.

It's maybe the task was too complex or there's maybe more than 30 2K input tokens. I even, I'll flip up to a Gemini flash at that point. It's not two 70 M is going to do everything that you possibly needed to do, but it can do a lot. And the other point is you don't always have to reach for the biggest model.

Not everything is a Gemini flash or pro sized type of thing. If you just need some quick text ing advice, then now you have a gem model for you. As well. So I think it's really just about, from the way I see it is, and particularly one reason I enjoy working at Google is like, users have their choice. Pick the model that works best for you, and if it's a Gemini Pro and that works for you, just go for it, use it.

But now you have a two 70 M model, you can also try it as well and really make the selection that really fit the model to your task, not fit your task to the model. 

hugo: Totally. So in, in that case, if people. Have a task and they wanna try out a particular Gemini [00:21:00] model. Do you have any heuristics or rules of thumb for how they should go about choosing which model size, Gemma model size to use?

ravin: So I think the first thing I know be upfront about this is if your task is multimodal, then the two 70 M and one B are not the, not the model sizes for. For you, you're gonna want then reach for models that have multimodal understanding. That's, that's one thing I would also say though, then it comes down to context length and things like that as well.

For a lot of reasons it would be challenging for folks to run models with humongous context on device. So if you have tons and tons of, of documents, like thousands of, and you need them all in a context at once, then you really, again, do want a Gemini sized model that has the million plus context window.

But if you need a model that takes, like for you and I, for instance, a bunch of, we're gonna take a bunch of your podcast episodes, for instance. 'cause you're quite a prolific podcaster. You have lots of content and you need to just get like a, you need to get JSON. That's just like the title, the duration, and a couple other attributes and things like that.

You could just throw that at the two 70 a model. Talking a bit about how we trained it, we really focused on instruction following and like JSON [00:22:00] understanding and things like that. And so you could as developer or take your core person of knowledge and ask you to get the JS o JSON parser out of it and you'll have an on-device JSO parsing model that takes your unstructured data and then structured into a way that you can then use in your programmatic workflow.

So that's something that I've found to be quite useful even at the the two 70 M model right now. 

hugo: Super cool. And I would even caution people when thinking about. These models with, you know, like for all intents and purposes, like infinite context, so let's say very large context that putting everything in there can work and it does a lot of the time, but it doesn't necessarily help you iterate or address value modes because if it's not working, you have no idea wise.

So clever chunking, clever types of retrieval, clever summarization, this type of stuff and breaking things down into pipelines can really help you improve on any AI systems you are. You're building, uh, as well, particularly with context rod and all of these types of things now, for a word from our sponsor, which is well.

Me, I teach a course called Building L, empowered [00:23:00] Software for data scientists and software engineers with my friend and colleague Stefan Krach, who works on agent force and AI agent infrastructure. At Salesforce. It's cohort based. We run it four times a year, and it's designed for people who want to go beat.

On prototypes and actually ship AI powered systems. The links in the show notes. So I love that I can run models on my cell phone and super excited to play around with two 70 million Gemma, two 70 million on my cell phone. I'm wondering what kind of hardware is this model designed for? Phones, laptops, raspberry pies.

ravin: So one is you're not totally constrained. I'd say this model can run on more things than a lot of the other models, so I think you could definitely deploy it on server grade hardware. And things like that. So you could have a lot of model, different model instances running it at the same time. Like you could have one server that's running multiple instances of Gemma, but you also then on the lower side, uh, we're releasing an.

Four models actually to be precise. So there's two seven M pre-trained at B, full precision, but then we are leasing [00:24:00] a quantized aware training, which cuts it down to a four bit. So that's a four x reduction in memory, but maintaining the performance. Also the same on the IT side as well, and so on. Lower and smaller devices that don't have gigs and gigs of memory.

Maybe one or two, actually one or two gigs or so. You could run the two 70 M Quantized award trained version at forbit. Uh, and it would run just as perform early as if you run it on a, or nearly as perform as you ran it on like a MacBook sized or larger size device. Same. We've also done explorations where you, it runs pretty okay without an accelerator.

So something that we take for granted now is that if you're gonna be running models locally or even in the cloud, they're running on accelerators. These are, let's say TPUs or GPUs or Apple, I think it's called Apple Metal, which is the type of chip that is really good at matrix multiplication. Whereas like historically and even still today, right?

Our computers also have CPUs to do a different set of tasks. Not every laptop has a A GP or accelerator in it, and [00:25:00] so this model still runs reasonably well on devices that don't have matrix multiplier accelerators and math accelerators and things like that. 

hugo: Super cool. I'm excited to hear the community from the community and all our listeners and viewers, what they play around with these models on and what they run it on and what they build.

Something you mentioned briefly that I kind of wanna dive into a bit more. 'cause I think this is part of one place where these models shine a lot is fine tuning. So I'm first wondering it and I definitely see it being but is fine tuning still 

ravin: relevant? I think yeah, 100% fine Tuning is still, is still.

With, in numerous cases, like you and I have seen, uh, use cases where people need models that do very specific things. I think with the Gemma models, we've seen particular teams and folks take the Gemma model and fine tune them for their specific corporates of knowledge, their specific language, and. I'm, I failing to recall the specific use case off the top of my head.

We could put it in the, in the show notes, but I believe there were a couple of Indic languages and I believe Korean Fine tunes and [00:26:00] Japanese fine tunes of the Gemma models as well, where the Gemma model was taken and it just excels in these particular languages because that's a need of that particular community that is unique to them.

And they can take these models and make them, make them their own. Do you wanna learn a bit on architectural details as well, Hugo? We can 

hugo: love to 

ravin: do some stuff. So the Gemma models, and then are these are using the same tokens as the larger size Gemma models as well. It's a 256 K token. What this means practically is tokenizes are essentially compressed language into tokens, as the name implies.

And so you can have different tokenized sizes. There's no, there's no actually rule that has to be any size, like for instance. I'm building a model at home that is outside of Google, that is a 30, A 30, what's called vocabulary size. It literally is just the ask characters, A through A through Z, lowercase, and has a start sequence token and end sequence token and a padding token.

Like it's, that's the, that's a super, that toy model that I use for my own learning at home. But when you go up [00:27:00] to more realistic models, you could pick a 30 2K izer, you could pay 60 4K. We stuck with 2 56 K because we've just seen that a lot of people, like I mentioned before. Wanna use Gemma with their own languages, with their own character sets.

And by having a larger tokenize, the model more easily adapts to those different languages. It has, it has better awareness, multilingual and other use cases than if we had a smaller tokenize. So for that reason, for instance, we stuck with a, with a larger tokenize. If you look at the Gemma models in particular, actually, if you look at the two 70 M in particular, it actually has a little bit of an interesting architecture in that, the actual parameters that like it multiplied and added and things like that, that's a hundred m and at the embedding table is 168 M.

So it's a weird model in the sense that the token, the embedding table, is larger than, than then we'll call it the activated parameters in this case. But we still stuck with that decision because we wanted to give people the flexibility to fine tune this model to where they needed to go, and a larger embedding size [00:28:00] and an embedding table allowed for that.

hugo: Amazing. I'm also interested for people who want to go out and play around with fine tuning this model for a specific task, how would they do 

ravin: this? So we're really lucky to have a bunch of really great ecosystem partners. So one of 'em, for instance, is, uh, hugging face. There is a hugging face tutorial that we'll have and I will put it in the show notes of how to fine tune this model on a single CoLab and if an up point on a free CoLab.

Part of the benefit of this, both, again being small when we talk about it, is that you don't need to have crazy level resources to fine tune it. This is totally fine. Tuneable in a free instance, CoLab with with your corpus on, with the hugging face Transformers library with free Google resources, with the hugging face transformers library, and then a small model.

I think this enables an even greater variety of people to go try their fine tuning experiments with, without having to spend any money at all. 

hugo: And so we've talked about some use cases for fine tuning. I'm wondering if there are any others that you see people fine tuning small [00:29:00] models for. 

ravin: So something I'll say again, going back to the embedding size and this i'll, this goes a little bit, it's more of a niche use case, but I think it'd be fun to talk on this podcast about some things that are more in depth because this uses the same tokenize as the larger models.

You could do things where you, you catch a lot of data that's pre tokenized and use a 270 M to test out a bunch of ideas. And then if you could then flip up to the larger Gemma models and fine tune those as well because it's one-to-one with the tokens. This is an example of where it's a particularly advanced use case.

One I'll start with because I think it's mentioned less, is you could use different size models in the Gemma Series and flip between them for your, for your different use cases and starting with a two, two 7:00 AM just again, test out your ideas. I think. Ideas that folks could use. We talked about the classifier use case.

That's easy. You fine tune to start out with. People could fine tune these models to be really good at, at JSON Parson, like we talked about before, we put a lot of effort in to make sure this model had a lot of like structured output awareness so folks could show the model, a [00:30:00] lot of unstructured structured output pairs and then you could just.

You don't even need to say prompt the model in the traditional sense anymore. You would just feed it, say your podcast episodes and you would get JSON output at the end. This would be super fast because you would save tokens on internal, you would save tokens on the context window 'cause you wouldn't even need the instructions anymore.

You would just give it a blob and the model would know what to do with it. That's another good example of fine tuning and then like Jung doing is making this model like your personal chat bot model that speaks in your style or your tone or the tone that, that you would like and really want. Then you have your own model on, on device that is just like your model, that's yours to, to chat with.

And 

hugo: incredibly cool. And, and I of course we'll link to the tutorial on hugging face, which as you say, you've got Google CoLab, you have Transformers, and you have a small language model. I'm wondering if there are any other tools or workflows you'd recommend for people to, to use to get started with fine tuning?

ravin: So we have also a Gemma three P library. It's also on GitHub and it has a, a lot of [00:31:00] integrations for things like lower fine tuning. And chat. And it fits. It's built on the Jacks stack as well. So it fits well into ecosystems where folks are using all these technologies are using that technology in particular.

And so that's another good one. I also, in particular, like using that library from me 'cause I think it's quite well contained in readable. And so if you're trying to understand how all these these things work, it's a fun library to get a couple layers. Layers deeper. And something that we've always been, again, super fortunate with the Gemma models and really thankful to the community as well, is we then start seeing this Gemma model, get picked up into all sorts of other places that places as well with the Gemma models, you'll find them on Unsought, you'll find them on Axel Auto.

You'll find them of course, within the Google Vertex AI area. But it feels like any, anytime somebody comes up with a cool open source framework that frankly doesn't even exist right now, they then tend to include the Gemma models in them as well without us even having to. Just organically, like we just see folks organically take up the GEM models and we're very thankful that the community [00:32:00] has such a positive reception for the models in particular.

hugo: Wonderful. And I'm wondering, would have you seen or would you fine tune this model size for production use cases, or are you thinking mostly for prototyping scenarios? 

ravin: I did think it's good for production use cases, like we, we talked about the, the restaurant stuff in particular and before, like I could easily see myself using this at the previous organizations that I've worked at, there's specific tasks that I had.

There's one organization in particular which had a lot of things that are not, which are not public. A lot of things that were only used within, within that organization. And so using a 270 am model would've made a lot of sense, even. At that, what is now multi-billion dollar company because we need a super fast model that's really easy to change.

Doesn't take a lot of resources. So it, it definitely is a production grade model. But I think to your, the other question you're asking though is it's not the model that you should use for everything. Right. I will point out that it's not the, it's not a front, like we're not using the word frontier to describe 270 M as well.

It's not going to have the same level of. Reasoning and multi-step trajectory and performance that you're gonna [00:33:00] get out of a Gemini model. If you and I needed to do, here's a good example, hug, go. I would trust Gemini to refactor major code bases that I have now the model, but also like Gemini, CLI and the way these things work, I can point it at an old PMC code base that I have.

C model is say, Hey, build this complex BA workable for me. Me, absolutely, I would use Gemini for that and I would trust Gemini with production level results. I would not trust 270 M to code an entire pie c production grade model for me. Like it's not a model that's going to have that level of. Context, for instance, but also that level of reasoning and thinking that you're gonna get out of a, a Gemini Pro or flash size model.

hugo: I love it. And I'm wondering, I'm, we've hinted at this in a variety of ways, but I'm wondering if explicitly you could speak to what trade offs should people expect when fine tuning a a smaller model? 

ravin: Yeah, so this, the hard trade offs you're gonna get, which are just, just factually there through TK context window.

You're not gonna be able to put in. An entire novel, an entire like Shakespeare novel and expect it to memorize all of Shakespeare. Ask it all questions about all of Shakespeare. If you're trying to one shot, write [00:34:00] an essay on Shakespeare, for instance. So not gonna be good at answering questions about Shakespeare.

The other one is, we talked about a bunch, is a multimodal, like you can't pass it a picture of Romeo and Juliet and expect it to understand this model has zero level of multimodal, for instance, but you could trade it to be a really good Shakespeare generator. If you really, for instance, enjoy the works of Shakespeare and you like the style of tone and you wanna get more of that, for instance, you could take Shakespeare, you could chunk it up in a fine tuning workflow like any, and batch it like any other fine tuning job.

And it could be a very performant Shakespeare generator. I have not tested this myself, so don't take this as factual generation, but like I would wager it could reach frontier level performance of Shakespeare generation if that's the only thing you had this model do, for instance. Now it would, of course, as we talk about models, and we actually let me dive into this point of fine tuning as well.

It's well known that when you fine tune a model with one capability, you often lose capabilities in like other areas. For instance. So like with this Shakespeare model, like I would stop expecting it to be able [00:35:00] to do JSON generation for instance. Like this fine tuned tune would lose its JSON ability. It probably would also lose its chat capability to be quite frank as well.

But it would be extremely good at generating. Novel poems for you each day. So if I needed a poem generator, and actually this is something I used to have on my Linux computer. I don't have it any now, but there used to be this, this terminal thing I would have where every time I started my terminal, it'd give me a quote of the day.

And at the time, what it was doing was it was either referencing a bunch of, I think it was like 5,000 quotes that, that were like in the code. And then you could also have a call through the internet if you needed to. Like it was going to go to database and grabbing these things. I could now have a, a local two 70 am model that every time I open my terminal, it generates like.

A positive phrase or rhyme or something that just cheer me up. And because it's a generative model, you would, you would have an infinite number of these things, right? You would never run out of these, these things. And so that's like a fun project that I could use for a two 70 M if I was to do it locally.

hugo: Fantastic. I love the idea of a Shakespeare generator, and you said you haven't explicitly done this. Yourself, but we've made [00:36:00] suggestions in workshops and podcasts before that listeners have gone out and I implemented. So if you're listening and wanna build something like that, go and jump in and let us know how you go.

Ian, I am interested, also interested in how you think about these models from a product and performance perspective. Now, of course, no, I'm not asking for any secret sauce or trade secrets, but I'm just wondering process wise how you think about like. Benchmarks when building before release and evals and even VI vibe, checking how things are working.

ravin: Yeah. I'd say this is one reason, actually I'm, I feel very fortunate I get to work with the Gemma team in particular. So every time we have a major Gemma cycle, we really sit down as a team and you think about what can we possibly do to create the best open models. With the technology that is available at this moment in time.

And I don't think it has to be said on this podcast, but like Gemma one was released last year. I'm already losing track of time, but I know, I think it was released last year or the year before. Mm-hmm. And AI technology as a whole has changed in that amount of time. There have been so many innovations and advancements from so [00:37:00] many different areas that, like when we sat down for Gemma two and Gemma three, we always sit down and say, with the technology that exists today, what's the best model we could release that would be useful for developers?

And I'll point people to the Gemma three technical paper. But in Gemma three, there were a number of changes we added in like sliding window sizes. We removed certain, we removed certain things within the architecture, like where certain normalization factors sit. There's all this upfront research that goes into building the best architecture.

And then the same level of rigor goes into what's the best data sources we could possibly create for these small model sizes. So like an incredible amount of work goes into creating a curated, a really good curated data sets that pack in high quality tokens at train time. And then the same level of rigor then goes into, I mean, to the multi model side, what's the best encoder?

And then also what's the best post-training recipe and what could we possibly get outta post-training? And we, the whole team, it's not, I have to say, give credit to the incredible number of people that go into this. But they think really hard about, about what does it take to make a super, super good [00:38:00] open model at that size.

So each one of these things I just mentioned, and with the two 70 am we did all these same things. We had to decide in architecture, we had to decide data, and let's just start with those before we even get to, to evals. Um. There's a number of things we do at that stage just to see does this architecture look good?

We look at lost curves. We look at how the model is picking up on data. Is this model actually learning the data that we're giving to it? And even to find, even to the level of, should the embedding size be 2 56, should it be 300, should it be four? Like we are running a lot of these internal experiments to see what is the right size for width versus depth at this, at this particular size.

And so in this size, we, we chose less than one B. And we ran a lot of experiments way at the beginning just to check the model loss curves to see how well it was training. So we did a lot of work up front just there. Then we spent a lot of time thinking about what sort of data we should put in for a model at this size as well.

We may, we have to make trade-offs to the model for this size. We can't ly teach it. Everything that exists in the entire universe, it wouldn't be able to learn all that. [00:39:00] One hypothesis we had and the bet we took is at this size, people are not going to be asking a lot of questions about, let's say again.

Where is the 2028 Olympics gonna take place? I don't think this is the model that folks are gonna go to get a lot of like factuality type of question. We're saying we don't obviously care about factuality, but we don't think that's the primary use case. We do think instruction following, and like we said, information structuring is a key use case for small models, which is why we.

Put a lot of tokens into those areas. And then from that, as you mentioned, that's where we pick the evals. This is the size and the capability we have. So we didn't pick, for instance, a Swb bench eval. The team I used to work on Google Labs, they have this product which goes out. You give it a GitHub PR and it figures out what needs to be done.

The eval for that is this eval called, uh, ATCH, I believe the same thing with you, with Mariner. Another product I worked on, which is Google's. Web browsing agent. There's an eval called Web Voyager, which is based around this web sort of task use case, right? We knew that this was not gonna be evals that be useful for the model this size, so we didn't [00:40:00] even, we never ran a web voyager set on this.

Is it external easier? You totally can. I wouldn't expect great results to be quite frank, but because this didn't make sense for this model, we didn't pick it however we did. You know what's funny actually, Hugo, is we did pick up evals. That, that were like legacy evals now. So there's evals that the frontier model don't really report on anymore because the models crush the benchmark.

They've totally saturated them. There's evals that are like, so have been crushed so hard by the large scale models that like that we don't, we. They're not, they're not reported. Like the models get a hundred percent every time, but for a small model, it's not at that level yet. So we, we, we dusted off some evals that were quote unquote legacy evals.

Um, and we started training them. We assessed the performance on those things as well. So we picked evals that the frontier models crush and don't report on it anymore because they always score a hundred percent. But evals that fit for a model the size, like eminent js, ON or or IF eval. 

hugo: Awesome. So we're gonna have to wrap up in a minute.

Sadly. I'm wondering though, are there. Any patterns you've seen in how [00:41:00] the community has adopted and is using Gemma models that surprised you? 

ravin: Yeah, so it's, it's the one that we started off a little bit at the beginning, which is we with the Gemma three family we're like, people really want strong multimodel, like top tier performance and that with the four B, 12 B 27 B.

But something that surprised us was we also put the one B out there because we thought we had a hypothesis. People would want a text only small performance model. But it has been incredibly fascinating to us to see how the adoption. Of the smaller model sizes, even though they're not the best Gemma models, for instance, in on benchmarks, they're still be, they still must be better for users because you were seeing a couple million downloads for Gemma one B.

I'm looking at a hugging face right now, and I'm seeing 1.3 models for Gemma four B. So if you look at the benchmarks, you would say like, why would I pick a Gemma four B? It's not the top of this benchmark or that benchmark. It has less performance on a benchmark level than a 27 B. But it must be doing a lot of great things for users if this many people are downloading it instead of a 27 B for instance, or in addition to a 27 B of As often you've seen in our workshops as well [00:42:00] that people download multiple models at once for the use cases.

So I'm, this is launch day or launch week as as you and I are recording this, and it is gonna be super fascinating to me to get, like to see the download numbers from folks and see how many folks download this even compared to other models within the Gemma family itself. 

hugo: And I love that you really double click on the fact that there are certain things that may not be the best at, but that it's incredibly useful and used due to its size as well.

And I always think like in embedding models, there are all types of hot, new, sexy, super performant embedding models. But if you look at, for example, I'm looking at it now on hugging face sentence transformers. The embeddings model, all mini lm, L six V two. This. Last month had nearly 89 million downloads, which is incredibly telling, particularly when you have all these APIs that do embeddings automatically for you and that type of stuff.

And it once again speaks to finding the best tool for the job and not using a jackhammer to make an [00:43:00] omelet. Pardon the mixed metaphor there. As a final question, I'm wondering what you are excited about to see developers do with this new Gemma. Two 70 m that they may not have done yet. 

ravin: So the one thing I'm always excited to do is, or excited to see, is I'm excited to see people publish their fine tunes.

I have to say, I think I'm very happy that a lot of people get to use their models, but I, with the Gemma models, it's you, you. I'm very happy to see people use models, but you don't always know what they're using 'em for. But when people publish fine tunes of models, they're really showing off what they did and they're really putting it out there into the world.

One thing I hope to see more than more than ever is with the low cost of fine tuning these models because it's a smaller size that people fine tune them, and then they put it out there for other people to use and create this gem averse of models that are not just. Me putting one model out and a bunch of people downloading it, but somebody taking it, making it their own, putting it out there, another person fine tuning that fine tune and making it [00:44:00] their own, and seeing like chains of models or multiple evolutions of these models.

I would love to see that with the two 70 A and I'm hoping with the, the low cost of fine tuning on Colabs or other, other lower powered devices or smaller devices, that we'll see more Gemma variants out there than we've ever seen. I've seen before. Another thing I'm always excited about is seeing how people use multiple models together.

It's always fascinating to see how folks use a combination of small, large, multi, multiple sets of models to do a particular task. So again, helping with a more, with a smaller model that can run on device that we see more, more folks using different models and with different size tasks and in, in novel ways.

hugo: Amazing listeners, viewers, builders. Please do let us know. Reach out on LinkedIn or Twitter or whatever it may be to let us know how you use these models. And actually, Gemma has a discord as well. Is that right? Rivian? I 

ravin: think we have a Google developers Discord, and so you'll get, that's a great one to join if you're looking to get [00:45:00] into the Google developers ecosystem.

Gemma, I think it's a good place to ask questions about Gemma as well. 

hugo: Super cool. Thanks so much for coming on the show once again, Ravin. But most of all, congratulations on this launch, and I, for one, can't wait to play around with the new model and report back on how I go with it. 

ravin: Thank you, Hugo. I'm excited for you to be one of our, uh, first users here.

hugo: Thanks for tuning in everybody, and thanks for sticking around until the end of the episode. I would honestly love to hear from you about what resonates with you in the show, what doesn't, and anybody you'd like to hear me speak with along with topics you'd like to hear more about. The best way to let me know is currently on LinkedIn, and I'll put my profile in the show notes.

Thanks once again and see you in the next episode.