The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.
===

greg: [00:00:00] If you wanna make progress towards intelligence, then you need a definition of intelligence. And one of the crazy things that just shocked me about being in this industry now for a little while is that we don't have a formal definition of intelligence that we all rely on. And so Francois went about trying to formally define intelligence and he has a very unique definition of intelligence, which is what leads into ARC here.

So his definition of intelligence is your ability to learn new things. Beautiful. It's really quite, it's really quite beautiful. And just to put a small little addendum on that, it's your efficiency at learning new things. So how quickly can you learn new things? We're specifically measuring new things rather than just recaller knowledge.

And so you won't see any questions that say, who's the fifth president of the United States? And this is in contrast to a lot of other benchmarks that may ask you to do what I call PhD plus plus problems, and so extremely hard problems that really shows where the ceiling is. But benchmarks like arc here show you where's the gap between what humans can do and what AI can't.

[00:01:00] Because by the way, every single data point on here has been human tested, so we know that humans can do these, but AI still has a really difficult time to do 'em. 

hugo: That was Greg Kamra, president of the Arc Prize Foundation. What you just heard was Greg's description of the benchmark that might be the best shot we have at measuring actual intelligence in machines, not test scores, not memorization, not vibes.

This episode is about ARC A GIA benchmark built on Francois Chalet's definition of intelligence. Not how much you know, but how efficiently you can learn something new. In this episode, Greg and I get into why most benchmarks completely miss the point how ARC is designed to expose that failure and what happens when even the most powerful models, GPT-4 class models can't solve tasks that humans get in one or two tries.

We talk about the structure of arc, how it's tested, and how the open AI O three model [00:02:00] unexpectedly performed well, but only with massive, carefully engineered compute. We dig into whether that counts as generalization and what it reveals about where the field actually is. Greg lays out a simple but radical idea.

We'll know we've hit a GI when humans stop being able to design tasks that stump ai. Until then, arc might be the best tool we have for tracking progress. If you care about evaluation, if you've ever wondered whether these systems are really learning or just guessing, well, this conversation is for you.

I'm Hugo Brown Anderson, host and producer of Vanish in Gradients. If you've been enjoying the podcast and wanna support the show, firstly I'm teaching my course on building LLM powered applications. Again, it's called Building LLM, applications for Data Scientists and Software Engineers. This time, I've got sessions lined up for both Europe and the us.

So if you or your team are working with LLMs and want something practical and hands on, I'd love to have you join or feel [00:03:00] free to pass it along. If you know someone, it'd be right for the links in the show notes. Second, spread the word. If this episode resonated, send it to a friend or a colleague. These conversations get better, the more thoughtful people are part of them.

And lastly, I'm in Europe for a while. I'll be in London, Berlin, Paris, a few other places over the next few weeks. If you're local and want to connect or you'd like me to come talk with your team about ai, just reach out. I'd love to hear from you. Also, a quick heads up, I recently dropped a new episode of my other podcast, high Signal with Fefe Lee.

We talked about foundation models, what human-centered AI actually looks like in practice, especially in areas like elder care and education, and why spatial intelligence might be the next big leap beyond language models. It's one of the more reflective conversations I've had on how AI is reshaping society and how we might shape it back.

If that sounds up your alley, check it out. The podcast is called High Signal and the link is in the show notes. Finally, I'd love to hear from you about what [00:04:00] resonates, what doesn't, and who you'd like to hear me speak with. You can reach me on Twitter. The podcast handle is at Vanishing Data. I'm at Hugo Bound.

It would be great if you could subscribe to the show on your app of choice, and if you like, do write us a review on iTunes and or anywhere else. That's how we get discovered and how we can support creating even more wonderful content for you all. Thanks for listening. Hello everybody. Hugo Bound Anderson here for Vanishing Gradients.

So excited to be here today with Greg Kamra. What is up Greg? Yo, I am very excited to be here. Thank you very much for having me, Hugo. Dude, what a wildly exciting time to be talking about such things and working on such things. I'm not a religious man. In any conventional sense, but it's Easter Friday here and everyone's at the beach and all that.

I'm, yeah, I'm going later, but I just, I was like, let's jump in. It's actually tough to sleep at the moment with all the exciting things happening in the space, 

greg: isn't it? I tell you what, as a benchmark company and benchmark provider, we go out there and a new model's released, we have to go rush around and go test it.

There's too many models there. Too many are [00:05:00] coming out here. It is hard to keep up with it. We need to keep on investing in our testing infrastructure without a 

hugo: doubt. And actually, I just, I woke up to prepare for this, doing a few other things, and then saw like Gemini 2.5 flash, and I'm just like, come on dudes.

I haven't even recovered from oh three and oh four mini. Quite, yeah, there's a lot going on here without a doubt. Greg, look, I'm interested in, I mean, you've done so much already in, in, in your career from your, your work at Salesforce, director of Strategy and Growth and Data Intelligence to being a founder to all the incredible content you, you produce and now your president of the ARC Prize.

I'm just wondering if you could tell us a bit about your journey and what got you really interested in how we 

greg: measure intelligence in AI systems. That's a long journey. We'll get to the cool part Last, long story short, I was an Excel monkey in the beginning of my career for a long time. I didn't realize it at the time, but I really like spreadsheets and I really like pulling data out of a spreadsheet.

That's what got me jazzed. I found that it was my skill. I liked doing it, and it was a great combination of things. I [00:06:00] quickly found out though, that if I only let Excel be the ceiling of my skillset, then I was gonna be in trouble. And so in 2014, I took one of the first data science bootcamps. It was called Zip Fee in Academy, a shout out to that crew.

Very cool. They were acquired by Galvanize, and so all of a sudden in 2014, I move over to San Francisco and I take the first, one of the first data science bootcamps, and there I took my skill level from. Excel was a ceiling. Now all of a sudden I'm learning pandas. So wait a minute, pandas and I tell people Pandas was my gateway drug into the rest of coding.

'cause then you have to start pulling from APIs to get data. You're working with a bunch of, then you're, you're working in Python and then Python's a gateway drug to so many other places and so many other languages from there. So ended up doing that, went further down that rabbit hole. Went to Salesforce for a while, led growth for a couple billion dollar products, which is just insane.

Super cool. Worked with a really fun team there. And long story short ended up working my way over to the AI space. And so all of a sudden, fast forward to November, many folks know November, [00:07:00] 2022. It's wild that we're almost coming up on a three year anniversary here and chat GBT drop and you start playing with an API.

You're like, holy cow. I can just immediately tell that this is gonna be transforming everything I was doing beforehand. And so jumped into that and I took a learn by doing. Approach or do by learning approach. And so I would learn, I was learning how to use chat GBT, but from a developer perspective, and then also like actual use cases.

So not like 50, here's 50 prompts that you need on LinkedIn or something like that. It was like, no, here's a tutorial. Here's me using Lang Chain to go do something cool. And it was at a very unique time, and I'm sure you remember this Hugo, too. All of a sudden the world had language models and not all of a sudden, of course, there's a lot of prehistory for it, but I'm saying the language models went into mainstream and it's, wait a minute, what do we do with this stuff?

What does it do? And even just enumerating the use cases. Oh, it summarizes, oh, it classifies, oh, it, you know, does sentiment analysis or whatever it may be. That was all still new. People were still figuring that stuff out. So it was a really fun time to be learning [00:08:00] alongside the community and sharing on Twitter and sharing on YouTube, just all this other good stuff.

And so I was doing that for a while. That actually led me over to Mike Cano. And so I heard Mike on an interview once and he was talking about Zapier's ai, internal AI adoption. So one of the crazy things is if you have a thousand employees at your company, how do you get them to adopt ai? What does even that mean in the first place?

And I heard Mike talk about it at an offhanded interview. It was like, maybe he said about it. He said two sentences on it. And I was like, wait a minute. That is such a cool topic. 'cause my background being in business and now, and with the ai, I was like, I wanna talk to Mike about this. So I cold reached out to him and one of his, either it was him or his assistant, said, Mike loves talking about this topic.

This is one of the topics near and dear to his heart. So I ended up doing an interview with him, which had just a ton of fun doing. And next thing you know, I'm skipping over a lot of details. But Mike comes to me and he says, Greg, I got a side project. I'm gonna be starting up. It was Arc Press and he, so last year in 2024, [00:09:00] we are mainly just a competition and we are a team of four.

So it was Mike Francois, myself, and a fellow colleague, Brian Landers, who's amazing as well. And then we just had a ton of fun and Arc Press blew up a little bit more than we thought it was going to. And towards the end of last year, Mike turned to me and asked me if I wanted to basically run it this year.

So we, not only did we upgrade our mission from just a competition to a proper nonprofit, but he asked me if I wanted to run it. And I'll pause here after I say this point because this is just so much fun. But I couldn't think of a cooler leverage point for my skillset to be doing right now. It's just, I'm so privileged and so honored to be doing what I'm doing.

It was a one-time thing, like if I said no to Arc Price at the time it wasn't coming back. 'cause it's such a timely period that we are within the AI world. So I thought about it and I thought, you know what? Being at this intersection of where I want to be running this thing, having some fun. Let's go and do it.

So it is been a really fun 2025 so far. 

hugo: Incredible. And funnily, people aren't here to, he hear about me, but I do wanna just for a minute just talk about how my journey dovetails with yours. 'cause there are synchronicities and [00:10:00] differences. I used to work in scientific research, biophysics, cell biology, early 2010s, and I ended up discovering I needed to munge a lot of data and discovered something called an I Python notebook.

There weren't Jupiter notebooks there. And then this thing called that I could import Pandas PD and PD read CSV Yo like this is one of the biggest things. Inline map plot lib the PI data ecosystem and the scientific Python community having these kind of modular set of tools that you could in interact with.

And they of course were in the US and Europe, but mainly in the US building a lot of these tools and doing it in person from scientists, Fernando Perez and Brian Granger. I. And then to Wes McKinney working in finance. So they were building all these tools together that we all found in incredible, and then those tools went into industry.

And then there were some big changes in the tools that we, we used particularly in the pa, the ops thing. The whole ops thing happened and then it's still happening. But then all these AI tools, and to your point, the LLMs weren't new. And in [00:11:00] fact it was a a, a pretty basic product wrapper around a preexisting LLM that brought our chat GPT moment out.

We just had the stable diffusion moment as as well, and interacting with these systems. And to your point, not only, I think we do focus too much on the generative aspects of them and not necessarily the few shot learning capabilities and in context learning that, that we can do like sentiment analysis and summarization and.

To your point as well, summarization, we can do, but we're still figuring out how to do evaluation and evals on summarization. Right? So it's really early days, and I compare it to, and other people have done this, of course, to the early days of being able to harness electricity where we didn't have the light bulb or the grid.

And I think Edison formed the innovation lab to figure out what's up. And they're like, okay, let's figure out how to, and a lot of people say this, but I do believe if development on foundation models stop today, which clearly it won't, we'd be figuring out how to use them for decades, if not longer. [00:12:00] I am interested in how your, because I've always thought a GI was perhaps a red herring of some sort and too many conversations focused on it and we didn't have good ways to measure it.

But when Francois Chalet, who people may know him from Keris and a lot of other work he's, he's done and his beautiful books, one of the, one of the few people who's a deep technologist and a deep philosopher and thinker about our societies. And in a lot of ways when he started working on. Or developed Arc A GII was like, this is something I, I wanna watch.

And I've always been concerned that LLMs basically are stochastic pars and me memorizes. Right? But I'm not so convinced anymore after a lot of the work you've done. So I'm wondering how you've thought about, so maybe you can give us the TLDR on ARC and A KGI and why it differentiates from just measuring, perhaps memorization of these 

greg: types of things.

Absolutely. So it started with Francois's paper, which I mean, I've read multiple times here. Francois is just [00:13:00] an amazing thinker throughout this whole thing. So it's really awesome to be able to work with him. So it starts with the definition of intelligence. He actually opens up his paper, like the first sentence in the abstract.

I'm gonna misquote it here, apologies. But it says something along the lines of. If you wanna make progress towards intelligence, then you need a definition of intelligence. If you basically want to, if you wanna make progress, then you need to clearly define where you're going for this. And I can't, it's just so true.

And in the paper he talks about a survey of many people throughout the 19 hundreds and maybe even beforehand, talking about what is the definition of intelligence. And one of the crazy things that just shocked me about being in this industry now for a little while is that we don't have a formal definition of intelligence that we all rely on.

Everybody has their own opinions on it. Sure. I I want to hear those. None of them are formal. It's rare that you're gonna write those down on, in, in mathematical notation or, or computer science notation. And so Francois went about trying to formally define intelligence and he has a very unique definition of intelligence, which is what leads into ARC here.

So his definition of [00:14:00] intelligence is your ability to learn new things. Beautiful. It's really quite, it's really quite beautiful. And just to put a small little addendum on, not it's your efficiency. At learning new things. So how quickly can you learn new things? Now the, the way that this is defended here is anybody in any system can get good at a narrow task.

You wanna get good at chess? Okay, cool. Go do a bunch of RL on it. Same thing for go. Same thing for any classification measures. Generally, if you just have enough training data, you're gonna be able to get good at that one task. However, those systems that are getting good at that one task aren't necessarily gonna be able to go play chess.

They're not gonna be able to drive a car, they're not gonna be able to go do all these other things. So intelligence, rather than just stuffing your model with more data and so that it can do more memorization and interpolation within traditionally and deep learning network, how well can you go learn new things?

That's the real question. And so what arc A GI does a GI is a benchmark that Francois proposed in order to test his measure of intelligence there. [00:15:00] Now keep in mind that this was 2019. So this is a while and this is six years ago, right? And how we've been through a global plague together since actually it's, it's been a long time there and he really introduced that the first time to challenge deep learning.

So the way ARC works is you have input and out grids, and I have an example here if you want me to show my screen. Would that help for it? I would love that. And I 

hugo: actually, we chatted about this briefly before, but while you pull that up, I'll say that up. The intuition behind these, a lot of these tasks is inherently visual as well.

So getting us a, a sense of that would 

greg: be useful. Let's go to an easy one here. This is the one that I was actually, no, we don't, we, we don't have that version up right now. But either way, the way ARC works is you have examples, and these are trained examples and humans are specifically very good at learning from just a few shot examples.

We don't need a ton of trading data in order to get something down. So we have examples here. We have inputs and we have outputs, input, output. And the whole goal of this entire thing is you want to see how does the input. Map to the output. So there's some transformation or rule that's required in order to turn the [00:16:00] input into the output.

Then you want to see what it is, and spoiler alert, sorry for anybody at home. But the way that this one works is you count how many squares or pixels are on the left side, and you repeat that many spaces out on the right. So there's four squares, four pixels. Here you go, 1, 2, 3, 4, start over. 1, 2, 3, 4, start over, and you go from there.

Now the really interesting thing about ARC is we call this a skill or, or some sort of rule or transformation that's needed to do this. If you go to another task, there is another skill that is novel and different and unique from the first task that you actually did. So here's another example. A different skill is required.

Here's a third example. A different skill is required. Here's the fourth example. A different skill is required. Now, here's where it gets really interesting. On our public data, we have four public sets, and these are really meant for training and to see how your model does. But in order for us to verify a score about how your model's doing, we actually have two hidden test sets [00:17:00] and the skills required on those hidden test sets have never been seen before on public data.

So the AI model that beats those hidden test sets we'll have had no choice but to learn on the fly, how to learn those new skills and those new examples, and to perform those at test time in and of himself. So what we're forcing the model to do is learn new things at test time. And that's the beautiful part about this entire benchmark is that we're specifically measuring new things rather than just recall or knowledge.

And so you won't see any questions that say, who's the fifth president of the United States? And this is in contrast to a lot of other benchmarks that may ask you to do what I call PhD plus problems. And so extremely hard problems that really shows where the ceiling is. But benchmarks like arc here show you where's the gap between what humans can do and what AI can't.

Because by the way, I forgot to mention, every single data point on here has been human tested. So we know that humans can do these, but AI still has a really difficult time to do 'em. 

hugo: And humans can do it 

greg: [00:18:00] relatively easily with it. Easily is a subjective term. Yeah. What the criteria that we hold. So for our V two data set that just came out, we went and tested 440 people.

So that's part of what, that's part of what goes into running a benchmark, which I didn't know beforehand. But for us to be able to make the claim that this humans can do this, but AI cannot, we need to have first party data and we need to authoritatively be able to say that. So we went down to San Diego and we tested four, about 400 people, and we only included tasks that humans could do that at least two human humans could do, and under two attempts or less.

So yes. Multiple humans can do this. And then we went and tested, you know, frontier models, top models right now are getting four and 5% on this, whereas humans got a panel of humans got all of these correct. So yes, 

hugo: super cool 

greg: and lovely that you get to go to San Diego for work as well. What a cool, chill place.

It's fun. I'm located in just north of San Francisco here and I born and raised in California. So San Diego's near and dear to my heart. 

hugo: Fantastic. And funnily, I used to live in the US as you may [00:19:00] know and some listeners and viewers may know, but Australians in the San Diego's very similar to Sydney in a lot of ways.

And we jokingly call it Sydney 

greg: with fish tacos right next to the border there. Some of my favorite Mexican food in the entire state is down in San Diego, so I appreciate that fact. 

hugo: I appreciate that. What we've talked through, something that humans can do with a few tries. Fantastic. And that ai, historically, AI systems foundation models has been horrible at, I want to just dig into this a bit deeper.

Because a lot of benchmarks we've had previously, like M-M-L-M-M-L-U, and a variety of others, there are concerns around memorization and that there's some sort of leakage from training to to test. So if you'll allow me to play devil's advocate gently a bit, and I know you can never like a hundred percent ensure these things, but how can you mostly ensure that these types of problems weren't 

greg: somehow in the training data?

Yeah, for sure. So my, the first thing that I'll say is ARC one, like ARC one and ARC two are [00:20:00] very scoped domains. It's literally a two by two input output. JSON, right? It doesn't match reality to it at all. So we very well know that there are a lot of blog posts about solving arc. There's a lot of code online about solving arc.

There's a lot of examples about solving ARC online. And what's really interesting is if we look at the scores for public data versus the scores of private data, there's a clear drop off between the two. So that already tells us that there's overfitting happening on the public eval set. Now what we do is before, first of all, we clearly label what is public and what is private.

Then if we share any private data with any lab or do any testing like that, we have trust agreements in place, whether those are mutual NDAs or whatever maybe, or we can ensure that it doesn't work. And then for our competition, for Kaggle, where we have our private dataset, there's no internet access allowed on our Kaggle competition.

So there's no data leakage that's happening from there. But here's the thing is like we acknowledge that it's not a completely perfect system and we're not doing state level security on top of this thing. The spirit of the benchmark is that you're not gonna be [00:21:00] training on it, and it plays within the rules, but we acknowledge that it's not state level security.

hugo: Makes perfect sense and I appreciate you elucidating that. I, look, dude, I want to, as you said, a GI one benchmark 2019 now in the past year. So I don't even know where to start, to be honest. I think maybe before jumping into what you've launched recently. Let's talk about oh three, man. What happened late 

greg: last year with open AI's oh three system?

It's a really fun story, and I haven't told it publicly on a podcast like this yet, but long, long story short, we're sitting in, it was early December, and we get an email from one of our close contacts at Open ai. I don't, I won't mention who, but one of our close contacts at OpenAI, and they say more or less, Hey Greg, we have a new model.

Hey Greg and Mike. We have a new model that we want to go test on this. I jump on a call with them and they say, we've tested this on public eval so far. And I say, great, what's the claim score? And it was quite high. And it was high enough where it's, okay, it's OpenAI and they claim a high score. We're gonna go take this very seriously right now.

And [00:22:00] so we worked with them over the next two weeks to basically verify their score on our semi-private data. So this is the holdout set. And they were wondering, the same thing that we were is how Overfit was their model, if at all. I would say they were very diligent about making sure that the Overfit measures were minimal.

That comes from that. So I have to commend the Open AI team for that. It was very cool to see. And then we go to OpenAI, their San Francisco office. I think it was the Tuesday, I think it was Tuesday, December 20th or 18th or something like that. And we have a meeting with Sam and basically the top brass at OpenAI.

So it was Sam, mark Jacob at the time. And then a few of the folks, JT Jerry, which was very cool. And we wanted to show them the verified scores and we wanted to show Sam specifically the verified scores. So we were in the room with them and we showed 'em basically more or less, the blog post that we wanted to post on this thing.

And then Sam turns to us and says, you know what? We think that you should come join us on the live stream on Friday. And this was the two, this was two days before, three days before. And we were not, I was not expect, we were not expecting that at [00:23:00] all. Our one condition is we wanted to write the script that we would say on that Friday.

Now, of course, we were joining their live stream, and so we weren't gonna go on there and be dis, I mean, disrespectful. If we wanted to be disrespectful, we wanted have done it on their turf. But we wanted to say, Hey, here's the script that we wanna read. They were cool with it, they approved it. They said, yep, looks great.

Come join us. So we did a couple rehearsals and then next, you know, we're on the live stream on Friday and it was a great time. Incredible. And what do you put this 

hugo: down to that, what were the key innovations in oh three that led to this 

greg: jump? We do have to be, I have to upfront acknowledge that my knowledge of this is all, is pretty much all speculation now.

It's informed speculation being need even in the space and trolling Twitter as much as I can and go and seeing all this stuff. But it's TBDE, exactly what the architecture changes are. What we do know as fact, and this is what we published on there beforehand, is that there was a lot of compute and a lot of tokens used.

On O three specifically itself. And so for O three low compute, and, but keep in mind, [00:24:00] O three was just publicly launched this week. And so I'm gonna call the version that we tested in December. I'm gonna call that O three preview. And the only reason why I call it preview is because we aren't sure, actually, no, it was confirmed that the model we tested in December is not the same as the model that came out just this week.

And this is, this recording is April 17th. So the oh three preview that we tested used a lot of compute for the low end. It was on the order of $20 per task. And so that's around $10,000, which was eligible at the time for our public leaderboard. Now, there was a big open question as to what do you actually price O three at for that?

And we used O one pricing that was available for December of 2024, and then the O three high compute. That used about 170 x the compute that we used on low compute. And so there was a substantial amount of extra compute that was going on for that. There's been open questions around was it single chain of thought?

Was there a bunch of sampling that happened from that? Was there search happening within the cts at at test time or [00:25:00] whatever it may be. But all that is just pure speculation. And so TBD on what that exactly is 

hugo: s such an exciting time. I am interested, so 2019 was a GI one, and in the past month we've seen the big next iteration of it.

So can you tell us about this Greg? 

greg: The way RKGI one was call it, I think it was 900 tasks and it was all done just by Francois, which is crazy like having made a bunch of arc tasks to date now too, knowing that Francois made 900 of these in the very beginning. Is absolutely insane. Now the second thing with that too is after five years you start to identify certain flaws with R kgi I one, which is completely, this is just the normal way that these things usually go and you have more learnings.

So R kgi I two is taking everything that we learned within the first five years of R KGI I one and putting it into R KGI I two. And not only that, it's a more sensitive measure to test out the nuances. So as AI gets better, we need more sensitive tools that can start to really piece apart what these models are doing and how their performance is actually going.

So there's a [00:26:00] few key differences between ARC one and ARC two, and we have an entire change log up on our website if anybody who wants to go check that out. Number one is that we've actually verified that a panel of humans can do every single one of these RRC two tasks. That's number one. Number two is the tasks require deeper application of rules in order to solve them.

So if you look at R kgi one, there's a lot of simple rules that are in there, and it's maybe fill in the corner of this square for the remaining check. Test set. That's pretty easy for current AI right here. But as you start to get into multi compositional rules or step-by-step rules, that is becoming more difficult for ai.

And so there's a lot more of those within the V two as well. And then, let's see, I'll have to pull up, I'll have to pull up the change log a little bit later, but think of it as the upgraded version of what we learned from Arc One and we're applying it to ARC two. 

hugo: Awesome. And so it launched a few weeks ago.

Can you just tell us a bit about what the prize is and how people got involved? And then I'd love to jump into what happened. 

greg: We want to find an open sourced [00:27:00] solution to arc so people may know ARC a lot as measuring models and, and seeing how that goes. Sure. That's a big part of what we do. Another huge part of what we do is we actually guide a lot of research.

So last year in our competition, arc Prize 2024, we had a paper track. And this paper track we, we awarded. $87,000 USD to top papers that were submitted for our competition. And we do that because we really want to push conceptual progress and we use ARC as a tool to say it's almost like a big target that says, Hey, make conceptual progress towards this direction.

We think it's really important. And we incentivize that with a lot with price pool that comes from there. So that went well. We had a grand prize of $600,000 last year. The grand prize was not claimed. Nobody hit the grand prize threshold that we wanted people to. And so we came back this year with Arc Prize 2025 and we still have our top score award, which is whoever gets the top score at the end of the year, which still have our paper prize, which is gonna award for the top conceptual progress.

And we actually increased our grand prize to [00:28:00] $700,000. So anybody that can beat RKGI two with an 85% threshold and open source, their solution is gonna be winning. Get 700 grand for that. 

hugo: Incredible. And what does open sourcing mean to you? Out of interest? 'cause we do live in a space where we have an open weight switch.

Reproducible per se, and open source, which can mean some. So it's a broad church, essentially. 

greg: It's a very broad term. The spirit of this is that somebody else could take the data or the system and they could go run with it. Basically, they could go reproduce it. Now, there's a very nuanced piece of this is what about the training process?

How'd you get those weights? And sure you open the weights, but what does that mean from a training process? It's very well known that training a model can get really messy and somebody may train a model over the course of the entire competition, which is going for eight months. And so we ask for a absolute best effort attempt to a certain sufficiency level as to describe how you actually did your training.

Now, if somebody trained over eight months, we don't need to confiscate their computer here and go get an audit log of this entire thing. But we need somebody to be able to retrain the model if [00:29:00] they chose to go do that work themselves. And usually a written description is gonna be enough for that for us.

hugo: Super cool. We have a great question in the chat from Natalia, and she's actually discovered it herself, but I'm kicking myself for not mentioning this earlier. What does ARC stand for? 

greg: Right? When Francois first introduced in 2019, arc a RC stood for Abstraction and Reasoning Corpus. However, mid last year, call it maybe March or April, we realized that Arc a RC was an overloaded term.

There's a lot of arcs out there. There's Arc Institute. There's more ARC benchmarks, and so. Me, Mike Francois, Brian. We sat on a Zoom call one afternoon and said, all right, we gotta do something about this name here. And so we decided to keep the organization or keep the competition was called Arc Prize.

The organization is now called Arc Prize Foundation, and then the benchmark itself, we dubbed Arc a GI. Now it's a little bit of a poignant term and it's a little bit of a mark, not a marketing term per se, but it's really, it's kinda meant to be a little bit more visceral because we're actually using these benchmarks to measure generalization, and [00:30:00] that's the ultimate North star that we're going after here is how can we measure and ultimately guide progress towards a GI.

hugo: What kind of submissions are you seeing so far and any interesting strategies or any, anything 

greg: caught you off guard? Unexpected. I would say there's two. Okay. There there's three. There's three, not three classes, but three. Three that I'll mention here. Number one is the least interesting to me, so we run a Kaggle competition.

What I've learned is that Kaler are extremely. Apt at probing a leaderboard. So they will do everything in their, everything they possibly can to go against the leaderboard and do a whole bunch of techniques for it. I've seen everything from people. The only responses you get back from cackle is you get back your runtime and you get back your score and people will go and try to encode information in the runtime.

So if they see certain task attributes, they'll put in wait statements or sleep statements. And depending on how long it sleeps, that gives 'em information out about their certain, about their tasks and everything. And they'll use that information to incrementally build up their score over the year. And we're needless to say, that's not in the spirit of the competition.

That's not what we're aiming for. But I [00:31:00] tell you what, that's just, that's an alignment of incentives. And we have cash money out there for anybody who can beat the prize. So we understand why it's out there. But that's the first class we call that brute forcing. The second class is going to be what a lot of top teams did last year, which was the test time compute, or test time training.

So what they actually did was, is on the hidden test set, there were a hundred problems, a hundred tasks. What they did was is at test time, meaning once they saw the task, they went and generated a ton of synthetic data that was special just to each individual ARC task. So they looked at task number one on the hidden test set and keep in mind there's no internet access allowed.

So this is all autonomous. And they generated a ton of synthetic data using popular DSLs that are out there right now and pop popular arc task generators. And then they trained an entire language model a hundred different times on each one of those ARC tasks. So you have a hundred different models with a bunch of synthetic data on top of it.

And that specifically tuned just to go battle that one specific arc task. And so people did that and they ended up scoring towards the end of last year. They [00:32:00] scored in the fifties on R kgi I one, I should say that we retested the best submission from last year from the architects, from Jan and Daniel to awesome people.

And that submission ended up scoring, I think 3% on R kgi I two. So RKGI two is scored lower quite a bit. Then the third class of submission that was submitted this year just, it's just super interesting, is a fellow named Isaac, apologies, I'm, his last name is slipping me right now. He has a submission called RKGI without pre-training.

So there's no pre-training model that's happening in the very beginning and he's doing all at test time. And so he's a really cool thing. And not only that, I'm encouraging him to write a paper and submit it this year because it's just so novel. And those are the really cool things that we believe is that progress towards a GI can likely happen from independent researchers.

It doesn't need to be a big lab. And what I like to tell people is that compute makes up for a lot of algorithmic inefficiencies. And so if you have a lot of compute that allows you to be, not, the word isn't lazy, but the the word I would use is that [00:33:00] makes up that. Fuzzes out some of the noise on your dirty algorithms then if you have more compute to, to go throw at it.

So if you don't have a lot of compute, that's okay, you're just gonna have to make more of a breakthrough on the algorithmic side. But that even goes to like even the GBT 4.5 video that open I just put out with a blog post where they're talking about it. Somebody had asked or the question was posed, Hey, humans are really language sample efficient.

Humans are really good at language. The question was asked on the podcast, how more, how much more inefficient are current language models to humans from a language efficiency perspective? And the representative, I think it was his name was Daniel, on there, he said, current language models are about a hundred thousand times less efficient than humans at language there, which just goes to show.

And we can really go really knee deep on intelligence in my thoughts and kind of philosophies around this here. But that just goes to show how much more algorithmic progress you need to make. So that's one of the reasons why I agree with OpenAI and their compute ambitions. 'cause no matter what, we're [00:34:00] gonna need the compute.

It's 'cause algorithms are just gonna be better. I don't, do not see a future in which we're not gonna need as much computer as absolutely possible. So it's okay to go pedal to the metal on that right now. 

hugo: Yeah, and this is very related to what we were talking about with how these tasks are achievable by humans with few shot learning essentially.

And I mean, children are great at one shot learning. This is an obvious example that people talk about in the space, but show a kid one pony and it can PO point out lots of ponies and differentiate them from even horses and that type of stuff, which is so important. I would, so before going deeper into what you've seen recently in the future of our A GII, I would love to just dive a bit deeper into your thoughts on in intelligence and how we measure it and the future of being able to measure it.

greg: Sure. So my starting point for a lot of this, which yeah, my starting point for this argument is we have one existence, proof of general intelligence that we know of. And that is the human brain. That's what we know. What we can [00:35:00] derive from that is three things. We can derive the output. So we know what humans can do.

We can see like that. They can learn new things. We can see how sample efficient they are with learning those new things. And so we have an output, which is good. And then we have two denominators, which are absolutely fascinating. We know how much energy the human brain takes in order to output what it does, right?

So we can literally measure the calories and calories converted to energy. We can do that. And that's fascinating because we can compare the energy required by the human brain to the energy required by a computer chip. And yes, that says some things about hardware efficiency too, but that also talks about algorithmic efficiency that comes from the same place.

Okay, great. The second denominator that goes into a human brain is the training data. And so we can make rough proxies as to the training data that goes into a human brain. Now, visual throws in a really big curve ball into this because of the pixel data and how information rich it is. But we can make rough proxies of how much training data goes into, it goes into humans as well.

So using that existence proof of general intelligence of the human brain, that's one of the main [00:36:00] reasons why we like to choose problems that are easy for humans and hard for ai, for our arc benchmark. Because if we can do that, then that identifies a gap against a known existence, proof of general intelligence.

So if you're going towards general intelligence, then using the human brain as a benchmark, or at least the starting point, is a really good place to start. Now, I do get a lot of pushback on this that says, Hey, Greg, the human brain isn't actually that efficient, don't you think? The U universally optimum algorithm of intelligence is gonna be a lot more efficient than humans?

And of course it is. Of course it's, people like to gimme a hard time for that, but there's multiple routes to a problem. I do fully believe that once we find a GI, we're gonna look back and be like, wow, that was inefficient. So anyway, that, that's a starting point there. I'm happy to take it anywhere else.

hugo: Yeah, I love that. I do want to, I love the idea of just trying to quantify the amount of energy re required to for these tasks. I think my pushback there would be, I suppose we're talking about human adults doing these tasks or [00:37:00] teenagers and not toddlers, for example. And so there's a huge amount of energy that goes into, and I don't mean to be too reductive or pragmatic about this, but goes into clothing and feeding and educating children as they.

As they grow up and then I suppose with a few shot learning what's encoded genetically and historically and evolutionarily in all the energetic costs of that also. 

greg: Yeah, you bring a good point here. So when I was giving my example, I was talking about a single human in isolation. However, we are the product, like a lot of the things that we can do in the learning mechanisms we have is cultural intelligence.

So it's culture that is built up over and over time. So it would be an interesting thought experiment to measure all of like just ball, parking, all the energy that went into those thoughts, which then were transferred over to me that I've been able to piggyback or that we've all been able to piggyback on top of.

But yeah, it's a good push. Good question for that. 

hugo: Yeah. And I'm interested, I think these are related in some way, and I presume you all have thought about this and I know Francois does think about such things, but, and this is an abstract question in some ways, but what's the role of curiosity in intelligence?[00:38:00] 

greg: An open question in my mind is how much of our intelligence that we have here is a biological artifact versus absolutely necessary for intelligence for itself. There's open questions and I, I won't claim to know the answers to these. Of course, I don't. There's open questions that people sometimes ask is, will a GI be conscious or not?

Okay, sure. What is consciousness? Let's go down the rabbit hole and let's go figure it out. I, let's not get into that. 'cause that's a whole different question here. Then similar to this question here, is what is curiosity and is that necessary for intelligence or is that a biological aspect? I would say that's TBD, that's much more, that's much more into the world of study of humans and how intelligence plays a part within humans rather than intelligence in and and of itself.

I know a ton of people are gonna argue with me on it, but it's funny, once you start getting into these world, these this place, then you really need to start paying attention to definitions and words, because you'll often see that you may disagree on a downstream argument, but you can root that all the way back to disagreeing on a definition of a word in the beginning.

hugo: Totally. And that's why it's so fundamental that Francois in his paper, [00:39:00] which I've linked to in. The chat and one of the show notes on the measure intelligent, at least he's defining, you can disagree with this definition, but then we need to have another conversation. Let's at least define it. And I do agree on the consciousness question.

I would push back, I don't know whether things, computers, software will be conscious. I don't know whether a lot of humans are, I don't know if I'm conscious half the 

greg: time as well. To be honest. Francois has an interesting take that consciousness is actually gained over the course of the life of a human.

So another interesting question, and this is just more of a thought experiment, is are babies a GI, not baby a, GI, not yo, but like literally a newborn out the womb. Is it a GI? Yes or no? Is a GI the capacity to have a GI? 'cause it does grow up in, into a full forming adult. Or is it a state? Is the potentiality define it or does it, does its current state?

Define it. And if it's current state defines it, it's like, dang. Not really. It's not deep. It's not doing, it's not doing much. But then over the course of its life, even my son right now, he's 11 months old, I watched him figure out what his arms do and how did he figure that out? [00:40:00] There's random neurons firing inside of his brain, his arm, his arms are fla flailing all over the place, and then all of a sudden he learns, wait a minute, this feelings that I have, I can start to control my arms.

And so you can see this random process start to get a little bit more organization around it. So maybe there is a GI in that. In that case, it's just a much bigger rabbit hole. 

hugo: Totally. 

greg: So going back to 

hugo: arc, 

greg: how 

hugo: close 

greg: are we 

hugo: to 

greg: people 

hugo: to cracking it, 

greg: do you think? A key narrative switch that we made this year is that efficiency is a first class citizen within our ARC reporting.

So last year we just had a 1D benchmark, which is literally just like a leaderboard with a score on it. In the world of inference, time compute, we have no choice but to report an input that corresponds to the output. Now, much along the lines of my theory earlier, I would love to have the input of direct energy usage.

Or direct training data that gets you your arc output. Because humans are not trained on the entire internet of data yet they do really well on arc. Whereas LLMs, they are trained on the entire internet of data. They don't do that. Great. So training data is a key piece of this. [00:41:00] So to answer your question, where I'm going with this is there are models that do very well on arc.

And so we saw O three in December that the Model O three preview, the one that we saw scored 87% on public eval on the high set, which is insane. However, at what cost, there was six figure, potentially seven figure sums that were spent in order to achieve that. And intelligence being an efficiency game, we need to understand what we're looking at here.

So to your point, if it's just a capabilities demonstration, we've seen some models do very well, very well on arc if it's in terms of human level efficiency on arc. We're still a few orders of magnitude away from that. So that's that. That's, we're not there yet. 

hugo: How can people get involved? People are listening and they're excited, but dunno where to start.

How would you suggest people, and I know you have a Discord, which I'll link to in, in the notes as well, for people to chat. 

greg: Here's the thing, there's a lot of passionate people on the Discord, which is awesome. It's hard for me to keep up. Sometimes it's a, there's a lot going on there. So there's that. Arc prize.org is our website, and on there we, we literally write the website to be a first time user experience.

And so if [00:42:00] you want to know about RKGI go to the RKGI tab and go start reading it there. In fact, I might have authored that page myself, so I don't know if there's any typos you could start there. And then not only that, there's a huge amount of energy on Twitter as well. So if you're into Twitter, go follow Arc Prize.

Myself, Mike Francois, Brian, and a lot of the folks working on it. If you're not into Twitter, then. On our website, we also have a resources page. So for every single YouTube video that we can find, we throw it up on, we throw it up on there. Every paper we can find, we throw it up on there, all the code and everything.

So there's a lot of really awesome introduction to ARC resources on there. That would probably be good for the audience, 

hugo: for those familiar with kind of the broader ecosystem of benchmarks, and I'm think we talked about MMLM, I can never say it. Say that 10 times in a row, really fast. MMLU, helm, big bench, all of these types of things.

All the great work that people at like ALO AI are doing, right. How do you see ARC fitting into this broader ecosystem of benchmark? 

greg: A core belief that I have is that you cannot tell the [00:43:00] whole story about a model just by one benchmark. Like you have to have a portfolio approach. And in some of the work that we do, our recommendations with government as well.

In fact, we just submitted a recommendation to the Office of Technology and Science, OSTP. With regards to the AI action plan and the strategy for that. And in there, one of the key things that we mentioned was you have to have a portfolio approach. So I think that there is a clear use case for a lot of different categories of benchmarks.

Now, arc, we place ourselves within an abstract and reasoning or a logic and reasoning type of benchmark because there's no words in our benchmark, there's no cultural knowledge, there won't be shapes that you recognize or anything like them. And so it's just purely off abstraction and reasoning. Now, because we measure, we also pride ourselves on being a generalization benchmark.

A lot of other benchmarks will tell you about the known knowns. You can know with certain the amount of trivia that a certain model may know if you give it like a SAT or trivia style exam. If a model does well on arc, you can tell, you can be certain that there's a certain amount of [00:44:00] generalization that's happening into it.

Now, generalization means that it's starting to go into the unknown. Unknowns about it could generalize. Generalize in a lot of places. We don't yet know which directions those models are gonna go and generalize from there. 

hugo: Sorry, I was just thinking, my mind just went a couple of places. You mentioned that there's a lot of chatter and building and stuff on discord.

'cause I'm, I've got a discord for the community. I'm building course and education stuff for the LM powered stuff that, that we've talked about before. And I am I watching people just build and chat on discord is incredible. It's totally overwhelming as well. Right. And especially with the type of electricity we're seeing in the space at the moment.

So where my mind went was, what's that experience like? But more, more generally. What have you learned so far? Just more generally watching people go after this benchmark in practice 

greg: for the discord noise. We have a Slack channel that. Just pings us when there's a new introduction in Discord. So we get a feeling on what everybody's background is.

And what we should do is we should go hook up a classifier to say, what type of person is this? And for everybody who signs up for the newsletter, we actually, [00:45:00] we ask 'em, what's your background? We ran that through an LM at one point, and we actually get a persona analysis for all the different people that are coming from us.

So one of the things that I've learned from Discord and our newsletter is, who the heck is participating in arc? Which is absolutely fascinating to come see and what else are we learning? There's a lot of energy around it. There's no better verifiers or greater than the public themselves. And so we just released RKGI two and there was a few different, there's a few tasks in the training sets that ended up getting passed our verification process, which is so just so cool to see the community come around for this.

So really thankful that those folks are helping us out with that. But they're passionate and we want them to have a spot to like vocalize and have a, like a fire pit that they can all talk about arc on. And so we're happy for it. Super cool. 

hugo: And are you at liberty to talk about the type of personas or the type of people represented?

Yeah, 

greg: what we saw is, I'll just say big buckets. Big bucket number one is ML researcher, a big lab co, like whoever the big lab is, they fill out about arc. Being in the benchmark game, there's a lot of optics around benchmarks as well, especially for big labs. There's a lot of money and there's a lot of investment and there's a [00:46:00] lot of decisions being made as to how good these models are and people tell how good these models are by vibes of just what people say on Twitter and things like that.

But then also by benchmarks. That being said, there's a lot of optics and there's a lot of things to take into consideration. So they may not talk about ARC a lot, but through a lot of private conversations that we've had, there's a lot of big labs that are thinking about arc, which they can't talk about publicly quite yet.

So that's number one. Number two is gonna be ML researcher at non big labs. So we had a few people from NVIDIA participate in ARC Prize last year. Few independent researchers that are working at maybe a startup, and they just are passionate about ARC on the side and they want to go work at it, which is great.

Group number three is gonna be professional kaler. So you look at their profiles and it says a grand, and they've participated in a bunch of competitions, so they may not have a deep arc background, but they are good at competitions and good at data science. And so they want to go, they want to go after it for that.

And then what's crazy, and one of my favorite personas that we have is we know of at least 10 different startups that are using ARC as a North Star, which within their company. So they've literally [00:47:00] said, we're a small A GI research company, call it seven to 10 people. And they choose ARC as a North Star to go and develop against because they believe if they can come up with an in-house closed source solution that beats arc, then they're gonna have something that is commercially monetizable.

So they want to go after that. And that's a, that's another huge persona that it's fun talking to. 

hugo: Fascinating To that point, I do wonder you need to occupy a sweet spot, right? Where it's hard enough, but not. Impossible. So how did, and this probably comes down to in initial design choices from fois and then the collaborations, but how do you think about balancing the benchmark?

So it's hard enough to matter without making it just totally impossible. 

greg: RKGI one was like pretty much quote unquote impossible for the first five years. And that to say is we have a slide on one of our videos that we did. It survived 2019 to 2022 or something. Little to no progress on it. There was some interesting, there was in some interesting [00:48:00] things that happened within purpose-built solutions that were happening on Kaggle, but nothing from a generalized model that could do well on it.

Then when chat GBT came out, it scored maybe, I think 0%, like when it first came out, like GBT three five or Da Vinci three or whatever, it scored low and even GPT-4 oh. A couple months ago or whatever it was scoring like something like 8% or something like that. So it's, it survived. Impossible for a long, it was basically impossible for a long time.

It wasn't until we saw reasoning models that scores on RKGI started to jump up and that's what's really interesting about a benchmark and one of the ways that you benchmark has validity is because it makes a capabilities assertion as to what's going on with the underlying models, which is one of the reasons why we love it.

As we designed RKGI two, we held, no, as long as we could still authoritatively claim that humans could do these tasks, but AI couldn't. We didn't care how hard the task was for or perceived hardness for ai for that. As you say, it's one of the craziest times Absolutely. To be an ai. It does not behoove us at all to make this test any easier than it already is.

We [00:49:00] want to be able to make an authoritative claim that the thing that beats RKGI two is extremely special, and if we make it as difficult as we can, but while still maintaining that humans can do this. So it's not necessarily moving the goalpost, it's like shifting the orientation of it. We believe that we have something special on our hands.

Now that brings us to the next point is you came out with one, you came out with two, what's RKGI three? What's going on with that? And so we're starting to already think about R KGI I three and that'll TBD on when it comes out. It's not gonna come out any earlier than March of next year, March of 2026.

But again, with that one, there will be a sweet spot, but we're not gonna, we're not gonna hold back on 

hugo: it. I really do love that you're still indexing on tasks that humans able to do as opposed to, to your term, PhD plus plus, which most humans can't do 'cause they don't have training or expertise among other things.

But AI is, you know, well suited for. So then switching that calculus around and having it focused on that, [00:50:00] 

greg: to bring this point home even further, there was one benchmark out there that I think, oh three, I forget what it scored. I'm making a numbers here, but I think oh three scored 10%. Okay. Now, if you gave O three internet access.

It scored 20% on that benchmark. So what that tells you is that all that model needed was more access to the outside world. It actually didn't need to increase the amount of intelligence that it actually had. It just needed more information and cultural knowledge from the outside world. So that tells me that test doesn't tell you a ton about the underlying intelligence of it, per se.

It tells you about its access to the outside world, which is completely fine to measure that scope of thing. What we are trying to do is specifically measure intelligence. So that's why we don't require any outside information. Like you're not gonna need to go search the internet to go do well on our care.

hugo: You mentioned this briefly with respect to on, on Kaggle, people hacking or overfitting in some sense, which I've done on Kaggle. Not for this, you know, back in the day. I'm [00:51:00] wondering if there are any other ways you've sent people try to overfit or quote unquote cheat. I don't, I'm using that term lazily and how you think about those risk, 

greg: it's one of those things where it's like.

Trust but verify with people who come into us. What's funny is sometimes I'll get an email that says, Greg, we claim a hundred percent on RKGI like just a hundred percent. It's already done. And my first thought is you've clearly put in the answers into your model or something. 'cause nobody's getting a hundred percent on RKGI, or at least we haven't seen it quite yet.

How do we think about other overfitting? We have some overfitting techniques that we proactively do against our data set to help ensure that it doesn't happen. And I don't want to go into 'em because that would give away the secrets for it. But wait, it's something that's on top of our mind. But here's the thing, there's a few ways to approach this, and I'm sure a security person would put vocabulary words on this and educate me on it.

But we take proactive measures with regards to the dataset. Okay, cool. Now with our hidden dataset, we also do like reactive measures. So if somebody's gonna come to us, we're not, we've had to say no to verify a lot of scores. Actually, there've been people who come to us and say, Hey Greg, [00:52:00] we are about to raise a fundraising round for our startup.

We want to announce a really good score on arc. Can you please test our model and give us the report so we can go take this to investors and go do it? We don't know them, we don't trust them. They're not claiming a soda score. It's a black box. They want us to throw our data over the fence and basically just go give it to 'em.

And so our recommendation to them is, look, go do it on the public eval. Go say anything you want about it. It's out there in the public. We can't test private data. So proactive and reactive measures. 

hugo: I'm prying a bit here, but I am interested, my, my ears just pricked up. I realize how many people, and if you can't answer this, of course, don't.

How many people like know what's in the test set and how do you think about security around that? Do you tell your partner like over, 

greg: over the Friday night drinks or whatever? It's, it'd be tough to even explain it. It's like hard to describe these things in words. And so I think that would, that would be okay for it.

I would say for RKGI one, there are, I don't know, six people with access to it. Francois is one of them. Kaggle is another party that comes from there. So there's not a ton for RKGI two, A lot more people submitted tasks for [00:53:00] it. And so one of the ways that we, somebody may submit a task, but they may not have access to the entire repo of tasks.

And so that's one of the security measures that we do for that. So for R kgi I two, it's a bit higher, but we try to keep the trust factor as as low as possible for it. 

hugo: As I said before, I really like that Francois came up with a relatively robust definition of intelligence. I, 

greg: to your point too, one of the crazy things is why don't we have more actual, like actual definitions.

I keep my ears out for any definition of intelligence and all that I see is just scoring well on some test. But any single test is gonna be really, unless it's a meta kind of benchmark or meta test, like RKGI is, but I'm absolutely shocked that there's not more definitions of intelligence out there.

hugo: Absolutely. And from many communities and people in variety from there've been intelligence researchers in psychology and all over the place, but we still haven't landed on it's, it seems just you can't quite touch it in some ways. It, 

greg: it does seem elusive. It seems elusive. I could see two scenarios playing out, and I'm not sure which one's, which.

Number one is, it can't be [00:54:00] defined. It's one of those things that it, it can't, it's not defined. Yep. Or number two is. There's a new, a recent YouTube video, so good about the astronomical ladder and measuring distances. And back in the old days when they wanted to measure the distance of the sun, they looked at the moon and it was extremely crude and extremely just a hack at how far it was.

And in fact, the first estimate was almost an order of magnitude off, like almost like six times off, whatever it was. Measuring intelligence may be like that. So we're taking our best attempts at the efficiency at which you learn new things. We have RKGI, which is a very crude tool. Like I said, it's very scoped and narrow domain, but it may be like that and maybe that we just don't have the tools to measure it as of yet.

hugo: I work a lot with L LMS and build a lot of AI systems and with LLM, like you check with LLMs and their conversational interfaces, right? And the APIs don't have memory in the way the products do, right? So when we're building systems, we think about adding memory, we think about retrieval and RAG is one example of that by just getting information from stuff.

We think about tool usage and [00:55:00] augmenting them with being able to search. And that's going on the agentic continuum in a lot of ways, right? Agents are still a marketing term in a lot of respects, I think, to be honest. But I did want to lead the conversation towards what type of role you think this agentic continuum plays in solving a style challenges.

And so I'm really thinking about retrieval and memory and tools essentially to be spec 

greg: in the cargo competition. We don't, we don't care if you use tools, we don't care if you use memory, we don't any of that stuff. Mm. I hate to be a, a narc about it, but we have to go and define those words like what is memory, right?

And it's, it's fine tuning a hundred different models on a specific arc task. Is that a form of memory? 'cause you're encoding information about the task into your, like actual weights for it. It's, I mean, I could see an argument for Yes for that. So I think that what we've seen is that base LLMs with just outputting the answer, do not do well in arc.

That's just an absolute fact that we see, well, a base LLM, like GPT-4 oh or GPT eight or whatever do well on it. Maybe we're not there yet. Who knows? [00:56:00] So far we've seen you need reasoning to do well on R kgi. I one. Is reasoning a tool? I mean it depends how we just start to define tools on this is spinning up your own Python script within your thought chain.

You know how O three is like doing like Python calls within its COT or whatever. Is that using the tool and is that gonna help you on arc? Yeah. Heck yeah. I'd help you on ARC for sure. So I'm being a little bit crude about it, but I think you need alternative approaches. Yes, you need tools in order to solve arc, but that's the part about intelligence in the first place.

Like you're not just gonna output on it in and of itself. 

hugo: Yeah, makes sense. And have you seen people like experiment with like significant multimodal models and that type of stuff? 

greg: Yeah, a lot of people they, A lot of people say, oh, you just need a multimodal model to do really well and then it's gonna get it.

We have, even with the rise of multimodal, we have not yet seen the success of doing well on arc, even with vision models that come from that. It's another tool, it's another technique. At its core, the ARC data is, [00:57:00] humans like to look at it in terms of colors and grids and everything, but at its core it's just A-J-S-O-N list of lists.

So we don't care and we are agnostic as to how a computer looks at it themselves. If a computer wants to convert it to an image and go after it from there, cool. If you wanna just treat it like JSON, great. If you wanna do some higher order of mathematics on it, great. What? Whatever you wanna use. Because here's the thing, it's like we're less concerned about how you actually implement your answer and it's more about the can you reason about what the puzzle's actually trying to do.

That's the part that we're excited about. Totally. Look, I'll believe a 

hugo: GI exists once. I never have to look at A JSO again as well. 

greg: It's funny you say that you haven't yet asked how do I define a GI? 

hugo: Yes, 

greg: please. 

hugo: I was getting there. Yeah. 

greg: In informally, and this is the informal definition, but I think it's a fun one, is we roughly think that when we arc prize, when arc prize can no longer come up with problems that humans can do, but AI cannot do.

For all intents and purposes we have a GI. So the way that we see it is right now there's a gap and that's between what humans can do and that that's what AI can do. [00:58:00] And we know there's a gap because we literally have arc a GI two as a proof point that yes, there are problems that humans can do and AI can't do.

So there's a gap. That gap will end up closing 'cause AI's gonna be better. That gap will end up getting closer and close, you know, close, close, close. And eventually there will not be a gap anymore. And at that point, us as an organization, well-funded organization with a lot of like smart people trying to do their fricking best to make benchmarks for it at that point when we can no longer do it for all intents and purposes, we practically have a GI at that point.

Now, is that a formal definition? No, but that's almost like one of those things. It's like how do you tell if a pan is hot? There's a lot of different ways to do it. You reactively proactively. Lot, lot of different ways to measure it. But that, that, that, yeah, exactly. That's one of the ways that, that's one of the ways that we like to, that we like to measure it on there.

hugo: I love it. 'cause it also has like a Turing test vibe as well. And it's eminently practical. It's like how far. Can we get to convince ourselves of this? And it is inherently a very scientific approach as well. 'cause science, you can make hypotheses, you can validate them. You can never [00:59:00] prove everything, anything certain.

If we just can't create any more things that it can't do that humans can do. That's a really good heuristic. And a practical heuristic I find 

greg: it's like there's one off tasks and it's like, oh, count how many Rs are in strawberry. Okay, cool. That's one problem. And yes, and yeah, it has a hard time doing it, but in order to like number one, quantify that problem and then number two come up with at least 400 to 500 other problems of that similar domain that are like that.

So you can actually start to get a measurement and score that you can reduce the variance on. That's hard. But that's the operational part of running a benchmark like that is the work. It's not just a good enough to come up with an idea. Like humanity's last exam, they had to organize, I forget how many people they had to organize a three or 400 different people.

I. To come up with questions, validate them, put 'em in their test, et cetera, et cetera. That is operational work That is difficult and that's like half of the job of running ARC prize is like making sure we can do that. It's not just coming up with a good idea. That's like the starting point and then you actually need to go execute on it.

hugo: Absolutely. I'm, we're gonna have to wrap up in, in a [01:00:00] minute, but I'm interested in looking ahead and just what, what are you most excited about in this space in the coming weeks, months and or say a year? I know that's a bit bullish given everything that's happening in this space, but Yeah. What's exciting you?

greg: The truth is that better models are coming out. That's just the way it goes. It feels like the year just started, but we're already a quarter in, so we're already like a little over 25%, well probably even 30% the way through the year through 2025. What's crazy is that for folks that have been in this for a while, I remember Dario saying that he believes a GI is gonna be here by end of 2026, that that's 18 months away.

So thinking about how to position ourselves. To be able to back up that claim or deny that claim about whether or not a GI is actually here is something that's on our mind developing V three. It's a long project, but we're undertaking that right now, which is gonna be insane. I would expect that we're seeing oh four Mini, that was just launched.

Oh four will come out this year at some point. Will oh four Pro come out this year at some point? I don't know. Is it gonna be fricking really good? Yeah, it's gonna be, [01:01:00] it's gonna be amazing on that. So working with the labs and make on seeing how that goes and everything. But then not only that, but then ARC Prize as a nonprofit.

So we fund ourselves through donations through the public. And so we just did a fundraise back in February that helped us out for a good chunk of 2025. We're gonna do another fundraise either towards the end of this year or in 2026. And so bringing more people on board who want to join the mission, like who want to get involved with this and who want to be on the leading edge of measuring intelligence and progress is a key piece for us.

hugo: We've linked to. The Aris website and the guide as well on how to get involved. But I think there are at least two approaches people can get involved in. The first is taking part in the competition and check out the website, check out the Discord, chat with people, check out the Kago competition if that's your thing.

On another side, if you work for an organization or personally feel like getting involved in sponsorship, like follow Greg and ARC on on Twitter and join the Discord and those types of things as well. I am interested. Can people get involved [01:02:00] on the like developing the ARC benchmark side as well? 

greg: I would say much like what I was talking about with regards to the operational challenge with it, we, unfortunately, it's a very small team that come that's like running this entire thing, so we don't have open opportunities to take on everybody that we'd like to have come on to it.

But for very select few folks who are interested in developing either V two or continuing or pushing on special projects with us. One big cohort of users I love talking to is graduate level or PhD level folks who are writing papers and they're figuring out what do I go and spend my research time on this year?

There's a lot of really cool arc ideas that we have that we'd love to actually go do, and in fact, I think I'm gonna do a call for research with Francois, like a content piece later in the year that's just like, Hey, here are the cool ideas that we think are worthy to go do. Please let us know if you're doing, maybe we even fund some of 'em, like we'll give you some funding in order to go push on some of these.

So I would say get in contact with us if you're excited about arc. There's always plenty of stuff to do. 

hugo: Fantastic. I wanna thank everyone for joining. We've ever had over a hundred people join during the live stream, so thank you all for joining, and if you're watching afterwards, thanks for tuning [01:03:00] in, Greg.

Thank you. Not only for all the amazing work you do and all the really exciting, groundbreaking work you do, but feed your time and generosity. You've got at least two children with ARC and an 11 month old as well. So really appreciate your time, generosity, and wisdom, man. Awesome. This has been a lot of fun, Huga.

Thank you very much for having me. Thanks for tuning in everybody, and thanks for sticking around to the end of the episode. I would honestly love to hear from you about what resonates with you in the show, what doesn't, and anybody you'd like to hear me speak with along with topics you'd like to hear more about.

The best way to let me know currently is on Twitter at Vanishing Data is the podcast handle, and I'm at Hugo Bound. See you in the next [01:04:00] episode.