The following is a rough transcript which has not been revised by Vanishing Gradients or the guest. Please check with us before using any quotations from this transcript. Thank you.

===

hamel: [00:00:00] Evals are a systematic way of measuring an AI application. If you wanna improve on something, it's very difficult to do that if you're not measuring it. A lot of people have done evals incorrectly. They've gone into a place where they're writing evals prospectively without doing any data analysis and not being thoughtful about what evals they have and have wasted a lot of time.

And for those people, they've been bitten really hard by evals. If you've wasted a lot of time on something and not gotten any value, then yeah, you're gonna be upset. If you try to take a, like a software engineering approach to evals, then you're gonna get. I think there needs to be a revenge of the data scientists 

in ai 

hugo: that was machine learning engineer Hamel Hussein, describing a frustration that's become all too common for teams building with ai.

In this episode, we explore the common traps he's identified from relying on generic off the shelf metrics to outsourcing data review and skipping the essential [00:01:00] work of looking at tracers, many of which stem from trying to force a deterministic traditional software paradigm where it no longer works.

Hamell and I discuss why getting this right is so crucial, how these common mistakes lead teams to chase misleading metrics, waste valuable resources, and ultimately stole the development of reliable products. And then we get into the solutions, a practical data-centric playbook. We have a guest caller as well, Brian Bischoff from Theory Ventures, who shares his powerful failure as a funnel analysis for diagnosing issues when building complex AI agents.

Among other things. On top of this, the next iteration of Hamill and sh Shanker's course. Evals for AI engineers and product managers starts on October 6th and they've been kind enough to give listeners a vanished in gradients 35% off the links in the show notes. This conversation was originally recorded as a live stream.

I'll link to the full video there as well. I'm Hugo Bound Anderson, and welcome to Vanish in Gradients,[00:02:00] 

Hugo Bound Anderson here with the one, the only hamel, the evals guy who's saying what is up, Pam? 

hamel: How's it going? You know, you have the best radio voice. I wish I could just have you announce me like all the time. 

hugo: I pr dude, there's, if you want to wanna hire me at some point to be your hype guy, that, that's definitely on, on the table.

I'm glad you said I have a voice for radio, not a face for [00:03:00] radio, which I've gotten before as well. But great to have you back on, on the show Hamel. And before jumping in, I'd just like to welcome e everybody who's tuning in live on YouTube. So welcome to a special edition of Vanishing Gradients. AI ML Data Science podcast to talk all about evals and we know everything, a lot of things Hamill loves about evaluation when it comes to AI apps, but today we'll talking about 10 things that he hates about evals.

Before we jump in, one of the things I'm excited about is Hamel Trash Anchor's upcoming course on AI evals for Engineers and PMs, which covers many details of what we'll be chatting about today, but if you're interested in that, next cohort starts on October 6th, and I'll put a link, which she'll give you 30% off in the chat there as well.

One final word, 5% off 35. 35. Amazing. It keeps tomorrow'll be 50 and a brief word from our other sponsor, which is, which happens to be me. I teach a course on, not [00:04:00] on we, we cover some evals, but it is in a deep dive in into evals. It's on the entire AI software development life cycle, which I co-teach with our friend and colleague Stefan Crouch, who works on agent infrastructure at Salesforce.

So I put a link to that in the chat as well. All of that aside, um, ell, maybe I'll say a few words by introduction, you are a machine learning engineer With over 20 years of experience, you've worked with lots of innovative companies such as Airbnb and GitHub, including early LLM research used by OpenAI for code understanding.

All of those things are amazing. I also love that you've worked with a lot of companies that aren't necessarily big tech companies, but helping bring tech to the long tail of people trying to build with data ML and AI over the years. 

hamel: Yeah. Yeah. I think this is the maybe fifth time on your podcast. I'm not sure, but it was quite fun.

I've, we've talked about some of these companies on your podcast, which I encourage other people to go check out. 

hugo: Absolutely. And yeah, at least five, five times I, I think [00:05:00] something like that. I'm just, but yeah. No, we joke you at one point you were like, Hey, you wanna just make me co-host? Yeah. That may make more sense.

And I will, I, in the show notes, I'll link to an episode where we do talk about your early days in, in, in consulting, in, in fact, and when you, when Hamill went to Vegas, that would make a good YouTube series. But enough, EE enough riffing. I am really interested to jump into the importance of evaluation with you, the challenges of evaluation when it comes to AI powered software.

But something you and I have talked about a lot and written about a lot, consult about, teach about is. How AI powered software differs from traditional software and the different skill sets that people from classic software engineering are coming in. People from data science are coming in. So I thought we could just start with a few quick quickfire questions.

So I'm wondering if you could tell us a bit about why we're having this conversation. How AI powered software actually differs from traditional software. 

hamel: Yeah, so AI powered software involves a lot of [00:06:00] stochastic outputs using an LLM. The outputs are unpredictable and so how do you write tests for that?

How do you measure if your application is behaving? And you can't just apply your standard software engineering techniques of writing tests, you have to think about it a little bit differently. You have to bring some data literacy to bear, meaning you have to look at this data, you have to. Apply a bit of a statistical mindset.

You don't have to become like an expert per se. You can get started just with counting things, but you need to analyze and look at data and reason about the behavior of a system in a data-driven way rather than asserting tests. That's a key part of it, and that's a skill that a lot of data scientists and machine learning engineers already have.

You have to tweak those techniques a little bit to apply to AI products. You don't have to tweak them a whole lot, [00:07:00] but Okay. Like you need to tweak them a little bit. I think the reason why we is even a thing, or is even notable to talk about this is I would say in the first round or first era of AI engineering it, the focus was on mostly software engineering, rightfully because.

You have to have a product first before you can optimize it. I think data scientists, machine learning engineers are good at optimizing systems. They're not always good at building products from zero to one, and so it probably was, evals was the wrong thing to focus on in kind of the early innings of ai, like this whole LM boom.

But as companies get more mature and as like the products are already out there and you want to make them work better, then that's when you need to go and pay attention to evals. I'll just stop there for a moment. I could keep talking about it. 

hugo: Yeah, I, I, I love all of that. [00:08:00] And one thing I totally agree on that zero to one, you can get away without even necessarily many vibe checks, one to 80, you need to really start doing vibe checks and then introduce some form of robust evaluation.

But then 80, 80 to a hundred it seems like. Having like very, seems like eval's a your moat then, to be honest, I also love that you mentioned data science. ML people are good at optimizing. We are generally very good at hill climbing and we love jumping in into data and we're curious about data, which classic software engineers may not be as much.

And I think the influx of people into something that's come out of ML powered software who may not be as curious about data, we need to figure out how to get people curious about data as well. I love everything you spoke to with respect to how AI powered software differs from traditional software. I'm wondering, as you've pointed towards the fact that machine learning can tell us a lot about how to build AI powered applications.

I'm wondering if you can tell us what similarity, why ML [00:09:00] can, and data science can help us here, but what differences there are as well. What new, what novel things have arisen? 

hamel: Oh, okay. I think that, so it's good that you mentioned Brian Bischoff. 'cause actually there's a lot more to. What is different about AI than traditional versus traditional software?

In fact, Brian has a good talk about, and the title of the talk is Stop Managing ai, like traditional software projects or something like that. There's a YouTube. If you Google that, you'll find a talk by Brian on YouTube where he goes through and he talks about how you need to optimize for running experiments and you need to think differently about your roadmaps.

Brian is probably the best person to talk about that. You should totally have him on the pod, by the way, to talk about this. So that's something to pay attention to as well. As far as what are the unique things with LLMs. So, okay, if we double click on evals or we dig into it a little bit, there's some meta evaluation problems where, okay, if you use an [00:10:00] LLM to evaluate the output of another LLM, how do you trust it?

That's a meta evaluation problem, and it turns out like you need to use some labeled data. That sounds a lot like machine learning, but you don't actually need to train a model. You just need to use the data for measurement. And it turns out you can, you know, you don't need as much data to do that versus training a model.

So there's a lot of little, there's not one thing, I'm just giving you one example of, you have to shift a little bit about what you're doing, like the kind of the practices. So instead of the training split being 40% of the data or 50% of the data or 60% of the data, whatever, probably your training split is your train split is like 10%.

Maybe train split just means, oh these are like where do you draw your few shot examples from? 

Hmm. 

hamel: Um, and so when we, when you go through and there's like lots of little examples throughout the way [00:11:00] where you can't just copy and paste data science. You have to think a little bit and see how does your data science knowledge generalize to LMS and you tweak.

Things a bit. So for example, error analysis. Error analysis is commonly introduced when you're talking about classifying things and especially is a very popular technique. Technique and something like computer vision. And it may not occur to every ML engineer, oh, let me do this with LLM outputs. And you can get stuck if you try to.

So like you have to simplify things a bit. So one trick for doing error analysis on traces is look at the trace and stop at the first upstream error that you see and write that down and then move on. Don't like try to write down everything you see necessarily. You can, but like we advise not to do that when you're starting out.

There's all these like little things that you do to like make your life easier. [00:12:00] You're still using ML principles and like a lot of the techniques, you're just adapting them because a lot of them do transfer. You just can't transfer them blindly, if that makes sense. 

hugo: That makes a lot of sense. And what I'm hearing is that one of the big similarities is we have these software systems that are constantly changing.

I have data coming in and out and a lot of the same tools and techniques a apply them, but I'm hearing that one distinction or one important difference is we don't necessarily, like in ml, we had accuracy in F1 and Precision and these types of things. Now of course, tying those to actual business metrics has always been a challenge.

But with LLMs it's even tougher because of the nature of not really knowing what inputs can be and outputs can be like it's the world has. Increased in in service area, although a lot of the techniques remain. Yeah, in 

hamel: some sense you still, you may, you definitely have to think about your metrics and how those metrics tie to business outcomes in the very same way that you would as a data scientist and you have to thread the needle all the way [00:13:00] through and say, Hey, are my offline metrics good proxies for the user behavior that I want to affect or the product behavior I want to affect, or whatever it is.

And you just make sure you keep an eye on that and you need to use the whole suite in the limit. You have to use the whole suite of your data science knowledge, including how to offline test, online test, AB testing, you name it. It's all gonna be valuable. One thing I talk about a lot with Brian is how important data analysis is at some point, I don't know when this happened, but data analysis became like really uncool.

It became just the job title data analyst. That's how you, that's how you can get poor is to like. Make that your job title. Everyone tried to graduate out of data analyst to data scientist and machine learning engineer 'cause it paid more. And it's those skills of, hey, like how do you dig through data, analyze it, transform it, reason about it, hunt through data, having good [00:14:00] intuition, having a nose, how to answer questions, vague questions or do the investigative journalism of like why did this happen?

All those skills that are quote data analysis skills, those are really effective when you're trying to debug an AI system. And, and I have a theory that was always important, even in ML systems, it just was made uncool somehow. And people did talk about it. Like people have always said, look at the data.

It's like really effective. But I think it's even more apparent with LLMs. 

hugo: I totally agree and I actually, Gabriela has a wonderful question that I think we've spoken to. In the discord. Gabriela has asked how different are the evals we're talking about now from the classic stats ml. Is it a new thing? Is it an old topic?

But reframed and I think we've spoken to a lot of this and I'll, I'll try to provide one kind of cookie trail here. I suppose, [00:15:00] as we've said, it is rebranded, but there are, and a lot of the techniques like error analysis and I actually linked to a podcast we did about looking at your data where you show up from that wonderful Andrew in course 

hamel: it's Yeah, the OG ML course.

Exactly. From a decade ago. 

hugo: Yeah. It's him showing error, doing error analysis. And it's actually in one of your blog posts as well. It's, I'm gonna get it. Maybe it's cat versus dog computer vision challenge or something. It's 

hamel: cat versus dog. Yeah, 

hugo: exactly. Yeah. And if you look at like the amount of dogs mislabeled as cats, maybe it's 5%.

I'm just picking that out of whatever cats mislabeled as dogs. 90% are errors of that. And even knowing that informs how you are gonna work in order to get lift on your model. 'cause if you focus on the ones all the things that you know are only 5%, that's the maximum lift you'll get. So probably good to focus on, on, on the other ones.

And so 

hamel: these are, what's really funny is I actually looked at that video earlier today. Amazing. It was the exact percentage. 

Those are the exact percentages. 

hugo: Oh dude. The amount of things I remember and the amount, like the important things I forget as well. [00:16:00] Is that Absolutely. Well that's cool, man. And when it comes to failure analysis for AI powered apps, same vibe, but in complexity of course.

Very different. So I always tell people like if you are looking, if you are looking through traces of a multiterm conversation or something with 15 possible tool calls or something along those lines, the amount of complexity in actually looking at those traces and figuring out what's up is a different base to a two by two confusion matrix 

hamel: a little bit.

It's the same kind of approach in general. This is like. Error analysis of taking notes about what the failure mode is and then doing a very simple analysis of that and collapsing that and finding patterns because, so I think it's not intuitive. I would say that error analysis works. Let's say you're presented with this cat versus dog classifier, and if you went to a non-data scientist and said, Hey, like it's not working, can you debug?

Most people would be totally lost, but like, what do you mean it's a cat versus dog [00:17:00] classifier? It doesn't work. What else do you want me to say? How do I even go about debugging this? And it's surprising, like when you actually look at the data and starts to take notes about what went wrong and what in the air analysis, the cats versus dogs.

And you'll notice things like, oh, like rain pictures that are rainy or it has rain somehow is messing up the classification. Or, oh, like there was a lot of blur. These ones where it got messed up or you know what, that actually does look like a cat. That's really confusing. I have no idea. This is a cat or a dog myself.

Those kind of things. And that helps you figure out what to do. And similarly with lms, it's actually a lot easier in some ways to figure out what to do because you are, first of all, you're conversing with the computer in English. So you have some in like way to understand, not perfectly, because the LLM may not execute your intent faithfully.

But a lot of times [00:18:00] there's some, there's clear prompting issues that you have. There's clear retrieval issues that you have and you can find those patterns. Whereas in the cats and the dogs thing, you need to know a little bit about machine learning. You need to like how, like you need to say, okay, I guess I need to go get more rainy dogs pictures and I need to assemble a data set.

And I think really carefully how I would train this model, train this error out of the model. Without causing it to forget like some other stuff and like how do I mix it into the training data? You have to get into machine learning quite extensively to go attack that problem. Whereas here, you don't have to do that.

It's more even more transparent to you and it's just, it actually makes the error analysis way more important because you get a lot more leverage out of it right away. 

hugo: Yeah, totally. So I do wanna jump into all the hates soon and I'd love if people put in the discord what they hate, um, and, and love, but hate about evals as well.

We've talked around the importance of evals, [00:19:00] how to think about them, particularly with respect to error analysis. I'm wondering, 'cause a lot of people struggle with this, if you could just give us a basic definition of what evals are and why they're important. 

hamel: Yeah. Definition of evals are a systematic way of measuring an AI application.

That's the definition. 

hugo: Awesome. And the next 

hamel: part 

hugo: was 

hamel: why 

hugo: they're 

hamel: important, but it seems obvious now. The thing, it's obvious if you wanna improve on something. It's very difficult to do that if you're not measuring it. So it's hard to achieve really any goal if you're not, have some kind of way to measure things to know if you're progressing or not.

hugo: And then for people who've built a prototype or something that's working, how can, how would you suggest the first few steps of them getting started with evals? 

hamel: Yeah, so the first steps is our error analysis, and we have some videos that walk you through that of like how to do that. We probably don't have time on this podcast.

The best way to learn how to do error analysis is to watch someone do it. And yeah, we just [00:20:00] did a live demo today or in a podcast that came out today, Lenny Chesky. We did a live demo using spreadsheets, so I highly recommend checking that out if you want to see what that looks like. 

hugo: Great. I'll link to a demo in the show notes, error analysis, 

hamel: and what next?

Okay, so you do error analysis. Error analysis helps you inform what to prioritize and what evals to write, or even if you should write evals. And then based on that, you have to decide what kind of evals do you wanna write. Do you wanna write code-based evals, LLM as a judge, if you can assert something, then by all means use code-based evals.

If you can't assert something, it's very fuzzy and it requires judgment. You need an lm as a judge. If you use an LM as a judge, you have to go through a meta evaluation process of figuring out can you trust this judge? Because if you use AI to judge ai, how do you know that? Like, why is that okay? Like you, you don't, you don't know if that's a, that's valid or not.

You need a way to trust it. [00:21:00] And so the way you do that is measuring your LM as a judge against human labels. So that's more expensive. It takes some effort. And so you have to make a judgment call of whether or not you wanna build these evals. And after you do that, then you can. Once you have a set of automated evals, then you can put those evals in CICD and there's ways to do that.

So it's not gonna kill you. So it's not expensive. You have to have tiers of evals according to if they're more expensive, you wanna run them less frequently. And then you can use them for production monitoring as well. You can run the evals against either all or a sample of your production traffic to understand what's happening.

hugo: I love it. And I am, before jumping into the 10 things you hate about evals, that I wouldn't be doing my job if I didn't mention. We've recently had an the great evals war 2025 on Twitter, a platform formerly known as Twitter, among other places. And I think [00:22:00] it's important to talk about this. I am, there's been a bunch of hate on evals and I think some of it's.

Pointing in the wrong direction. Some of it's understanding, some of it's just terminology issues. But I'm wondering if you could just run us through what people's main objections to evals are currently. 

hamel: Yeah, there's a, it's a lot of different things. One is, so a lot of people have done evals incorrectly.

They've gone into a place where they're writing evals prospectively without doing any data analysis and not being thoughtful about what evals they have and have wasted a lot of time. And for those people, they've been bitten really hard by evals. If you've wasted a lot of time on something and not gotten any value, then yeah, you're gonna be upset.

If you try to take a, like a software engineering approach to evals, then you're gonna get burned. And it's really easy to fall into this trap of taking a software engineering approach. A lot of vendors, they have, they sell tools like a, like different tooling around evals. [00:23:00] Part of what they offer is off the shelf evals.

Hey, plug in this library, run all your traces through it and we'll tell you what's wrong. And they'll clear out, print out a dashboard for you that has like hallucination score, toxicity score, whatever score that are off the shelf metrics. And those don't work May, they may find something, but ultimately they don't correlate.

They never correlate with your most burning problems. Only way to find out what your problems are is that data analysis step, the thing that we're calling error analysis and grounding your evals in that. And so people that take that approach of what's being sold to them, either implicitly or explicitly by tools, they also get burned.

So there's that. There's also a narrative that certain products have not used evals and only use vibes. The kind of prominent example of that is Claude Code. The team there says, has gone on podcasts or whatnot and says, and said, Hey, we don't use [00:24:00] evals. So in certain domains, I don't think so. There's a lot of problems with that.

One is certain domains just don't generalize really well to other domains. Code coding agents are very unique. That's a very special situation where the developer and the domain expert are collapsed into a single person. That's completely different. The feedback loop is completely different. You can have a very tight feedback loop and you can skip some steps because yeah, the developers are the domain experts and you actually are, are actually dog fooding the product extensively.

And so there's that. There's also, I doubt that there aren't any, some flavor of evals going on. So upstream from model usage, I can tell you that those foundation models have been eval extensively against coding tasks. And so those evals have been done upstream. And so you can get away with doing less evals.

'cause those are baked into the model so much that you are, in a way, you're [00:25:00] in this, like in more, in this initial, you can stay in this initial stage longer. You can rely on vibe checks. That's one thing. Secondly is, I promise you that the cloud code team is using a lot of analytics. They're competing a lot of statistics around the way people are using cloud code.

I'll be really surprised if they're not okay, what conversation, what's a, what session length are people having? What, what are their context usage patterns? All kinds of things like in different statistics to help them understand how people are using cloud code to find surprising behaviors and outliers.

And that is part of evals. And so people may push it back and say, Hey, I'm expanding the definition of evals. I'm not, I'm saying that evals. So like I've always said that looking at your data. Is the most important part of evals and you can't decouple it from evals. 

Mm-hmm. 

hamel: That's exactly the problem. And so whatever we need to do, you can call it something [00:26:00] else, can invent a new term, I don't care.

But you absolutely need to be looking at data. I would say looking at data is even more important than writing the test or having the eval itself. 

hugo: Now for a word from our sponsor, which is, well me, I teach a course called Building LLM, powered software for data scientists and software engineers with my friend and colleague Stephan Crouch, who works on agent force and AI agent infrastructure.

At Salesforce, it's cohort based. We run it four times a year and it's designed for people who want to go beyond prototypes and actually ship AI powered systems. The links in the show notes makes a lot of sense. I do now wanna jump in to. The thing that we promise, the things you hate about evals, you've already spoken to one of them.

When we chat about this, the first one you mentioned, which you've already spoken to briefly, are generic metrics and off the shelf evals. Maybe you can tell us a bit more about what you hate about this and why teams fall back on [00:27:00] these. 

hamel: Yeah, so generic metrics, hallucination score, toxicity score, conciseness score, adherence score, whatever the hell, all these speed of scores off the shelf.

Ultimately these waste your time. These give you some kind of false security and that's wasting your time is super expensive and, and it's very destructive because you start chasing these illusory targets that don't actually make your system better are are, if they do make your system better, that's not actually the highest priority thing that you should be chasing.

Yeah, I just think that it's harmful In many cases. There is a right way to use generic metrics, so the right way to use generic metrics is to apply these generic. Scoring systems on your traces and then use that as a data exploration technique. For example, use a off the shelf hallucination grading thing, that's fine, but see what the score is and sort your traces by that [00:28:00] score and see if anything interesting is there.

Now that is one way to use it. I will say I am very skeptical of the whole thing because if you're using some off the shelf hallucination scoring thing, what is that? That's just a prompt. It's like someone has shipped a prompt to you and you need to look at what the hell the prompt is. If not, that would be, that's, that would be laziness because there's a good chance, like the stock definition of hallucination can be tweaked to fit your use case a lot better and just be weary of any off the shelf grading and take a look at the prompt if you are gonna use it, but also don't report on the score.

You can use it as a data exploration technique only. 

hugo: Absolutely. And I do think one thing you're speaking to is. We're building out the early days of an application layer on an incredibly horizontal technology like foundation models. We shouldn't even expect off the shelf metrics to align to what we really wanna measure in, in, in a business sense as well.

The second thing you mentioned you hate about evals, that the completely outsourcing data review and leaving [00:29:00] out domain experts. Tell us about this one. 

hamel: Yeah. A lot of people with building an AI application, they, in their minds, they say, okay, the building and engineering of this AI application is an engineering, fundamentally an engineering effort.

So I'm just going to get a PowerPoint slide. I'm gonna say what I want. I'm gonna draw some diagrams and I'm just gonna strip it to my engineering team. They're gonna build it and they are going to figure out whether it's good or not. So this whole eval discussion, they, in your mind, the, the people in charge, they're like, okay, that's an engineering thing.

They're gonna do the evals. No. The domain experts are the ones who need to be intimately involved in the evals because the developer doesn't have the context to know if the product is working the way it should. They'll try to, but oftentimes they don't have all the context. And so we need to bring the domain expert into the loop.

And at the end end, we also need to, when possible, have the domain experts write the prompt. You don't [00:30:00] want to dis intermediate, prompt writing through a developer. That's just a tragedy because it's not necessary if with the right tools, with ultimately you're communicating to the computer with English.

It should, you should close that loop 

hugo: totally. One thing I do love that I've always wondered about how to structure teams around this type of work as well. And one thing I love about something I, I learned from you and Rey in your course and otherwise is also thinking about with respect to evals, having a protector, I always used to call it a gatekeeper of some sort.

That's a horrible term, but I think you've called it a be, uh, benevolent dictator of someone who can. Actors defense on, on, on these types of systems? 

hamel: Yeah, so this term that we love using benevolent dictator, this is just in response to the fact that when it comes to annotating and looking at data, a lot of companies get stuck by doing it by committee.

And that's not necessary. You don't need to do everything by committee. You can delegate it usually to one person. Every time I've been involved with this, we are able to [00:31:00] find a single domain expert that we could delegate this task to, that everyone trusted. And of course, like if you're company is bigger and you have like different geographies and different products and things like that, you might have different domain experts looking at different data, but you don't always need a committee looking at the same data.

At scale, you can, and there's different situations where you may want that. I'm just talking about for normies out there, building AI products. You can unblock. You need to make this process simple. So as we discussed before, like people can get burned with evals really fast 'cause you can overcomplicate it.

It is very easy. So you need to stay super pragmatic the whole time. And part of that is this benevolent dictator thing. 

hugo: So number three on your list of hates was overeager eval automation. So what's the problem with this? Where does it go wrong? 

hamel: So every time we teach evals, people want to go straight into automation.

They're like, I don't want to, can I just have an LLM look at the data and tell [00:32:00] me what is wrong? Can I have an LLM do the open coding or the can? I just are They take the output of the LLM, let's say it's categorizing different errors and just take those for granted. Also, you can try to do hill climbing algorithmically.

There's a lot of different libraries and approaches that you can use to. Optimizing a metric like some eval metric. And those aren't necessarily bad per se, in a blanket sense, but you don't want to prematurely go there because what you're do, what you're leaving out if you try to be overeager with this automation is you're not thinking, so it's not even about an eval thing, it's about any kind of use of AI for anything.

If you're writing a document, you're writing an email. If you're doing really anything, you need to look at what the hell is happening and you need to have a human in the loop. Otherwise you might get slop. Right? Yeah. And you might also, you will likely get slop if you try to be too over eager with the automation.

You have to use the automation [00:33:00] thoughtfully, and so people get too carried away with this over year automation and it produces bad results, and so that can, sometimes it's not intuitive to people. That's absolutely true. Now, just to zoom out a little bit, okay. People often struggle with the idea like, why do I need to look at data?

Why do I need to label things, et cetera. The reason is like you need to infuse your taste and judgment into your AI product, and the AI cannot read your mind, right? Like you need to iterate and refine what you want and respond to what has happened happening. You need to keep refining your requirements and until we get to a GI, so the, there's no situation the, okay, so like you can delegate to AI without checking it too much if you completely trust is judgment more generally, right?

Hugo, if I work with you on a daily basis, [00:34:00] and let's say you're helping me in my consulting business and every single time, every different kinds of tasks I give you, you nail it all the time. Okay? You did this marketing project over here. You nailed it. You did this like consulting project evals over here.

You nailed it. You built this agent over here, you like did it yourself. You're like, whatever. You are like, okay, this guy Hugo, he's basically. Just give him whatever. I don't care. Like I trust it. I trust it blindly and until I hear something bad, I'm just gonna give it to Hugo. And that's what you do with good people.

Right? So that's a GI. Right. And so until we have a GI, you wouldn't just tell an LLM to just eval your whole product like without you in the loop. That's not gonna work. And hopefully we'll have a GI at some point. It depends on how you feel about a GI, but Okay. You can get rid of, you don't have to do evals if you have a GI in the, it depends what your definition of A GI is.

But I'd help. I was trying to define it. Yeah. 

hugo: Yeah. And I do think even in with some form of a GI, we still need to explain to the a GI [00:35:00] what we're thinking and how it can mimic our choices and our, our desires and that type of stuff. I do wanna jump onto the next one, which I really spins me out. The fourth thing you mentioned that you hate is when teams don't look at the data at all.

hamel: Does that really happen? Happens almost a hundred percent of the time. Tell us about 

hugo: it. 

hamel: Every time I get. Contacted for any consulting project of, Hey, we need to build evals. If we are stuck, we don't know why our system isn't performing. The first thing I'll do is I'll say, okay, let's take a look at your traces.

And within a few hours we always find very significant problems that are easily fixed and people aren't looking at their data. And so this first step is always skipped this data analysis and people are jumping straight into, Hey, I bought, I bought a vendor of some kind, so they bought Braintrust Lang Smith, the rise, whatever.

I'm not picking on them, I'm just saying these are like some vendors and they, it's not that the vendor said to use off the shelf evals, they just have some off the shelf [00:36:00] evals. They created a dashboard like, Hey, we're doing evals. It's not, it's not really working. What do we do? So yeah, I think it takes some skill though, to look at data.

It sounds like we are trivializing it a bit by saying, Hey, just look at the data. As of though, you just need to have a pulse. And just open your eyes and look at your data, but actually it takes a fair amount of skill to look at data, this whole data analysis regime of, there's all these kind of like visualizing data, navigating it, et cetera.

Brian, you have any thoughts? 

hugo: So just quickly, we got Brian. Oh, am I muted? Oh, no, I'm not. We've got Brian Bischoff on, on, on the call from Theory Ventures. Illustrious background in, in everything. What's up Brian? 

bryan: Hey, I was just calling in with a question, but yeah, we can let Helel finish his answer. 

hamel: No, I would like to see get the color of Brian in here.

That would be great. 

bryan: Yeah, I'm just calling in with a question Hamel. Okay. I obviously very enthusiastic about evals, very enthusiastic about this like data analysis piece. [00:37:00] Everyone is talking about ag agentic. Everything's getting way more agentic. We're getting. Tool calls all over the place and we're getting agents that can take a problem, break it down into pieces, and fan out some of those pieces of logic.

As a quick example, if you ask Claude Code to go and do something, it'll start by making a plan. It'll then start working through that plan. It'll step by step. Sometimes it'll make little tool calls to do things. It'll maybe run a command in a terminal. It'll make an edit, request a different tool.

Sometimes it'll use an MCP, which is a really popular mechanism these days for accessing specific kinds of data. But one thing that I'm starting to find as we get more and more agent is that like I find myself wondering, where should I focus on the evals? Should I focus really tightly on this like high level like flow, or should I be more focused on the individual tools?

How do I kind of manage that trade off? And then maybe [00:38:00] as a sort of like obvious follow up there. Have you seen anything that has been really inspiring to you around like these like more agentic tool use cases? Because it's something that is coming up for me every day lately. 

Yeah. I love this. I wish I could, if you could throw a softball over the screen, I could like just hit it like virtually.

That's what happen here. So 

bryan: I didn't intend it to be a softball question. No, no, 

hamel: I'm just kidding. No, but my favorite thing around this is actually what Brian has, which is what Brian talks about is his, that's why I call this, that's a little bit funny is that, so Brian has this wonderful kind of analytical technique that he is adapted from his data science skills, which is like a, he has this a failure funnel and it's a trans, there's many different versions of this.

One is like a transition matrix where you have. Your different agents and you can see where the failures are occurring in the handoffs between different agents and allows you to focus on and see different hotspots [00:39:00] and it goes pretty deep. There's no way we can go into it fully, Brian, where if people can find your talks on that.

bryan: Yeah, you can check out Howell's blog. I gave a talk with Hamel explaining that and you can also, if you google like a data council, Brian bi bischoff failure as a funnel. And so just a quick summary. Basically the idea is ultimately for each capability you have a target and a series of steps that's necessary to get there.

Uh, each of those steps is an opportunity for the age to screw up and if it has an opportunity to screw up, you could eval those individually. And if you just keep careful tracks through things like error analysis of where it screws up in the process, you can actually allocate how often a failure if falls into one of those buckets.

And that can give you a lot of direction onto what to focus. Next. If you think about some of the ways that Hamel and Reya explain how to do error analysis and how to do axial coating and to really build these like categories of failure, it's a very similar concept. [00:40:00] Mine's a little bit more geared towards like when there's tools in the mix.

One thing that I am struggling with these days is to what degree should I be like aggressively going after just evaluating harder on the tool itself versus evaluating the agent's understanding of how to call that tool. I feel like there's so many ways that it can screw up calling that tool. Just today we release the new tool that like one of our investors is like getting really unfortunate results back from just the tool itself.

And so that's where I'm thinking a lot about mcps. 

hamel: Yeah, I would say also use error analysis there to help prioritize what to focus on. Also, oh wait. To go back a little bit just to linger on this failure funnel and why I love it so much 'cause it deserves a little bit more intention. It's okay, what problem is it solving?

So one is with agents, you can, conceptually, the problem is you have this complicated dag of different flows and this like [00:41:00] sarcastic system where the journey can be very variable in this dag and go everywhere and you can have failures in any step and it cascades and blah, blah, all that stuff, right? And if you step away from that and you say, okay, how do you as a human being analyze such a system, it sounds chaotic.

You might think, oh, that sounds impossible. Like, how are you going to even do that? A naive approach is like, visualize the whole dag and animate it and watch like real time streaming information with all these fancy 

colors and stuff. Oh, it's like failing here now it's failing here. No, we all, that's not gonna work.

Like that's, that just looks cool, but it's not 

hamel: gonna work. And so. Turns out, if you zoom out and you think, okay, how do you analyze a dag? How do you analyze this like sarcastic system? Transition matrix is a technique, is a kind of technique that allows you to make sense of that. Even the before, it's not about ai, it's just in general.

So what Brian did, the reason why it's so [00:42:00] beautiful is like Brian generalized, his analytical skills, his data science, analytical skills, and he said, okay, like I need to, he, no one told him. He said, okay, I need to analyze this dag, so I'm going to use a transition matrix. It makes total sense when you think about it, but he was only able to do that, I would argue, because he has a strong data science background and he was able to basically see that right away.

Is, am I mischaracterizing that Brian? 

bryan: Right away is a little generous, but otherwise, I think we're aligned. I think it took me a couple iterations of staring at this data in the table. Sometimes Hamel makes it a pivot table, which is a really nice presentation as well. And it's actually the very first way that I approached it was via pivot table.

I only later came to the transition matrix piece, but yeah. Yeah. Thanks. I still have a lot of challenge in just constantly trying to deal with the ever complexifying like. [00:43:00] Things that an an AI system can be asked to do. It's just adding more and more agent like loops. And I'm starting, actually as crazy as it sounds, I'm starting to get to the point where now I have multiple funnels in my like eval setup for products.

So I don't just have one funnel, that's the whole thing, soup to nuts. It's actually individual parts. So I have a failure funnel for one kind of action and a different file funnel for another. And so that's been a new emergent complexity. But yeah, still using failure funnels. So 

hugo: amazing. Brian, and I appreciate you've got to jump in a minute, but thanks for the great question and I've had you on the podcast before, but it'd be great to have you back sometime to talk about failure.

Failure is a final. I do have one final question. If you do have 30 seconds or so as we're here to, to love the hate and embrace the cactus and kiss the frog and all those things as well. So I'm just, we know what you love about e evals. What's something that just really pisses you off about them? 

bryan: Okay.

I'll tell you a quick one and then a more serious one. So the quick one is, I hate that name. I think the [00:44:00] name is terrible. Hamel. If you could have fixed everybody's like way of referring to these things, that would've been great for the community. I think unfortunately you, it's too late for that. And the SEO story is over.

We we're stuck with evals. I really just think that it's like subtly confusing to people 'cause they're like, what am I evaluating? And sometimes they don't even necessarily think of it as evaluation. They think of it as some distinct thing. And so that has always bothered me if you just, I'm a mathematician.

And so if you think about evaluating a function, you're putting something in and seeing what comes out that's evaluating function. I don't love that for this. And so that's one thing that always bothers me. It's a little bit of a trivial thing, but one that I do really dislike about like evals, and this gonna be related to some of what Hamel feels is I think it can be very easy to feel like they are, uh, an impediment to speed.

I often hear, especially from executives. We wanna move fast, let's just [00:45:00] get this out there. Let's just try it. We wanna see if it's working. And so there's this tension between speed and evaluation. And I think that's maybe one of my most frustrated things about evals is they get this bad rap that they are like an impediment to speed.

I don't actually believe that they're an impediment to speed. There's a a very famous saying, slow is smooth and smooth is fast. And I think slow is smooth and smooth is fast, is probably the best way to wrap your brain around why evals don't need to be an impediment to speed. But at first blush, I think a lot of executives here, they wanna build more evals.

Oh my god, my revenue's supposed to go up a hundred million dollars this quarter. So anyway, that's my little ax to grind. And yeah, thanks for giving me a couple minutes on the show. Such a, always a pleasure. I agree. 

hamel: I agree with the name, man. I think Brian and I may agree that it should be rebranded to Data Science for ai.

Because that's really what we're talking about. Totally. [00:46:00] 

hugo: Thes of the gods of marketing aren't a huge fan of that one, sadly. Yeah. Currently data analysis for AI is even better, actually. I think DA ai. 

bryan: Yeah, that's it. Yeah. Yeah. Awesome. Thanks Dan. Alright, thanks Ryan. Have a great 

hugo: time. See you man. Hey, so Hamel, we've, this is great.

We've got four things you hate already. One generic metrics off the shelf evals, two, outsourcing data review, skipping domain experts, three over eager eval automation, four people who don't even look at their data at all. We're scheduled until the end of the hour, but we're, let's go slightly over to get through all of them, but let's do some rapid, some rapid fire ones.

Number five was not thinking deeply about prompts. Maybe you can tell us a bit about this and why it gets you. 

hamel: Yeah. Okay. You need to read your prompts. You should probably write your prompts. Honestly, there's too much lazy prompts out there. I can't tell you the number of times I look at a prompt and clearly no one has reviewed it.

It's been, it's some slop it like it's, and it's obviously slop because it has like. Emojis, tons of emojis in it or whatever, things that just don't make any sense. And I said, do you want emojis in your response? Why the hell do you have emojis? Many [00:47:00] goddamn emojis in your prompt. And I'll give you an example.

I was working on a HR assistant for recruiting, and the prompt said, you, it said you are a data scientist. Please write an email. I was like, wait a second. What you tell the recruiter is a data scientist. It's just that some lazy person that was a data scientist was writing the prompt and just copying and past it.

And so a lot of people are not paying attention to the prompts. Also, you need to, if you're using a framework, look at the prompt. So I have this blog post, the title is, fuck You, show me the prompt. And I really meant those words. I was really angry when I wrote the post. It was very frustrating. 'cause there's a lot of, there's a lot of, and you can Google it.

There's a lot of frameworks. They hide the prompt from you. And if you see what the prompt is, it will surprise you. Uh, the complexity that you are adopting for no 

hugo: reason. I remember when you drafted that post and you sent it to me and you're like, Hey, do you think [00:48:00] it's cool to have this cuss word in the title?

And I was like, I'm Australian dude, so clearly it's cool. But also I was like, it's actually, it's important. And the were I American? I'd say it's like this is a very descriptive, it described how we were all feeling at that point in, in time and the way you intercepted it was cool. But yeah, using frameworks can obscure prompts, it can obscure all different types of things, so I appreciate that, that clarity.

Then the next thing that you told me you hate about evals was dashboards that you know, are just full of noisy metrics and are more noise than signal. So maybe you can tell us a bit about this one. 

hamel: Yeah, so those are the dashboards full of the generic, off the shelf evals that don't mean anything. A lot of people's evals look like that, so don't do that.

If you have that, you need to get help from some, somewhere. It doesn't have to be me, just get help from someone or somewhere. Just to back yourself off of that. And then what would you replace it with? I would replace it with customized [00:49:00] metrics that track your error rate, your errors, actual errors that you are experiencing, that are informed by the error analysis, the and the data analysis that you did instead of this vague idea of this like generic problem.

hugo: The next one on the list was getting stuck with annotation. Can you just remind us all what annotation is and what this issue looks like in practice and why it's so dangerous? 

hamel: Yeah, so getting stuck, an annotation is, hey, you need to label some data. You need to look at data, you need to write some notes about what your, the problems that you see so that you can spot patterns and people get stuck there in various ways.

One is they try to do it by committee, like I mentioned, that's the whole benevolent dictator thing, so I don't have to rehash that. You have a benevolent dictator. But also try to make it really easy. So it's really easy to vibe code an app nowadays to help you with data annotation. So AI is really good at displaying data on a web interface, like a simple data annotation app.

So you need to unblock [00:50:00] yourself. You need, it's often useful to, because data, looking at data is so important. You should try to remove all the friction. So you want to often create your own custom data annotation tool that can render the data in a way that is like maximizes the ease of looking at it.

For example, if you're looking at an email, render it to look like an email, if it has like widgets, pictures, whatever in it. Render all those things. Render it exactly how the user sees it so that you can read it effectively and do small things like hide things that you don't need to see. By default, bring in external information and metadata and render it in ways that you can see it all on one screen.

So almost everyone that I've worked with have made their own data. We have made their own data annotation tools 

hugo: That kind of dovetails really nicely into the next one. 'cause I, some of the biggest value I've got from AI myself and in in my work, is [00:51:00] spinning up vibe, coding, these types of tools as, as well.

Number eight of the things you hate was endlessly trying tools like vector dbs, agent frameworks, new models, those types of things. Instead of doing something like error analysis. So maybe you can tell us a bit about what, how you see this play out. 

hamel: Yeah. People can get stuck in endless loops of, Hey, my retrieval is not working.

Let me try a new vector database are, Hey, my agents aren't working. Let me switch from L graph to OpenAI agent SDK, or let me just try some different models. Hopefully, let me just keep trying different models. Now you shouldn't do that. If your focus is on tools, not processes, you're gonna drown, but you should.

It's okay to switch vector databases or it's okay to switch models, but you should have a hypothesis, so you should have an hypothesis informed by data. There's no point switching models if your retrieval is garbage, right, [00:52:00] or there's no point switching vector dbs if your retrieval is fine, or even if your retrieval is fine.

If you do analysis, you might find that the vector DB is not the problem. It's the way that you are indexing documents. It's the way you're representing them. Or maybe like certain sources are not even in the database, whatever, like you need to look at the data and figure out what the problem is, not just like churn tools.

So a lot of people just get in this endless cycle of churning tools. 

hugo: Then the ninth one is something when we talked about how can people get started with evals. We talked about, and you talked about automating too quickly as well, but we talked about. Using automation and then aligning with humans. And I'll actually, I'll share a link in the show notes to a podcast I did with Philip Carter about the work you did with him at Honeycomb in the early days.

A wonderful process demonstration of how to align LM as judges with humans. But a pet peeve of yours is people putting LLM in the judges' seat without grounding them in human oversight, right? 

hamel: Yeah. So one mistake is like just [00:53:00] writing a prompt for your LM judge and just saying, you're done. It's appealing, right?

Just prompt and then you're done. That's it. But the problem is like you don't know if that LM judge is any good or not, and so you need to measure it with human labels. A lot of people don't know how to do that. That's a bit of a machine learning, data science technique, like creating different splits and measuring like how good you are and iterating.

You're doing hill climbing and you're doing hill climbing in a way that you don't want to overfit. So. That's exactly the skillset of data scientists. And what we do is we teach people. That's one of the things we teach people how to do without having to teach them all of data science in machine learning.

hugo: The other thing is outta the box L, in my experience, and a lot of my friends, including you, LMRs judges don't necessarily perform immediately, but after, yet you can align pretty well pretty quickly if you do it correctly. And by correctly. I actually mean, and once again, do it in spreadsheets, do error analysis, see, [00:54:00] do a blind judge, and then you blind give it a judgment.

Then look at your judgment versus the judge's judgment. And then iterate on your judge there. 

hamel: Yeah, exactly. You usually need to iterate at least three or four times to get good alignment. Or if you're not, you're leaving somebody on the table. It's kind of miracle if you get one shot alignment with just like prompting, maybe the problem is easy.

I mean you can, but yeah, usually you need to iterate a bit. Hmm. And so. It's important to do that process. Exactly. 

hugo: And this is a slight tangent from the 10 things you hate about AI evals with Hamel Hussein. I am interested though, in when writing judges and building LM as judges a lot of the time, few, like a bit of in context learning and, and few shot prompting can help.

So giving examples of good positive examples, negative examples. I've also found a a not insignificant proportion of the time the judge can over fit to those and giving it [00:55:00] heuristics can perform better. So do you have any, is it you just gotta try these out or do you have any guidance on when to use Yeah.

Few shot examples and not, yeah, 

hamel: so it's, it's really helpful to anthropomorphize the language model for this to decide. And what I mean by that is if you're explaining to an employee how to do a task, you want to give them guidelines for sure to say, Hey, this is what I want done, this is what you should think about, et cetera.

Then you might provide some examples. And what are the best examples? The best examples are like edge cases and things that are tough to describe with like your guidelines, or they're like tricky. You want them, you want the examples to like flesh out some of the guidelines or give more color in some way.

And those are the best kinds of examples that you can provide usually. And so that can help a lot with the overfitting. 'cause what you're trying to do is explain the whole thing, the [00:56:00] task in a way that allows the language model to generalize as much as possible. And so if the meat of what you're explaining is in the examples and it's not in the guidelines, then yeah, then even a human would overfit that.

Mm. Because you might get confused and be like, you're just, 'cause if no one's explaining to you why, then you would just become a parrot. Just like anything else in life. So that's exactly right. And maybe it's not that surprising. 

hugo: Yeah, it isn't that surprising. But knowing how it plays out and then just hearing from the trenches of what works and what doesn't, which is a lot of the value I get from the work and conversations with you in your course among those, all the content a lot of our friends put out is just seeing what's happening on, on, on the front lines as well.

So that is super helpful. So another one, one of your, one of your pet peeves, which I find so fascinating, which it may not be obvious to a lot of people, is not using enough [00:57:00] AI in daily work. So not having intuition. So the 10th thing you hate is engineers not using AI enough themselves. So they lack intuition about its strengths and limits.

So can you tell us a bit about this? 

hamel: Yeah. I often get questions like, Hey, should I, what models should I use for this task? Are, Hey, I have this idea. Should I, do you think it's okay? And those questions are fine, but sometimes those questions illuminate that the person is not using AI themselves. Like very quickly they illuminate, Hey, are you using AI like in your personal work, in your workflow?

Or is AI just this like only confined to this product you're trying to build at work? And then it doesn't exist anywhere else. And so you don't 'cause, and it is very important to get intuition on what AI is good at and what it's not good at, because that trickles down into a lot of things. It trickles down into sort of your intuition about scope, what to build.

It [00:58:00] also lets you gain the skepticism, the healthy skepticism that you need in the right places to motivate things like evals. Because if nothing motivate you more to write evals than using AI on a constant basis and watching it fail in jagged ways. And the word jagged is important. Like it's excellent here and there.

And it sometimes is like amazing and sometimes it fails and like you have some intuition, maybe why, but you don't know. And so that's exactly what you need to do this. And I would say, because I don't, I use it constantly, I would say, I would never ask what model should I use? I would always just say, I have no idea.

Let's try it. Let's try different ones. Like right now, what's on the frontier? I have some intuition like, Hey, I like this model. I've used this model for writing, I've used this model for coding. But you may start there, but it's good to try things. If you're building a product that's different, right? You want to see what's good.

So yeah, I think it's super [00:59:00] important to do that. Otherwise you're gonna get left behind. 

hugo: Totally. And a piece of advice I give people who wanna get started with ai, like technical or otherwise, it's a piece of advice you gave me several years ago, which is go for a long walk and. Have a chat with an AI system and actually see things you're interested in, see what comes up, see what it's good at, see what it doesn't.

And that isn't necessarily building with it immediately, but you actually do get a lot of insight into their strengths and weaknesses even in just having a longer conversation with it. Are you still 

hamel: doing that? 

hugo: Yeah. 

hamel: What do you do with that outta curiosity? What do you, what kind of conversation do you have?

Like about, 

hugo: either about books I'm reading for example, or I've been traveling a lot recently, so about places I'm going, history on top of that. Conversations around, I do, I'm interested in a lot of current AI tool calling and web search capabilities, so even trying to get. UpToDate information weekly on what's happening in the AI space.

I find it really interesting 'cause a lot [01:00:00] of the time they're horrible as well. If I try to chat with chat GBT about last week in AI for example, I was, I'll be like, what are the five big things that happen for developers in ai? It'll give me, it'll give me three things that are absolutely unimportant and then one that's for non-technical people and then maybe one, one thing.

So it's actually super interesting to get that insight for our space as well, I think. 

hamel: Yeah. What tools do you like using, but by the way, like for this exploration, 

hugo: main ones will chat, GBT, Claude and Gemini in AI Studio. 

hamel: Oh, okay, cool. Yep. Okay. Yep. I like using perplexity for this also, Gemini, probably more and more perplexity.

I like the speed of perplexity. It's like amazing. Yeah, it's interesting. 

hugo: Perplexity I did use on and off. It doesn't give me much signal into. How the models I'll be using will work as well. Oh, I, I see. 

Okay. Yeah, 

hugo: I Google AI Studio though is incredible. Huge fan. [01:01:00] 

hamel: Oh yeah. I'm a huge fan of it. I write a lot, like all the time, all day and I, I find Gemini to be really good.

Absolutely. It sounds less like slop than everything else. 

hugo: Yeah. And actually I'll link to something in the show notes, the post or it's a lightning lesson you did with I think Isaac and Eleanor or one of them a, about AI using AI to write content and it's a cool post that they put out there with some code for how to do that, which I think a lot of people will find.

Interesting. Are there any other tools so that we have talked about the 10 things you Hate? There is one more I wanna get to, which is a bonus 11th one. I am just interested. Yeah. What other tools or what's exciting you at the moment or, and I don't mean in the actual work you are doing, but in playing with AI and having fun with that.

Yeah. 

hamel: The most exciting thing is like coding agents. It's like the funniest thing you can I have is like to build things, try different coding agents. Pushing them to the limits. It's like the one, it's the product that I would say it's fur this along in the AI space, arguably. [01:02:00] And it's the most like magical kind of AI experience in a way.

'cause it's, it's ahead for a lot of reasons. And yeah, I just like using, I use them all a lot just to test them. Like I use Claude, I use Codex, I use the Gemini CLI or the Google CLI open hands, use amp amps. I'm using a lot of these things. 

hugo: All of these are super cool. But I do think when we get to things like AMP and check out Amp everyone, it's, it's an agentic coding tool.

You don't even pick the model. It, it routes things depending on your request, which I'm anti and pro and I don't necessarily wanna get into that conversation right now, but it feels like a. Higher level of abstraction than OG vibe coding is copy and pasting from Stack Overflow. Then you have something like chatting with an LLM like chat, GBT.

Then you have completion in your editor. Then you have, it will just write code. But AMP feels like it abstracts it to a whole new [01:03:00] level. Do you think? 

hamel: Yeah, I feel like it does. I like it a lot. It doesn't work well like on everything I would say, but it works well for a lot of things. Like I recently had to refactor my blog extensively for, I have this like long LLM evals, FAQ that has 40 FAQs in it.

And for SEO purposes, I wanted to try to make each FAQ its own page, but also have it in one page as well without duplicating all this stuff. And I wanted a system to render it all, so I had to figure that out. AMP was able to figure out like the whole machinery, which is like very impressive of splitting it out into all these templates and then like.

How to stitch it together, how to render it on the fly. I also wanted to make a PDF out of it. This is like all this stuff that I would never do, um, you know, without ai. 

hugo: Totally. And I do, I feel like it's, it reminds me of Manus in some ways. Right? People haven't checked out. Manus. Definitely have a look.

Are you [01:04:00] using 

hamel: Manus right now 

hugo: still? Yeah. Yeah. I haven't used it in the past pa, past week or two, but I, yeah, I don't use it a lot, but I do play around with it. How about you? 

hamel: I used it a bit. I turned off of it. I stopped using it. Yeah. I find, what I wanna check out now 

hugo: is their email stuff. So everyone Manus, you may know chat, GBT, you can do agen stuff.

It will. Do web searches, it'll try to figure out what you want. Manus in a lot of ways is a higher kind of layer of abstraction. It will turn things into subtasks and it will try to figure out a whole flow or protocol or recipe for what you want to to do. And I've got a bunch, you can tag Manus in emails now and it will automate things for you and I wanna test out that that functionality.

The reason I brought it up though is I feel like Amp for coding is like mancy in the sense that it's a gen approaches. It will do a lot of the work to try and figure out what you are getting at and to figure out the steps in the flow as opposed to you specifying, these are the five things I want you to do.

I do want to, I feel like we should do another [01:05:00] podcast on, on, on all of this cool stuff. To be honest, Carlos. 

Yeah, totally 

hugo: man. So the 11th thing that you hate and it's the thing you hate the most, you said to me is convincing engineers to exercise data. Thinking. We've been talking around this, but I'm just wondering, when you try to convince software engineers that evals require data thinking or just, can you tell us what this means in practice and how it's different from treating evals just like another test suite?

hamel: Yeah. I get a lot of pushback. I would say not all the ev, not everyone, but I would say a significant amount of pushback about doing data analysis and doing evals in the way that we teach, as opposed to doing some automatic stuff. So the ev, the pushback that I get from software engineers is like, Hey, we don't have time for this.

Can you just write some tests? Can you just take some, just gimme the tests. There must be some tests that I can just run and just tell me pass or fail because I need to, I'm building some software and I need that. And I say, no, you need to do this analysis, blah, blah. [01:06:00] And they really push back and say, Hey, we don't have resources for that.

We don't have resources to look at data or look at anything. And, and it can take a lot of unlearning for folks to say, okay, this is a stochastic system. There really isn't a good way to evaluate it or test it without looking at data. 

bryan: Mm. 

hamel: There is really no shortcut out of it. And software engineers can be very resistant to this idea of, okay, we need like a data scientist type of thinking.

In some sense, software engineers feel empowered. They have, you know, able to build products and get very far with building AI products and can be very confident about, okay, like what everything they need to build ai. But I think this can be a blind spot and I think this data sciencey blind spot is just something that.

Lot of software [01:07:00] engineers haven't been taught, and I think some of their rhetoric has been, Hey, if you're a software engineer, these are, you already have all the skills you need. 

hugo: I love that, and I think that's, that's been a through line of this entire conversation. I do think a lot of people coming into ai, people from the data science stuff need to, I think, learn a lot more about the software stuff than the ops stuff that's necessary and learn to collaborate with people with those skills.

From the software side, it's really the data thinking that's so important. We've cheekily indexed heavily on things you hate and it has been with a, with a dose of cheekiness today. But in order to not devolve into too much cynicism, I'm wondering if you could, we could wrap up by you telling us the thing that you love most about evals.

hamel: Yeah. The thing I love most about it is it's actually really fun. It's a blast. Anytime you look at data, you can see within 30 minutes, you can find all kinds of problems. And when you find these problems, like it makes everyone really happy. They're like, oh, Hamel came to me. And he [01:08:00] found these like X, Y, Z problems and he did it so fast and I've been building this application for a year and I didn't know about this.

That's amazing. What the hell? You know, what did you do? And then I'm like, okay, I'm not a a magician. Let me show you how to do it. And it's really great. Like it's so much fun to do that. It's like you may have experienced a similar thrill as a data scientist to someone, Hey, why is churn increasing? Why is our LTV going down?

Why does you know x, y, Z happen in our business? And you're able, as data scientists dig through the data and find out why. Come up to some valid hypothesis. And so that's a lot of fun and it's very impactful. And so I think evals, we do need a new term for it. I think Brian is right 'cause we're trying to shove.

Fundamentally in the limit. We're trying to shove all of datas science into this term evals, I think, but it's just a lot of fun. I think there needs to be a revenge of the data scientists in ai. 

hugo: Amazing. I [01:09:00] love it. I know what I'm gonna be doing Nano banana straight after this live stream. I, I really do love your reminding us of how much fun it, it is to do evals, to look at your data, especially when you're building products that you're deeply interested in how they work and what value they they deliver.

And it is the scientific mindset. It is a hacker mentality and it's all the things, all of us got into data science and ML in the first place. I definitely didn't get into it for the environment. Variable issues. 

hamel: Yeah, for sure. 

hugo: Yeah. Yeah. Super, super cool, man. Once again, everyone, Hamill's and Tre's next cohort of their course starts on October 6th.

Is that correct, Hamill? That's right. 

Yeah. 

hugo: If you're interested, I'm putting the link in Discord. And in YouTube as well, you can sign up in this course gives you not 30 but 35% off, which is super exciting. I have also put a link to my course for those people interested in learning a lot more about the entire software development lifecycle.

I am just wondering, [01:10:00] Howell, if we can, if you could tell us a bit about this next iteration of your, your course. You've taught it several times, so what's what's new? 

hamel: Yeah, so we, we've taught it two times already and we tried to make it significantly better each time. So what we learned from the first two times is, hey, it is better to separate the lessons from one, like a live interaction.

So what we have is we professionally recorded all the lectures and edited them so that there's no wasted time. And so we are making all the lectures recorded and then we have, at this point we have 12 office hours. Which is quite a lot, which is dedicated to live questions and answers, and that's completely different.

The next different thing is we're teaching people about ai and this is an AI course, but so you should have some AI assistant that help you deal with the material. In addition to us. We also have a discord [01:11:00] where you get lifetime access to, and you can ask us any question you want, but in addition to that, we have uploaded like everything we know about evals, like every, any podcast we've ever done, including this one.

After it's finished any paper, we've written blog posts, talk course lectures, all the discord chats, there's over, I think there's like over 25,000 questions from Discord that we've curated, all of that stuff. We've uploaded into basically a chat bot and we're giving that to students for 10 months. So we're trying to give people all the tools they need to succeed and being very aggressive about that.

That's all of that stuff is new for this cohort. 

hugo: Super cool. Having taken your course twice, I'm excited to jump in again and check out some of the new content and check out how to learn about AI with ai, with the chat bot as well. Thank you once again everyone for joining and all the great questions on on on Discord.

Um, hope to see you next time. Subscribe to Luma our YouTube and subscribe all of those things. [01:12:00] Most of all though Hamill, appreciate your time and, and knowledge. It's always great to chat about these things. 

hamel: Sounds great. Thank you. 

hugo: Absolutely. Thanks to tuning in everybody, and thanks for sticking around to the end of the episode.

I would honestly love to hear from you about what resonates with you in the show, what doesn't, and anybody you'd like to hear me speak with along with topics you'd like to hear more about. The best way to let me know currently is on Twitter. At Vanishing Data is the podcast handle, and I'm at Hugo Bound.

See you in the next [01:13:00] episode.